operator/docs/prepare-server.md

Here's the guide formatted as a `README.md` file, ready for a GitHub repository or local documentation.

```markdown
# Optimizing Debian for K3s

This guide outlines steps to optimize a Debian server for running K3s (Lightweight Kubernetes). Optimization involves a combination of general Linux best practices, K3s-specific recommendations, and considerations for your specific workload.

## Table of Contents

- [1. Debian Base System Optimization](#1-debian-base-system-optimization)
  - [a. Kernel Parameters (sysctl.conf)](#a-kernel-parameters-sysctlconf)
  - [b. User Limits (ulimit)](#b-user-limits-ulimit)
  - [c. Disable Unnecessary Services](#c-disable-unnecessary-services)
  - [d. Update System](#d-update-system)
  - [e. Swap Configuration](#e-swap-configuration)
- [2. K3s Specific Optimizations](#2-k3s-specific-optimizations)
  - [a. Choose a Performant Storage Backend](#a-choose-a-performant-storage-backend)
  - [b. Containerd Tuning](#b-containerd-tuning)
  - [c. K3s Server and Agent Configuration](#c-k3s-server-and-agent-configuration)
  - [d. CNI Choice](#d-cni-choice)
- [3. General Server Best Practices](#3-general-server-best-practices)
  - [a. Fast Storage](#a-fast-storage)
  - [b. Adequate RAM and CPU](#b-adequate-ram-and-cpu)
  - [c. Network Configuration](#c-network-configuration)
  - [d. Monitoring](#d-monitoring)
  - [e. Logging](#e-logging)
- [4. Post-Optimization Verification](#4-post-optimization-verification)

---

## 1. Debian Base System Optimization

These steps are generally beneficial for any server, but particularly important for containerized environments like K3s.

### a. Kernel Parameters (sysctl.conf)

Edit `/etc/sysctl.conf` and apply changes with `sudo sysctl -p`.

```ini
# Increase maximum open files (for container processes, K3s components)
fs.inotify.max_user_watches = 524288 # For fs-based operations within containers
fs.inotify.max_user_instances = 8192 # For fs-based operations within containers
fs.file-max = 2097152 # Increase overall system file handle limit

# Increase limits for network connections
net.core.somaxconn = 65535 # Max backlog of pending connections
net.ipv4.tcp_tw_reuse = 1 # Allow reuse of TIME_WAIT sockets (caution: can sometimes mask issues)
net.ipv4.tcp_fin_timeout = 30 # Reduce TIME_WAIT duration
net.ipv4.tcp_max_syn_backlog = 65535 # Max number of remembered connection requests
net.ipv4.tcp_keepalive_time=600 # Shorter keepalive interval
net.ipv4.tcp_keepalive_intvl=60 # Keepalive interval
net.ipv4.tcp_keepalive_probes=3 # Keepalive probes

# Increase memory limits for network buffers (especially if high network traffic)
net.core.rmem_max = 26214400
net.core.wmem_max = 26214400
net.core.rmem_default = 26214400
net.core.wmem_default = 26214400

# Other useful parameters
vm.max_map_count = 262144 # Essential for Elasticsearch, MongoDB, etc.
vm.dirty_ratio = 5 # Reduce dirty page percentage for better write performance
vm.dirty_background_ratio = 10 # Reduce dirty page percentage for better write performance
kernel.pid_max = 4194304 # Increase max PIDs
```

**Explanation:**
- `fs.file-max`: K3s and its deployed containers can open a large number of files. Increasing this prevents "Too many open files" errors.
- `net.*`: These parameters help in handling a high number of concurrent network connections crucial for a Kubernetes cluster.
- `vm.max_map_count`: Required by some applications that run on Kubernetes (e.g., Elasticsearch).

### b. User Limits (ulimit)

Edit `/etc/security/limits.conf` (or create a file like `/etc/security/limits.d/k3s.conf`) for all users, or specifically for the user K3s runs as (often `root` by default or a dedicated `k3s` user).

```
# For all users (or a specific k3s user if you configure it)
*    soft nofile 65536
*    hard nofile 131072
*    soft nproc  65536
*    hard nproc  131072
```
**Note:** A reboot or logging out/in is often required for these changes to take effect for user sessions. Services typically pick up new limits upon restart.

**Explanation:**
- `nofile` (number of open files): Sets the per-user/per-process limit. K3s and pods need a high limit.
- `nproc` (number of processes): Each container consumes processes. A high limit prevents hitting a ceiling.

### c. Disable Unnecessary Services

Reducing background services frees up CPU, RAM, and I/O.
```bash
sudo systemctl disable --now apache2 # Example, replace with actual unused services
sudo systemctl disable --now nginx    # Example
sudo systemctl disable --now cups     # If not using printing
sudo systemctl disable --now modemmanager # If not using a modem
sudo systemctl disable --now bluetooth # If no bluetooth devices
# Review active services using:
# systemctl list-unit-files --type=service --state=enabled
```

### d. Update System

Keep your system packages up-to-date for security and performance bug fixes.
```bash
sudo apt update
sudo apt upgrade -y
sudo apt dist-upgrade -y # For major version changes (if applicable)
sudo apt autoremove -y
sudo reboot # After significant kernel or base system updates
```

### e. Swap Configuration

**It is generally recommended to disable swap on K3s nodes, especially worker nodes.** Swapping can severely degrade performance in containerized environments due to unpredictable latency.

If you absolutely must have swap (e.g., very low memory server, not recommended for production):
*   Reduce swappiness: `sudo sysctl vm.swappiness=10` (or even `1`). Add `vm.swappiness = 10` to `/etc/sysctl.conf`.
*   Preferably, disable swap entirely if you have sufficient RAM:
    ```bash
    sudo swapoff -a
    sudo sed -i '/ swap / s/^/#/' /etc/fstab
    ```
    **WARNING:** Only disable swap if your system has sufficient RAM to handle its workload without it. If nodes run out of memory without swap, processes will be OOM-killed, leading to instability.

## 2. K3s Specific Optimizations

### a. Choose a Performant Storage Backend

The choice of K3s's data store significantly impacts performance and availability.

*   **SQLite (Default):** Good for single-node setups or small, non-critical clusters. Performance can degrade with many changes or large clusters.
*   **External Database (MariaDB/MySQL, PostgreSQL):**
    *   **Recommended for Production:** Offers high availability and better performance than embedded SQLite for multi-node K3s server configurations.
    *   **Placement:** Place the external database on a separate server or on a dedicated, fast storage volume.
*   **External etcd:** Offers the best performance and scalability, but is more complex to manage and requires its own dedicated etcd cluster.

### b. Containerd Tuning

K3s uses containerd as its container runtime.

*   **Fast Storage for Containerd:** Ensure the directories where containerd stores its data are on fast storage (NVMe SSDs are ideal).
    *   `/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs` (K3s specific)
    *   (`/var/lib/containerd` if using a standalone containerd setup)
    This is critical for image pulls, container startup, and overlayfs performance.

### c. K3s Server and Agent Configuration

Configure K3s using a configuration file (e.g., `/etc/rancher/k3s/config.yaml`) or command-line flags.

*   **Disable Unused Components:** Reduce resource consumption by disabling features you don't need.
    *   `--disable traefik`: If using Nginx Ingress Controller or another ingress.
    *   `--disable servicelb`: If using a cloud provider Load Balancer, MetalLB, or another solution.
    *   `--disable local-storage`: If using cloud provider storage, NFS, or another remote storage solution.
    *   `--disable metrics-server`: If using a different metrics solution or don't need it.
    *   `--disable helm-controller`: If exclusively using `kubectl` for deployments.

    **Example `/etc/rancher/k3s/config.yaml` for a server node:**
    ```yaml
    # /etc/rancher/k3s/config.yaml
    server: true
    disable:
      - traefik
      - servicelb
      - local-storage
      - metrics-server
    # Example for external database
    # datastore-endpoint: "mysql://k3s:password@tcp(db-server:3306)/kube?parseTime=true"
    ```

### d. CNI Choice

K3s defaults to Flannel (with VXLAN), which is performant for many use cases.
*   **Alternative CNIs (Calico, Cilium):** If you require advanced network policies, superior performance in high-throughput scenarios, or specific networking features, consider replacing Flannel. These can offer better raw throughput or lower latency but add complexity.
    *   If installing K3s, you'd typically skip Flannel installation (`--flannel-backend=none`) then install your chosen CNI.
    *   Ensure your chosen CNI is optimized with the correct kernel modules and sysctls.

## 3. General Server Best Practices

### a. Fast Storage

*   **SSD/NVMe:** Absolutely crucial for K3s performance, especially for the K3s data directory (`$K3S_DATA_DIR`, default: `/var/lib/rancher/k3s`), `/var/lib/containerd`, and the operating system itself. Pod startup times, image pulls, and database operations are heavily I/O bound.
*   **RAID:** If using multiple drives, consider RAID1 or RAID10 for redundancy and increased I/O performance.

### b. Adequate RAM and CPU

*   **RAM:** K3s servers (especially with embedded SQLite) require more RAM. Worker nodes also need ample RAM for their pods. Err on the side of more RAM.
*   **CPU:** Ensure sufficient CPU cores for K3s components, containers, and your workloads.

### c. Network Configuration

*   **Gigabit Ethernet (at least):** 10Gbps or faster is ideal for larger clusters or high-bandwidth applications.
*   **MTU:** Ensure consistent MTU settings across all nodes and your network infrastructure. K3s default CNI (Flannel VXLAN) might use a smaller MTU (e.g., 1450) due to encapsulation overhead. Misconfigured MTU can lead to packet fragmentation and performance issues.
*   **Jumbo Frames:** If your network supports it and all components are configured for it, jumbo frames (e.g., 9000 bytes MTU) can reduce overhead and improve throughput, but requires careful and consistent configuration.

### d. Monitoring

*   **Prometheus/Grafana:** Essential for monitoring resource usage (CPU, RAM, disk I/O, network) of your nodes and K3s components. This helps identify and diagnose bottlenecks.
*   **Kube-state-metrics:** Provides metrics about Kubernetes objects.
*   **Node Exporter:** Provides system-level metrics.
*   **cAdvisor (usually bundled with container runtimes):** Provides container-level metrics.

### e. Logging

*   **Centralized Logging (ELK Stack, Loki, etc.):** Stream logs from K3s components and pods to a central logging system for easier debugging, troubleshooting, and performance analysis.

## 4. Post-Optimization Verification

1.  **Reboot:** After making changes to kernel parameters or `limits.conf`, a full system reboot is often the safest way to ensure all changes are fully applied.
2.  **Verify sysctl settings:** `sudo sysctl -a | grep -i <parameter_name>` (e.g., `sudo sysctl -a | grep -i fs.file-max`)
3.  **Verify ulimits:** Check `ulimit -n` and `ulimit -u` in a new shell. For specific running processes, inspect `/proc/<pid>/limits`.
4.  **Monitor Performance:** Use tools like `htop`, `iostat`, `netstat`, `dstat`, and your installed monitoring stack (Prometheus/Grafana) to observe the impact of your changes. Look for reduced CPU usage, lower I/O wait, improved network throughput, and stable memory usage.
5.  **Test Workloads:** Deploy your actual applications and perform load testing to ensure the optimizations yield the desired performance benefits under realistic conditions.

By diligently following these steps, you can establish a robust and highly performant Debian environment for your K3s cluster. Always test changes in a staging or development environment before applying them to production systems.
```