operator/docs/prepare-server.md at abdd4b81c4515b3bd8c429bf300be28728873447

homelab/operator

Fork 0

mirror of https://github.com/morten-olsen/homelab-operator.git synced 2026-02-08 01:36:28 +01:00

Files

Morten Olsen 9fe279b1b5 docs

2025-09-05 08:56:04 +02:00

12 KiB

Raw Blame History

Here's the guide formatted as a README.md file, ready for a GitHub repository or local documentation.

# Optimizing Debian for K3s

This guide outlines steps to optimize a Debian server for running K3s (Lightweight Kubernetes). Optimization involves a combination of general Linux best practices, K3s-specific recommendations, and considerations for your specific workload.

## Table of Contents

- [1. Debian Base System Optimization](#1-debian-base-system-optimization)
  - [a. Kernel Parameters (sysctl.conf)](#a-kernel-parameters-sysctlconf)
  - [b. User Limits (ulimit)](#b-user-limits-ulimit)
  - [c. Disable Unnecessary Services](#c-disable-unnecessary-services)
  - [d. Update System](#d-update-system)
  - [e. Swap Configuration](#e-swap-configuration)
- [2. K3s Specific Optimizations](#2-k3s-specific-optimizations)
  - [a. Choose a Performant Storage Backend](#a-choose-a-performant-storage-backend)
  - [b. Containerd Tuning](#b-containerd-tuning)
  - [c. K3s Server and Agent Configuration](#c-k3s-server-and-agent-configuration)
  - [d. CNI Choice](#d-cni-choice)
- [3. General Server Best Practices](#3-general-server-best-practices)
  - [a. Fast Storage](#a-fast-storage)
  - [b. Adequate RAM and CPU](#b-adequate-ram-and-cpu)
  - [c. Network Configuration](#c-network-configuration)
  - [d. Monitoring](#d-monitoring)
  - [e. Logging](#e-logging)
- [4. Post-Optimization Verification](#4-post-optimization-verification)

---

## 1. Debian Base System Optimization

These steps are generally beneficial for any server, but particularly important for containerized environments like K3s.

### a. Kernel Parameters (sysctl.conf)

Edit `/etc/sysctl.conf` and apply changes with `sudo sysctl -p`.

```ini
# Increase maximum open files (for container processes, K3s components)
fs.inotify.max_user_watches = 524288 # For fs-based operations within containers
fs.inotify.max_user_instances = 8192 # For fs-based operations within containers
fs.file-max = 2097152 # Increase overall system file handle limit

# Increase limits for network connections
net.core.somaxconn = 65535 # Max backlog of pending connections
net.ipv4.tcp_tw_reuse = 1 # Allow reuse of TIME_WAIT sockets (caution: can sometimes mask issues)
net.ipv4.tcp_fin_timeout = 30 # Reduce TIME_WAIT duration
net.ipv4.tcp_max_syn_backlog = 65535 # Max number of remembered connection requests
net.ipv4.tcp_keepalive_time=600 # Shorter keepalive interval
net.ipv4.tcp_keepalive_intvl=60 # Keepalive interval
net.ipv4.tcp_keepalive_probes=3 # Keepalive probes

# Increase memory limits for network buffers (especially if high network traffic)
net.core.rmem_max = 26214400
net.core.wmem_max = 26214400
net.core.rmem_default = 26214400
net.core.wmem_default = 26214400

# Other useful parameters
vm.max_map_count = 262144 # Essential for Elasticsearch, MongoDB, etc.
vm.dirty_ratio = 5 # Reduce dirty page percentage for better write performance
vm.dirty_background_ratio = 10 # Reduce dirty page percentage for better write performance
kernel.pid_max = 4194304 # Increase max PIDs

Explanation:

fs.file-max: K3s and its deployed containers can open a large number of files. Increasing this prevents "Too many open files" errors.
net.*: These parameters help in handling a high number of concurrent network connections crucial for a Kubernetes cluster.
vm.max_map_count: Required by some applications that run on Kubernetes (e.g., Elasticsearch).

b. User Limits (ulimit)

Edit /etc/security/limits.conf (or create a file like /etc/security/limits.d/k3s.conf) for all users, or specifically for the user K3s runs as (often root by default or a dedicated k3s user).

# For all users (or a specific k3s user if you configure it)
*    soft nofile 65536
*    hard nofile 131072
*    soft nproc  65536
*    hard nproc  131072

Note: A reboot or logging out/in is often required for these changes to take effect for user sessions. Services typically pick up new limits upon restart.

Explanation:

nofile (number of open files): Sets the per-user/per-process limit. K3s and pods need a high limit.
nproc (number of processes): Each container consumes processes. A high limit prevents hitting a ceiling.

c. Disable Unnecessary Services

Reducing background services frees up CPU, RAM, and I/O.

sudo systemctl disable --now apache2 # Example, replace with actual unused services
sudo systemctl disable --now nginx    # Example
sudo systemctl disable --now cups     # If not using printing
sudo systemctl disable --now modemmanager # If not using a modem
sudo systemctl disable --now bluetooth # If no bluetooth devices
# Review active services using:
# systemctl list-unit-files --type=service --state=enabled

d. Update System

Keep your system packages up-to-date for security and performance bug fixes.

sudo apt update
sudo apt upgrade -y
sudo apt dist-upgrade -y # For major version changes (if applicable)
sudo apt autoremove -y
sudo reboot # After significant kernel or base system updates

e. Swap Configuration

It is generally recommended to disable swap on K3s nodes, especially worker nodes. Swapping can severely degrade performance in containerized environments due to unpredictable latency.

If you absolutely must have swap (e.g., very low memory server, not recommended for production):

Reduce swappiness: sudo sysctl vm.swappiness=10 (or even 1). Add vm.swappiness = 10 to /etc/sysctl.conf.
Preferably, disable swap entirely if you have sufficient RAM:
```
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
```
WARNING: Only disable swap if your system has sufficient RAM to handle its workload without it. If nodes run out of memory without swap, processes will be OOM-killed, leading to instability.

2. K3s Specific Optimizations

a. Choose a Performant Storage Backend

The choice of K3s's data store significantly impacts performance and availability.

SQLite (Default): Good for single-node setups or small, non-critical clusters. Performance can degrade with many changes or large clusters.
External Database (MariaDB/MySQL, PostgreSQL):
- Recommended for Production: Offers high availability and better performance than embedded SQLite for multi-node K3s server configurations.
- Placement: Place the external database on a separate server or on a dedicated, fast storage volume.
External etcd: Offers the best performance and scalability, but is more complex to manage and requires its own dedicated etcd cluster.

b. Containerd Tuning

K3s uses containerd as its container runtime.

Fast Storage for Containerd: Ensure the directories where containerd stores its data are on fast storage (NVMe SSDs are ideal).
- /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs (K3s specific)
- (/var/lib/containerd if using a standalone containerd setup) This is critical for image pulls, container startup, and overlayfs performance.

c. K3s Server and Agent Configuration

Configure K3s using a configuration file (e.g., /etc/rancher/k3s/config.yaml) or command-line flags.

Disable Unused Components: Reduce resource consumption by disabling features you don't need.
- --disable traefik: If using Nginx Ingress Controller or another ingress.
- --disable servicelb: If using a cloud provider Load Balancer, MetalLB, or another solution.
- --disable local-storage: If using cloud provider storage, NFS, or another remote storage solution.
- --disable metrics-server: If using a different metrics solution or don't need it.
- --disable helm-controller: If exclusively using kubectl for deployments.
Example /etc/rancher/k3s/config.yaml for a server node:
```
# /etc/rancher/k3s/config.yaml
server: true
disable:
  - traefik
  - servicelb
  - local-storage
  - metrics-server
# Example for external database
# datastore-endpoint: "mysql://k3s:password@tcp(db-server:3306)/kube?parseTime=true"
```

d. CNI Choice

K3s defaults to Flannel (with VXLAN), which is performant for many use cases.

Alternative CNIs (Calico, Cilium): If you require advanced network policies, superior performance in high-throughput scenarios, or specific networking features, consider replacing Flannel. These can offer better raw throughput or lower latency but add complexity.
- If installing K3s, you'd typically skip Flannel installation (--flannel-backend=none) then install your chosen CNI.
- Ensure your chosen CNI is optimized with the correct kernel modules and sysctls.

3. General Server Best Practices

a. Fast Storage

SSD/NVMe: Absolutely crucial for K3s performance, especially for the K3s data directory ($K3S_DATA_DIR, default: /var/lib/rancher/k3s), /var/lib/containerd, and the operating system itself. Pod startup times, image pulls, and database operations are heavily I/O bound.
RAID: If using multiple drives, consider RAID1 or RAID10 for redundancy and increased I/O performance.

b. Adequate RAM and CPU

RAM: K3s servers (especially with embedded SQLite) require more RAM. Worker nodes also need ample RAM for their pods. Err on the side of more RAM.
CPU: Ensure sufficient CPU cores for K3s components, containers, and your workloads.

c. Network Configuration

Gigabit Ethernet (at least): 10Gbps or faster is ideal for larger clusters or high-bandwidth applications.
MTU: Ensure consistent MTU settings across all nodes and your network infrastructure. K3s default CNI (Flannel VXLAN) might use a smaller MTU (e.g., 1450) due to encapsulation overhead. Misconfigured MTU can lead to packet fragmentation and performance issues.
Jumbo Frames: If your network supports it and all components are configured for it, jumbo frames (e.g., 9000 bytes MTU) can reduce overhead and improve throughput, but requires careful and consistent configuration.

d. Monitoring

Prometheus/Grafana: Essential for monitoring resource usage (CPU, RAM, disk I/O, network) of your nodes and K3s components. This helps identify and diagnose bottlenecks.
Kube-state-metrics: Provides metrics about Kubernetes objects.
Node Exporter: Provides system-level metrics.
cAdvisor (usually bundled with container runtimes): Provides container-level metrics.

e. Logging

Centralized Logging (ELK Stack, Loki, etc.): Stream logs from K3s components and pods to a central logging system for easier debugging, troubleshooting, and performance analysis.

4. Post-Optimization Verification

Reboot: After making changes to kernel parameters or limits.conf, a full system reboot is often the safest way to ensure all changes are fully applied.
Verify sysctl settings: sudo sysctl -a | grep -i <parameter_name> (e.g., sudo sysctl -a | grep -i fs.file-max)
Verify ulimits: Check ulimit -n and ulimit -u in a new shell. For specific running processes, inspect /proc/<pid>/limits.
Monitor Performance: Use tools like htop, iostat, netstat, dstat, and your installed monitoring stack (Prometheus/Grafana) to observe the impact of your changes. Look for reduced CPU usage, lower I/O wait, improved network throughput, and stable memory usage.
Test Workloads: Deploy your actual applications and perform load testing to ensure the optimizations yield the desired performance benefits under realistic conditions.

By diligently following these steps, you can establish a robust and highly performant Debian environment for your K3s cluster. Always test changes in a staging or development environment before applying them to production systems.

12 KiB Raw Blame History