Chapter 1: The Local Engineering Environment
Welcome to the first chapter of Kubernetes: Zero to Hero in Production. By the end of this chapter, you will have a deep understanding of container runtimes, Linux kernel primitives, and multiple local Kubernetes cluster architectures that simulate production conditions — all running on your workstation.
1.1 Container Runtimes: containerd Architecture
Before we touch a single kubectl command, we must understand what runs our containers. Since Kubernetes 1.24 removed dockershim, containerd has become the default container runtime in most production clusters.
1.1.1 containerd Component Architecture
containerd is a graduated CNCF project that manages the complete container lifecycle. It exposes a gRPC API that implements the Kubernetes Container Runtime Interface (CRI).
+--------------------+
| kubelet |
| (on each node) |
+--------+-----------+
|
CRI gRPC |
v
+----------------------+ +----------------------------+
| ctr CLI (debug) | | containerd |
+----------------------+ | +----------------------+ |
| | GRPC API Server | |
+----------------------+ | +----------------------+ |
| nerdctl (user) |------+ | CRI Plugin (cri) | |
+----------------------+ | +----------------------+ |
| | Content Store | |
+----------------------+ | | (Blob storage) | |
| crictl (debug) |------+ +----------------------+ |
+----------------------+ | | Metadata DB (bolt) | |
| +----------------------+ |
| | Image Service | |
| +----------------------+ |
| | Snapshotter | |
| | (overlayfs) | |
| +----------------------+ |
| | Shim (per pod) | |
+----------------------|-----+
|
runc |
v
+----------------------------+
| runc |
| +----------------------+ |
| | cgroups v2 | |
| | namespaces | |
| | rootfs (overlayfs) | |
| +----------------------+ |
+----------------------------+Key internal subsystems:
| Component | Responsibility | Storage Backend |
|---|---|---|
| GRPC API Server | Accepts CRI and distribution API calls over Unix socket | In-memory |
CRI Plugin (cri) | Translates Kubernetes CRI calls to containerd operations | BoltDB metadata |
| Content Store | Stores raw blob content (layer compressed tarballs) | Filesystem (/var/lib/containerd/io.containerd.content.v1.content/) |
| Metadata DB | Tracks images, containers, snapshots, and namespaces | BoltDB (/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db) |
| Image Service | Pulls, unpacks, and manages images via distribution spec | Content Store + Snapshotters |
| Snapshotter | Manages filesystem snapshots (default: overlayfs) | Filesystem (/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/) |
Shim (containerd-shim-runc-v2) | Per-container daemon that keeps STDIO/STDERR open after runc exits | N/A |
SRE Warning: The shim process is critical for pod survival during containerd restarts. Each container gets its own shim. If the shim dies, the container dies. Monitor shim process count as a health signal:
ps aux | grep containerd-shim | wc -l.
1.1.2 CRI Flow: From kubelet to Container
When the Kubernetes scheduler assigns a pod to a node, the following flow occurs:
- kubelet calls
RunPodSandbox()via CRI gRPC to the containerd socket (/run/containerd/containerd.sockor/var/run/containerd/containerd.sock) - containerd’s CRI plugin creates a pod sandbox (infra container using
pauseimage) which holds the network namespace - kubelet calls
CreateContainer()for each container in the pod spec - CRI plugin resolves the image reference, checks the content store, and pulls if necessary
- Snapshotter creates an overlayfs mount from the image layers + container writable layer
- containerd launches
containerd-shim-runc-v2which invokesruncto create the container - runc uses cgroups v2 to set resource constraints and namespaces for isolation
- The shim keeps the container’s stdio streams connected and reports exit codes
1.1.3 Complete containerd Configuration (Production Template)
Below is a production-grade config.toml for containerd. This file lives at /etc/containerd/config.toml on Linux nodes.
# /etc/containerd/config.toml
# Production-grade containerd configuration tuned for Kubernetes
version = 2
# Root directory for containerd persistent data
root = "/var/lib/containerd"
# State directory for containerd transient data
state = "/run/containerd"
# Unix socket path for CRI gRPC communication
grpc.address = "/run/containerd/containerd.sock"
grpc.uid = 0
grpc.gid = 0
grpc.max_recv_message_size = 16777216
grpc.max_send_message_size = 16777216
# Debug and metrics
debug.level = "info"
metrics.address = "127.0.0.1:1338"
metrics.grpc_histogram = false
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
# Sandbox image — keep this pinned to a specific version
sandbox_image = "registry.k8s.io/pause:3.9"
# Max container log size before rotation
max_container_log_line_size = 16384
# Enable SELinux if your nodes use it
enable_selinux = false
# Cgroup management
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
snapshotter = "overlayfs"
discard_unpacked_layers = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
runtime_engine = ""
runtime_root = ""
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
# CRITICAL: Must be true for cgroups v2 systems (all modern Linux distros)
SystemdCgroup = true
BinaryName = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
# GPU-accelerated runtime for AI/ML workloads (Chapter 9)
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
SystemdCgroup = true
BinaryName = "/usr/bin/nvidia-container-runtime"
# Registry mirror configuration
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = [
"https://registry-1.docker.io",
"https://mirror.gcr.io",
]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."ghcr.io"]
endpoint = ["https://ghcr.io"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."k8s.gcr.io"]
endpoint = ["https://registry.k8s.io"]
# Registry authentication (use Kubernetes image pull secrets instead)
[plugins."io.containerd.grpc.v1.cri".registry.configs]
# CDI (Container Device Interface) for GPU devices
[plugins."io.containerd.cdi.v1"]
enabled = true
# Timeouts for CRI operations
[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
tls_cert_file = ""
tls_key_file = ""
# OOM score adjustment for containerd
oom_score = -999SRE Warning: The
SystemdCgroup = trueoption is mandatory on all modern Linux distributions using cgroups v2. Setting this tofalseon a cgroups v2 host causes pod creation failures withfailed to write "max" to "pids.max"errors. Verify your cgroup mode withstat -fc %T /sys/fs/cgroup/—cgroup2fsmeans v2.
1.1.4 Verification Commands
# Check containerd status
sudo systemctl status containerd --no-pager
# Verify containerd is listening on its socket
sudo ctr version
sudo ctr plugins list | grep -E "cri|runc"
# Check the CRI plugin is registered with kubelet
sudo crictl info | jq .
# List all pods tracked by containerd
sudo crictl pods
# List all images cached by containerd
sudo crictl images
# Export containerd metrics
curl -s http://127.0.0.1:1338/metrics | head -50
# Check cgroup mode
stat -fc %T /sys/fs/cgroup/
# Verify containerd config is valid
sudo containerd config dump1.2 Linux cgroups v2 Deep Dive
Control Groups (cgroups) are the Linux kernel feature that limits, accounts for, and isolates resource usage for process hierarchies. Kubernetes relies on cgroups for CPU, memory, and PID enforcement.
1.2.1 cgroups v2 Unified Hierarchy
cgroups v2 (introduced in Linux 4.5, production-ready since 5.x) uses a single unified hierarchy instead of the multiple hierarchies in v1. All resource controllers are mounted at /sys/fs/cgroup/.
/sys/fs/cgroup/
├── cgroup.controllers # Available controllers
├── cgroup.subtree_control # Controllers active for children
├── cpu.stat # CPU usage statistics
├── memory.current # Current memory usage in bytes
├── memory.max # Memory hard limit ("max" if unlimited)
├── memory.min # Memory protection floor
├── memory.low # Memory best-effort floor
├── io.stat # I/O statistics
├── io.max # I/O limits
├── pids.current # Current number of PIDs/Tasks
├── pids.max # Maximum number of PIDs
├── kubepods/ # Kubernetes pod cgroups
│ ├── burstable/ # Burstable QoS class pods
│ │ └── pod<UID>/ # Per-pod cgroup
│ │ └── <containerUID>/
│ │ ├── cpu.max
│ │ ├── memory.current
│ │ ├── memory.max
│ │ ├── memory.min
│ │ ├── memory.high
│ │ ├── pids.max
│ │ └── io.max
│ ├── guaranteed/ # Guaranteed QoS class pods
│ └── besteffort/ # BestEffort QoS class pods
└── system.slice/ # System services1.2.2 Resource Controllers
| Controller | Interface File | Purpose | Kubernetes Mapping |
|---|---|---|---|
| CPU | cpu.max | Hard limit on CPU time (quota period) | resources.limits.cpu |
| CPU | cpu.weight | Relative CPU weight (1-10000) | resources.requests.cpu |
| Memory | memory.max | Hard memory limit in bytes | resources.limits.memory |
| Memory | memory.high | Memory throttling threshold | Throttling before OOM |
| Memory | memory.min | Memory protection floor | Guaranteed memory floor |
| I/O | io.max | I/O bandwidth limits (rbps/wbps/riops/wiops) | Not native in K8s |
| PID | pids.max | Maximum number of processes/threads | Kubelet --pod-max-pids |
| cpuset | cpuset.cpus | CPU affinity mask | cpuManagerPolicy=static |
| hugetlb | hugetlb.1GB.max | HugeTLB usage limit | resources.limits/hugepages-* |
1.2.3 Runtime Inspection Commands
# Show available controllers on your system
cat /sys/fs/cgroup/cgroup.controllers
# Show which controllers are active for new children in the root cgroup
cat /sys/fs/cgroup/cgroup.subtree_control
# Find a specific Kubernetes pod cgroup
POD_UID=$(kubectl get pod my-app -n default -o jsonpath='{.metadata.uid}')
POD_CGROUP=$(find /sys/fs/cgroup/kubepods -name "*${POD_UID:0:8}*" -type d 2>/dev/null)
echo "Pod cgroup: $POD_CGROUP"
# Inspect CPU limits for a running container
CONTAINER_ID=$(sudo crictl ps --name my-app -q)
CGROUP_PATH=$(sudo crictl inspect $CONTAINER_ID | jq -r '.info.runtimeSpec.linux.cgroupsPath')
echo "Container cgroup: /sys/fs/cgroup/$CGROUP_PATH"
cat /sys/fs/cgroup/$CGROUP_PATH/cpu.max
# Watch memory pressure in real-time
CONTAINER_ID=$(sudo crictl ps --name my-app -q)
CGROUP_PATH=$(sudo crictl inspect $CONTAINER_ID | jq -r '.info.runtimeSpec.linux.cgroupsPath')
watch -n 2 "cat /sys/fs/cgroup/$CGROUP_PATH/memory.current && cat /sys/fs/cgroup/$CGROUP_PATH/memory.max"
# Check for OOM kills in a cgroup
CONTAINER_ID=$(sudo crictl ps --name my-app -q)
CGROUP_PATH=$(sudo crictl inspect $CONTAINER_ID | jq -r '.info.runtimeSpec.linux.cgroupsPath')
cat /sys/fs/cgroup/$CGROUP_PATH/memory.events | grep oomSRE Warning: When a container hits
memory.maxand is OOM-killed, the kernel incrementsoom_killinmemory.events. This is your first signal. Always alert onoom_kill > 0in kubelet metrics (kubelet_container_memory_working_set_bytescrossing the limit is the leading indicator).
1.3 Linux Namespaces
Namespaces provide the isolation pillar of containerization. Each namespace wraps a global system resource so that processes within the namespace see an isolated instance of that resource.
1.3.1 Namespace Types
| Namespace | Flag | Isolates | Kernel Version |
|---|---|---|---|
| PID | CLONE_NEWPID | Process ID number space | 2.6.24 |
| Network | CLONE_NEWNET | Network devices, stacks, ports, routing tables | 2.6.29 |
| Mount | CLONE_NEWNS | Mount points, filesystem hierarchy | 2.4.19 |
| UTS | CLONE_NEWUTS | Hostname and NIS domain name | 2.6.19 |
| IPC | CLONE_NEWIPC | System V IPC, POSIX message queues | 2.6.19 |
| User | CLONE_NEWUSER | User and group ID space | 3.8 |
| Cgroup | CLONE_NEWCGROUP | Cgroup root directory | 4.6 |
| Time | CLONE_NEWTIME | Boot and monotonic clocks | 5.6 |
1.3.2 Verifying Namespace Isolation
# Find the PID of a running container's init process
CONTAINER_ID=$(sudo crictl ps --name my-app -q)
PID=$(sudo crictl inspect $CONTAINER_ID | jq -r '.info.pid')
echo "Container PID: $PID"
# Inspect namespaces for the container
sudo lsns -p $PID
# Output example:
# NS TYPE NPROCS PID USER COMMAND
# 4026531835 cgroup 1 12345 /app
# 4026531837 pid 1 12345 /app
# 4026531838 net 1 12345 /app
# 4026531840 mnt 1 12345 /app
# Check which namespaces are shared with the host
sudo nsenter -t $PID -n ip addr
# View network namespace details
sudo ls -la /proc/$PID/ns/
# Output example:
# lrwxrwxrwx 1 root root 0 ... cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 root root 0 ... ipc -> 'ipc:[4026531839]'
# lrwxrwxrwx 1 root root 0 ... mnt -> 'mnt:[4026532160]'
# lrwxrwxrwx 1 root root 0 ... net -> 'net:[4026531993]'
# lrwxrwxrwx 1 root root 0 ... pid -> 'pid:[4026531836]'
# lrwxrwxrwx 1 root root 0 ... user -> 'user:[4026531837]'
# lrwxrwxrwx 1 root root 0 ... uts -> 'uts:[4026531838]'SRE Warning: If two containers share the same network namespace (as in a pod’s sidecar pattern), they share the same loopback interface. Port conflicts are possible. This is how pods achieve
localhostcommunication between containers — they share the pod’s network namespace via the infra/pause container.
1.4 Minikube in Production-Simulation
Minikube is the most accessible local Kubernetes environment. With the right configuration, it simulates a production-grade multi-node cluster on a single machine.
1.4.1 Driver Selection Matrix
| Driver | Platform | Performance | Nested Virt | Best For |
|---|---|---|---|---|
| Docker | Linux, macOS, Windows | Medium | No | Quick start, resource-constrained laptops |
| KVM2 | Linux | High | Yes | Heavy workloads, multi-node simulation, storage testing |
| Hyperkit | macOS | High | Yes | Multi-node on macOS |
| VirtualBox | All | Low | Yes | Cross-platform consistency |
| None (bare metal) | Linux | Highest | No | CI/CD runners, advanced users with existing Docker |
1.4.2 Multi-Node Minikube Cluster Setup
This is our reference configuration for the book — a 3-node cluster using KVM2 on Linux:
#!/usr/bin/env bash
# =============================================================================
# minikube-multi-node-setup.sh
# Creates a 3-node Minikube cluster with production-simulated configuration
# =============================================================================
set -euo pipefail
# --- Driver detection ---
# Auto-detect the best driver for the platform
case "$(uname -s)" in
Linux)
if command -v virsh &>/dev/null && virsh list --name &>/dev/null 2>&1; then
DRIVER="kvm2"
else
DRIVER="docker"
echo "[WARN] KVM2 not detected, falling back to docker driver"
echo "[WARN] Install libvirt: sudo apt install libvirt-daemon-system libvirt-clients qemu-kvm"
fi
;;
Darwin)
if command -v hyperkit &>/dev/null; then
DRIVER="hyperkit"
else
DRIVER="docker"
echo "[WARN] hyperkit not detected, falling back to docker driver"
echo "[WARN] Install hyperkit: brew install hyperkit"
fi
;;
*)
DRIVER="docker"
;;
esac
# --- Start the multi-node cluster ---
minikube start \
--driver="${DRIVER}" \
--nodes=3 \
--cpus=4 \
--memory=8192 \
--disk-size=40g \
--kubernetes-version=v1.30.0 \
--cni=cilium \
--container-runtime=containerd \
--network-plugin=cni \
--service-cluster-ip-range="10.96.0.0/16" \
--extra-config=kubelet.cgroup-driver=systemd \
--extra-config=kubelet.cgroup-root=/ \
--extra-config=kubelet.pod-max-pids=4096 \
--extra-config=apiserver.service-node-port-range=30000-32767 \
--ports=127.0.0.1:8443:8443 \
--ports=127.0.0.1:10080:80 \
--ports=127.0.0.1:10443:443
# --- Verify cluster health ---
echo ""
echo "[INFO] Verifying cluster health..."
kubectl cluster-info
kubectl get nodes -o wide
echo ""
echo "[INFO] Cluster nodes:"
kubectl get nodes --show-labels | grep -E "node-role|topology"
echo ""
echo "[INFO] Node resource capacity:"
kubectl describe node minikube | grep -A5 "Capacity"SRE Warning: The default Minikube CPU/memory allocation (2 CPUs, 4 GB RAM) is insufficient for multi-node clusters. Always allocate at least 4 CPUs and 8 GB RAM per node for realistic workload testing.
1.4.3 Enabling Minikube Addons
#!/usr/bin/env bash
# =============================================================================
# minikube-addons-setup.sh
# Enables all addons required for production simulation
# =============================================================================
set -euo pipefail
echo "[INFO] Enabling Minikube addons..."
# Ingress controller (nginx-ingress)
minikube addons enable ingress
# MetalLB for LoadBalancer service emulation
minikube addons enable metallb
# Storage provisioner for dynamic PV provisioning
minikube addons enable storage-provisioner
# Kubernetes Dashboard for visual monitoring
minikube addons enable dashboard
minikube addons enable metrics-server
# Log viewer for troubleshooting
minikube addons enable logviewer
# Registry for local image hosting
minikube addons enable registry
# Headlamp for advanced UI
minikube addons enable headlamp
echo ""
echo "[INFO] Verifying addons..."
minikube addons list | grep -E "ingress|metallb|storage|dashboard"
echo ""
echo "[INFO] Ingress controller pods:"
kubectl get pods -n ingress-nginx
echo ""
echo "[INFO] MetalLB pods:"
kubectl get pods -n metallb-system
echo ""
echo "[INFO] Storage provisioner pods:"
kubectl get pods -n kube-system | grep storage
echo ""
echo "[INFO] Dashboard:"
minikube dashboard --url &1.4.4 MetalLB Configuration for Local LoadBalancer
MetalLB provides LoadBalancer IPs in a local environment where no cloud LB exists. Configure it with an explicit IP pool:
# =============================================================================
# Configure MetalLB IP address pool for Minikube
# =============================================================================
# First, verify MetalLB is running
kubectl get pods -n metallb-system
# Create a Layer2 IP address pool
# This IP range must be within the Minikube docker network CIDR
# Default Minikube network: 192.168.49.0/24 (KVM2) or 192.168.99.0/24 (VirtualBox) or 172.17.0.0/16 (Docker)
cat <<'EOF' | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: local-pool
namespace: metallb-system
spec:
addresses:
# For KVM2 driver: 192.168.49.1-192.168.49.254
# For Docker driver: 172.17.255.200-172.17.255.240
# Adjust the CIDR based on your minikube ip output:
# MINIKUBE_IP=$(minikube ip) && echo "${MINIKUBE_IP%.*}.200-${MINIKUBE_IP%.*}.240"
- 192.168.49.200-192.168.49.240
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: local-l2-advertisement
namespace: metallb-system
spec:
ipAddressPools:
- local-pool
EOF
# Verify MetalLB configuration
kubectl describe ipaddresspool -n metallb-system local-pool
# Test with a LoadBalancer service
kubectl create deployment test-lb --image=nginx:alpine --replicas=2
kubectl expose deployment test-lb --type=LoadBalancer --port=80 --name=test-lb-svc
sleep 10
kubectl get svc test-lb-svc
# Expected output: EXTERNAL-IP column shows an IP from the pool (e.g., 192.168.49.200)1.4.5 Local Storage Provisioner Configuration
The storage-provisioner addon provides dynamic PV provisioning. However, for production simulation, you may want to configure a dedicated storage class with explicit reclaim policies:
# =============================================================================
# Configure storage classes for local development
# =============================================================================
# Create a fast SSD-like storage class (uses the default storage-provisioner)
cat <<'EOF' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-local
provisioner: k8s.io/minikube-hostpath
reclaimPolicy: Retain
volumeBindingMode: Immediate
allowVolumeExpansion: true
EOF
# Create a standard HDD-like storage class
cat <<'EOF' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard-local
provisioner: k8s.io/minikube-hostpath
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false
EOF
# Verify storage classes
kubectl get sc
# Test dynamic provisioning with a PVC
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: fast-local
EOF
kubectl get pvc test-pvc
kubectl get pv1.4.6 Complete Verification Runbook
Run this checklist after setting up your Minikube cluster:
#!/usr/bin/env bash
# =============================================================================
# minikube-verify.sh
# Complete verification runbook for Minikube multi-node cluster
# =============================================================================
set -euo pipefail
echo "=========================================="
echo " Minikube Multi-Node Verification"
echo "=========================================="
echo ""
echo "1. Cluster Status"
echo "------------------"
kubectl cluster-info
echo ""
echo "2. Node Health"
echo "--------------"
kubectl get nodes -o wide
NODE_COUNT=$(kubectl get nodes --no-headers | wc -l)
echo "Node count: ${NODE_COUNT}"
if [ "${NODE_COUNT}" -lt 3 ]; then
echo "[WARN] Expected 3 nodes, found ${NODE_COUNT}. Multi-node sim may be degraded."
fi
echo ""
echo "3. CoreDNS"
echo "----------"
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
kubectl run -it --rm dns-test --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 --restart=Never -- nslookup kubernetes.default.svc.cluster.local
echo ""
echo "4. Ingress Controller"
echo "---------------------"
kubectl get pods -n ingress-nginx
kubectl get svc -n ingress-nginx
echo ""
echo "5. MetalLB"
echo "----------"
kubectl get pods -n metallb-system
kubectl get ipaddresspool -n metallb-system
echo ""
echo "6. Storage"
echo "----------"
kubectl get sc
kubectl get pods -n kube-system | grep storage
echo ""
echo "7. Network (Cilium/CNI)"
echo "-----------------------"
kubectl get pods -n kube-system -l k8s-app=cilium 2>/dev/null || echo "Cilium not in kube-system, checking..."
kubectl get pods --all-namespaces -l k8s-app=cilium 2>/dev/null || echo "Cilium not running, check CNI with: kubectl get pods -n kube-system | grep -i cni"
kubectl run -it --rm ping-test --image=busybox:1.36 --restart=Never -- ping -c 3 8.8.8.8
echo ""
echo "8. Dashboard"
echo "-----------"
echo "Access at: $(minikube dashboard --url 2>/dev/null || echo 'Not available')"
echo ""
echo "=========================================="
echo " Verification Complete"
echo "=========================================="1.5 kind: Architecting Multi-Node Topologies
kind (Kubernetes-in-Docker) runs Kubernetes nodes as Docker containers. It is the gold standard for CI/CD and controlled multi-node local simulations because of its repeatability and container-native architecture.
1.5.1 Complete Multi-Node kind Configuration
Below is our reference kind-config.yaml defining a 1 control-plane + 3 worker node cluster with ingress emulation and explicit host port mappings:
# =============================================================================
# kind-config.yaml
# Multi-node kind cluster configuration
# 1 control-plane + 3 worker nodes, ingress-ready, with host port mapping
# =============================================================================
#
# Usage:
# kind create cluster --config kind-config.yaml --name production-sim
#
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
# Cluster name (used for context and Docker container naming)
name: production-sim
# Kubernetes version to use for all nodes
# Pin to a specific patch version for reproducibility
nodes:
# --------------------------------------------------------------------------
# Control Plane Node
# --------------------------------------------------------------------------
- role: control-plane
# Explicit node image — always pin to a specific version
image: kindest/node:v1.30.0@sha256:2d5f7a6f8c2b0a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
# Label this node as ingress-ready for the nginx ingress controller
node-labels: "ingress-ready=true"
- |
kind: ClusterConfiguration
# Enable PodSecurityPolicy (deprecated in 1.25+, use OPA/Kyverno instead)
apiServer:
extraArgs:
enable-admission-plugins: "NodeRestriction,NamespaceLifecycle"
controllerManager:
extraArgs:
node-cidr-mask-size: "24"
# Map container ports to host ports for ingress access
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
- containerPort: 443
hostPort: 443
protocol: TCP
- containerPort: 30000
hostPort: 30000
protocol: TCP
- containerPort: 30001
hostPort: 30001
protocol: TCP
# Resource limits for the kind node container
extraMounts:
- hostPath: /var/lib/kind-local-pv
containerPath: /mnt/local-storage
# Enable direct mount propagation for stateful workloads
propagation: Bidirectional
# --------------------------------------------------------------------------
# Worker Node 1
# --------------------------------------------------------------------------
- role: worker
image: kindest/node:v1.30.0@sha256:2d5f7a6f8c2b0a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1
extraMounts:
- hostPath: /var/lib/kind-local-pv/worker1
containerPath: /mnt/local-storage
propagation: Bidirectional
# --------------------------------------------------------------------------
# Worker Node 2
# --------------------------------------------------------------------------
- role: worker
image: kindest/node:v1.30.0@sha256:2d5f7a6f8c2b0a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1
extraMounts:
- hostPath: /var/lib/kind-local-pv/worker2
containerPath: /mnt/local-storage
propagation: Bidirectional
# --------------------------------------------------------------------------
# Worker Node 3
# --------------------------------------------------------------------------
- role: worker
image: kindest/node:v1.30.0@sha256:2d5f7a6f8c2b0a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1
extraMounts:
- hostPath: /var/lib/kind-local-pv/worker3
containerPath: /mnt/local-storage
propagation: Bidirectional
# ----------------------------------------------------------------------------
# Networking configuration
# ----------------------------------------------------------------------------
networking:
# The IP family to use
ipFamily: ipv4
# Service subnet CIDR
serviceSubnet: "10.96.0.0/16"
# Pod subnet CIDR
podSubnet: "10.244.0.0/16"
# Disable default CNI if we want to install our own (Cilium, Calico)
disableDefaultCNI: false
# API server address — 0.0.0.0 allows external access
apiServerAddress: "0.0.0.0"
# API server port on the host
apiServerPort: 6443
# ----------------------------------------------------------------------------
# kubeadm configuration patches applied globally to all nodes
# ----------------------------------------------------------------------------
kubeadmConfigPatches:
- |
kind: ClusterConfiguration
metadata:
name: config
# Use systemd cgroup driver for kubelet
controllerManager:
extraArgs:
node-monitor-grace-period: "30s"
node-monitor-period: "5s"
scheduler:
extraArgs:
bind-timeout-seconds: "30"
- |
kind: KubeletConfiguration
cgroupDriver: systemd
# Protect against PID exhaustion
podPidsLimit: 4096
# Eviction thresholds
evictionHard:
memory.available: "256Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
evictionSoft:
memory.available: "512Mi"
nodefs.available: "15%"
evictionSoftGracePeriod:
memory.available: "1m30s"
nodefs.available: "2m"
evictionMaxPodGracePeriod: 601.5.2 Creating and Verifying the kind Cluster
# =============================================================================
# kind-cluster-workflow.sh
# Complete kind cluster creation and verification workflow
# =============================================================================
set -euo pipefail
# --- Prerequisites ---
echo "[PRE] Checking prerequisites..."
for cmd in kind kubectl docker; do
if ! command -v $cmd &>/dev/null; then
echo "[ERROR] $cmd not found. Please install it first."
exit 1
fi
done
# Create the local host directories for PV mounts
sudo mkdir -p /var/lib/kind-local-pv/worker{1,2,3}
sudo chmod -R 777 /var/lib/kind-local-pv
# --- Create the cluster ---
echo ""
echo "[INFO] Creating multi-node kind cluster 'production-sim'..."
kind create cluster --config kind-config.yaml --name production-sim --wait 5m
# --- Verify nodes ---
echo ""
echo "[INFO] Cluster nodes:"
kubectl get nodes -o wide
# --- Label worker nodes for workload scheduling ---
echo ""
echo "[INFO] Labeling worker nodes..."
for NODE in $(kubectl get nodes --no-headers -l '!node-role.kubernetes.io/control-plane' -o name); do
kubectl label --overwrite $NODE node.kubernetes.io/worker=true
done
# --- Install ingress controller (without Helm for simplicity) ---
echo ""
echo "[INFO] Installing nginx-ingress controller..."
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
# Wait for the ingress controller to be ready
kubectl wait --namespace ingress-nginx \
--for=condition=ready pod \
--selector=app.kubernetes.io/component=controller \
--timeout=180s
# --- Install MetalLB for LoadBalancer support ---
echo ""
echo "[INFO] Installing MetalLB..."
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.5/config/manifests/metallb-native.yaml
kubectl wait --namespace metallb-system \
--for=condition=ready pod \
--selector=app=metallb \
--timeout=120s
# Get the Docker network subnet for MetalLB IP pool
NETWORK_CIDR=$(docker network inspect kind -f '{{(index .IPAM.Config 0).Subnet}}')
NETWORK_BASE=$(echo ${NETWORK_CIDR} | cut -d'.' -f1-3)
cat <<EOF | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: kind-pool
namespace: metallb-system
spec:
addresses:
- ${NETWORK_BASE}.200-${NETWORK_BASE}.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: kind-l2
namespace: metallb-system
spec:
ipAddressPools:
- kind-pool
EOF
# --- Install local path provisioner for storage ---
echo ""
echo "[INFO] Installing Local Path Provisioner..."
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.28/deploy/local-path-storage.yaml
kubectl wait --namespace local-path-storage \
--for=condition=ready pod \
--selector=app=local-path-provisioner \
--timeout=120s
# --- Final verification ---
echo ""
echo "=========================================="
echo " kind Cluster 'production-sim' Ready"
echo "=========================================="
echo ""
echo "Control Plane: $(kubectl get nodes --no-headers -l node-role.kubernetes.io/control-plane -o name)"
echo "Worker Nodes: $(kubectl get nodes --no-headers -l '!node-role.kubernetes.io/control-plane' -o name | wc -l)"
echo "Ingress: http://localhost:80"
echo "Dashboard: kubectl proxy --port=8080 &"
# Save the kubeconfig for later use
kind get kubeconfig --name production-sim > ~/.kube/kind-production-sim-config
echo ""
echo "[INFO] Kubeconfig saved to ~/.kube/kind-production-sim-config"
echo "[INFO] Use: export KUBECONFIG=~/.kube/kind-production-sim-config"SRE Warning: kind nodes share the Docker host kernel. If you run a workload that loads kernel modules (e.g.,
iptables,ebpf), it affects the host. kind is ideal for control plane testing and CI, but never assume kernel isolation — use Minikube with KVM2 or a real VM for true kernel-level testing.
1.6 k3d/k3s for Lightweight Edge & HA Simulation
k3s is a CNCF-certified Kubernetes distribution optimized for resource-constrained environments. k3d wraps k3s in Docker containers, providing instant cluster creation with built-in load balancing.
1.6.1 Multi-Server HA k3d Configuration
This configuration creates a 3-server (control-plane) HA cluster with 2 agent (worker) nodes and embedded etcd (dqlite):
# =============================================================================
# k3d-multi-server-ha.yaml
# Multi-server HA k3d cluster for edge computing simulation
# 3 server nodes (embedded etcd) + 2 agent nodes
# =============================================================================
#
# Usage:
# k3d cluster create --config k3d-multi-server-ha.yaml
#
apiVersion: k3d.io/v1alpha5
kind: Simple
metadata:
name: edge-ha
# Number of server (control-plane) nodes for HA
servers: 3
# Number of agent (worker) nodes
agents: 2
# Container image to use for all nodes
image: rancher/k3s:v1.30.1-k3s1
# Port mappings — traffic hits the built-in load balancer
# which distributes across all server nodes
ports:
# HTTP ingress
- port: 80:80
nodeFilters:
- loadbalancer
# HTTPS ingress
- port: 443:443
nodeFilters:
- loadbalancer
# Kubernetes API (load balanced across servers)
- port: 6443:6443
nodeFilters:
- loadbalancer
# Metrics port
- port: 37000:30000
nodeFilters:
- loadbalancer
# Registry configuration for local image mirroring
registries:
create:
name: k3d-registry.localhost
host: "0.0.0.0"
hostPort: "5000"
# Options for k3s configuration
options:
k3s:
# Extra arguments passed to all server nodes
extraServerArgs:
# Enable embedded etcd (dqlite) for HA — REQUIRED for multi-server
- "--cluster-init"
# Disable built-in Traefik — we'll install our own ingress controller
- "--disable=traefik"
# Disable local-storage to use our own provisioner
- "--disable=local-storage"
# Disable metrics-server to use our own from components
- "--disable=metrics-server"
# Enable Pod Security Standards
- "--kube-apiserver-arg=enable-admission-plugins=NodeRestriction,PodSecurity"
# Configure etcd for HA
- "--etcd-snapshot-schedule-cron=*/30 * * * *"
- "--etcd-snapshot-retention=24"
# Set service CIDR (must not conflict with Docker networks)
- "--service-cidr=10.96.0.0/16"
# Set cluster CIDR
- "--cluster-cidr=10.244.0.0/16"
extraAgentArgs: []
kubeconfig:
updateDefaultKubeconfig: true
switchCurrentContext: true
# Runtime configuration
runtime:
# Enable Docker-in-Docker for nested container builds
allowLocalPaths: true
# Limits per container (applied to k3d proxies and LB)
ulimits:
- name: nofile
soft: 65536
hard: 655361.6.2 Creating and Verifying the HA k3d Cluster
# =============================================================================
# k3d-ha-cluster-setup.sh
# Complete k3d HA cluster creation and verification workflow
# =============================================================================
set -euo pipefail
# --- Prerequisites ---
echo "[PRE] Checking prerequisites..."
for cmd in k3d kubectl docker; do
if ! command -v $cmd &>/dev/null; then
echo "[ERROR] $cmd not found."
exit 1
fi
done
# --- Create the HA cluster ---
echo ""
echo "[INFO] Creating HA k3d cluster 'edge-ha'..."
k3d cluster create --config k3d-multi-server-ha.yaml --wait
# --- Verify cluster topology ---
echo ""
echo "[INFO] Cluster nodes:"
kubectl get nodes -o wide
echo ""
echo "[INFO] Server (control-plane) nodes:"
kubectl get nodes --no-headers -l 'node-role.kubernetes.io/control-plane' -o custom-columns=NAME:.metadata.name
echo ""
echo "[INFO] Agent (worker) nodes:"
kubectl get nodes --no-headers -l '!node-role.kubernetes.io/control-plane' -o custom-columns=NAME:.metadata.name
# --- Verify etcd cluster health ---
echo ""
echo "[INFO] Checking etcd cluster health..."
# Find the first server container
SERVER_CONTAINER=$(docker ps --filter "name=k3d-edge-ha-server-0" --format "{{.Names}}")
if [ -n "$SERVER_CONTAINER" ]; then
docker exec $SERVER_CONTAINER k3s etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/server-client.key \
endpoint health --cluster -w table
fi
# --- Install ingress controller ---
echo ""
echo "[INFO] Installing nginx-ingress controller..."
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
kubectl wait --namespace ingress-nginx \
--for=condition=ready pod \
--selector=app.kubernetes.io/component=controller \
--timeout=180s
# --- Deploy a test multi-service application ---
echo ""
echo "[INFO] Deploying test application..."
cat <<'APPLICATION_EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: edge-app
labels:
app: edge-app
spec:
replicas: 6
selector:
matchLabels:
app: edge-app
template:
metadata:
labels:
app: edge-app
spec:
topologySpreadConstraints:
# Spread pods across zones
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: edge-app
containers:
- name: app
image: nginx:alpine
ports:
- containerPort: 80
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 3
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: edge-app-svc
spec:
ports:
- port: 80
targetPort: 80
selector:
app: edge-app
type: LoadBalancer
APPLICATION_EOF
# Wait for pods to be ready across all nodes
kubectl wait --for=condition=ready pod -l app=edge-app --timeout=120s
# Verify pod distribution across nodes
echo ""
echo "[INFO] Pod distribution across nodes:"
kubectl get pods -l app=edge-app -o wide
echo ""
echo "=========================================="
echo " HA k3d Cluster 'edge-ha' Ready"
echo "=========================================="
echo ""
echo "Control Plane HA: 3 servers"
echo "Workers: 2 agents"
echo "API Endpoint: https://localhost:6443"
echo "HTTP Ingress: http://localhost:80"
echo "Local Registry: localhost:5000"SRE Warning: k3d uses k3s which embeds etcd (dqlite). Multiple server nodes require an odd number (1, 3, 5) for etcd quorum. Two servers provide NO HA benefit — if either fails, you lose quorum. Always use 3 or 5 server nodes for real HA simulation.
Chapter Summary
In this chapter, you have:
- Mastered containerd internals — architecture, CRI flow, and production configuration tuning
- Understood cgroups v2 — unified hierarchy, resource controllers, and real-time inspection commands
- Traced Linux namespaces — the nine namespace types and how Kubernetes leverages each
- Built a 3-node Minikube cluster — with ingress, MetalLB, storage, and a complete verification runbook
- Architected a 4-node kind cluster — with explicit node configurations, host path mounts, and multiple port mappings
- Simulated an HA edge cluster — using k3d with 3-server embedded etcd and LoadBalancer integration
These local environments are not toys — they are production-simulation sandboxes. Every concept tested here translates directly to your production clusters.
Next Steps
Proceed to Chapter 2: Core Kubernetes Primitives & Imperative Control, where you will learn to control the cluster imperatively with kubectl and understand the fundamental workload primitives: Pods, Deployments, ReplicaSets, Services, and self-healing mechanisms.
Appendix: Common Troubleshooting
| Symptom | Likely Cause | Resolution |
|---|---|---|
failed to create containerd task: cgroup | SystemdCgroup = false with cgroups v2 | Set SystemdCgroup = true in containerd config |
The connection to the server localhost:8080 was refused | Kubeconfig not set | eval $(minikube docker-env) or kind export kubeconfig |
Node shows NotReady | CNI not installed | kubectl apply -f https://raw.githubusercontent.com/.../weave-daemonset.yaml |
LoadBalancer pending forever | MetalLB not configured | Deploy IPAddressPool and L2Advertisement resources |
OOMKilled pod in CrashLoopBackOff | Memory limit too low | Check memory.events and increase resources.limits.memory |
Pods stuck in ContainerCreating | StorageClass missing | Install local-path-provisioner or similar |