Phase 1 · Ch 1 — The Local Engineering Environment

Chapter 1: The Local Engineering Environment

Welcome to the first chapter of Kubernetes: Zero to Hero in Production. By the end of this chapter, you will have a deep understanding of container runtimes, Linux kernel primitives, and multiple local Kubernetes cluster architectures that simulate production conditions — all running on your workstation.


1.1 Container Runtimes: containerd Architecture

Before we touch a single kubectl command, we must understand what runs our containers. Since Kubernetes 1.24 removed dockershim, containerd has become the default container runtime in most production clusters.

1.1.1 containerd Component Architecture

containerd is a graduated CNCF project that manages the complete container lifecycle. It exposes a gRPC API that implements the Kubernetes Container Runtime Interface (CRI).

                              +--------------------+
                              |    kubelet         |
                              |  (on each node)    |
                              +--------+-----------+
                                       |
                              CRI gRPC |
                                       v
+----------------------+      +----------------------------+
|   ctr CLI (debug)    |      |     containerd             |
+----------------------+      |  +----------------------+  |
                              |  |  GRPC API Server     |  |
+----------------------+      |  +----------------------+  |
|   nerdctl (user)     |------+  |  CRI Plugin (cri)   |  |
+----------------------+      |  +----------------------+  |
                              |  |  Content Store      |  |
+----------------------+      |  |  (Blob storage)     |  |
|   crictl (debug)     |------+  +----------------------+  |
+----------------------+      |  |  Metadata DB (bolt) |  |
                              |  +----------------------+  |
                              |  |  Image Service      |  |
                              |  +----------------------+  |
                              |  |  Snapshotter        |  |
                              |  |  (overlayfs)        |  |
                              |  +----------------------+  |
                              |  |  Shim (per pod)     |  |
                              +----------------------|-----+
                                                     |
                                             runc    |
                                                     v
                              +----------------------------+
                              |     runc                   |
                              |  +----------------------+  |
                              |  |  cgroups v2          |  |
                              |  |  namespaces          |  |
                              |  |  rootfs (overlayfs)  |  |
                              |  +----------------------+  |
                              +----------------------------+

Key internal subsystems:

ComponentResponsibilityStorage Backend
GRPC API ServerAccepts CRI and distribution API calls over Unix socketIn-memory
CRI Plugin (cri)Translates Kubernetes CRI calls to containerd operationsBoltDB metadata
Content StoreStores raw blob content (layer compressed tarballs)Filesystem (/var/lib/containerd/io.containerd.content.v1.content/)
Metadata DBTracks images, containers, snapshots, and namespacesBoltDB (/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db)
Image ServicePulls, unpacks, and manages images via distribution specContent Store + Snapshotters
SnapshotterManages filesystem snapshots (default: overlayfs)Filesystem (/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/)
Shim (containerd-shim-runc-v2)Per-container daemon that keeps STDIO/STDERR open after runc exitsN/A

SRE Warning: The shim process is critical for pod survival during containerd restarts. Each container gets its own shim. If the shim dies, the container dies. Monitor shim process count as a health signal: ps aux | grep containerd-shim | wc -l.

1.1.2 CRI Flow: From kubelet to Container

When the Kubernetes scheduler assigns a pod to a node, the following flow occurs:

  1. kubelet calls RunPodSandbox() via CRI gRPC to the containerd socket (/run/containerd/containerd.sock or /var/run/containerd/containerd.sock)
  2. containerd’s CRI plugin creates a pod sandbox (infra container using pause image) which holds the network namespace
  3. kubelet calls CreateContainer() for each container in the pod spec
  4. CRI plugin resolves the image reference, checks the content store, and pulls if necessary
  5. Snapshotter creates an overlayfs mount from the image layers + container writable layer
  6. containerd launches containerd-shim-runc-v2 which invokes runc to create the container
  7. runc uses cgroups v2 to set resource constraints and namespaces for isolation
  8. The shim keeps the container’s stdio streams connected and reports exit codes

1.1.3 Complete containerd Configuration (Production Template)

Below is a production-grade config.toml for containerd. This file lives at /etc/containerd/config.toml on Linux nodes.

# /etc/containerd/config.toml
# Production-grade containerd configuration tuned for Kubernetes
 
version = 2
 
# Root directory for containerd persistent data
root = "/var/lib/containerd"
 
# State directory for containerd transient data
state = "/run/containerd"
 
# Unix socket path for CRI gRPC communication
grpc.address = "/run/containerd/containerd.sock"
grpc.uid = 0
grpc.gid = 0
grpc.max_recv_message_size = 16777216
grpc.max_send_message_size = 16777216
 
# Debug and metrics
debug.level = "info"
metrics.address = "127.0.0.1:1338"
metrics.grpc_histogram = false
 
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    # Sandbox image — keep this pinned to a specific version
    sandbox_image = "registry.k8s.io/pause:3.9"
 
    # Max container log size before rotation
    max_container_log_line_size = 16384
 
    # Enable SELinux if your nodes use it
    enable_selinux = false
 
    # Cgroup management
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "runc"
      snapshotter = "overlayfs"
      discard_unpacked_layers = true
 
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"
          runtime_engine = ""
          runtime_root = ""
 
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            # CRITICAL: Must be true for cgroups v2 systems (all modern Linux distros)
            SystemdCgroup = true
            BinaryName = "runc"
 
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
        # GPU-accelerated runtime for AI/ML workloads (Chapter 9)
        runtime_type = "io.containerd.runc.v2"
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
          SystemdCgroup = true
          BinaryName = "/usr/bin/nvidia-container-runtime"
 
    # Registry mirror configuration
    [plugins."io.containerd.grpc.v1.cri".registry]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = [
            "https://registry-1.docker.io",
            "https://mirror.gcr.io",
          ]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."ghcr.io"]
          endpoint = ["https://ghcr.io"]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."k8s.gcr.io"]
          endpoint = ["https://registry.k8s.io"]
 
      # Registry authentication (use Kubernetes image pull secrets instead)
      [plugins."io.containerd.grpc.v1.cri".registry.configs]
 
  # CDI (Container Device Interface) for GPU devices
  [plugins."io.containerd.cdi.v1"]
    enabled = true
 
# Timeouts for CRI operations
[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
  tls_cert_file = ""
  tls_key_file = ""
 
# OOM score adjustment for containerd
oom_score = -999

SRE Warning: The SystemdCgroup = true option is mandatory on all modern Linux distributions using cgroups v2. Setting this to false on a cgroups v2 host causes pod creation failures with failed to write "max" to "pids.max" errors. Verify your cgroup mode with stat -fc %T /sys/fs/cgroup/cgroup2fs means v2.

1.1.4 Verification Commands

# Check containerd status
sudo systemctl status containerd --no-pager
 
# Verify containerd is listening on its socket
sudo ctr version
sudo ctr plugins list | grep -E "cri|runc"
 
# Check the CRI plugin is registered with kubelet
sudo crictl info | jq .
 
# List all pods tracked by containerd
sudo crictl pods
 
# List all images cached by containerd
sudo crictl images
 
# Export containerd metrics
curl -s http://127.0.0.1:1338/metrics | head -50
 
# Check cgroup mode
stat -fc %T /sys/fs/cgroup/
 
# Verify containerd config is valid
sudo containerd config dump

1.2 Linux cgroups v2 Deep Dive

Control Groups (cgroups) are the Linux kernel feature that limits, accounts for, and isolates resource usage for process hierarchies. Kubernetes relies on cgroups for CPU, memory, and PID enforcement.

1.2.1 cgroups v2 Unified Hierarchy

cgroups v2 (introduced in Linux 4.5, production-ready since 5.x) uses a single unified hierarchy instead of the multiple hierarchies in v1. All resource controllers are mounted at /sys/fs/cgroup/.

/sys/fs/cgroup/
├── cgroup.controllers      # Available controllers
├── cgroup.subtree_control  # Controllers active for children
├── cpu.stat                # CPU usage statistics
├── memory.current          # Current memory usage in bytes
├── memory.max              # Memory hard limit ("max" if unlimited)
├── memory.min              # Memory protection floor
├── memory.low              # Memory best-effort floor
├── io.stat                 # I/O statistics
├── io.max                  # I/O limits
├── pids.current            # Current number of PIDs/Tasks
├── pids.max                # Maximum number of PIDs
├── kubepods/               # Kubernetes pod cgroups
│   ├── burstable/          # Burstable QoS class pods
│   │   └── pod<UID>/       # Per-pod cgroup
│   │       └── <containerUID>/
│   │           ├── cpu.max
│   │           ├── memory.current
│   │           ├── memory.max
│   │           ├── memory.min
│   │           ├── memory.high
│   │           ├── pids.max
│   │           └── io.max
│   ├── guaranteed/         # Guaranteed QoS class pods
│   └── besteffort/         # BestEffort QoS class pods
└── system.slice/           # System services

1.2.2 Resource Controllers

ControllerInterface FilePurposeKubernetes Mapping
CPUcpu.maxHard limit on CPU time (quota period)resources.limits.cpu
CPUcpu.weightRelative CPU weight (1-10000)resources.requests.cpu
Memorymemory.maxHard memory limit in bytesresources.limits.memory
Memorymemory.highMemory throttling thresholdThrottling before OOM
Memorymemory.minMemory protection floorGuaranteed memory floor
I/Oio.maxI/O bandwidth limits (rbps/wbps/riops/wiops)Not native in K8s
PIDpids.maxMaximum number of processes/threadsKubelet --pod-max-pids
cpusetcpuset.cpusCPU affinity maskcpuManagerPolicy=static
hugetlbhugetlb.1GB.maxHugeTLB usage limitresources.limits/hugepages-*

1.2.3 Runtime Inspection Commands

# Show available controllers on your system
cat /sys/fs/cgroup/cgroup.controllers
 
# Show which controllers are active for new children in the root cgroup
cat /sys/fs/cgroup/cgroup.subtree_control
 
# Find a specific Kubernetes pod cgroup
POD_UID=$(kubectl get pod my-app -n default -o jsonpath='{.metadata.uid}')
POD_CGROUP=$(find /sys/fs/cgroup/kubepods -name "*${POD_UID:0:8}*" -type d 2>/dev/null)
echo "Pod cgroup: $POD_CGROUP"
 
# Inspect CPU limits for a running container
CONTAINER_ID=$(sudo crictl ps --name my-app -q)
CGROUP_PATH=$(sudo crictl inspect $CONTAINER_ID | jq -r '.info.runtimeSpec.linux.cgroupsPath')
echo "Container cgroup: /sys/fs/cgroup/$CGROUP_PATH"
cat /sys/fs/cgroup/$CGROUP_PATH/cpu.max
 
# Watch memory pressure in real-time
CONTAINER_ID=$(sudo crictl ps --name my-app -q)
CGROUP_PATH=$(sudo crictl inspect $CONTAINER_ID | jq -r '.info.runtimeSpec.linux.cgroupsPath')
watch -n 2 "cat /sys/fs/cgroup/$CGROUP_PATH/memory.current && cat /sys/fs/cgroup/$CGROUP_PATH/memory.max"
 
# Check for OOM kills in a cgroup
CONTAINER_ID=$(sudo crictl ps --name my-app -q)
CGROUP_PATH=$(sudo crictl inspect $CONTAINER_ID | jq -r '.info.runtimeSpec.linux.cgroupsPath')
cat /sys/fs/cgroup/$CGROUP_PATH/memory.events | grep oom

SRE Warning: When a container hits memory.max and is OOM-killed, the kernel increments oom_kill in memory.events. This is your first signal. Always alert on oom_kill > 0 in kubelet metrics (kubelet_container_memory_working_set_bytes crossing the limit is the leading indicator).


1.3 Linux Namespaces

Namespaces provide the isolation pillar of containerization. Each namespace wraps a global system resource so that processes within the namespace see an isolated instance of that resource.

1.3.1 Namespace Types

NamespaceFlagIsolatesKernel Version
PIDCLONE_NEWPIDProcess ID number space2.6.24
NetworkCLONE_NEWNETNetwork devices, stacks, ports, routing tables2.6.29
MountCLONE_NEWNSMount points, filesystem hierarchy2.4.19
UTSCLONE_NEWUTSHostname and NIS domain name2.6.19
IPCCLONE_NEWIPCSystem V IPC, POSIX message queues2.6.19
UserCLONE_NEWUSERUser and group ID space3.8
CgroupCLONE_NEWCGROUPCgroup root directory4.6
TimeCLONE_NEWTIMEBoot and monotonic clocks5.6

1.3.2 Verifying Namespace Isolation

# Find the PID of a running container's init process
CONTAINER_ID=$(sudo crictl ps --name my-app -q)
PID=$(sudo crictl inspect $CONTAINER_ID | jq -r '.info.pid')
echo "Container PID: $PID"
 
# Inspect namespaces for the container
sudo lsns -p $PID
# Output example:
#   NS TYPE  NPROCS   PID  USER     COMMAND
#   4026531835 cgroup   1  12345      /app
#   4026531837 pid      1  12345      /app
#   4026531838 net      1  12345      /app
#   4026531840 mnt      1  12345      /app
 
# Check which namespaces are shared with the host
sudo nsenter -t $PID -n ip addr
 
# View network namespace details
sudo ls -la /proc/$PID/ns/
# Output example:
#   lrwxrwxrwx 1 root root 0 ... cgroup -> 'cgroup:[4026531835]'
#   lrwxrwxrwx 1 root root 0 ... ipc    -> 'ipc:[4026531839]'
#   lrwxrwxrwx 1 root root 0 ... mnt    -> 'mnt:[4026532160]'
#   lrwxrwxrwx 1 root root 0 ... net    -> 'net:[4026531993]'
#   lrwxrwxrwx 1 root root 0 ... pid    -> 'pid:[4026531836]'
#   lrwxrwxrwx 1 root root 0 ... user   -> 'user:[4026531837]'
#   lrwxrwxrwx 1 root root 0 ... uts    -> 'uts:[4026531838]'

SRE Warning: If two containers share the same network namespace (as in a pod’s sidecar pattern), they share the same loopback interface. Port conflicts are possible. This is how pods achieve localhost communication between containers — they share the pod’s network namespace via the infra/pause container.


1.4 Minikube in Production-Simulation

Minikube is the most accessible local Kubernetes environment. With the right configuration, it simulates a production-grade multi-node cluster on a single machine.

1.4.1 Driver Selection Matrix

DriverPlatformPerformanceNested VirtBest For
DockerLinux, macOS, WindowsMediumNoQuick start, resource-constrained laptops
KVM2LinuxHighYesHeavy workloads, multi-node simulation, storage testing
HyperkitmacOSHighYesMulti-node on macOS
VirtualBoxAllLowYesCross-platform consistency
None (bare metal)LinuxHighestNoCI/CD runners, advanced users with existing Docker

1.4.2 Multi-Node Minikube Cluster Setup

This is our reference configuration for the book — a 3-node cluster using KVM2 on Linux:

#!/usr/bin/env bash
# =============================================================================
# minikube-multi-node-setup.sh
# Creates a 3-node Minikube cluster with production-simulated configuration
# =============================================================================
 
set -euo pipefail
 
# --- Driver detection ---
# Auto-detect the best driver for the platform
case "$(uname -s)" in
  Linux)
    if command -v virsh &>/dev/null && virsh list --name &>/dev/null 2>&1; then
      DRIVER="kvm2"
    else
      DRIVER="docker"
      echo "[WARN] KVM2 not detected, falling back to docker driver"
      echo "[WARN] Install libvirt: sudo apt install libvirt-daemon-system libvirt-clients qemu-kvm"
    fi
    ;;
  Darwin)
    if command -v hyperkit &>/dev/null; then
      DRIVER="hyperkit"
    else
      DRIVER="docker"
      echo "[WARN] hyperkit not detected, falling back to docker driver"
      echo "[WARN] Install hyperkit: brew install hyperkit"
    fi
    ;;
  *)
    DRIVER="docker"
    ;;
esac
 
# --- Start the multi-node cluster ---
minikube start \
  --driver="${DRIVER}" \
  --nodes=3 \
  --cpus=4 \
  --memory=8192 \
  --disk-size=40g \
  --kubernetes-version=v1.30.0 \
  --cni=cilium \
  --container-runtime=containerd \
  --network-plugin=cni \
  --service-cluster-ip-range="10.96.0.0/16" \
  --extra-config=kubelet.cgroup-driver=systemd \
  --extra-config=kubelet.cgroup-root=/ \
  --extra-config=kubelet.pod-max-pids=4096 \
  --extra-config=apiserver.service-node-port-range=30000-32767 \
  --ports=127.0.0.1:8443:8443 \
  --ports=127.0.0.1:10080:80 \
  --ports=127.0.0.1:10443:443
 
# --- Verify cluster health ---
echo ""
echo "[INFO] Verifying cluster health..."
kubectl cluster-info
kubectl get nodes -o wide
 
echo ""
echo "[INFO] Cluster nodes:"
kubectl get nodes --show-labels | grep -E "node-role|topology"
 
echo ""
echo "[INFO] Node resource capacity:"
kubectl describe node minikube | grep -A5 "Capacity"

SRE Warning: The default Minikube CPU/memory allocation (2 CPUs, 4 GB RAM) is insufficient for multi-node clusters. Always allocate at least 4 CPUs and 8 GB RAM per node for realistic workload testing.

1.4.3 Enabling Minikube Addons

#!/usr/bin/env bash
# =============================================================================
# minikube-addons-setup.sh
# Enables all addons required for production simulation
# =============================================================================
 
set -euo pipefail
 
echo "[INFO] Enabling Minikube addons..."
 
# Ingress controller (nginx-ingress)
minikube addons enable ingress
 
# MetalLB for LoadBalancer service emulation
minikube addons enable metallb
 
# Storage provisioner for dynamic PV provisioning
minikube addons enable storage-provisioner
 
# Kubernetes Dashboard for visual monitoring
minikube addons enable dashboard
minikube addons enable metrics-server
 
# Log viewer for troubleshooting
minikube addons enable logviewer
 
# Registry for local image hosting
minikube addons enable registry
 
# Headlamp for advanced UI
minikube addons enable headlamp
 
echo ""
echo "[INFO] Verifying addons..."
minikube addons list | grep -E "ingress|metallb|storage|dashboard"
 
echo ""
echo "[INFO] Ingress controller pods:"
kubectl get pods -n ingress-nginx
 
echo ""
echo "[INFO] MetalLB pods:"
kubectl get pods -n metallb-system
 
echo ""
echo "[INFO] Storage provisioner pods:"
kubectl get pods -n kube-system | grep storage
 
echo ""
echo "[INFO] Dashboard:"
minikube dashboard --url &

1.4.4 MetalLB Configuration for Local LoadBalancer

MetalLB provides LoadBalancer IPs in a local environment where no cloud LB exists. Configure it with an explicit IP pool:

# =============================================================================
# Configure MetalLB IP address pool for Minikube
# =============================================================================
 
# First, verify MetalLB is running
kubectl get pods -n metallb-system
 
# Create a Layer2 IP address pool
# This IP range must be within the Minikube docker network CIDR
# Default Minikube network: 192.168.49.0/24 (KVM2) or 192.168.99.0/24 (VirtualBox) or 172.17.0.0/16 (Docker)
 
cat <<'EOF' | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: local-pool
  namespace: metallb-system
spec:
  addresses:
    # For KVM2 driver: 192.168.49.1-192.168.49.254
    # For Docker driver: 172.17.255.200-172.17.255.240
    # Adjust the CIDR based on your minikube ip output:
    #   MINIKUBE_IP=$(minikube ip) && echo "${MINIKUBE_IP%.*}.200-${MINIKUBE_IP%.*}.240"
    - 192.168.49.200-192.168.49.240
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: local-l2-advertisement
  namespace: metallb-system
spec:
  ipAddressPools:
    - local-pool
EOF
 
# Verify MetalLB configuration
kubectl describe ipaddresspool -n metallb-system local-pool
 
# Test with a LoadBalancer service
kubectl create deployment test-lb --image=nginx:alpine --replicas=2
kubectl expose deployment test-lb --type=LoadBalancer --port=80 --name=test-lb-svc
sleep 10
kubectl get svc test-lb-svc
# Expected output: EXTERNAL-IP column shows an IP from the pool (e.g., 192.168.49.200)

1.4.5 Local Storage Provisioner Configuration

The storage-provisioner addon provides dynamic PV provisioning. However, for production simulation, you may want to configure a dedicated storage class with explicit reclaim policies:

# =============================================================================
# Configure storage classes for local development
# =============================================================================
 
# Create a fast SSD-like storage class (uses the default storage-provisioner)
cat <<'EOF' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-local
provisioner: k8s.io/minikube-hostpath
reclaimPolicy: Retain
volumeBindingMode: Immediate
allowVolumeExpansion: true
EOF
 
# Create a standard HDD-like storage class
cat <<'EOF' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard-local
provisioner: k8s.io/minikube-hostpath
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false
EOF
 
# Verify storage classes
kubectl get sc
 
# Test dynamic provisioning with a PVC
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: fast-local
EOF
 
kubectl get pvc test-pvc
kubectl get pv

1.4.6 Complete Verification Runbook

Run this checklist after setting up your Minikube cluster:

#!/usr/bin/env bash
# =============================================================================
# minikube-verify.sh
# Complete verification runbook for Minikube multi-node cluster
# =============================================================================
 
set -euo pipefail
 
echo "=========================================="
echo "  Minikube Multi-Node Verification"
echo "=========================================="
 
echo ""
echo "1. Cluster Status"
echo "------------------"
kubectl cluster-info
 
echo ""
echo "2. Node Health"
echo "--------------"
kubectl get nodes -o wide
NODE_COUNT=$(kubectl get nodes --no-headers | wc -l)
echo "Node count: ${NODE_COUNT}"
if [ "${NODE_COUNT}" -lt 3 ]; then
  echo "[WARN] Expected 3 nodes, found ${NODE_COUNT}. Multi-node sim may be degraded."
fi
 
echo ""
echo "3. CoreDNS"
echo "----------"
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
kubectl run -it --rm dns-test --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 --restart=Never -- nslookup kubernetes.default.svc.cluster.local
 
echo ""
echo "4. Ingress Controller"
echo "---------------------"
kubectl get pods -n ingress-nginx
kubectl get svc -n ingress-nginx
 
echo ""
echo "5. MetalLB"
echo "----------"
kubectl get pods -n metallb-system
kubectl get ipaddresspool -n metallb-system
 
echo ""
echo "6. Storage"
echo "----------"
kubectl get sc
kubectl get pods -n kube-system | grep storage
 
echo ""
echo "7. Network (Cilium/CNI)"
echo "-----------------------"
kubectl get pods -n kube-system -l k8s-app=cilium 2>/dev/null || echo "Cilium not in kube-system, checking..." 
kubectl get pods --all-namespaces -l k8s-app=cilium 2>/dev/null || echo "Cilium not running, check CNI with: kubectl get pods -n kube-system | grep -i cni"
kubectl run -it --rm ping-test --image=busybox:1.36 --restart=Never -- ping -c 3 8.8.8.8
 
echo ""
echo "8. Dashboard"
echo "-----------"
echo "Access at: $(minikube dashboard --url 2>/dev/null || echo 'Not available')"
 
echo ""
echo "=========================================="
echo "  Verification Complete"
echo "=========================================="

1.5 kind: Architecting Multi-Node Topologies

kind (Kubernetes-in-Docker) runs Kubernetes nodes as Docker containers. It is the gold standard for CI/CD and controlled multi-node local simulations because of its repeatability and container-native architecture.

1.5.1 Complete Multi-Node kind Configuration

Below is our reference kind-config.yaml defining a 1 control-plane + 3 worker node cluster with ingress emulation and explicit host port mappings:

# =============================================================================
# kind-config.yaml
# Multi-node kind cluster configuration
# 1 control-plane + 3 worker nodes, ingress-ready, with host port mapping
# =============================================================================
#
# Usage:
#   kind create cluster --config kind-config.yaml --name production-sim
#
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
 
# Cluster name (used for context and Docker container naming)
name: production-sim
 
# Kubernetes version to use for all nodes
# Pin to a specific patch version for reproducibility
nodes:
  # --------------------------------------------------------------------------
  # Control Plane Node
  # --------------------------------------------------------------------------
  - role: control-plane
    # Explicit node image — always pin to a specific version
    image: kindest/node:v1.30.0@sha256:2d5f7a6f8c2b0a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            # Label this node as ingress-ready for the nginx ingress controller
            node-labels: "ingress-ready=true"
      - |
        kind: ClusterConfiguration
        # Enable PodSecurityPolicy (deprecated in 1.25+, use OPA/Kyverno instead)
        apiServer:
          extraArgs:
            enable-admission-plugins: "NodeRestriction,NamespaceLifecycle"
        controllerManager:
          extraArgs:
            node-cidr-mask-size: "24"
    # Map container ports to host ports for ingress access
    extraPortMappings:
      - containerPort: 80
        hostPort: 80
        protocol: TCP
      - containerPort: 443
        hostPort: 443
        protocol: TCP
      - containerPort: 30000
        hostPort: 30000
        protocol: TCP
      - containerPort: 30001
        hostPort: 30001
        protocol: TCP
    # Resource limits for the kind node container
    extraMounts:
      - hostPath: /var/lib/kind-local-pv
        containerPath: /mnt/local-storage
        # Enable direct mount propagation for stateful workloads
        propagation: Bidirectional
 
  # --------------------------------------------------------------------------
  # Worker Node 1
  # --------------------------------------------------------------------------
  - role: worker
    image: kindest/node:v1.30.0@sha256:2d5f7a6f8c2b0a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1
    extraMounts:
      - hostPath: /var/lib/kind-local-pv/worker1
        containerPath: /mnt/local-storage
        propagation: Bidirectional
 
  # --------------------------------------------------------------------------
  # Worker Node 2
  # --------------------------------------------------------------------------
  - role: worker
    image: kindest/node:v1.30.0@sha256:2d5f7a6f8c2b0a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1
    extraMounts:
      - hostPath: /var/lib/kind-local-pv/worker2
        containerPath: /mnt/local-storage
        propagation: Bidirectional
 
  # --------------------------------------------------------------------------
  # Worker Node 3
  # --------------------------------------------------------------------------
  - role: worker
    image: kindest/node:v1.30.0@sha256:2d5f7a6f8c2b0a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1
    extraMounts:
      - hostPath: /var/lib/kind-local-pv/worker3
        containerPath: /mnt/local-storage
        propagation: Bidirectional
 
# ----------------------------------------------------------------------------
# Networking configuration
# ----------------------------------------------------------------------------
networking:
  # The IP family to use
  ipFamily: ipv4
  # Service subnet CIDR
  serviceSubnet: "10.96.0.0/16"
  # Pod subnet CIDR
  podSubnet: "10.244.0.0/16"
  # Disable default CNI if we want to install our own (Cilium, Calico)
  disableDefaultCNI: false
  # API server address — 0.0.0.0 allows external access
  apiServerAddress: "0.0.0.0"
  # API server port on the host
  apiServerPort: 6443
 
# ----------------------------------------------------------------------------
# kubeadm configuration patches applied globally to all nodes
# ----------------------------------------------------------------------------
kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    metadata:
      name: config
    # Use systemd cgroup driver for kubelet
    controllerManager:
      extraArgs:
        node-monitor-grace-period: "30s"
        node-monitor-period: "5s"
    scheduler:
      extraArgs:
        bind-timeout-seconds: "30"
  - |
    kind: KubeletConfiguration
    cgroupDriver: systemd
    # Protect against PID exhaustion
    podPidsLimit: 4096
    # Eviction thresholds
    evictionHard:
      memory.available: "256Mi"
      nodefs.available: "10%"
      nodefs.inodesFree: "5%"
    evictionSoft:
      memory.available: "512Mi"
      nodefs.available: "15%"
    evictionSoftGracePeriod:
      memory.available: "1m30s"
      nodefs.available: "2m"
    evictionMaxPodGracePeriod: 60

1.5.2 Creating and Verifying the kind Cluster

# =============================================================================
# kind-cluster-workflow.sh
# Complete kind cluster creation and verification workflow
# =============================================================================
 
set -euo pipefail
 
# --- Prerequisites ---
echo "[PRE] Checking prerequisites..."
for cmd in kind kubectl docker; do
  if ! command -v $cmd &>/dev/null; then
    echo "[ERROR] $cmd not found. Please install it first."
    exit 1
  fi
done
 
# Create the local host directories for PV mounts
sudo mkdir -p /var/lib/kind-local-pv/worker{1,2,3}
sudo chmod -R 777 /var/lib/kind-local-pv
 
# --- Create the cluster ---
echo ""
echo "[INFO] Creating multi-node kind cluster 'production-sim'..."
kind create cluster --config kind-config.yaml --name production-sim --wait 5m
 
# --- Verify nodes ---
echo ""
echo "[INFO] Cluster nodes:"
kubectl get nodes -o wide
 
# --- Label worker nodes for workload scheduling ---
echo ""
echo "[INFO] Labeling worker nodes..."
for NODE in $(kubectl get nodes --no-headers -l '!node-role.kubernetes.io/control-plane' -o name); do
  kubectl label --overwrite $NODE node.kubernetes.io/worker=true
done
 
# --- Install ingress controller (without Helm for simplicity) ---
echo ""
echo "[INFO] Installing nginx-ingress controller..."
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
 
# Wait for the ingress controller to be ready
kubectl wait --namespace ingress-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=180s
 
# --- Install MetalLB for LoadBalancer support ---
echo ""
echo "[INFO] Installing MetalLB..."
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.5/config/manifests/metallb-native.yaml
kubectl wait --namespace metallb-system \
  --for=condition=ready pod \
  --selector=app=metallb \
  --timeout=120s
 
# Get the Docker network subnet for MetalLB IP pool
NETWORK_CIDR=$(docker network inspect kind -f '{{(index .IPAM.Config 0).Subnet}}')
NETWORK_BASE=$(echo ${NETWORK_CIDR} | cut -d'.' -f1-3)
cat <<EOF | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: kind-pool
  namespace: metallb-system
spec:
  addresses:
  - ${NETWORK_BASE}.200-${NETWORK_BASE}.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: kind-l2
  namespace: metallb-system
spec:
  ipAddressPools:
  - kind-pool
EOF
 
# --- Install local path provisioner for storage ---
echo ""
echo "[INFO] Installing Local Path Provisioner..."
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.28/deploy/local-path-storage.yaml
kubectl wait --namespace local-path-storage \
  --for=condition=ready pod \
  --selector=app=local-path-provisioner \
  --timeout=120s
 
# --- Final verification ---
echo ""
echo "=========================================="
echo "  kind Cluster 'production-sim' Ready"
echo "=========================================="
echo ""
echo "Control Plane:  $(kubectl get nodes --no-headers -l node-role.kubernetes.io/control-plane -o name)"
echo "Worker Nodes:   $(kubectl get nodes --no-headers -l '!node-role.kubernetes.io/control-plane' -o name | wc -l)"
echo "Ingress:        http://localhost:80"
echo "Dashboard:      kubectl proxy --port=8080 &"
 
# Save the kubeconfig for later use
kind get kubeconfig --name production-sim > ~/.kube/kind-production-sim-config
echo ""
echo "[INFO] Kubeconfig saved to ~/.kube/kind-production-sim-config"
echo "[INFO] Use: export KUBECONFIG=~/.kube/kind-production-sim-config"

SRE Warning: kind nodes share the Docker host kernel. If you run a workload that loads kernel modules (e.g., iptables, ebpf), it affects the host. kind is ideal for control plane testing and CI, but never assume kernel isolation — use Minikube with KVM2 or a real VM for true kernel-level testing.


1.6 k3d/k3s for Lightweight Edge & HA Simulation

k3s is a CNCF-certified Kubernetes distribution optimized for resource-constrained environments. k3d wraps k3s in Docker containers, providing instant cluster creation with built-in load balancing.

1.6.1 Multi-Server HA k3d Configuration

This configuration creates a 3-server (control-plane) HA cluster with 2 agent (worker) nodes and embedded etcd (dqlite):

# =============================================================================
# k3d-multi-server-ha.yaml
# Multi-server HA k3d cluster for edge computing simulation
# 3 server nodes (embedded etcd) + 2 agent nodes
# =============================================================================
#
# Usage:
#   k3d cluster create --config k3d-multi-server-ha.yaml
#
apiVersion: k3d.io/v1alpha5
kind: Simple
metadata:
  name: edge-ha
 
# Number of server (control-plane) nodes for HA
servers: 3
 
# Number of agent (worker) nodes
agents: 2
 
# Container image to use for all nodes
image: rancher/k3s:v1.30.1-k3s1
 
# Port mappings — traffic hits the built-in load balancer
# which distributes across all server nodes
ports:
  # HTTP ingress
  - port: 80:80
    nodeFilters:
      - loadbalancer
  # HTTPS ingress
  - port: 443:443
    nodeFilters:
      - loadbalancer
  # Kubernetes API (load balanced across servers)
  - port: 6443:6443
    nodeFilters:
      - loadbalancer
  # Metrics port
  - port: 37000:30000
    nodeFilters:
      - loadbalancer
 
# Registry configuration for local image mirroring
registries:
  create:
    name: k3d-registry.localhost
    host: "0.0.0.0"
    hostPort: "5000"
 
# Options for k3s configuration
options:
  k3s:
    # Extra arguments passed to all server nodes
    extraServerArgs:
      # Enable embedded etcd (dqlite) for HA — REQUIRED for multi-server
      - "--cluster-init"
      # Disable built-in Traefik — we'll install our own ingress controller
      - "--disable=traefik"
      # Disable local-storage to use our own provisioner
      - "--disable=local-storage"
      # Disable metrics-server to use our own from components
      - "--disable=metrics-server"
      # Enable Pod Security Standards
      - "--kube-apiserver-arg=enable-admission-plugins=NodeRestriction,PodSecurity"
      # Configure etcd for HA
      - "--etcd-snapshot-schedule-cron=*/30 * * * *"
      - "--etcd-snapshot-retention=24"
      # Set service CIDR (must not conflict with Docker networks)
      - "--service-cidr=10.96.0.0/16"
      # Set cluster CIDR
      - "--cluster-cidr=10.244.0.0/16"
    extraAgentArgs: []
 
  kubeconfig:
    updateDefaultKubeconfig: true
    switchCurrentContext: true
 
  # Runtime configuration
  runtime:
    # Enable Docker-in-Docker for nested container builds
    allowLocalPaths: true
    # Limits per container (applied to k3d proxies and LB)
    ulimits:
      - name: nofile
        soft: 65536
        hard: 65536

1.6.2 Creating and Verifying the HA k3d Cluster

# =============================================================================
# k3d-ha-cluster-setup.sh
# Complete k3d HA cluster creation and verification workflow
# =============================================================================
 
set -euo pipefail
 
# --- Prerequisites ---
echo "[PRE] Checking prerequisites..."
for cmd in k3d kubectl docker; do
  if ! command -v $cmd &>/dev/null; then
    echo "[ERROR] $cmd not found."
    exit 1
  fi
done
 
# --- Create the HA cluster ---
echo ""
echo "[INFO] Creating HA k3d cluster 'edge-ha'..."
k3d cluster create --config k3d-multi-server-ha.yaml --wait
 
# --- Verify cluster topology ---
echo ""
echo "[INFO] Cluster nodes:"
kubectl get nodes -o wide
 
echo ""
echo "[INFO] Server (control-plane) nodes:"
kubectl get nodes --no-headers -l 'node-role.kubernetes.io/control-plane' -o custom-columns=NAME:.metadata.name
 
echo ""
echo "[INFO] Agent (worker) nodes:"
kubectl get nodes --no-headers -l '!node-role.kubernetes.io/control-plane' -o custom-columns=NAME:.metadata.name
 
# --- Verify etcd cluster health ---
echo ""
echo "[INFO] Checking etcd cluster health..."
# Find the first server container
SERVER_CONTAINER=$(docker ps --filter "name=k3d-edge-ha-server-0" --format "{{.Names}}")
if [ -n "$SERVER_CONTAINER" ]; then
  docker exec $SERVER_CONTAINER k3s etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
    --cert=/var/lib/rancher/k3s/server/tls/etcd/server-client.crt \
    --key=/var/lib/rancher/k3s/server/tls/etcd/server-client.key \
    endpoint health --cluster -w table
fi
 
# --- Install ingress controller ---
echo ""
echo "[INFO] Installing nginx-ingress controller..."
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
kubectl wait --namespace ingress-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=180s
 
# --- Deploy a test multi-service application ---
echo ""
echo "[INFO] Deploying test application..."
cat <<'APPLICATION_EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: edge-app
  labels:
    app: edge-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: edge-app
  template:
    metadata:
      labels:
        app: edge-app
    spec:
      topologySpreadConstraints:
        # Spread pods across zones
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: edge-app
      containers:
        - name: app
          image: nginx:alpine
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 3
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: edge-app-svc
spec:
  ports:
    - port: 80
      targetPort: 80
  selector:
    app: edge-app
  type: LoadBalancer
APPLICATION_EOF
 
# Wait for pods to be ready across all nodes
kubectl wait --for=condition=ready pod -l app=edge-app --timeout=120s
 
# Verify pod distribution across nodes
echo ""
echo "[INFO] Pod distribution across nodes:"
kubectl get pods -l app=edge-app -o wide
 
echo ""
echo "=========================================="
echo "  HA k3d Cluster 'edge-ha' Ready"
echo "=========================================="
echo ""
echo "Control Plane HA: 3 servers"
echo "Workers:          2 agents"
echo "API Endpoint:     https://localhost:6443"
echo "HTTP Ingress:     http://localhost:80"
echo "Local Registry:   localhost:5000"

SRE Warning: k3d uses k3s which embeds etcd (dqlite). Multiple server nodes require an odd number (1, 3, 5) for etcd quorum. Two servers provide NO HA benefit — if either fails, you lose quorum. Always use 3 or 5 server nodes for real HA simulation.


Chapter Summary

In this chapter, you have:

  1. Mastered containerd internals — architecture, CRI flow, and production configuration tuning
  2. Understood cgroups v2 — unified hierarchy, resource controllers, and real-time inspection commands
  3. Traced Linux namespaces — the nine namespace types and how Kubernetes leverages each
  4. Built a 3-node Minikube cluster — with ingress, MetalLB, storage, and a complete verification runbook
  5. Architected a 4-node kind cluster — with explicit node configurations, host path mounts, and multiple port mappings
  6. Simulated an HA edge cluster — using k3d with 3-server embedded etcd and LoadBalancer integration

These local environments are not toys — they are production-simulation sandboxes. Every concept tested here translates directly to your production clusters.


Next Steps

Proceed to Chapter 2: Core Kubernetes Primitives & Imperative Control, where you will learn to control the cluster imperatively with kubectl and understand the fundamental workload primitives: Pods, Deployments, ReplicaSets, Services, and self-healing mechanisms.


Appendix: Common Troubleshooting

SymptomLikely CauseResolution
failed to create containerd task: cgroupSystemdCgroup = false with cgroups v2Set SystemdCgroup = true in containerd config
The connection to the server localhost:8080 was refusedKubeconfig not seteval $(minikube docker-env) or kind export kubeconfig
Node shows NotReadyCNI not installedkubectl apply -f https://raw.githubusercontent.com/.../weave-daemonset.yaml
LoadBalancer pending foreverMetalLB not configuredDeploy IPAddressPool and L2Advertisement resources
OOMKilled pod in CrashLoopBackOffMemory limit too lowCheck memory.events and increase resources.limits.memory
Pods stuck in ContainerCreatingStorageClass missingInstall local-path-provisioner or similar