Kubernetes HA Cluster on Proxmox LXC – MetalLB, HAproxy, embedded etcd

Introduction

I wanted to set up a Kubernetes Cluster with HA on my Proxmox Datacenter. I went with HA because i have more than one Proxmox Server so my idea was to deploy the Containers on different servers and use HA to always have my services running, even if one server fails. We also use HAproxy and keepalived to have a HA Cluster of 2 Containers (one on each Proxmox host) to allow access to our server nodes over a single virtual IP

I decided to go with K3s as my Kubernetes distribution because its lightweight and easy to set up.

In K3s, there are 2 approaches to HA. One with an external database and one with embedded etcd. I decided not to use an external database because i would need to setup 2 VMs/Containers on my Proxmox hosts with replication just so i dont have a single point of failure again.

The setup im using in tl;dr:

  • Rocky Linux 9 LXCs on 2 Proxmox Hosts
  • K3s, 3 server nodes and 3 agent nodes
  • HAproxy, 2 LXCs
  • no traefik, no servicelb – MetalLB instead
  • HA through embedded etcd

Architecture

General preparation

In this setup i use 6 LXCs for 6 nodes and 2 LCXs for HAproxy which requires 9 IP addresses. We also use a range of 11 IPs for the Loadbalancer, resulting in 20 IPs total.

I’m using:

NameIPHost
server-110.1.20.40Proxmox 1
server-210.1.20.41Proxmox 2
server-310.1.20.42Proxmox 2
agent-110.1.20.43Proxmox 1
agent-210.1.20.44Proxmox 2
agent-310.1.20.45Proxmox 2
haproxy110.1.20.46Proxmox 1
haproxy210.1.20.47Proxmox 2
HAproxy VIP10.1.20.5HAproxys
Loadbalancer IP Range10.1.20.50-60K3s LB
IPs

Make sure to install kubectl and helm on your machine! All the .yaml-files i use below are in my home directory, make sure to be in the directoy of the files when using kubectl later on.

Preparing LXC on Proxmox

First, we need to download the Rocky Linux image and upload it to Proxmox to a valid storage – CT Templates.

The download can be found here:

https://us.lxd.images.canonical.com/images/rockylinux/9/amd64/default/

The newest versions can be found in the folders here. The file we need is „rootfs.tar.xz“. Download it, rename it to rocky9.tar.xz or something and upload it to your Proxmox storage.

Now we can start creating the LXCs.

Click on Create CT in Proxmox and check Advanced in the bottom:

Make sure to uncheck „Unpriviled container“, enter root password and optionally insert public ssh key. Set the hostname to the node name, e.g. server-1.

In the next window select your Storage with the CT Template, select the Rocky 9 template and continue.

Select your storage where the LXC is created and assign a disc size. I just use 16GB, do this as it fits your storage/needs.

Same as before, assignt CPUs – i use 4.

Again, assign Memory. I’m using 4GB with 0mb swap.

In the next 2 windows assign your network settings. Bridge, vlan-id, IP (from the table above), Gateway and DNS-Server. I also disable the Firewall here.

Once you’re on the Confirm page make sure to uncheck „Start after created“! We have to adjust some settings before booting the LXC.

In the /etc/pve/lxc directory, you’ll find files called XXX.conf, where XXX are the ID numbers of the containers we just created. Using your text editor of choice, edit the files for the containers we created to add the following lines:

lxc.apparmor.profile: unconfined
lxc.cgroup.devices.allow: a
lxc.cap.drop:
lxc.mount.auto: "proc:rw sys:rw"

Note: It’s important that the container is stopped when you try to edit the file, otherwise Proxmox’s network filesystem will prevent you from saving it.

In order, these options (1) disable AppArmor, (2) allow the container’s cgroup to access all devices, (3) prevent dropping any capabilities for the container, and (4) mount /proc and /sys as read-write in the container.

Next, we need to publish the kernel boot configuration into the container. Normally, this isn’t needed by the container since it runs using the host’s kernel, but the Kubelet uses the configuration to determine various settings for the runtime, so we need to copy it into the container. To do this, first start the container using the Proxmox web UI, then run the following command on the Proxmox host:

pct push <container id> /boot/config-$(uname -r) /boot/config-$(uname -r)

Finally, in each of the containers, we need to make sure that /dev/kmsg exists. Kubelet uses this for some logging functions, and it doesn’t exist in the containers by default. For our purposes, we’ll just alias it to /dev/console. In each container, create the file /usr/local/bin/conf-kmsg.sh with the following contents:

#!/bin/sh -e
if [ ! -e /dev/kmsg ]; then
    ln -s /dev/console /dev/kmsg
fi

mount --make-rshared /

This script symlinks /dev/console as /dev/kmsg if the latter does not exist. Finally, we will configure it to run when the container starts with a SystemD one-shot service. Create the file /etc/systemd/system/conf-kmsg.service with the following contents:

[Unit]
Description=Make sure /dev/kmsg exists

[Service]
Type=simple
RemainAfterExit=yes
ExecStart=/usr/local/bin/conf-kmsg.sh
TimeoutStartSec=0

[Install]
WantedBy=default.target

Finally, enable the service by running the following:

chmod +x /usr/local/bin/conf-kmsg.sh
systemctl daemon-reload
systemctl enable --now conf-kmsg

Do these steps 6 times, 3x for the server nodes and 3x for the agent nodes.

Setting up HAproxy LXCs

Setup two LXCs containers with the Rocky 9 image as before, just skip the parts after creating the containers. You can boot the two containers right after creation.

Do the following on both containers.

We start by installing haproxy and keepalived on the LXC.

dnf install haproxy keepalived

After this, we have to allow IP forwarding and non-local binding to allow keepalived to forward network packets to the backend servers.

IP forwarding:

sed -i 's/#net.ipv4.ip_forward=1/net.ipv4.ip_forward=1/' /etc/sysctl.conf

Bind to non-local adresses:

echo "net.ipv4.ip_nonlocal_bind = 1" >> /etc/sysctl.conf

Reload sysctl settings:

sysctl -p

Now we edit the haproxy config file with vi:

vi /etc/haproxy/haproxy.cfg

Remove the default config and insert the following:

global
    log /dev/log  local0 warning
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon
   
   stats socket /var/lib/haproxy/stats
   
defaults
  log global
  option  httplog
  option  dontlognull
        timeout connect 5000
        timeout client 50000
        timeout server 50000
   
frontend kube-apiserver
  bind *:6443
  mode tcp
  option tcplog
  default_backend kube-apiserver
   
backend kube-apiserver
    mode tcp
    option tcplog
    option tcp-check
    balance roundrobin
    default-server inter 10s downinter 5s rise 2 fall 2 slowstart 60s maxconn 250 maxqueue 256 weight 100
    server kube-controller-server-1 10.1.20.40:6443 check
    server kube-controller-server-2 10.1.20.41:6443 check
    server kube-controller-server-3 10.1.20.42:6443 check

Save the file and run the following command to restart HAproxy.

systemctl restart haproxy

Make it persist through reboots:

systemctl enable haproxy

Make sure to configure the LB on the other LXC as well.

Keepalived Configuration

Keepalived must be installed on both machines while the configuration of them is slightly different.

Run the following command to configure Keepalived.

vi /etc/keepalived/keepalived.conf

Here is the configuration (haproxy1) for your reference:

global_defs {
  notification_email {
  }
  router_id LVS_DEVEL
  vrrp_skip_check_adv_addr
  vrrp_garp_interval 0
  vrrp_gna_interval 0
}
   
vrrp_script chk_haproxy {
  script "killall -0 haproxy"
  interval 2
  weight 2
}
   
vrrp_instance haproxy-vip {
  state MASTER   # Master on haproxy1, BACKUP on second
  priority 200   # 200 on haproxy1, 100 on second
  interface eth0                       # Network card name
  virtual_router_id 60
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass 1111
  }
  unicast_src_ip 10.1.20.45      # The IP address of this machine
  unicast_peer {
    10.1.20.46                 # The IP address of peer machines
  }
   
  virtual_ipaddress {
    10.1.20.5/24                 # The VIP address
  }
   
  track_script {
    chk_haproxy
  }
}

The config on haproxy2 looks a little different:

global_defs {
  notification_email {
  }
  router_id LVS_DEVEL
  vrrp_skip_check_adv_addr
  vrrp_garp_interval 0
  vrrp_gna_interval 0
}
   
vrrp_script chk_haproxy {
  script "killall -0 haproxy"
  interval 2
  weight 2
}
   
vrrp_instance haproxy-vip {
  state BACKUP   # Master on haproxy1, BACKUP on second
  priority 100   # 200 on haproxy1, 100 on second
  interface eth0                       # Network card name
  virtual_router_id 60
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass 1111
  }
  unicast_src_ip 10.1.20.46      # The IP address of this machine
  unicast_peer {
    10.1.20.45                 # The IP address of peer machines
  }
   
  virtual_ipaddress {
    10.1.20.5/24                 # The VIP address
  }
   
  track_script {
    chk_haproxy
  }
}

Save the files and run the following command to restart keepalived.

systemctl restart keepalived

Make it persist through reboots:

systemctl enable keepalived

Setting up the Container OS & K3s

Now that we’ve got the containers up and running, we will set up Rancher K3s on them. Luckily, Rancher intentionally makes this pretty easy.

Setting up server nodes

Starting on the first server node, we’ll run the following command to setup K3s:

curl -sfL https://get.k3s.io | sh -s - server --token=YOURTOKENHERE  --tls-san dns.name.lab.local --tls-san 10.1.20.5 --cluster-init --disable servicelb --disable traefik

We run the setup with a token (just generate one or use random string), –clusterinit on the FIRST NODE ONLY and then disable the default loadbalancer and traefik ingress proxy. –tls-san is needed to pass the virtual IP information on to be added in the certificates.

On the next 2 nodes we basically run the same command – minus the –cluster-init but instead use –server to point those nodes to the first node we set up.

curl -sfL https://get.k3s.io | sh -s - server --token=YOURTOKENHERE --tls-san dns.name.lab.local --tls-san 10.1.20.5 --server https://10.1.20.40:6443 --disable servicelb --disable traefik

Once everything is done, you can copy /etc/rancher/k3s/k3s.yaml to ~/.kube/config on your local machine.

Edit the k3s.yaml, the IP in there probably is set to 127.0.0.1 – change this to 10.1.2.5 (VIP of the HAproxy) and you should be able to see your new cluster using kubectl get nodes!

NAME       STATUS   ROLES                       AGE   VERSION
server-1   Ready    control-plane,etcd,master   17h   v1.27.7+k3s2
server-2   Ready    control-plane,etcd,master   17h   v1.27.7+k3
server-3   Ready    control-plane,etcd,master   17h   v1.27.7+k3s2

Adding the agent nodes

Now we go back to our remaining three LXCs, the agent nodes. We just have to join them to the cluster, pointing them at the HA IP (10.1.20.5). For this we run the same command on all three nodes:

curl -sfL https://get.k3s.io | sh -s - agent --token=YOURTOKENHERE --server https://10.1.20.5:6443

Do this step on all nodes and finally check with kubectl get nodes:

NAME       STATUS   ROLES                       AGE   VERSION
agent-1    Ready    <none>                      59s   v1.27.7+k3s2
agent-2    Ready    <none>                      29s   v1.27.7+k3s2
agent-3    Ready    <none>                      14s   v1.27.7+k3s2
server-1   Ready    control-plane,etcd,master   17h   v1.27.7+k3s2
server-2   Ready    control-plane,etcd,master   17h   v1.27.7+k3s2
server-3   Ready    control-plane,etcd,master   17h   v1.27.7+k3s2

At this point our HA Cluster is up and running! In the next step, we set up MetalLB as our Loadbalancer to make the services reachable for „end user“.

Load balancer

Since K3s is fully compatible with Helm out of the box, we just use the Helm Controller to install MetalLB.

We assign a range of IPs that our Loadbalancer can use to make services available. In my case i just use 10.1.20.50 to 10.1.20.60.

We start by adding the MetalLB repo:

helm repo add metallb https://metallb.github.io/metallb

In the next step, we install it but for this, we want to install it with additional parameters. Create a file metallb-values.yaml and fill it with this text:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default-pool
  namespace: metallb-system
spec:
  addresses:
  - 10.1.20.50-10.1.20.60
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default
  namespace: metallb-system
spec:
  ipAddressPools:
  - default-pool

Next, we install MetalLB.

helm install metallb metallb/metallb --create-namespace \
--namespace metallb-system --wait

And finally, we apply the metallb-values.yaml to it.

kubectl apply -f metallb-values.yaml

Now, verify all pods are there with kubectl get pods -n metallb-system:

NAME                                  READY   STATUS    RESTARTS   AGE
metallb-controller-6cb58c6c9b-jtd58   1/1     Running   0          3m24s
metallb-speaker-4s862                 4/4     Running   0          3m24s
metallb-speaker-lv2bf                 4/4     Running   0          3m24s
metallb-speaker-m2wcz                 4/4     Running   0          3m24s
metallb-speaker-rrdw5                 4/4     Running   0          3m24s
metallb-speaker-tg2xr                 4/4     Running   0          3m24s
metallb-speaker-vn24x                 4/4     Running   0          3m24s

Kubernetes Dashboard

To test everything, we install the Kubernetes Dashboard and set it up on our loadbalancer.

kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.7.0/aio/deploy/recommended.yaml

The Dashboard is running, we just have to apply more config to expose it through the Loadbalancer.

Create a file dashboard-lb.yaml and fill it with the following code:

apiVersion: v1
kind: Service
metadata:
  name: kubernetes-dashboard-lb
  namespace: kube-system
spec:
  type: LoadBalancer
  ports:
    - port: 443
      protocol: TCP
      targetPort: 8443
  selector:
    k8s-app: kubernetes-dashboard

And now apply it with kubectl apply -f dashboard-lb.yaml

After a few seconds, check the IP it got assigned with kubectl -n kubernetes-dashboard get svc:

NAME                        TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
dashboard-metrics-scraper   ClusterIP      10.43.16.74   <none>        8000/TCP        2m6s
kubernetes-dashboard        ClusterIP      10.43.99.34   <none>        443/TCP         2m6s
kubernetes-dashboard-lb     LoadBalancer   10.43.125.0   10.1.20.50    443:30167/TCP   119s

Now we can open the web browser and access https://10.1.20.50 and we are presented this site:

As per the documentation from the Kubernetes Dashboard:

To protect your cluster data, Dashboard deploys with a minimal RBAC configuration by default. Currently, Dashboard only supports logging in with a Bearer Token. To create a token for this demo, you can follow our guide on creating a sample user.

https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/

Because i want to access the Dashboard frequently, i went with a long-lived Bearer Token. I created a file db-admin.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: admin-user
  namespace: kubernetes-dashboard
---
apiVersion: v1
kind: Secret
metadata:
  name: admin-user
  namespace: kubernetes-dashboard
  annotations:
    kubernetes.io/service-account.name: "admin-user"   
type: kubernetes.io/service-account-token  

And applied it with kubectl apply -f db-admin.yaml. Next step is to create a ClusterRoleBinding, crb.yaml:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: admin-user
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: admin-user
  namespace: kubernetes-dashboard

Apply it, as always: kubectl apply -f crb.yaml

Now we extract the token with:

kubectl get secret admin-user -n kubernetes-dashboard -o jsonpath={".data.token"} | base64 -d

The string we get now can be used to access our dashboard. Once we’re in, we can check out all the nodes:

The next we really need is an ingress controller. K3s ships with Traefik but we disabled it in the installation with a flag because i wanted to use nginx ingress controller.

NGINX Ingress Controller

With the ingress controller we can assign dns names to services and expose them. Later on we can also automatically generate Certificates for those hostnames to securely use https.

First, we add the repo to helm:

helm repo add nginx-stable https://helm.nginx.com/stable

After this, we can install nginx. We use the default settings but create a namespace for it:

helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx --create-namespace

Now, we create a service that maps the ports 80 and 443 of nginx to MetalLB. This way, the nginx gets one of our Loadbalancer IPs and is accessible via http(s) on this ip. We create ingress-controller-lb.yaml:

apiVersion: v1
kind: Service
metadata:
  name: ingress-nginx-controller-loadbalancer
  namespace: ingress-nginx
spec:
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 80
    - name: https
      port: 443
      protocol: TCP
      targetPort: 443
  type: LoadBalancer

And apply it: kubectl apply -f ingress-controller-lb.yaml. Now we can deploy services and pods and expose them on nginx. We can do one final kubectl get services -n ingress-nginx to see what IP from our range the ingress controller got:

NAME                                 TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
ingress-nginx-controller             LoadBalancer   10.43.147.154   10.1.20.51    80:30240/TCP,443:32350/TCP   5h31m
ingress-nginx-controller-admission   ClusterIP      10.43.78.109    <none>        443/TCP                      5h31m

In my example we got the 10.1.20.51.

Example: deploy uptime-kuma docker

To deploy a docker container on our cluster we have to deploy it, create a service that maps the port(s) and an ingress to expose it through nginx.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: uptime-kuma
  namespace: uptime-kuma
spec:
  selector:
    matchLabels:
      name: uptime-kuma-nginx-backend
  template:
    metadata:
      labels:
        name: uptime-kuma-nginx-backend
    spec:
      containers:
        - name: backend
          image: louislam/uptime-kuma:1
          imagePullPolicy: Always
          ports:
            - containerPort: 3001
---
apiVersion: v1
kind: Service
metadata:
  name: uptime-kuma-nginx-service
  namespace: uptime-kuma
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 3001
  selector:
    name: uptime-kuma-nginx-backend
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: uptime-kuma-nginx-ingress
  namespace: uptime-kuma
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "false"
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/use-regex: "true"
spec:
  rules:
  - host: kuma.homelab.local
    http:
      paths:
        - path: /(.*)
          pathType: Prefix
          backend:
            service:
              name: uptime-kuma-nginx-service
              port:
                number: 8080

We save this as kuma-http.yaml and before we deploy it, we create the namespace for it with kubectl create namespace uptime-kuma. Then deploy it with kubectl apply -f kuma-http.yaml -n uptime-kuma

Since we set the host to the fqdn kuma.homelab.local, we have to make sure our DNS server points this name to 10.20.1.51. Now when we open it in our browser we are greeted by the setup page of uptime kuma!