Skip to main content

Raymii.org Raymii.org Logo

Quis custodiet ipsos custodes?
Home | About | All pages | Cluster Status | RSS Feed

High Available k3s kubernetes cluster with keepalived, galera and longhorn

Published: 09-07-2024 22:30 | Author: Remy van Elst | Text only version of this article



After my first adventure with Kubernetes, getting started with k3s on my small 3 node ARM cluster that boots via PXE / NFS, I noticed that there is only one k3s node that has the control-plane,master role. If that node fails you can no longer manager the cluster. Other nodes can fail and then the workloads (pods) will be restarted eventually after 5 minutes, but this node is special. Time to change that and make it a high available cluster. K3s supports high-availability with embedded etcd and with external databases like MySQL and postgres. etcd will thrash your storage (SD cards) so I decided to go with a MySQL cluster using Galera for the database and keepalived for the High Available Cluster IP. This guide will show you how to configure the HA database and HA-IP and I'll also setup longhorn for high-available block storage inside kubernetes. The end result is that I can pull the power from any two of the three nodes without the k3s cluster or workloads going down.

Recently I removed all Google Ads from this site due to their invasive tracking, as well as Google Analytics. Please, if you found this content useful, consider a small donation using any of the options below:

I'm developing an open source monitoring app called Leaf Node Monitoring, for windows, linux & android. Go check it out!

Consider sponsoring me on Github. It means the world to me if you show your appreciation and you'll help pay the server costs.

You can also sponsor me by getting a Digital Ocean VPS. With this referral link you'll get $200 credit for 60 days. Spend $25 after your credit expires and I'll get $25!

You need at least 3 nodes for a cluster!. 2 nodes are prone to split-brain situations and harder to setup. If you go beyond 3 nodes, make sure you always have an odd-number of cluster components (3, 5, 7, etc) so in the case of a failure there is always a majority for the quorum.

My k3s cluster now has 4 nodes, but one of them is only a worker node, not part of the high-available control plane setup, so technically, the high-available cluster uses 3 nodes.

The version of Kubernetes/k3s I use for this article is v1.29.6+k3s1.

Overview

Here is a high-level diagram of the setup:

ha-k3s.png

The blue parts (keepalived & galera) on the diagram inside a node provide the high-availability for the kubernetes (k3s) control plane. For other high-available options for k3s, consult their documentation.

keepalived provides a high-available IP via VRRP. All three nodes run keepalived, one of them being the MASTER, the other two are the BACKUP nodes. This HA-IP is what the Galera database will use and where all k3s nodes (that are not part of the HA control plane) will communicate with. I'm also using it as a LoadBalancer to make services accessible in the cluster from the outside.

Galera is a synchronous multi-master replication plug-in for InnoDB. A client can read and write to any node in a cluster and changes are applied to all servers. Galera-cluster is built in to MariaDB nowadays. I used to use Percona ExtraDB but since it's now built in to MariaDB I decided to use that. Galera makes MySQL clustering very easy. Next to the multi-master part (no failover since there are no slaves) it also automatically syncs up any new nodes, skipping lengthy manual provisioning.

The rest of the diagram inside of a node runs "inside" of kubernetes.

Longhorn provides high-available block storage. Any Persistent Volumes in you kubernetes cluster are replicated over all three nodes including automated failover. In k3s, by default, volumes use local storage, meaning that if a node goes down, the volume is not accessible on the other node where the pods will start up again. With Longhorn that problem is gone, making the k3s cluster truly high-available. Otherwise our control-plane would survive a node failure, but the workloads would not function correctly afterwards.

The rest of the article will focus on setting up the individual components. For simplicity's sake I'm showing the manual setup. The parts are small enough to easily convert to Ansible or your favorite configuration management tool.

It's important that you install the components in order shown here. First keepalived, then galera, then k3s. Before deploying any workload inside of kubernetes, make sure you have setup Longhorn for HA-storage.

My cluster uses Armbian, a Debian distribution for small board computers. The guide should work on regular debian as well.

The High-Available IP address (managed by keepalived) for the cluster is:

  • 192.0.2.50

The 3 cluster nodes have the following IP addresses:

  • 192.0.2.60
  • 192.0.2.61
  • 192.0.2.62

The worker node in the cluster has the following IP address:

  • 192.0.2.63

The HA-IP (192.0.2.50) must be set up before the database or k3s can be installed.

keepalived

I've been using keepalived for over 10 years, that article is from 2014. For a simple setup with just an HA-IP like this I prefer it over corosync since it's easier to setup.

Install the required components on all 3 nodes:

apt install dirmngr software-properties-common rsync keepalived

Choose one of your 3 nodes to be the master node. This node will by default have the high-available IP. It does not matter which node it is, although if your cluster has different hardware it's best to choose the most powerful node.

Edit the configuration file:

vim /etc/keepalived/keepalived.conf

Place the following:

vrrp_instance VI_1 {
  state MASTER
  interface end0
  virtual_router_id 51
  priority 100
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass 
  }
  virtual_ipaddress {
    192.0.2.50
  }
}

Change the following parameters:

  • interface: your physical network interface, use ip addr to find it
  • auth_pass to a secret password.
  • virtual_ipaddress to the HA-IP you want to use.

On the other two nodes, also edit the configuration file:

vim /etc/keepalived/keepalived.conf

Place the following:

vrrp_instance VI_1 {
  state BACKUP
  interface end0
  virtual_router_id 51
  priority 
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass 
  }
  virtual_ipaddress {
    192.0.2.50
  }
}

Change the following parameters:

  • auth_pass to match the password in the master node config file
  • priority: each node must have a different (lower) priority. Node2 should be 50, Node3 should be 49 and so on.
  • interface: your physical network interface, use ip addr to find it
  • virtual_ipaddress to the HA-IP you configured on the master node.

After configuring keepalived restart the service on all three nodes:

systemctl restart keepalived

You can test the HA-IP by logging in via ssh to the HA-IP for example. Or you can send SIGUSR1 to the keepalived process and check /tmp/keepalived.data:

kill -SIGUSR1 $(pidof keepalived)
cat  /tmp/keepalived.data

Output:

------< Global definitions >------
[...]
 VRRP IPv4 mcast group = 224.0.0.18
 VRRP IPv6 mcast group = ff02::12
 Gratuitous ARP delay = 5
 Gratuitous ARP repeat = 5
 [...]
------< VRRP Topology >------
 VRRP Instance = VI_1
   VRRP Version = 2
   State = BACKUP
   Master router = 192.0.2.60
   Master priority = 100
   Flags: none
   Wantstate = BACKUP
   Number of config faults = 0
   Number of interface and track script faults = 0
   Number of track scripts init = 0
   Last transition = 1720124853.373151 (Thu Jul  4 20:27:33.373151 2024)
   Read timeout = 1720131680.650708 (Thu Jul  4 22:21:20.650708 2024)
   Master down timer = 3804687 usecs
   Interface = end0
   Using src_ip = 192.0.2.61
   Multicast address 224.0.0.18
   [...]
   Virtual Router ID = 51
   Priority = 50
   Effective priority = 50
   Total priority = 50
   Advert interval = 1 sec
   Virtual IP (1):
     192.0.2.50 dev end0 scope global
   fd_in 13, fd_out 14
  [...]

Test the keepalived failover by shutting down your master node. Check the above file on the two other nodes and login via SSH on the HA-IP to see that the other node has taken over.

MariaDB Galera cluster

On all three nodes, add the MariaDB APT repository:

add-apt-repository 'deb http://mirrors.digitalocean.com/mariadb/repo/11.5/debian bookworm main'
apt-key adv --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 0xF1656F24C74CD1D8

On their page there is also a setup script, but that requires piping curl to bash which is very insecure. Setting up the repository manually is not much more difficult.

Update the local cache and install mariadb:

apt update
apt install mariadb-server 

This article exposes MariaDB to your network. On a real production setup, please use separate VLAN's for management and this kind of traffic. For a home-lab setup like mine, this is fine.

Make sure mariadb is accessible over the network (instead of localhost):

vim  /etc/mysql/mariadb.conf.d/50-server.cnf

Update bind-address, change 127.0.0.1 to 0.0.0.0:

# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address            = 0.0.0.0

Make sure the root user has a secure password. By default that user has no password. Open a MySQL prompt:

mariadb -uroot

Set a secure password:

set password = password("SuperS3cUr3P@ssw0rd");
quit;

Now it's time to set up the Galera Cluster. Create the configuration file:

vim /etc/mysql/conf.d/galera.cnf

Paste the following:

[mysqld]
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0

# Galera Provider Configuration
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so

# Galera Cluster Configuration
wsrep_cluster_name="k3s_cluster"
wsrep_cluster_address="gcomm://192.0.2.60,192.0.2.61,192.0.2.62"

# Galera Synchronization Configuration
wsrep_sst_method=rsync

# Galera Node Configuration
wsrep_node_address="192.0.2.60"
wsrep_node_name="node1"

Change the following parameters to be the same on all 3 nodes:

  • wsrep_cluster_name: the name of your cluster
  • wsrep_cluster_address: add all IP's of the cluster nodes

Change the following parameters on each node, unique for that specific node:

  • wsrep_node_address: IP address of that node
  • wsrep_node_name name for that node

Time to bootstrap the cluster. Stop MySQL on all nodes:

systemctl stop mysql  

On node 1, start up galera:

galera_new_cluster # no output

Check:

mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_cluster_size'"

Output:

+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 1     |
+--------------------+-------+

On node 2 (and node3):

systemctl start mariadb

Check cluster status on node 1:

+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 2     |
+--------------------+-------+

Repeat the above step on node3, the value should increase to 3.

Galera Cluster auto-reboot

Normally you would not want to do this and assume the cluster is up. In my case I turn off the Orange Pi Zero 3 computers when I'm experimenting for the day. When restarting a Galera Cluster you need to follow specific steps and do checks to make sure everything starts up correctly without data loss. The same goes for when one node drops out.

But that also means that every start of my cluster I manually need to intervene and fix the Galera cluster. In this case I don't care much about data loss or correct startup order, I want things to boot without manual intervention. So here is something you should never do in production.

Edit the mariadb systemd service:

systemctl edit mariadb  

Add the following (in between the comment lines):

### Editing /etc/systemd/system/mariadb.service.d/override.conf
### Anything between here and the comment below will become the new contents of the file

[Service]
Restart=on-failure
RestartSec=5s

### Lines below this comment will be discarded

By default mariadb only restarts on-abort, not on-failure. I want it to retry starting up continuesly.

Next is the most dangerous part, a cronjob to forcefully "fix" the galera cluster:

* *  *   *   *   sleep 30 ; /usr/bin/systemctl is-active --quiet mariadb.service || { /usr/bin/sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/g' /var/lib/mysql/grastate.dat ; /usr/bin/galera_new_cluster || { sleep 60 ; /sbin/reboot; } } 2>&1 | /usr/bin/logger

This cronjob will wait 30 seconds and then, if the mariadb service is not running, set safe_to_bootstrap to 1. Then it bootstraps the cluster. If that bootstrapping fails, it sleeps for a minute and reboots.

If mariadb is running it does nothing. If the bootstrapping succeeds, it also does not reboot.

In normal operations I issue the shutdown command like this:

ssh -o ConnectTimeout=5 -o BatchMode=yes -o StrictHostKeyChecking=no root@192.0.2.60 poweroff; 
ssh -o ConnectTimeout=5 -o BatchMode=yes -o StrictHostKeyChecking=no root@192.0.2.61 poweroff;
sleep 30;
ssh -o ConnectTimeout=5 -o BatchMode=yes -o StrictHostKeyChecking=no root@192.0.2.62 poweroff; 

With this wait time it results in an "orderly" shutdown with one node being the "last" node for the Galera cluster, so most of the time this cronjob does nothing.

Continuing on with the k3s parts of the database setup.

k3s mysql database setup

This is simple and straightforward, just like you would normally create a database and user. Login to the database:

mysql -uroot

Create the specific database for k3s:

CREATE DATABASE k3s;

Create a user with password for k3s:

CREATE USER 'k3s'@'%' IDENTIFIED BY 'Supers3cr3tP4ssw0rd';

Grant that user all permissions on that database:

GRANT ALL PRIVILEGES ON k3s.* TO 'k3s'@'%';
FLUSH PRIVILEGES;

Installing K3S

On the first node (of the three) you install K3S like you would normally do but with two extra arguments.

The first is --tls-san="$HA-IP", the Keepalived high available cluster IP. All nodes will connect to k3s and the database via this high available IP.

The second argument is --datastore-endpoint, which is where you provide your newly created database connection information, using the HA-IP as well.

curl -sfL https://get.k3s.io | sh -s - \
  --datastore-endpoint="mysql://k3s:Supers3cr3tP4ssw0rd@tcp(192.0.2.50:3306)/k3s" \
  --tls-san="192.0.2.50"

After installation finishes, get the token. You need that on the other nodes:

cat /var/lib/rancher/k3s/server/node-token

Output:

K10a[...]418::server:7a8[...]8e441

On the second and third master node, use the following command. It's the same as above only the token is provided extra:

curl -sfL https://get.k3s.io | sh -s - server \
  --datastore-endpoint="mysql://k3s:Supers3cr3tP4ssw0rd@tcp(192.0.2.50:3306)/k3s" \
  --token="K10a[...]418::server:7a8[...]8e441" \
  --tls-san="192.0.2.50"

Installation of extra agents is done like normal, just the token and the HA-IP:

curl -sfL https://get.k3s.io | \
  K3S_URL="https://192.0.2.50:6443" \
  K3S_TOKEN="K10a[...]418::server:7a8[...]8e441" \
  sh -s -

After installation you can issue the following command on any of the master nodes:

kubectl get nodes

Output:

NAME            STATUS   ROLES                  AGE     VERSION
opz3-1-onder    Ready    control-plane,master   7m23s   v1.29.6+k3s1
opz3-2-midden   Ready    control-plane,master   9m50s   v1.29.6+k3s1
opz3-3-boven    Ready    control-plane,master   7m18s   v1.29.6+k3s1
opz3-4-top      Ready    <none>                 14s     v1.29.6+k3s1

There are 3 master nodes now!

We're almost done with the high-available part. The only thing to do is install longhorn in the cluster so that our pod storage(persistent volumes) is high available as well.

Please consult my other article on how to set up your local workstation with kubectl and helm. You will need that for the rest of this article.

HA storage in k3s

longhorn.png

A screenshot of my longhorn dashboard

longhorn is a distributed block storage system which makes the persistent volumes in kubernetes available on multiple nodes and keeps those in sync. If one node fails, pods get started up on another node and storage is still available. Without longhorn, the persistent volume would not be available since that, in the case of k3s, is on the local node.

This article exposes iSCSI to your network. On a real production setup, please use separate VLAN's for storage traffic. For a home-lab setup like mine, this is fine.

Start by installing the following package on all k3s nodes:

apt-get install open-iscsi

On your local admin workstation you must create a folder for the yaml files:

mkdir longhorn
cd longhorn

Download the deployment file for longhorn:

wget https://raw.githubusercontent.com/longhorn/longhorn/v1.6.0/deploy/longhorn.yaml

Apply it with kubectl:

kubectl apply -f longhorn.yaml

For testing, expose the dashboard on port 8877. There is no login required by default. In a follow up article I'll show you how to setup an Ingress with a password, but for now this is fine.

kubectl expose service longhorn-frontend --type=LoadBalancer --port=8877 --target-port 8000  --name=longhorn-frontend-ext --namespace longhorn-system

You should now be able to navigate in your browser to http://ha-ip:8877 and see the longhorn dashboard.

We need to make the longhorn storage be the default storage. Otherwise, unchanged helm charts and deployments still would use the local-path storage (per node). This step is k3s specific.

Check the current storage classes:

kubectl get storageclass -A

Output:

NAME                   PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)   rancher.io/local-path   Delete          WaitForFirstConsumer   false                  18m
longhorn (default)     driver.longhorn.io      Delete          Immediate              true                   3m48s  

Use the following command to remove local-path as default option:

kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'

Output:

storageclass.storage.k8s.io/local-path patched

Check the storage classes again:

kubectl get storageclass -A

Output:

NAME                 PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path           rancher.io/local-path   Delete          WaitForFirstConsumer   false                  18m
longhorn (default)   driver.longhorn.io      Delete          Immediate              true                   4m23s

The last thing you need to do is go to the dashboard, settings and search for Pod Deletion Policy When Node is Down. By default it's set to Do Nothing. In my case, I set it to delete-both-statefulset-and-deployment-pod.

Longhorn will force delete StatefulSet/Deployment terminating pods on nodes that are down to release Longhorn volumes so that Kubernetes can spin up replacement pods.

If you do not change this and a node fails, the other nodes will not be able to start the pods with the following error:

the volume is currently attached to different node (the node that is down)

The longhorn documentation explains what this setting does:

This section is aimed to inform users of what happens during a node failure and what is expected during the recovery.

After one minute, kubectl get nodes will report NotReady for the failure node.

After about five minutes, the states of all the pods on the NotReady node will change to either Unknown or NodeLost.

StatefulSets have a stable identity, so Kubernetes won't force delete the pod for the user. See the official Kubernetes documentation about forcing the deletion of a StatefulSet.

Deployments don't have a stable identity, but for the Read-Write-Once type of storage, since it cannot be attached to two nodes at the same time, the new pod created by Kubernetes won't be able to start due to the RWO volume still attached to the old pod, on the lost node.

In both cases, Kubernetes will automatically evict the pod (set deletion timestamp for the pod) on the lost node, then try to recreate a new one with old volumes. Because the evicted pod gets stuck in Terminating state and the attached volumes cannot be released/reused, the new pod will get stuck in ContainerCreating state, if there is no intervene from admin or storage software.

Which basically means that there is no automated failover of storage, unless you change the aforementioned option Pod Deletion Policy When Node is Down to something else then do-nothing.

Testing

The most important part of the high availability cluster is testing different scenario's. Just like with backups, if you don't regularly test restores, you don't actually have a backup.

The most simple scenario is powering off one node. (ssh, then poweroff). You should see that reflected in kubectl get nodes. After 5 minutes all pods should be started on another node. Try shutting down the node that is the keepalived node with MASTER in the HA-IP config. Note that after a few seconds the IP will be available on the other node and you should be able to reach any port forwards / loadbalancers again.

Here is a screenshot from headlamp showing one of the master nodes being down:

headlamp

The cluster should not be impacted and after 5 minutes all pods / workloads should be back up.

The second test should be unplugging the network cable. Not powering off cleanly but just unplugging the network. Same effect, after 5 minutes your cluster should be healthy again.

Third test is to power off two nodes out of three, one by one. Many more tests are possible but this set should give you enough confidence on the high-availability of the cluster.

Tags: armbian , cloud , cluster , helm , high-availability , iscsi , k3s , k8s , keepalived , kubernetes , linux , longhorn , mariadb , mysql , orange-pi , raspberry-pi , tutorials