This is a text-only version of the following page on https://raymii.org: --- Title : High Available k3s kubernetes cluster with keepalived, galera and longhorn Author : Remy van Elst Date : 09-07-2024 22:30 URL : https://raymii.org/s/tutorials/High_Available_k3s_kubernetes_cluster_with_keepalived_galera_and_longhorn.html Format : Markdown/HTML --- After my [first adventure with Kubernetes](/s/tutorials/My_First_Kubernetes_k3s_cluster_on_3_Orange_Pi_Zero_3s_including_k8s_dashboard_hello-node_and_failover.html), getting started with k3s on my small 3 node ARM cluster that [boots via PXE / NFS](/s/tutorials/Netboot_PXE_Armbian_on_an_Orange_Pi_Zero_3_from_SPI_with_NFS_root_filesystem.html), I noticed that there is only one k3s node that has the `control-plane,master` role. If that node fails you can no longer manager the cluster. Other nodes can fail and then the workloads (pods) will be restarted eventually after 5 minutes, but this node is special. Time to change that and make it a high available cluster. K3s [supports](https://web.archive.org/web/20240703112841/https://docs.k3s.io/datastore/ha) high-availability with embedded `etcd` and with external databases like `MySQL` and `postgres`. `etcd` will thrash your storage (SD cards) so I decided to go with a `MySQL` cluster using `Galera` for the database and `keepalived` for the High Available Cluster IP. This guide will show you how to configure the HA database and HA-IP and I'll also setup [longhorn](https://web.archive.org/web/20240707025724/https://longhorn.io/) for high-available block storage inside kubernetes. The end result is that I can pull the power from any two of the three nodes without the k3s cluster or workloads going down.

Recently I removed all Google Ads from this site due to their invasive tracking, as well as Google Analytics. Please, if you found this content useful, consider a small donation using any of the options below:

I'm developing an open source monitoring app called Leaf Node Monitoring, for windows, linux & android. Go check it out!

Consider sponsoring me on Github. It means the world to me if you show your appreciation and you'll help pay the server costs.

You can also sponsor me by getting a Digital Ocean VPS. With this referral link you'll get $200 credit for 60 days. Spend $25 after your credit expires and I'll get $25!

**You need at least 3 nodes for a cluster!**. 2 nodes are prone to split-brain situations and harder to setup. If you go beyond 3 nodes, make sure you always have an odd-number of cluster components (3, 5, 7, etc) so in the case of a failure there is always a majority for the quorum. My k3s cluster now has 4 nodes, but one of them is only a worker node, not part of the high-available control plane setup, so technically, the high-available cluster uses 3 nodes. The version of [Kubernetes/k3s](https://docs.k3s.io/release-notes/v1.29.X) I use for this article is `v1.29.6+k3s1`. ### Overview Here is a high-level diagram of the setup: ![ha-k3s.png](/s/inc/img/ha-k3s-1.png) The blue parts (`keepalived` & `galera`) on the diagram inside a node provide the high-availability for the kubernetes (k3s) control plane. For other high-available options for k3s, consult [their documentation] (https://web.archive.org/web/20240703112841/https://docs.k3s.io/datastore/ha). `keepalived` provides a high-available IP via VRRP. All three nodes run `keepalived`, one of them being the `MASTER`, the other two are the `BACKUP` nodes. This HA-IP is what the `Galera` database will use and where all k3s nodes (that are not part of the HA control plane) will communicate with. I'm also using it as a `LoadBalancer` to make services accessible in the cluster from the outside. `Galera` is [a synchronous multi-master replication plug-in] (https://galeracluster.com/products/) for InnoDB. A client can read and write to any node in a cluster and changes are applied to all servers. Galera-cluster is [built in to MariaDB] (https://web.archive.org/web/20240707034001/https://mariadb.com/kb/en/what-is-mariadb-galera-cluster/) nowadays. I used to use Percona ExtraDB but since it's now built in to MariaDB I decided to use that. Galera makes MySQL clustering very easy. Next to the multi-master part (no failover since there are no slaves) it also automatically syncs up any new nodes, skipping lengthy manual provisioning. The rest of the diagram inside of a node runs "inside" of kubernetes. [Longhorn] (https://web.archive.org/web/20240707025724/https://longhorn.io/) provides high-available block storage. Any `Persistent Volumes` in you kubernetes cluster are replicated over all three nodes including automated failover. In k3s, by default, volumes use local storage, meaning that if a node goes down, the volume is not accessible on the other node where the pods will start up again. With `Longhorn` that problem is gone, making the k3s cluster truly high-available. Otherwise our control-plane would survive a node failure, but the workloads would not function correctly afterwards. The rest of the article will focus on setting up the individual components. For simplicity's sake I'm showing the manual setup. The parts are small enough to easily convert to Ansible or your favorite configuration management tool. It's important that you install the components in order shown here. First `keepalived`, then `galera`, then `k3s`. Before deploying any workload inside of kubernetes, make sure you have setup `Longhorn` for HA-storage. My cluster uses [Armbian](https://www.armbian.com/), a Debian distribution for small board computers. The guide should work on regular debian as well. The High-Available IP address (managed by `keepalived`) for the cluster is: - `192.0.2.50` The 3 cluster nodes have the following IP addresses: - `192.0.2.60` - `192.0.2.61` - `192.0.2.62` The worker node in the cluster has the following IP address: - `192.0.2.63` The HA-IP (`192.0.2.50`) must be set up before the database or k3s can be installed. ### keepalived I've been using [keepalived for over 10 years](/s/tutorials/Keepalived-Simple-IP-failover-on-Ubuntu.html), that article is from 2014. For a simple setup with just an HA-IP like this I prefer it over `corosync` since it's easier to setup. Install the required components on all 3 nodes: apt install dirmngr software-properties-common rsync keepalived Choose one of your 3 nodes to be the master node. This node will by default have the high-available IP. It does not matter which node it is, although if your cluster has different hardware it's best to choose the most powerful node. Edit the configuration file: vim /etc/keepalived/keepalived.conf Place the following: vrrp_instance VI_1 { state MASTER interface end0 virtual_router_id 51 priority 100 advert_int 1 authentication { auth_type PASS auth_pass } virtual_ipaddress { 192.0.2.50 } } Change the following parameters: - `interface`: your physical network interface, use `ip addr` to find it - `auth_pass` to a secret password. - `virtual_ipaddress` to the HA-IP you want to use. On the other two nodes, also edit the configuration file: vim /etc/keepalived/keepalived.conf Place the following: vrrp_instance VI_1 { state BACKUP interface end0 virtual_router_id 51 priority advert_int 1 authentication { auth_type PASS auth_pass } virtual_ipaddress { 192.0.2.50 } } Change the following parameters: - `auth_pass` to match the password in the master node config file - `priority`: each node must have a different (lower) priority. Node2 should be `50`, Node3 should be `49` and so on. - `interface`: your physical network interface, use `ip addr` to find it - `virtual_ipaddress` to the HA-IP you configured on the master node. After configuring `keepalived` restart the service on all three nodes: systemctl restart keepalived You can test the HA-IP by logging in via `ssh` to the HA-IP for example. Or you can send `SIGUSR1` to the `keepalived` process and check `/tmp/keepalived.data`: kill -SIGUSR1 $(pidof keepalived) cat /tmp/keepalived.data Output: ------< Global definitions >------ [...] VRRP IPv4 mcast group = 224.0.0.18 VRRP IPv6 mcast group = ff02::12 Gratuitous ARP delay = 5 Gratuitous ARP repeat = 5 [...] ------< VRRP Topology >------ VRRP Instance = VI_1 VRRP Version = 2 State = BACKUP Master router = 192.0.2.60 Master priority = 100 Flags: none Wantstate = BACKUP Number of config faults = 0 Number of interface and track script faults = 0 Number of track scripts init = 0 Last transition = 1720124853.373151 (Thu Jul 4 20:27:33.373151 2024) Read timeout = 1720131680.650708 (Thu Jul 4 22:21:20.650708 2024) Master down timer = 3804687 usecs Interface = end0 Using src_ip = 192.0.2.61 Multicast address 224.0.0.18 [...] Virtual Router ID = 51 Priority = 50 Effective priority = 50 Total priority = 50 Advert interval = 1 sec Virtual IP (1): 192.0.2.50 dev end0 scope global fd_in 13, fd_out 14 [...] Test the `keepalived` failover by shutting down your master node. Check the above file on the two other nodes and login via SSH on the HA-IP to see that the other node has taken over. ### MariaDB Galera cluster On all three nodes, add the [MariaDB APT repository] (https://mariadb.com/kb/en/installing-mariadb-deb-files/): add-apt-repository 'deb http://mirrors.digitalocean.com/mariadb/repo/11.5/debian bookworm main' apt-key adv --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 0xF1656F24C74CD1D8 On their page there is also a setup script, but that requires [piping curl to bash] (https://web.archive.org/web/20240707041822/https://sysdig.com/blog/friends-dont-let-friends-curl-bash/) which is very insecure. Setting up the repository manually is not much more difficult. Update the local cache and install mariadb: apt update apt install mariadb-server **This article exposes MariaDB to your network. On a real production setup, please use separate VLAN's for management and this kind of traffic**. For a home-lab setup like mine, this is fine. Make sure `mariadb` is accessible over the network (instead of localhost): vim /etc/mysql/mariadb.conf.d/50-server.cnf Update `bind-address`, change `127.0.0.1` to `0.0.0.0`: # Instead of skip-networking the default is now to listen only on # localhost which is more compatible and is not less secure. bind-address = 0.0.0.0 Make sure the `root` user has a secure password. By default that user has no password. Open a MySQL prompt: mariadb -uroot Set a secure password: set password = password("SuperS3cUr3P@ssw0rd"); quit; Now it's time to set up the Galera Cluster. Create the configuration file: vim /etc/mysql/conf.d/galera.cnf Paste the following: [mysqld] binlog_format=ROW default-storage-engine=innodb innodb_autoinc_lock_mode=2 bind-address=0.0.0.0 # Galera Provider Configuration wsrep_on=ON wsrep_provider=/usr/lib/galera/libgalera_smm.so # Galera Cluster Configuration wsrep_cluster_name="k3s_cluster" wsrep_cluster_address="gcomm://192.0.2.60,192.0.2.61,192.0.2.62" # Galera Synchronization Configuration wsrep_sst_method=rsync # Galera Node Configuration wsrep_node_address="192.0.2.60" wsrep_node_name="node1" Change the following parameters to be the same on all 3 nodes: - `wsrep_cluster_name`: the name of your cluster - `wsrep_cluster_address`: add all IP's of the cluster nodes Change the following parameters on each node, unique for that specific node: - `wsrep_node_address`: IP address of that node - `wsrep_node_name` name for that node Time to bootstrap the cluster. Stop MySQL on all nodes: systemctl stop mysql On node 1, start up galera: galera_new_cluster # no output Check: mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_cluster_size'" Output: +--------------------+-------+ | Variable_name | Value | +--------------------+-------+ | wsrep_cluster_size | 1 | +--------------------+-------+ On node 2 (and node3): systemctl start mariadb Check cluster status on node 1: +--------------------+-------+ | Variable_name | Value | +--------------------+-------+ | wsrep_cluster_size | 2 | +--------------------+-------+ Repeat the above step on `node3`, the value should increase to 3. #### Galera Cluster auto-reboot Normally you would not want to do this and assume the cluster is up. In my case I turn off the Orange Pi Zero 3 computers when I'm experimenting for the day. When [restarting a Galera Cluster] (https://web.archive.org/web/20240709040142/https://galeracluster.com/library/training/tutorials/restarting-cluster.html) you need to follow specific steps and do checks to make sure everything starts up correctly without data loss. The same goes for when one node drops out. But that also means that every start of my cluster I manually need to intervene and `fix` the Galera cluster. In this case I don't care much about data loss or correct startup order, I want things to boot without manual intervention. So here is something **you should never do in production**. Edit the `mariadb` systemd service: systemctl edit mariadb Add the following (in between the comment lines): ### Editing /etc/systemd/system/mariadb.service.d/override.conf ### Anything between here and the comment below will become the new contents of the file [Service] Restart=on-failure RestartSec=5s ### Lines below this comment will be discarded By default `mariadb` only restarts `on-abort`, not `on-failure`. I want it to retry starting up continuesly. Next is the **most dangerous part**, a cronjob to forcefully "fix" the galera cluster: * * * * * sleep 30 ; /usr/bin/systemctl is-active --quiet mariadb.service || { /usr/bin/sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/g' /var/lib/mysql/grastate.dat ; /usr/bin/galera_new_cluster || { sleep 60 ; /sbin/reboot; } } 2>&1 | /usr/bin/logger This cronjob will wait 30 seconds and then, if the `mariadb` service is not running, set `safe_to_bootstrap` to `1`. Then it `bootstraps` the cluster. If that bootstrapping fails, it sleeps for a minute and reboots. If `mariadb` is running it does nothing. If the bootstrapping succeeds, it also does not reboot. In normal operations I issue the shutdown command like this: ssh -o ConnectTimeout=5 -o BatchMode=yes -o StrictHostKeyChecking=no root@192.0.2.60 poweroff; ssh -o ConnectTimeout=5 -o BatchMode=yes -o StrictHostKeyChecking=no root@192.0.2.61 poweroff; sleep 30; ssh -o ConnectTimeout=5 -o BatchMode=yes -o StrictHostKeyChecking=no root@192.0.2.62 poweroff; With this wait time it results in an "orderly" shutdown with one node being the "last" node for the Galera cluster, so most of the time this cronjob does nothing. Continuing on with the k3s parts of the database setup. ### k3s mysql database setup This is simple and straightforward, just like you would normally create a database and user. Login to the database: mysql -uroot Create the specific database for `k3s`: CREATE DATABASE k3s; Create a user with password for `k3s`: CREATE USER 'k3s'@'%' IDENTIFIED BY 'Supers3cr3tP4ssw0rd'; Grant that user all permissions on that database: GRANT ALL PRIVILEGES ON k3s.* TO 'k3s'@'%'; FLUSH PRIVILEGES; ### Installing K3S On the first node (of the three) you install K3S like you would normally do but with two extra arguments. The first is `--tls-san="$HA-IP"`, the Keepalived high available cluster IP. All nodes will connect to k3s and the database via this high available IP. The second argument is `--datastore-endpoint`, which is where you provide your newly created database connection information, using the HA-IP as well. curl -sfL https://get.k3s.io | sh -s - \ --datastore-endpoint="mysql://k3s:Supers3cr3tP4ssw0rd@tcp(192.0.2.50:3306)/k3s" \ --tls-san="192.0.2.50" After installation finishes, get the token. You need that on the other nodes: cat /var/lib/rancher/k3s/server/node-token Output: K10a[...]418::server:7a8[...]8e441 On the second and third master node, use the following command. It's the same as above only the token is provided extra: curl -sfL https://get.k3s.io | sh -s - server \ --datastore-endpoint="mysql://k3s:Supers3cr3tP4ssw0rd@tcp(192.0.2.50:3306)/k3s" \ --token="K10a[...]418::server:7a8[...]8e441" \ --tls-san="192.0.2.50" Installation of extra agents is done like normal, just the token and the HA-IP: curl -sfL https://get.k3s.io | \ K3S_URL="https://192.0.2.50:6443" \ K3S_TOKEN="K10a[...]418::server:7a8[...]8e441" \ sh -s - After installation you can issue the following command on any of the master nodes: kubectl get nodes Output: NAME STATUS ROLES AGE VERSION opz3-1-onder Ready control-plane,master 7m23s v1.29.6+k3s1 opz3-2-midden Ready control-plane,master 9m50s v1.29.6+k3s1 opz3-3-boven Ready control-plane,master 7m18s v1.29.6+k3s1 opz3-4-top Ready 14s v1.29.6+k3s1 There are 3 master nodes now! We're almost done with the high-available part. The only thing to do is install `longhorn` in the cluster so that our pod storage(persistent volumes) is high available as well. Please [consult my other article](/s/tutorials/My_First_Kubernetes_k3s_cluster_on_3_Orange_Pi_Zero_3s_including_k8s_dashboard_hello-node_and_failover.html#toc_2) on how to set up your local workstation with `kubectl` and `helm`. You will need that for the rest of this article. ### HA storage in k3s ![longhorn.png](/s/inc/img/ha-k3s-2.png) > A screenshot of my longhorn dashboard `longhorn` is a distributed block storage system which makes the persistent volumes in kubernetes available on multiple nodes and keeps those in sync. If one node fails, pods get started up on another node and storage is still available. Without `longhorn`, the persistent volume would not be available since that, in the case of k3s, is on the local node. **This article exposes iSCSI to your network. On a real production setup, please use separate VLAN's for storage traffic**. For a home-lab setup like mine, this is fine. Start by installing the following package on all `k3s` nodes: apt-get install open-iscsi On your [local admin workstation](/s/tutorials/My_First_Kubernetes_k3s_cluster_on_3_Orange_Pi_Zero_3s_including_k8s_dashboard_hello-node_and_failover.html#toc_2) you must create a folder for the yaml files: mkdir longhorn cd longhorn Download the deployment file for longhorn: wget https://raw.githubusercontent.com/longhorn/longhorn/v1.6.0/deploy/longhorn.yaml Apply it with `kubectl`: kubectl apply -f longhorn.yaml For testing, expose the dashboard on port `8877`. There is no login required by default. In a follow up article I'll show you how to setup an `Ingress` with a password, but for now this is fine. kubectl expose service longhorn-frontend --type=LoadBalancer --port=8877 --target-port 8000 --name=longhorn-frontend-ext --namespace longhorn-system You should now be able to navigate in your browser to `http://ha-ip:8877` and see the longhorn dashboard. We need to make the `longhorn` storage be the default storage. Otherwise, unchanged helm charts and deployments still would use the `local-path` storage (per node). This step is `k3s` specific. Check the current storage classes: kubectl get storageclass -A Output: NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE local-path (default) rancher.io/local-path Delete WaitForFirstConsumer false 18m longhorn (default) driver.longhorn.io Delete Immediate true 3m48s Use the following command to remove `local-path` as default option: kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}' Output: storageclass.storage.k8s.io/local-path patched Check the storage classes again: kubectl get storageclass -A Output: NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE local-path rancher.io/local-path Delete WaitForFirstConsumer false 18m longhorn (default) driver.longhorn.io Delete Immediate true 4m23s The last thing you need to do is go to the dashboard, settings and search for `Pod Deletion Policy When Node is Down`. By default it's set to `Do Nothing`. In my case, I set it to `delete-both-statefulset-and-deployment-pod`. > Longhorn will force delete StatefulSet/Deployment terminating pods on nodes that are down to release Longhorn volumes so that Kubernetes can spin up replacement pods. If you do not change this and a node fails, the other nodes will not be able to start the pods with the following error: the volume is currently attached to different node (the node that is down) The [longhorn documentation](https://web.archive.org/web/20240709044003/https://longhorn.io/docs/1.6.0/high-availability/node-failure/#what-to-expect-when-a-kubernetes-node-fails) explains what this setting does: > This section is aimed to inform users of what happens during a node failure and what is expected during the recovery. > After one minute, kubectl get nodes will report NotReady for the failure node. > After about five minutes, the states of all the pods on the NotReady node will change to either Unknown or NodeLost. > StatefulSets have a stable identity, so Kubernetes won't force delete the pod for the user. See the official Kubernetes documentation about forcing the deletion of a StatefulSet. > Deployments don't have a stable identity, but for the Read-Write-Once type of storage, since it cannot be attached to two nodes at the same time, the new pod created by Kubernetes won't be able to start due to the RWO volume still attached to the old pod, on the lost node. > In both cases, Kubernetes will automatically evict the pod (set deletion timestamp for the pod) on the lost node, then try to recreate a new one with old volumes. Because the evicted pod gets stuck in Terminating state and the attached volumes cannot be released/reused, the new pod will get stuck in ContainerCreating state, if there is no intervene from admin or storage software. Which basically means that there is no automated failover of storage, unless you change the aforementioned option `Pod Deletion Policy When Node is Down` to something else then `do-nothing`. ### Testing The most important part of the high availability cluster is testing different scenario's. Just like with backups, if you don't regularly test restores, you don't actually have a backup. The most simple scenario is powering off one node. (`ssh`, then `poweroff`). You should see that reflected in `kubectl get nodes`. After 5 minutes all pods should be started on another node. Try shutting down the node that is the `keepalived` node with `MASTER` in the HA-IP config. Note that after a few seconds the IP will be available on the other node and you should be able to reach any port forwards / loadbalancers again. Here is a screenshot from headlamp showing one of the master nodes being down: ![headlamp](/s/inc/img/ha-k3s-3.png) The cluster should not be impacted and after 5 minutes all pods / workloads should be back up. The second test should be unplugging the network cable. Not powering off cleanly but just unplugging the network. Same effect, after 5 minutes your cluster should be healthy again. Third test is to power off two nodes out of three, one by one. Many more tests are possible but this set should give you enough confidence on the high-availability of the cluster. --- License: All the text on this website is free as in freedom unless stated otherwise. This means you can use it in any way you want, you can copy it, change it the way you like and republish it, as long as you release the (modified) content under the same license to give others the same freedoms you've got and place my name and a link to this site with the article as source. This site uses Google Analytics for statistics and Google Adwords for advertisements. You are tracked and Google knows everything about you. Use an adblocker like ublock-origin if you don't want it. All the code on this website is licensed under the GNU GPL v3 license unless already licensed under a license which does not allows this form of licensing or if another license is stated on that page / in that software: This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Just to be clear, the information on this website is for meant for educational purposes and you use it at your own risk. I do not take responsibility if you screw something up. Use common sense, do not 'rm -rf /' as root for example. If you have any questions then do not hesitate to contact me. See https://raymii.org/s/static/About.html for details.