Raymii.org

אֶשָּׂא עֵינַי אֶל־הֶהָרִים מֵאַיִן יָבֹא עֶזְרִֽי׃
Home | About | All pages | Cluster Status | RSS Feed

Proxmox VE 7 Corosync QDevice in a Docker container

Published: 17-04-2022 | Last update: 29-01-2024 04:30 | Author: Remy van Elst | Text only version of this article

❗ This post is over one years old. It may no longer be up to date. Opinions may have changed.

Proxmox VE cluster QDevice technical explanation
- Corosync External Vote Support
- QDevice Technical Overview
Proxmox VE host setup, part 1
Docker container setup
Proxmox VE host setup, part 2

At home I have a 2 node Proxmox VE cluster consisting of 2 HP EliteDesk Mini machines, both running with 16 GB RAM and both an NVMe and SATA SSD with ZFS on root (256 GB). It's small enough (physically) and is just enough for my homelab needs specs wise. I have a few services at home, the rest runs on 'regular' virtual private servers online. Proxmox VE, a virtualization stack based on Debian, has support for clustering. Clustering can mean different things to different people, in this case the cluster support means a group of physical servers, but Proxmox also supports actual high-availability with fail over and (live) migration. I also have a small NAS with spinning disks for shared storage via NFS, which is helpful when experimenting with live migration.

Update 29-01-2024: The Docker image now contains a health check command and sets the sticky bit on /var/run so corosync-qnetd can create its runtime directory.

cluster

My modest two node Proxmox VE cluster, with extra quorum device

For a cluster (in any sense of the word), you need at least 3 nodes, otherwise there is no quorum. Meaning, if one node goes down, it (and the other node) cannot know if the problem is at their side or the other side. With an uneven number of nodes, one node can always ask another node, hey, it is just me or do so see the issue as well? If it receives no reply, it knows it's their problem, if the other node does reply, they know it's the third node that has the problem. Corosync, the cluster software used by Proxmox, supports an external Quorum device. This is a small piece of software running on a third node which provides an extra vote for the quorum (the extra vote) without being a Proxmox VE server. Any cluster with a even number of nodes can get such a split-brain situation, and in my experience, those are bad.

A two node proxmox cluster shuts down all virtual machines and containers in the case of one node failure, the cluster is in read-only mode. So if you power down one of your proxmox servers, the other one is offline as well, even if you've just migrated all machines to that node. With the extra quorum device, you can safely power down one node.

In my case I wanted to run this on my NAS, since (physical) space is a premium. The NAS supports Docker, this guide explains how to run the QDevice for Proxmox in a Docker container. The docker container is a Debian 11 container with systemd and openssh, since the proxmox QDevice setup requires that. More like an LXC container than a docker container. The NAS itself does not run Debian and has no corosync packages, which is why I resorted to this route.

If you can run an extra debian server, there is no need to go the Docker route, you can just follow the regular guide. There is a qdevice docker image but that guide does not work for Proxmox VE 7 and requires a lot of manual setup. Using my method involves a lot less steps, since you're basically running an extra debian VPS (a container with systemd and openssh).

I've been working with Corosync since 2012, have written a few posts in 2013 so I consider myself experienced in corosync usage. The Proxmox web interface makes clustering very easy.

Proxmox VE cluster QDevice technical explanation

My cluster has 2 nodes, both use ZFS as local storage. The Proxmox Docs explains all the details on running a cluster and the GUI makes it very easy to setup. For this guide I'm assuming you already have the 2 node cluster setup.

Quoting the documentation page with more information on the QDevice:

Corosync External Vote Support

This section describes a way to deploy an external voter in a Proxmox VE cluster. When configured, the cluster can sustain more node failures without violating safety properties of the cluster communication.

For this to work, there are two services involved:

A QDevice daemon which runs on each Proxmox VE node
An external vote daemon which runs on an independent server

As a result, you can achieve higher availability, even in smaller setups (for example 2+1 nodes).

QDevice Technical Overview

The Corosync Quorum Device (QDevice) is a daemon which runs on each cluster node. It provides a configured number of votes to the cluster's quorum subsystem, based on an externally running third-party arbitrator's decision. Its primary use is to allow a cluster to sustain more node failures than standard quorum rules allow. This can be done safely as the external device can see all nodes and thus choose only one set of nodes to give its vote. This will only be done if said set of nodes can have quorum (again) after receiving the third-party vote.

Currently, only QDevice Net is supported as a third-party arbitrator. This is a daemon which provides a vote to a cluster partition, if it can reach the partition members over the network. It will only give votes to one partition of a cluster at any time. It's designed to support multiple clusters and is almost configuration and state free. New clusters are handled dynamically and no configuration file is needed on the host running a QDevice.

The only requirements for the external host are that it needs network access to the cluster and to have a corosync-qnetd package available. We provide a package for Debian based hosts, and other Linux distributions should also have a package available through their respective package manager. Note In contrast to corosync itself, a QDevice connects to the cluster over TCP/IP. The daemon may even run outside of the cluster's LAN and can have longer latencies than 2 ms.

We support QDevices for clusters with an even number of nodes and recommend it for 2 node clusters, if they should provide higher availability. For clusters with an odd node count, we currently discourage the use of QDevices. The reason for this is the difference in the votes which the QDevice provides for each cluster type. Even numbered clusters get a single additional vote, which only increases availability, because if the QDevice itself fails, you are in the same position as with no QDevice at all.

On the other hand, with an odd numbered cluster size, the QDevice provides (N-1) votes, where N corresponds to the cluster node count. This alternative behavior makes sense; if it had only one additional vote, the cluster could get into a split-brain situation. This algorithm allows for all nodes but one (and naturally the QDevice itself) to fail. However, there are two drawbacks to this:

If the QNet daemon itself fails, no other node may fail or the cluster immediately loses quorum. For example, in a cluster with 15 nodes, 7 could fail before the cluster becomes inquorate. But, if a QDevice is configured here and it itself fails, no single node of the 15 may fail. The QDevice acts almost as a single point of failure in this case.
The fact that all but one node plus QDevice may fail sounds promising at first, but this may result in a mass recovery of HA services, which could overload the single remaining node. Furthermore, a Ceph server will stop providing services if only ((N-1)/2) nodes or less remain online.

(end quote)

Proxmox VE host setup, part 1

On all your Proxmox VE 7 machines you must manually install an extra package:

apt install corosync-qdevice

Again, I'm assuming you have your cluster configured already. I'm also using different VLAN's (extra USB 3 gigabit nic's) for the different networks, but I'm leaving that outside of the scope of this guide to keep it simple.

After installing the package we can continue on configuring our Docker container. When the docker container is running, we finish of the setup by configuring the new qdevice from one of our Proxmox VE hosts.

Docker container setup

The Docker container cannot run on one of your proxmox servers or inside a VM on proxmox. You must run it on a different, external server. (You technically could run it on proxmox, but that defeats the whole point.)

I'm using my NAS, which supports Docker. The container image is based on this image, which builds a debian image with OpenSSH and systemd. Mine is simplified to only run Debian Bullseye (11). As opposed to the example command, CAP_SYS_ADMIN is not required, the systemd version in Debian 11 is recent enough. The Docker image also installs the corosync-qnetd including changing the permissions on the /etc/corosync folder to the correct user.

The container should automatically start at boot of the docker host, but if you want to run it on another host, copy the data volume folder and start it there. It will then have all the config files required.

Create a folder on the Docker host where the Corosync container will store it's corosync cluster config:

mkdir -p /volume1/docker/qnetd/corosync-data

This can be any path you like, the container will mount it as a volume.

Navigate to the folder one level above, where I'll store the Dockerfile and some other info:

cd /volume1/docker/qnetd

On my docker host, all named containers get a folder under /volume1/docker/$container-name, with subfolders for each volume.

Create a Dockerfile:

vim Dockerfile

Place the following contents inside:

FROM debian:bullseye
RUN echo 'debconf debconf/frontend select teletype' | debconf-set-selections
RUN apt-get update && apt-get dist-upgrade -qy && apt-get install -qy --no-install-recommends systemd systemd-sysv corosync-qnetd  openssh-server && apt-get clean && rm -rf /var/lib/apt/lists/* /var/log/alternatives.log /var/log/apt/history.log /var/log/apt/term.log /var/log/dpkg.log
RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
# Change password to something secure
RUN echo 'root:password' | chpasswd    
RUN chmod 1777 /var/run
RUN chown -R coroqnetd:coroqnetd /etc/corosync/
RUN systemctl mask -- dev-hugepages.mount sys-fs-fuse-connections.mount
RUN rm -f /etc/machine-id /var/lib/dbus/machine-id
FROM debian:bullseye
COPY --from=0 / /
ENV container docker
STOPSIGNAL SIGRTMIN+3
VOLUME [ "/sys/fs/cgroup", "/run", "/run/lock", "/tmp" ]
HEALTHCHECK CMD corosync-qnetd-tool -s
EXPOSE 5403
CMD [ "/sbin/init" ]

Change the password part in the line 'root:password' to a secure password which will be used for SSH root login.

Build the docker image:

 docker build . -t debian-qdevice

The -t flag will give the image a recognizable name. You can now start a container based on that image:

docker run -d -it \
  --name qnetd \
  --net=macvlan \
  --ip=192.0.2.20 \
  -v /path/to/your/docker/folder/qnetd/corosync-data:/etc/corosync    \
  -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
  --restart=always \
  debian-qdevice:latest

I'm using macvlan to give my containers their own IP address. The Proxmox QDevice Setup code does not show support for a different SSH port, which is why I gave the container it's own IP. You could put a name/port combo in your ~/.ssh/config file on the Proxmox hosts like so:

Host 192.0.2.19
  HostName 192.0.2.19
  Port 2222
  User root

Then you can start the container without --net=macvlan --ip=192... but you must specify the ports used by corosync and SSH: -p 5403:5403 -p 2222:22. The setup command later on can use the IP of the docker host, but be extra careful that you have setup your SSH config file correctly. 192.0.2.19 is the example IP of my docker host, not of the docker container. You could unexpectedly run commands as root on your Docker host.

The start command will print out a UUID of the new container, you can check if it's running with the docker ps command:

docker ps

Example output:

CONTAINER ID   IMAGE                  COMMAND        CREATED              STATUS                PORTS     NAMES
45d861ce6acf   debian-qdevice:latest       "/sbin/init"   About a minute ago   Up About a minute               qnetd

Test if you can SSH into the container with the username root and your password. If SSH works, continue on with the guide.

Proxmox VE host setup, part 2

This part has to be done on one of your Proxmox VE servers, it is synced automatically to the other servers. Execute the following command:

pvecm qdevice setup 192.0.2.20

You're asked for the root password:

/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@192.0.2.20's password:

The rest of the output is all setup of corosync:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@192.0.2.20'"
and check to make sure that only the key(s) you wanted were added.


INFO: initializing qnetd server
Certificate database (/etc/corosync/qnetd/nssdb) already exists. Delete it to initialize new db

INFO: copying CA cert and initializing on all nodes

node 'pve1': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'pve1': Creating new key and cert db
node 'pve1': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'pve1': Importing CA
node 'pve2': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'pve2': Creating new key and cert db
node 'pve2': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'pve2': Importing CA
INFO: generating cert request
Creating new certificate request


Generating key.  This may take a few moments...

Certificate request stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.crq

INFO: copying exported cert request to qnetd server

INFO: sign and export cluster cert
Signing cluster certificate
Certificate stored in /etc/corosync/qnetd/nssdb/cluster-cluster1.crt

INFO: copy exported CRT

INFO: import certificate
Importing signed cluster certificate
Notice: Trust flag u is set automatically if the private key is present.
pk12util: PKCS12 EXPORT SUCCESSFUL
Certificate stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.p12

INFO: copy and import pk12 cert to all nodes

node 'pve1': Importing cluster certificate and key
node 'pve1': pk12util: PKCS12 IMPORT SUCCESSFUL
node 'pve2': Importing cluster certificate and key
node 'pve2': pk12util: PKCS12 IMPORT SUCCESSFUL
INFO: add QDevice to cluster configuration

INFO: start and enable corosync qdevice daemon on node 'pve1'...
Synchronizing state of corosync-qdevice.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable corosync-qdevice
Created symlink /etc/systemd/system/multi-user.target.wants/corosync-qdevice.service -> /lib/systemd/system/corosync-qdevice.service.

INFO: start and enable corosync qdevice daemon on node 'pve2'...
Synchronizing state of corosync-qdevice.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable corosync-qdevice
Created symlink /etc/systemd/system/multi-user.target.wants/corosync-qdevice.service -> /lib/systemd/system/corosync-qdevice.service.
Reloading corosync.conf...
Done

You can check the status of the cluster and quorum device with the following command:

pvecm status

Example output with the Quorum device setup:

Cluster information
-------------------
Name:             cluster1
Config Version:   7
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Apr 17 21:31:06 2022
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.98
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate Qdevice 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.0.2.10 (local)
0x00000002          1    A,V,NMW 192.0.2.11
0x00000000          1            Qdevice

If you have not setup the QDevice correctly, the last part of the output will be different, showing no votes for the QDevice:

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2  
Flags:            Quorate Qdevice 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.0.2.10 (local)
0x00000002          1    A,V,NMW 192.0.2.11
0x00000000          0            Qdevice (votes 1)

Check the Docker container (login via SSH), you should see the daemon running via ps auxf:

root@d74eb68b6507:~# ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  17676  9568 ?        Ss   19:30   0:00 /sbin/init
coroqne+    30  0.0  0.2  17692 13572 ?        Ss   19:30   0:00 /usr/bin/corosync-qnetd -f
root        33  0.0  0.0   4332  2140 pts/0    Ss+  19:30   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400
root        34  0.0  0.1  13272  7644 ?        Ss   19:30   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root        97  6.5  0.1  13736  8256 ?        Ss   19:34   0:00  \_ sshd: root@pts/1
root       103  2.0  0.0   3964  3420 pts/1    Ss   19:34   0:00      \_ -bash
root       106  0.0  0.0   6696  3000 pts/1    R+   19:34   0:00          \_ ps auxf

At my first attempt of Dockerizing the QDevice, I forgot to set the correct permissions on the /etc/corosync folder. The daemon failed to start. There is no logging inside the container, but after running the daemon manually as root (which worked) and inspecting the systemd unit file, the cause of the issue was clear.

The QDevice is not visible inside the web interface as far as I know.

Tags: cluster , corosync , docker , high-availability , homelab , kvm , linux , lxc , proxmox , proxmox-ve , qemu , sysadmin , tutorials , virtualization