Raymii.org
Quis custodiet ipsos custodes?Home | About | All pages | Cluster Status | RSS Feed
High Available Mosquitto MQTT on Kubernetes
Published: 14-05-2025 22:11 | Author: Remy van Elst | Text only version of this article
Table of Contents
In this post, we'll walk through a fully declarative, Kubernetes-native setup for running a highly available MQTT broker using Eclipse Mosquitto. This configuration leverages core Kubernetes primitives (Deployments
, Services
, ConfigMaps
, and RBAC
), alongside Traefik IngressRouteTCP
to expose MQTT traffic externally. It introduces a lightweight, self-healing failover mechanism that automatically reroutes traffic to a secondary broker if the primary becomes unhealthy. The setup also demonstrates internal MQTT bridging, allowing seamless message propagation between brokers. The big advantage over a single Pod deployment (which, in case of node failure, k8s will restart after 5 minutes) is that this setup has a downtime of only 5 seconds and shared state, so all messages will be available on a failover.
Recently I removed all Google Ads from this site due to their invasive tracking, as well as Google Analytics. Please, if you found this content useful, consider a small donation using any of the options below. It means the world to me if you show your appreciation and you'll help pay the server costs:
GitHub Sponsorship
PCBWay referral link (You get $5, I get $20 after you've placed an order)
Digital Ocea referral link ($200 credit for 60 days. Spend $25 after your credit expires and I'll get $25!)
Diagram of the setup
This guide assumes you have a working Kubernetes setup using Traefik. In my
case the version of Kubernetes/k3s
I use for this article is v1.32.2+k3s1
.
If you haven't got such a cluster, maybe checkout all my other kubernetes posts.
In a typical Kubernetes deployment with a single Mosquitto pod, resilience is
limited. If the node running the pod fails, Kubernetes can take up to 5
minutes to detect the failure and recover. This delay stems from the default
node-monitor-grace-period
, which is often set to 5 minutes (300s). During
this window, MQTT clients lose connectivity, messages are dropped, and
systems depending on real-time messaging may suffer degraded performance or
enter fault modes.
The configuration presented here avoids that downtime by deploying both a
primary
and secondary
Mosquitto broker, each in its own pod, scheduled on
different nodes and using a custom failover controller to handle traffic
redirection. A lightweight controller monitors the readiness of the primary
pod and, if it becomes unavailable, patches the Kubernetes Service
to reroute
traffic to the secondary
broker within 5 seconds. This dramatically reduces
recovery time and improves system responsiveness during failures.
Because the secondary
broker is always running and bridged to the primary
, it
maintains near-real-time message state. Clients continue connecting to the
same LoadBalancer
endpoint (raymii-mosquitto-svc
), with no need to update client
configurations or manage DNS changes. The failover is transparent and fast,
ensuring message flow continues even when the primary is offline.
This article is targeted to using k3s
and traefik
. Adapting it for
nginx
shouldn't be hard.
Summary
This is a summary of what the YAML file does. The two mosquitto instances are
both accessible on their own ports (2883
for the primary
, 3883
for the
secondary
) as well as port 1883 (which has automatic failover). Clients
should connect to port 1883 and the other ports are avilable for monitoring.
The mosquitto instances are configured to bridge all messages. Anything that
gets published to the primary
, is published to the secondary
and vice
versa. When a failover happens, clients loose connection and must reconnect,
but the secondary
broker has all messages(including retained ones). When
the primary
is back online, one other reconnect is required by clients. You
can tweak the failover
controller to keep the secondary act as the primary,
until the secondary fails, but that is out of scope for this article.
The Pod that runs the failover
monitoring is scheduled on a different Node
due to Affinity
. That failover
Pod only restarts every 5 minutes or so in
case of failure. If the Node
that runs the failover
pod fails AND the
Node
running the primary
pod fails, failover won't happen until
the failover
Pod is back up again. So in some rare cases failover might
still take 5 minutes. Even then, due to the bridge config, less messages would
be lost.ZX
In my intented use case, clients reconnect whenever there is a failure and publish retained messages when connecting, so failover back is not a problem.
Any retained messages that were published to the secondary
, during an outage,
are published back to the primary
due to the bridge configuration.
1. Namespace & ConfigMaps
- Creates a
raymii-mosquitto-dev
namespace. - Two ConfigMaps:
- Primary Broker ConfigMap: Configures the primary broker to listen on ports 1883 (external) and 2883 (for bridge to connect to).
- Secondary Broker ConfigMap: Configures the secondary broker to bridge to the primary on port 2883, and listens on ports 1883 (external) and 3883.
2. Deployments
raymii-mosquitto-primary
:- Listens on ports 1883 (external) and 2883 (for the bridge to connect to).
raymii-mosquitto-secondary
:- Bridges to the primary on port 2883.
- Listens on ports 1883 (external) and 3883.
raymii-mosquitto-failover
:- Pod with a shell loop checking the readiness of the primary broker.
- If the primary is not ready, it patches the selector of
raymii-mosquitto-svc
to point to the secondary broker, redirecting traffic to it.
3. Services
raymii-mosquitto-svc
:- Main
LoadBalancer
service, dynamically routing traffic to either the primary or secondary broker.
- Main
raymii-mosquitto-primary-svc
:- Directs traffic to the
primary
broker's second listener (2883).
- Directs traffic to the
raymii-mosquitto-secondary-svc
:- Directs traffic to the
secondary
broker's second listener (3883).
- Directs traffic to the
4. RBAC
- Role and binding allowing the
failover
pod to:Get
,list
, andpatch
pods and services.- Used by the
failover
pod to check status every 5 seconds and, if needed, failover.
5. Traefik IngressRouteTCP
raymii-mosquitto-dev-mqtt
:- Routes external MQTT traffic to
raymii-mosquitto-svc
.
- Routes external MQTT traffic to
- Direct routes for
primary
andsecondary
brokers as well.
Why the Mosquitto Failover Pod Needs a Service Account
In Kubernetes, no pod can access cluster resources by default, not even to
check the status of other pods or patch services. That's a problem when the
Mosquitto failover
pod needs to monitor the broker health and switch traffic
between primary
and secondary
. Without the right permissions, the failover
logic silently fails.
Therefore we must create a ServiceAccount
, bind it to a Role
with get
,
list
, and patch
permissions for pods
and services
, and assign it to the
failover
pod. This RBAC setup is the only way to let the pod query health and
dynamically reroute traffic via kubectl
.
How the Mosquitto Failover Pod Keeps the MQTT Service Alive
The raymii-mosquitto-failover
pod is a lightweight control loop designed for one
purpose: keep MQTT traffic flowing even when the primary
broker goes down.
It continuously checks the readiness of the primary
Mosquitto pod. If the
primary
fails, it patches the raymii-mosquitto-svc
service to route traffic to the
secondary broker. When the primary
recovers, traffic is restored
automatically. to the primary
.
This pod runs kubectl
inside a shell loop, using Kubernetes API calls to
detect health and redirect traffic. It's deliberately simple, no operators, no
sidecars, no custom resources.
I could build a custom controller, but that's overkill. Controllers bring overhead, extra code, CRDs, lifecycle management, and more complexity for something this specific. The failover pod trades abstraction for control: it's readable, auditable, debuggable, and deploys instantly. For one job done right, less is more.
Now for the summary of the YAML file.
What happens when the primary node is down?
The setup ran, on three nodes. One for the primary
, one for the secondary
and one for the failover
monitoring.
I pulled the network cable of the k3s server running the primary
node.
Clients disconnected, but reconnected after a few seconds to the
raymii-mosquitto-svc
Service
, landing on the secondary
node.
After a few minutes, more than 5, I plugged the network cable back
in to the k3s server that was hosting the primary
pod. The failover
Pod noticed and patched the service and back again:
kubectl logs -n raymii-mosquitto-dev -l app=mosquitto-failover
Output:
service/raymii-mosquitto-svc patched (no change)
Wed May 14 18:58:30 UTC 2025 - Primary healthy, routing to primary.
service/raymii-mosquitto-svc patched
Wed May 14 19:11:54 UTC 2025 - Primary down, routing to secondary.
service/raymii-mosquitto-svc patched
Wed May 14 19:13:41 UTC 2025 - Primary healthy, routing to primary.
K8S Deployment YAML file
This is the YAML file, including the k3s 1.32 HelmChartConfig
to expose ports other than 443 and 80. If you use NGINX
you must adapt that part to your setup.
The namespace is raymii-mosquitto-dev
. Search and replace if you want a different
namespace. Maybe attach a persistent volume if you use certificates or a
custom CA for authentication, or where you save the mosquitto persistent DB.
For my usecase, where clients publish retained on every connect
, there is
no need to save the raymii-mosquitto.db
file. You might want to use Longhorn
or
to save that state. For simplicity, I'm using a ConfigMap
for the broker
configuration.
---
apiVersion: v1
kind: Namespace
metadata:
name: raymii-mosquitto-dev
---
apiVersion: v1
kind: ConfigMap
metadata:
name: raymii-mosquitto-primary-config
namespace: raymii-mosquitto-dev
data:
raymii-mosquitto.conf: |
listener 1883
allow_anonymous true
listener 2883
allow_anonymous true
---
apiVersion: v1
kind: ConfigMap
metadata:
name: raymii-mosquitto-bridge-config
namespace: raymii-mosquitto-dev
data:
raymii-mosquitto.conf: |
listener 1883
allow_anonymous true
listener 3883
allow_anonymous true
connection bridge-to-primary
address raymii-mosquitto-primary-svc.raymii-mosquitto-dev.svc.cluster.local:2883
clientid raymii-mosquitto-bridge
topic # both 0
start_type automatic
try_private true
notifications true
restart_timeout 5
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: raymii-mosquitto-primary
namespace: raymii-mosquitto-dev
spec:
replicas: 1
selector:
matchLabels:
app: raymii-mosquitto
role: primary
template:
metadata:
labels:
app: raymii-mosquitto
role: primary
spec:
containers:
- name: raymii-mosquitto
image: eclipse-mosquitto:2.0.21
command: ["mosquitto"]
args: ["-c", "/raymii-mosquitto/config/raymii-mosquitto.conf"]
ports:
- containerPort: 1883
- containerPort: 2883
livenessProbe:
tcpSocket:
port: 1883
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
tcpSocket:
port: 1883
initialDelaySeconds: 5
periodSeconds: 10
volumeMounts:
- name: primary-config
mountPath: /raymii-mosquitto/config/
volumes:
- name: primary-config
configMap:
name: raymii-mosquitto-primary-config
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- raymii-mosquitto
topologyKey: kubernetes.io/hostname
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: raymii-mosquitto-secondary
namespace: raymii-mosquitto-dev
spec:
replicas: 1
selector:
matchLabels:
app: raymii-mosquitto
role: secondary
template:
metadata:
labels:
app: raymii-mosquitto
role: secondary
spec:
containers:
- name: raymii-mosquitto
image: eclipse-mosquitto:2.0.21
command: ["mosquitto"]
args: ["-c", "/raymii-mosquitto/config/raymii-mosquitto.conf"]
ports:
- containerPort: 1883
- containerPort: 3883
livenessProbe:
tcpSocket:
port: 1883
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
tcpSocket:
port: 1883
initialDelaySeconds: 5
periodSeconds: 10
volumeMounts:
- name: bridge-config
mountPath: /raymii-mosquitto/config/
volumes:
- name: bridge-config
configMap:
name: raymii-mosquitto-bridge-config
restartPolicy: Always
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- raymii-mosquitto
topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
name: raymii-mosquitto-svc
namespace: raymii-mosquitto-dev
spec:
type: LoadBalancer
selector:
app: raymii-mosquitto
role: primary
ports:
- port: 1883
targetPort: 1883
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: raymii-mosquitto-primary-svc
namespace: raymii-mosquitto-dev
spec:
type: LoadBalancer
selector:
app: raymii-mosquitto
role: primary
ports:
- port: 2883
targetPort: 2883
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: raymii-mosquitto-secondary-svc
namespace: raymii-mosquitto-dev
spec:
type: LoadBalancer
selector:
app: raymii-mosquitto
role: secondary
ports:
- port: 3883
targetPort: 3883
protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: raymii-mosquitto-failover
namespace: raymii-mosquitto-dev
spec:
replicas: 1
selector:
matchLabels:
app: raymii-mosquitto-failover
template:
metadata:
labels:
app: raymii-mosquitto-failover
spec:
serviceAccountName: raymii-mosquitto-failover-sa
containers:
- name: failover
image: bitnami/kubectl
command:
- /bin/sh
- -c
- |
PREV_STATUS=""
while true; do
STATUS=$(kubectl get pod -l app=raymii-mosquitto,role=primary -n raymii-mosquitto-dev -o jsonpath='{.items[0].status.conditions[?(@.type=="Ready")].status}')
if [ "$STATUS" != "$PREV_STATUS" ]; then
if [ "$STATUS" != "True" ]; then
kubectl patch service raymii-mosquitto-svc -n raymii-mosquitto-dev -p '{"spec":{"selector":{"app":"raymii-mosquitto","role":"secondary"}}}'
echo "$(date) - Primary down, routing to secondary."
else
kubectl patch service raymii-mosquitto-svc -n raymii-mosquitto-dev -p '{"spec":{"selector":{"app":"raymii-mosquitto","role":"primary"}}}'
echo "$(date) - Primary healthy, routing to primary."
fi
PREV_STATUS="$STATUS"
fi
sleep 5
done
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- raymii-mosquitto
topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: raymii-mosquitto-failover-sa
namespace: raymii-mosquitto-dev
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: raymii-mosquitto-failover-role
namespace: raymii-mosquitto-dev
rules:
- apiGroups: [""]
resources: ["pods", "services"]
verbs: ["get", "patch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: raymii-mosquitto-failover-rb
namespace: raymii-mosquitto-dev
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: raymii-mosquitto-failover-role
subjects:
- kind: ServiceAccount
name: raymii-mosquitto-failover-sa
namespace: raymii-mosquitto-dev
---
apiVersion: traefik.io/v1alpha1
kind: IngressRouteTCP
metadata:
name: raymii-mosquitto-dev-mqtt
namespace: raymii-mosquitto-dev
spec:
entryPoints:
- raymii-mosquitto-dev-mqtt
routes:
- match: HostSNI(`*`)
services:
- name: raymii-mosquitto-svc
port: 1883
---
apiVersion: traefik.io/v1alpha1
kind: IngressRouteTCP
metadata:
name: raymii-mosquitto-dev-mqtt-primary
namespace: raymii-mosquitto-dev
spec:
entryPoints:
- raymii-mosquitto-dev-mqtt-primary
routes:
- match: HostSNI(`*`)
services:
- name: raymii-mosquitto-primary-svc
port: 2883
---
apiVersion: traefik.io/v1alpha1
kind: IngressRouteTCP
metadata:
name: raymii-mosquitto-dev-mqtt-secondary
namespace: raymii-mosquitto-dev
spec:
entryPoints:
- raymii-mosquitto-dev-mqtt-secondary
routes:
- match: HostSNI(`*`)
services:
- name: raymii-mosquitto-secondary-svc
port: 3883
Traefik Helm Chart Config for k3s
In k3s 1.32, Traefik is the default ingress controller, but by default, it's
wired only for HTTP(S) routing. The moment you need to route TCP services
like MQTT (ports 1883, 2883, 3883), you hit a hard wall unless you explicitly
configure Traefik to expose those ports. That's where the HelmChartConfig
CRD becomes essential.
By creating a HelmChartConfig
with the correct valuesContent
, you're
injecting custom values into the Traefik Helm chart managed by k3s itself.
Without this, Traefik won't bind to additional TCP ports, won't route traffic
to the MQTT services and won't even start listeners, because k3s uses its
own embedded Helm controller and you can't patch the deployment directly.
This configuration is the only supported way to modify the Traefik deployment
in-place when using the bundled k3s setup.
K3s watches this HelmChartConfig
, applies the changes during Traefik chart
reconciliation, and ensures that ports like 1883, 2883, 3883 are properly
exposed at the node level and routed to the right IngressRouteTCP
rules.
This is the YAML file:
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: traefik
namespace: kube-system
spec:
valuesContent: |-
logs:
general:
level: "DEBUG"
access:
enabled: false
ports:
web:
port: 80
expose:
default: true
websecure:
port: 443
expose:
default: true
raymii-mosquitto-dev-mqtt:
port: 1883
expose:
default: true
exposedPort: 1883
raymii-mosquitto-dev-mqtt-primary:
port: 2883
expose:
default: true
exposedPort: 2883
raymii-mosquitto-dev-mqtt-secondary:
port: 3883
expose:
default: true
exposedPort: 3883
Tags: armbian
, cloud
, ha
, high-availability
, k3s
, k8s
, kubernetes
, linux
, mosquitto
, mqtt
, orange-pi
, raspberry-pi
, traefik
, tutorials