Citus support

Patroni makes it extremely simple to deploy Multi-Node Citus clusters.

TL;DR

There are only a few simple rules you need to follow:

Citus database extension to PostgreSQL must be available on all nodes. Absolute minimum supported Citus version is 10.0, but, to take all benefits from transparent switchovers and restarts of workers we recommend using at least Citus 11.2.
Cluster name (scope) must be the same for all Citus nodes!
Superuser credentials must be the same on coordinator and all worker nodes, and pg_hba.conf should allow superuser access between all nodes.
REST API access should be allowed from worker nodes to the coordinator. E.g., credentials should be the same and if configured, client certificates from worker nodes must be accepted by the coordinator.
Add the following section to the patroni.yaml:

citus:
  group: X  # 0 for coordinator and 1, 2, 3, etc for workers
  database: citus  # must be the same on all nodes

After that you just need to start Patroni and it will handle the rest:

Patroni will set bootstrap.dcs.synchronous_mode to quorum if it is not explicitly set to any other value.
citus extension will be automatically added to shared_preload_libraries.
If max_prepared_transactions isn’t explicitly set in the global dynamic configuration Patroni will automatically set it to 2*max_connections.
The citus.local_hostname GUC value will be adjusted from localhost to the value that Patroni is using in order to connect to the local PostgreSQL instance. The value sometimes should be different from the localhost because PostgreSQL might be not listening on it.
The citus.database will be automatically created followed by CREATE EXTENSION citus.
Current superuser credentials will be added to the pg_dist_authinfo table to allow cross-node communication. Don’t forget to update them if later you decide to change superuser username/password/sslcert/sslkey!
The coordinator primary node will automatically discover worker primary nodes and add them to the pg_dist_node table using the citus_add_node() function.
Patroni will also maintain pg_dist_node in case failover/switchover on the coordinator or worker clusters occurs.

patronictl

Coordinator and worker clusters are physically different PostgreSQL/Patroni clusters that are just logically grouped together using the Citus database extension to PostgreSQL. Therefore in most cases it is not possible to manage them as a single entity.

It results in two major differences in patronictl behaviour when patroni.yaml has the citus section comparing with the usual:

The list and the topology by default output all members of the Citus formation (coordinators and workers). The new column Group indicates which Citus group they belong to.
For all patronictl commands the new option is introduced, named --group. For some commands the default value for the group might be taken from the patroni.yaml. For example, patronictl pause will enable the maintenance mode by default for the group that is set in the citus section, but for example for patronictl switchover or patronictl remove the group must be explicitly specified.

An example of patronictl list output for the Citus cluster:

postgres@coord1:~$ patronictl list demo
+ Citus cluster: demo ----------+----------------+---------+----+-------------+-----+------------+-----+
| Group | Member  | Host        | Role           | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+-------+---------+-------------+----------------+---------+----+-------------+-----+------------+-----+
|     0 | coord1  | 172.27.0.10 | Replica        | running |  1 |   0/41C0368 |   0 |  0/41C0368 |   0 |
|     0 | coord2  | 172.27.0.6  | Quorum Standby | running |  1 |   0/41C0368 |   0 |  0/41C0368 |   0 |
|     0 | coord3  | 172.27.0.4  | Leader         | running |  1 |             |     |            |     |
|     1 | work1-1 | 172.27.0.8  | Quorum Standby | running |  1 |   0/31D3198 |   0 |  0/31D3198 |   0 |
|     1 | work1-2 | 172.27.0.2  | Leader         | running |  1 |             |     |            |     |
|     2 | work2-1 | 172.27.0.5  | Quorum Standby | running |  1 |   0/31CDFC0 |   0 |  0/31CDFC0 |   0 |
|     2 | work2-2 | 172.27.0.7  | Leader         | running |  1 |             |     |            |     |
+-------+---------+-------------+----------------+---------+----+-------------+-----+------------+-----+

If we add the --group option, the output will change to:

postgres@coord1:~$ patronictl list demo --group 0
+ Citus cluster: demo (group: 0, 7179854923829112860) -+-------------+-----+------------+-----+
| Member | Host        | Role           | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+--------+-------------+----------------+---------+----+-------------+-----+------------+-----+
| coord1 | 172.27.0.10 | Replica        | running |  1 |   0/41C0368 |   0 |  0/41C0368 |   0 |
| coord2 | 172.27.0.6  | Quorum Standby | running |  1 |   0/41C0368 |   0 |  0/41C0368 |   0 |
| coord3 | 172.27.0.4  | Leader         | running |  1 |             |     |            |     |
+--------+-------------+----------------+---------+----+-------------+-----+------------+-----+

postgres@coord1:~$ patronictl list demo --group 1
+ Citus cluster: demo (group: 1, 7179854923881963547) -+-------------+-----+------------+-----+
| Member  | Host       | Role           | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+---------+------------+----------------+---------+----+-------------+-----+------------+-----+
| work1-1 | 172.27.0.8 | Quorum Standby | running |  1 |   0/31D3198 |   0 |  0/31D3198 |   0 |
| work1-2 | 172.27.0.2 | Leader         | running |  1 |             |     |            |     |
+---------+------------+----------------+---------+----+-------------+-----+------------+-----+

Citus worker switchover

When a switchover is orchestrated for a Citus worker node, Citus offers the opportunity to make the switchover close to transparent for an application. Because the application connects to the coordinator, which in turn connects to the worker nodes, then it is possible with Citus to pause the SQL traffic on the coordinator for the shards hosted on a worker node. The switchover then happens while the traffic is kept on the coordinator, and resumes as soon as a new primary worker node is ready to accept read-write queries.

An example of patronictl switchover on the worker cluster:

postgres@coord1:~$ patronictl switchover demo
+ Citus cluster: demo ----------+----------------+---------+----+-------------+-----+------------+-----+
| Group | Member  | Host        | Role           | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+-------+---------+-------------+----------------+---------+----+-------------+-----+------------+-----+
|     0 | coord1  | 172.27.0.10 | Replica        | running |  1 |   0/41C0368 |   0 |  0/41C0368 |   0 |
|     0 | coord2  | 172.27.0.6  | Quorum Standby | running |  1 |   0/41C0368 |   0 |  0/41C0368 |   0 |
|     0 | coord3  | 172.27.0.4  | Leader         | running |  1 |             |     |            |     |
|     1 | work1-1 | 172.27.0.8  | Leader         | running |  1 |             |     |            |     |
|     1 | work1-2 | 172.27.0.2  | Quorum Standby | running |  1 |   0/31D3198 |   0 |  0/31D3198 |   0 |
|     2 | work2-1 | 172.27.0.5  | Quorum Standby | running |  1 |   0/31CDFC0 |   0 |  0/31CDFC0 |   0 |
|     2 | work2-2 | 172.27.0.7  | Leader         | running |  1 |             |     |            |     |
+-------+---------+-------------+----------------+---------+----+-------------+-----+------------+-----+
Citus group: 2
Primary [work2-2]:
Candidate ['work2-1'] []:
When should the switchover take place (e.g. 2024-08-26T08:02 )  [now]:
Current cluster topology
+ Citus cluster: demo (group: 2, 7179854924063375386) -+-------------+-----+------------+-----+
| Member  | Host       | Role           | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+---------+------------+----------------+---------+----+-------------+-----+------------+-----+
| work2-1 | 172.27.0.5 | Quorum Standby | running |  1 |   0/31CDFC0 |   0 |  0/31CDFC0 |   0 |
| work2-2 | 172.27.0.7 | Leader         | running |  1 |             |     |            |     |
+---------+------------+----------------+---------+----+-------------+-----+------------+-----+
Are you sure you want to switchover cluster demo, demoting current primary work2-2? [y/N]: y
2024-08-26 07:02:40.33003 Successfully switched over to "work2-1"
+ Citus cluster: demo (group: 2, 7179854924063375386) --------+---------+------------+---------+
| Member  | Host       | Role    | State   | TL | Receive LSN |     Lag | Replay LSN |     Lag |
+---------+------------+---------+---------+----+-------------+---------+------------+---------+
| work2-1 | 172.27.0.5 | Leader  | running |  1 |             |         |            |         |
| work2-2 | 172.27.0.7 | Replica | stopped |    |     unknown | unknown |    unknown | unknown |
+---------+------------+---------+---------+----+-------------+---------+------------+---------+

postgres@coord1:~$ patronictl list demo
+ Citus cluster: demo ----------+----------------+---------+----+-------------+-----+------------+-----+
| Group | Member  | Host        | Role           | State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+-------+---------+-------------+----------------+---------+----+-------------+-----+------------+-----+
|     0 | coord1  | 172.27.0.10 | Replica        | running |  1 |   0/41C0368 |   0 |  0/41C0368 |   0 |
|     0 | coord2  | 172.27.0.6  | Quorum Standby | running |  1 |   0/41C0368 |   0 |  0/41C0368 |   0 |
|     0 | coord3  | 172.27.0.4  | Leader         | running |  1 |             |     |            |     |
|     1 | work1-1 | 172.27.0.8  | Leader         | running |  1 |             |     |            |     |
|     1 | work1-2 | 172.27.0.2  | Quorum Standby | running |  1 |   0/31D3198 |   0 |  0/31D3198 |   0 |
|     2 | work2-1 | 172.27.0.5  | Leader         | running |  2 |             |     |            |     |
|     2 | work2-2 | 172.27.0.7  | Quorum Standby | running |  2 |   0/31CDFC0 |   0 |  0/31CDFC0 |   0 |
+-------+---------+-------------+----------------+---------+----+-------------+-----+------------+-----+

And this is how it looks on the coordinator side:

# The worker primary notifies the coordinator that it is going to execute "pg_ctl stop".
2024-08-26 07:02:38,636 DEBUG: query(BEGIN, ())
2024-08-26 07:02:38,636 DEBUG: query(SELECT pg_catalog.citus_update_node(%s, %s, %s, true, %s), (3, '172.19.0.7-demoted', 5432, 10000))
# From this moment all application traffic on the coordinator to the worker group 2 is paused.

# The old worker primary is assigned as a secondary.
2024-08-26 07:02:40,084 DEBUG: query(SELECT pg_catalog.citus_update_node(%s, %s, %s, true, %s), (7, '172.19.0.7', 5432, 10000))

# The future worker primary notifies the coordinator that it acquired the leader lock in DCS and about to run "pg_ctl promote".
2024-08-26 07:02:40,085 DEBUG: query(SELECT pg_catalog.citus_update_node(%s, %s, %s, true, %s), (3, '172.19.0.5', 5432, 10000))

# The new worker primary just finished promote and notifies coordinator that it is ready to accept read-write traffic.
2024-08-26 07:02:41,485 DEBUG: query(COMMIT, ())
# From this moment the application traffic on the coordinator to the worker group 2 is unblocked.

Secondary nodes

Starting from Patroni v4.0.0 Citus secondary nodes without noloadbalance tag are also registered in pg_dist_node. However, to use secondary nodes for read-only queries applications need to change citus.use_secondary_nodes GUC.

Peek into DCS

The Citus cluster (coordinator and workers) are stored in DCS as a fleet of Patroni clusters logically grouped together:

/service/batman/              # scope=batman
/service/batman/0/            # citus.group=0, coordinator
/service/batman/0/initialize
/service/batman/0/leader
/service/batman/0/members/
/service/batman/0/members/m1
/service/batman/0/members/m2
/service/batman/1/            # citus.group=1, worker
/service/batman/1/initialize
/service/batman/1/leader
/service/batman/1/members/
/service/batman/1/members/m3
/service/batman/1/members/m4
...

Such an approach was chosen because for most DCS it becomes possible to fetch the entire Citus cluster with a single recursive read request. Only Citus coordinator nodes are reading the whole tree, because they have to discover worker nodes. Worker nodes are reading only the subtree for their own group and in some cases they could read the subtree of the coordinator group.

Citus on Kubernetes

Since Kubernetes doesn’t support hierarchical structures we had to include the citus group to all K8s objects Patroni creates:

batman-0-leader  # the leader config map for the coordinator
batman-0-config  # the config map holding initialize, config, and history "keys"
...
batman-1-leader  # the leader config map for worker group 1
batman-1-config
...

I.e., the naming pattern is: ${scope}-${citus.group}-${type}.

All Kubernetes objects are discovered by Patroni using the label selector, therefore all Pods with Patroni&Citus and Endpoints/ConfigMaps must have similar labels, and Patroni must be configured to use them using Kubernetes settings or environment variables.

A couple of examples of Patroni configuration using Pods environment variables:

for the coordinator cluster

apiVersion: v1
kind: Pod
metadata:
  labels:
    application: patroni
    citus-group: "0"
    citus-type: coordinator
    cluster-name: citusdemo
  name: citusdemo-0-0
  namespace: default
spec:
  containers:
  - env:
    - name: PATRONI_SCOPE
      value: citusdemo
    - name: PATRONI_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: PATRONI_KUBERNETES_POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: PATRONI_KUBERNETES_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: PATRONI_KUBERNETES_LABELS
      value: '{application: patroni}'
    - name: PATRONI_CITUS_DATABASE
      value: citus
    - name: PATRONI_CITUS_GROUP
      value: "0"

for the worker cluster from the group 2

apiVersion: v1
kind: Pod
metadata:
  labels:
    application: patroni
    citus-group: "2"
    citus-type: worker
    cluster-name: citusdemo
  name: citusdemo-2-0
  namespace: default
spec:
  containers:
  - env:
    - name: PATRONI_SCOPE
      value: citusdemo
    - name: PATRONI_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: PATRONI_KUBERNETES_POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: PATRONI_KUBERNETES_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: PATRONI_KUBERNETES_LABELS
      value: '{application: patroni}'
    - name: PATRONI_CITUS_DATABASE
      value: citus
    - name: PATRONI_CITUS_GROUP
      value: "2"

As you may noticed, both examples have citus-group label set. This label allows Patroni to identify object as belonging to a certain Citus group. In addition to that, there is also PATRONI_CITUS_GROUP environment variable, which has the same value as the citus-group label. When Patroni creates new Kubernetes objects ConfigMaps or Endpoints, it automatically puts the citus-group: ${env.PATRONI_CITUS_GROUP} label on them:

apiVersion: v1
kind: ConfigMap
metadata:
  name: citusdemo-0-leader  # Is generated as ${env.PATRONI_SCOPE}-${env.PATRONI_CITUS_GROUP}-leader
  labels:
    application: patroni    # Is set from the ${env.PATRONI_KUBERNETES_LABELS}
    cluster-name: citusdemo # Is automatically set from the ${env.PATRONI_SCOPE}
    citus-group: '0'        # Is automatically set from the ${env.PATRONI_CITUS_GROUP}

You can find a complete example of Patroni deployment on Kubernetes with Citus support in the kubernetes folder of the Patroni repository.

There are two important files for you:

Dockerfile.citus
citus_k8s.yaml

Citus upgrades and PostgreSQL major upgrades

First, please read about upgrading Citus version in the documentation. There is one minor change in the process. When executing upgrade, you have to use patronictl restart instead of systemctl restart to restart PostgreSQL.

The PostgreSQL major upgrade with Citus is a bit more complex. You will have to combine techniques used in the Citus documentation about major upgrades and Patroni documentation about PostgreSQL major upgrade. Please keep in mind that Citus cluster consists of many Patroni clusters (coordinator and workers) and they all have to be upgraded independently.