# Cross-DC replication
The Cadence Global Domain feature provides clients with the capability to continue their workflow execution from another cluster in the event of a datacenter failover. Although you can configure a Global Domain to be replicated to any number of clusters, it is only considered active in a single cluster.
# Global Domains Architecture
Cadence has introduced a new top level entity, Global Domains, which provides support for replication of workflow
execution across clusters. A global domain can be configured with more than one clusters, but can only be
active in one of the clusters at any point of time.
We call it
standby when not active in other clusters.
The number of standby clusters can be zero, if a global domain only configured to one cluster. This is preferred/recommended.
Any workflow of a global domain can only make make progress in its
active cluster. And the workflow progress is replicated to other
standby clusters. For example,
starting workflow by calling
StartWorkflow, or starting activity(by
PollForActivityTask API), can only be processed in its active cluster. After active cluster made progress,
standby clusters (if any) will poll the history from active to replicate the workflow states.
However, standby clusters can also receive the requests, e.g. for starting workflows or starting activities. They know which cluster the domain is active at.
So the requests can be routed to the active clusters. This is called
api-forwarding in Cadence.
api-forwarding makes it possible to have no downtime during failover.
There are two
selected-api-forwarding, applications need to run different set of activity & workflow workers polling on every cluster.
Cadence will only dispatch tasks on the current active cluster; workers on the standby cluster will sit idle
until the Global Domain is failed over. This is recommended if XDC is being used in multiple clusters running in very remote data centers(regions), which forwarding is expensive to do.
all-domain-api-forwarding, applications only need to run activity & workflow workers polling on one cluster. This makes it easier for the application setup. This is recommended
when clusters are all in local or nearby datacenters. See more details in discussion (opens new window).
# Conflict Resolution
Unlike local domains which provide at-most-once semantics for activity execution, Global Domains can only support at-least-once semantics. Cadence global domain relies on asynchronous replication of events across clusters, so in the event of a failover it is possible that activity gets dispatched again on the new active cluster due to a replication task lag. This also means that whenever workflow execution is updated after a failover by the new cluster, any previous replication tasks for that execution cannot be applied. This results in loss of some progress made by the workflow execution in the previous active cluster. During such conflict resolution, Cadence re-injects any external events like Signals to the new history before discarding replication tasks. Even though some progress could rollback during failovers, Cadence provides the guarantee that workflows won’t get stuck and will continue to make forward progress.
# Global Domain Concepts, Configuration and Operation
This config is used to distinguish domains local to the cluster from the global domain. It controls the creation of replication tasks on updates allowing the state to be replicated across clusters. This is a read-only setting that can only be set when the domain is provisioned.
A list of clusters where the domain can fail over to, including the current active cluster. This is also a read-only setting that can only be set when the domain is provisioned. A re-replication feature on the roadmap will allow updating this config to add/remove clusters in the future.
# Active Cluster Name
Name of the current active cluster for the Global Domain. This config is updated each time the Global Domain is failed over to another cluster.
# Failover Version
Unique failover version which also represents the current active cluster for Global Domain. Cadence allows failover to be triggered from any cluster, so failover version is designed in a way to not allow conflicts if failover is mistakenly triggered simultaneously on two clusters.
# Operate by CLI
The Cadence CLI can also be used to query the domain config or perform failovers. Here are some useful commands.
# Describe Global Domain
The following command can be used to describe Global Domain metadata:
$ cadence --do cadence-canary-xdc d desc Name: cadence-canary-xdc Description: cadence canary cross dc testing domain OwnerEmail: email@example.com DomainData: Status: REGISTERED RetentionInDays: 7 EmitMetrics: true ActiveClusterName: dc1 Clusters: dc1, dc2
# Failover Global Domain using domain update command(being deprecated in favor of managed graceful failover)
The following command can be used to failover Global Domain my-domain-global to the dc2 cluster:
$ cadence --do my-domain-global d up --ac dc2
# Failover Global Domain using Managed Graceful Failover
First of all, update the domain to enable this feature for the domain
$ cadence --do test-global-domain-0 d update --domain_data IsManagedByCadence:true $ cadence --do test-global-domain-1 d update --domain_data IsManagedByCadence:true $ cadence --do test-global-domain-2 d update --domain_data IsManagedByCadence:true ...
Then you can start failover the those global domains using managed failover:
cadence admin cluster failover start --source_cluster dc1 --target_cluster dc2
This will failover all the domains with
IsManagedByCadence:true from dc1 to dc2.
You can provide more detailed options when using the command, and also watch the progress of the failover.
Feel free to explore the
cadence admin cluster failover tab.
# Running Locally
The best way is to use Cadence docker-compose (opens new window):
docker-compose -f docker-compose-multiclusters.yml up
# Running in Production
Enable global domain feature needs to be enabled in static config.
Here we use clusterDCA and clusterDCB as an example. We pick clusterDCA as the primary(used to called "master") cluster. The only difference of being a primary cluster is that it is responsible for domain registration. Primary can be changed later but it needs to be the same across all clusters.
The ClusterMeta config of clusterDCA should be
dcRedirectionPolicy: policy: "selected-apis-forwarding" clusterMetadata: enableGlobalDomain: true failoverVersionIncrement: 10 masterClusterName: "clusterDCA" currentClusterName: "clusterDCA" clusterInformation: clusterDCA: enabled: true initialFailoverVersion: 1 rpcName: "cadence-frontend" rpcAddress: "<>:<>" clusterDCB: enabled: true initialFailoverVersion: 0 rpcName: "cadence-frontend" rpcAddress: "<>:<>"
And ClusterMeta config of clusterDCB should be
dcRedirectionPolicy: policy: "selected-apis-forwarding" clusterMetadata: enableGlobalDomain: true failoverVersionIncrement: 10 masterClusterName: "clusterDCA" currentClusterName: "clusterDCB" clusterInformation: clusterDCA: enabled: true initialFailoverVersion: 1 rpcName: "cadence-frontend" rpcAddress: "<>:<>" clusterDCB: enabled: true initialFailoverVersion: 0 rpcName: "cadence-frontend" rpcAddress: "<>:<>"
After the configuration is deployed:
Register a global domain
cadence --do <domain_name> domain register --global_domain true --clusters clusterDCA clusterDCB --active_cluster clusterDCA
Run some workflow and failover domain from one to another
cadence --do <domain_name> domain update --active_cluster clusterDCB
Then the domain should be failed over to clusterDCB. Now worklfows are read-only in clusterDCA. So your workers polling tasks from clusterDCA will become idle.
Note 1: that even though clusterDCA is standy/read-only for this domain, it can be active for another domain. So being active/standy is per domain basis not per clusters. In other words, for example if you use XDC in case of DC failure of clusterDCA, you need to failover all domains from clusterDCA to clusterDCB.
Note 2: even though a domain is standy/read-only in a cluster, say clusterDCA, sending write requests(startWF, signalWF, etc) could still work because there is a forwarding component in the Frontend service. It will try to re-route the requests to an active cluster for the domain.