Zonal Isolation for Cadence Workflows
At Uber, we want to achieve regional resilience such that losing a zone within a region can be tolerated without requiring a cross-region failover. We also want to make sure that losing a zone only affects a subset of workload, at most, rather than everything. However, in Cadence-based systems, the workload in a region is distributed randomly across all workers in the region at a “task-level granularity”, which means a workflow may be worked on by any worker in the region where the domain is active. To achieve this goal, we introduced Zonal Isolation for Cadence Workflows - a feature designed to pin workflows to the zone they are started in, so that zonal isolation can be achieved at a workflow-level.
What is Zonal Isolation for Cadence Workflows?
At high-level, Zonal Isolation for Cadence Workflows can be thought in 2 levels:
- Task-level isolation: All decision tasks and activity tasks of a workflow are only processed by workers from the zone where the workflow was started
- Infrastructure-level isolation: Within a regional Cadence cluster, workflows are handled by server instances in the same zone where they were started, and the corresponding data is stored in that zone as well.
Infrastructure-level isolation is quite challenging to implement as it requires significant changes to the core design of the Cadence server. Due to the complexity involved, support for this feature is not planned for the foreseeable future.
As a result, the focus remains on achieving task-level zonal isolation outside the Cadence server, which offers a more practical and immediate way to improve system resilience. It provides the capability of ensuring that an unhealthy zone (i.e. bad deployment of workers) only affect a subset of workflows (started from a certain zone) rather than every workflow in a Cadence domain.
How Zonal Isolation Works in Cadence?
Architecture
Here is what the architecture of a zonally isolated Cadence-based system looks like:
Fig: Workflows started in one zone are only dispatched to workers from the same zone. Colors to emphasize pinning.
Implementation
Determine the zone of a workflow and workers
To ensure that tasks are dispatched to workers in the same zone as the workflows, we must identify the origin zone of both. The zone of a workflow is determined by the origin zone of the StartWorkflowExecution request, while the zone of workers is determined by the origin zone of the PollForDecisionTask and PollForActivityTask requests. There are three possible ways to determine the origin zone for these requests:
- Uber's Approach: Let Cadence SDK set the origin zone in the headers of the requests before sending the request to Cadence.
- Preferred Approach: Get the origin zone of the requests from headers set by network infrastructure.
- Determine the origin zone of the requests from the zone of the cadence-frontend instance receiving the request, if the network layer has already achieved zonal isolation.
The 2nd approach is the ideal one, but Uber's network infrastructure doesn't provide such headers and the network layer is not ready for zonal isolation. As a result, we adopt the 1st approach. At Uber, we have internal libraries in Go and Java acting as wrappers around Cadence SDK injecting necessary configurations. These libraries have been updated to include the origin zone in the request headers using a header called cadence-client-isolation-group.
How to dispatch tasks to workers from the same zone?
To implement task-level isolation, we introduce a new dimension to the tasklist — isolation group. When a workflow is initiated, the origin zone of the workflow is stored in the database. Each time a decision or activity task is dispatched to cadence-matching, the workflow's origin zone is used as the isolation group for that task.
When a worker sends a PollForDecisionTask or PollForActivityTask request to cadence-matching, the request is labeled with the worker's isolation group (i.e., the worker's zone). Tasks are then dispatched only to poller requests that have the same isolation group, ensuring that tasks are processed by workers in the same zone as the workflow's origin.
