Zonal Isolation for Cadence Workflows
At Uber, we want to achieve regional resilience such that losing a zone within a region can be tolerated without requiring a cross-region failover. We also want to make sure that losing a zone only affects a subset of workload, at most, rather than everything. However, in Cadence-based systems, the workload in a region is distributed randomly across all workers in the region at a “task-level granularity”, which means a workflow may be worked on by any worker in the region where the domain is active. To achieve this goal, we introduced Zonal Isolation for Cadence Workflows - a feature designed to pin workflows to the zone they are started in, so that zonal isolation can be achieved at a workflow-level.
What is Zonal Isolation for Cadence Workflows?
At high-level, Zonal Isolation for Cadence Workflows can be thought in 2 levels:
- Task-level isolation: All decision tasks and activity tasks of a workflow are only processed by workers from the zone where the workflow was started
- Infrastructure-level isolation: Within a regional Cadence cluster, workflows are handled by server instances in the same zone where they were started, and the corresponding data is stored in that zone as well.
Infrastructure-level isolation is quite challenging to implement as it requires significant changes to the core design of the Cadence server. Due to the complexity involved, support for this feature is not planned for the foreseeable future.
As a result, the focus remains on achieving task-level zonal isolation outside the Cadence server, which offers a more practical and immediate way to improve system resilience. It provides the capability of ensuring that an unhealthy zone (i.e. bad deployment of workers) only affect a subset of workflows (started from a certain zone) rather than every workflow in a Cadence domain.