Skip to main content
Zijian Chen
Software Engineer @ Uber
View all authors

Adaptive Tasklist Scaler

· 5 min read
Zijian Chen
Software Engineer @ Uber

At Uber, we previously relied on a dynamic configuration service to manually control the number of partitions for scalable tasklists. This configuration approach introduced several operational challenges:

  • Error-prone: Manual updates and deployments were required.
  • Unresponsive: Adjustments were typically reactive, often triggered by customer reports or observed backlogs.
  • Irreversible: Once increased, the number of partitions was rarely decreased due to the complexity of the two-phase process, especially when anticipating future traffic spikes.

To address these issues, we introduced a new component in the Cadence Matching service: Adaptive Tasklist Scaler. This component dynamically monitors tasklist traffic and adjusts partition counts automatically. Since its rollout, we've seen a significant reduction in incidents and operational overhead caused by misconfigured tasklists.

Zonal Isolation for Cadence Workflows

· 9 min read
Zijian Chen
Software Engineer @ Uber

At Uber, we want to achieve regional resilience such that losing a zone within a region can be tolerated without requiring a cross-region failover. We also want to make sure that losing a zone only affects a subset of workload, at most, rather than everything. However, in Cadence-based systems, the workload in a region is distributed randomly across all workers in the region at a “task-level granularity”, which means a workflow may be worked on by any worker in the region where the domain is active. To achieve this goal, we introduced Zonal Isolation for Cadence Workflows - a feature designed to pin workflows to the zone they are started in, so that zonal isolation can be achieved at a workflow-level.

What is Zonal Isolation for Cadence Workflows?

At high-level, Zonal Isolation for Cadence Workflows can be thought in 2 levels:

  1. Task-level isolation: All decision tasks and activity tasks of a workflow are only processed by workers from the zone where the workflow was started
  2. Infrastructure-level isolation: Within a regional Cadence cluster, workflows are handled by server instances in the same zone where they were started, and the corresponding data is stored in that zone as well.

Infrastructure-level isolation is quite challenging to implement as it requires significant changes to the core design of the Cadence server. Due to the complexity involved, support for this feature is not planned for the foreseeable future.

As a result, the focus remains on achieving task-level zonal isolation outside the Cadence server, which offers a more practical and immediate way to improve system resilience. It provides the capability of ensuring that an unhealthy zone (i.e. bad deployment of workers) only affect a subset of workflows (started from a certain zone) rather than every workflow in a Cadence domain.