At Uber, we want to achieve regional resilience such that losing a zone within a region can be tolerated without requiring a cross-region failover. We also want to make sure that losing a zone only affects a subset of workload, at most, rather than everything. However, in Cadence-based systems, the workload in a region is distributed randomly across all workers in the region at a “task-level granularity”, which means a workflow may be worked on by any worker in the region where the domain is active. To achieve this goal, we introduced Zonal Isolation for Cadence Workflows - a feature designed to pin workflows to the zone they are started in, so that zonal isolation can be achieved at a workflow-level.
What is Zonal Isolation for Cadence Workflows?
At high-level, Zonal Isolation for Cadence Workflows can be thought in 2 levels:
Task-level isolation: All decision tasks and activity tasks of a workflow are only processed by workers from the zone where the workflow was started
...
We’ve heard your feedback: deploying Cadence has been a challenge, especially with limited documentation on operational aspects. So far, we’ve only provided a few docker compose files to help you get started on a development machine. However, deploying and managing Cadence at scale requires a deep understanding of underlying services, configurations and their dependencies.
To address these challenges, we’re launching several initiatives to make it easier to deploy and operate Cadence clusters. These include deployment specs for common scenarios, monitoring dashboards, alerts, runbooks, and more comprehensive documentation.
Introducing Cadence Kubernetes Helm Chart v0
Today, we are happy to announce the release of Cadence Kubernetes Helm Chart v0. This will be the starting point for standardizing Cadence deployments on Kubernetes. We chose Kubernetes because it's the leading compute pla ...
At Uber, we run several big multitenant Cadence clusters with hundreds of domains in each. The clusters being multi-tenant means potential noisy neighbor effects between domains.
An essential aspect of avoiding this is managing how workflows interact with our infrastructure to prevent any single workflow from causing instability for the whole cluster. To this end, we are excited to introduce Workflow ID-based rate limits — a new feature designed to protect our clusters from problematic workflows and ensure stability across the board.
Why Workflow ID-based Rate Limits?
We already have rate limits for how many requests can be sent to a domain. However, since Cadence is sharded on the workflow ID, a user-provided input, an overused workflow with a particular id might overwhelm a shard by making too many requests. There are two main ways this happens:
A user starts, or signals the ...
If you haven’t heard about Cadence, this section is for you. In a short description, Cadence is a code-driven workflow orchestration engine. The definition itself may not tell enough, so it would help splitting it into three parts:
What’s a workflow? (everyone has a different definition)
Why does it matter to be code-driven?
Benefits of Cadence
What is a Workflow?
workflow.png
In the simplest definition, it is “a multi-step execution”. Step here represents individual operations that are a little heavier than small in-process function calls. Although they are not limited to those: it could be a separate service call, processing a large dataset, map-reduce, thread sleep, scheduling next run, waiting for an external input, starting a sub workflow etc. It’s anything a user thinks as a single unit of logic in their code. Those steps often have dependencies among themselves. Some steps, including the very first step, might ...
NO. This change will not trigger non-deterministic error.
An Activity is the smallest unit of execution for Cadence and what happens inside activities are not recorded as historical events and therefore will not be replayed. In short, this change is deterministic and it is fine to modify logic inside activities.
Does changing the workflow definition trigger non-determinstic errors?
YES. This is a very typical non-deterministic error.
When a new workflow code change is deployed, Cadence will find if it is compatible with
Cadence history. Changes to workflow definition will fail the replay process of Cadence
as it finds the new workflow definition imcompatible with previous historical events.
Here is a list of common workflow definition changes.
Changing workflow parameter counts
Changing workflow parameter types
Changing workflow return types
The following changes are not categorized as definition changes and therefore will not
trigger non-deterministic e ...