Minimizing blast radius in Cadence: Introducing Workflow ID-based Rate Limits
At Uber, we run several big multitenant Cadence clusters with hundreds of domains in each. The clusters being multi-tenant means potential noisy neighbor effects between domains.
An essential aspect of avoiding this is managing how workflows interact with our infrastructure to prevent any single workflow from causing instability for the whole cluster. To this end, we are excited to introduce Workflow ID-based rate limits — a new feature designed to protect our clusters from problematic workflows and ensure stability across the board.
Why Workflow ID-based Rate Limits?
We already have rate limits for how many requests can be sent to a domain. However, since Cadence is sharded on the workflow ID, a user-provided input, an overused workflow with a particular id might overwhelm a shard by making too many requests. There are two main ways this happens:
- A user starts, or signals the same workflow ID too aggressively,
- A workflow starts too many activities over a short period of time (e.g. thousands of activities in seconds).