Skip to main content
Jakob Haahr Taankvist
Senior Software Engineer @ Uber
View all authors

Two Hidden Deadlocks in Cadence Matching: 1 Day, 2 Engineers, 6 Lines of Code

· 13 min read
Jakob Haahr Taankvist
Senior Software Engineer @ Uber
Eleonora Di Gregorio
Senior Software Engineer @ Uber

How the new Cadence Shard Manager Found and Mitigated Two Latent Deadlocks

We're rolling out a new Shard Manager service for Cadence that replaces the existing hash-ring based routing, and it's coming to the open-source release soon. The new architecture gives us load balancing, graceful shard handovers, and the debuggability and observability we've used to find the two deadlocks in this post. During the rollout, the Shard Manager exposed two latent deadlocks in the Cadence Matching service. It moved traffic to the healthy instance and kept the system running while two engineers fixed them in a day, with six lines of code total.

To understand how this happened, we first need to understand the new architecture.

Minimizing blast radius in Cadence: Introducing Workflow ID-based Rate Limits

· 7 min read
Jakob Haahr Taankvist
Senior Software Engineer @ Uber

At Uber, we run several big multitenant Cadence clusters with hundreds of domains in each. The clusters being multi-tenant means potential noisy neighbor effects between domains.

An essential aspect of avoiding this is managing how workflows interact with our infrastructure to prevent any single workflow from causing instability for the whole cluster. To this end, we are excited to introduce Workflow ID-based rate limits — a new feature designed to protect our clusters from problematic workflows and ensure stability across the board.

Why Workflow ID-based Rate Limits?

We already have rate limits for how many requests can be sent to a domain. However, since Cadence is sharded on the workflow ID, a user-provided input, an overused workflow with a particular id might overwhelm a shard by making too many requests. There are two main ways this happens:

  1. A user starts, or signals the same workflow ID too aggressively,
  2. A workflow starts too many activities over a short period of time (e.g. thousands of activities in seconds).