Eleonora Di Gregorio - One post

Two Hidden Deadlocks in Cadence Matching: 1 Day, 2 Engineers, 6 Lines of Code

April 20, 2026 · 13 min read

Jakob Haahr Taankvist

Senior Software Engineer @ Uber

Eleonora Di Gregorio

Senior Software Engineer @ Uber

How the new Cadence Shard Manager Found and Mitigated Two Latent Deadlocks

We're rolling out a new Shard Manager service for Cadence that replaces the existing hash-ring based routing, and it's coming to the open-source release soon. The new architecture gives us load balancing, graceful shard handovers, and the debuggability and observability we've used to find the two deadlocks in this post. During the rollout, the Shard Manager exposed two latent deadlocks in the Cadence Matching service. It moved traffic to the healthy instance and kept the system running while two engineers fixed them in a day, with six lines of code total.

To understand how this happened, we first need to understand the new architecture.

Shard Manager Architecture

Cadence has two sharded services, Matching and History. In this blog post we will focus on the Matching service.

In the Matching service the shards are the Cadence task lists. A single Cadence task list is owned by a single Matching instance, and all requests to that particular task list are routed to the owning instance.

The new architecture with the Shard Manager has several benefits:

It's easier to reason about which instance of Cadence Matching owns a task list.
Observability of the system is easier. With the hash-ring based routing the state of the system is spread across all instances of all the Cadence services. Meaning even subtle issues are very hard to debug and reason about.
A centralized component makes it possible to manage shards intelligently. For example, we can now move shards between instances to balance the load. We can isolate bad shards, and we can drain bad hosts.

The Shard Manager architecture looks like this.

1. Matching instances send heartbeats to Shard Manager

The Matching instances heartbeat periodically to the Shard Manager, so the Shard Manager always knows which instances are alive.
The Shard Manager assigns shards to the instances and returns the current assignments in the heartbeat responses.
On every change to the shard assignments, the Shard Manager pushes the new routing map to the Frontend services, so they always have a full view of which instance owns which task list.
A client request arrives at a Frontend service.
The Frontend looks up the owning instance in its routing map and forwards the request directly to it.

The Incident

We have been rolling out the new Shard Manager service to our environments for a few weeks, we have currently rolled out to our customer facing staging environments, and will roll out to production in the next few weeks.

On Thursday morning we woke up to this histogram of log messages in the Shard Manager service:

errorwarn

20k

40k

60k

80k

22:0022:0722:1322:2022:26

Two spikes of thousands of error and warn messages in a very short time frame. Apparently resolving within 10 minutes. So what went wrong? First let's look at what the logs are, let's group them by message and error:

message	error	level	count()
Subscriber not keeping up with state updates, dropping update	null	warn	147,405
Internal service error	failed to assign ephemeral shard: no active executors available for namespace: cadence-matching-staging	error	136,876
No active executors found. Cannot assign shards.	null	error	296

So first we get a lot of updates in the system, so many that the instances cannot keep up, then we start getting errors saying there are no active executors (Matching instances). Let's check the deployment system and see if there were instances.

Number of instances according to the deployment system:

So there were instances! Interestingly right when the errors started happening the number of instances dropped from 3 to 2. And when the errors stopped happening the number of instances went back up to 3.

So it's strange, did the instance removal trigger something in the Shard Manager?

A red herring - leader election

The Shard Manager elects a leader, and that leader is responsible for detecting stale instances and removing them, and reassigning the shards.

Was the leader working? Let's check the leadership election logs of the Shard Manager:

timestamp

message

hostname

21:48:05