Signals are how the outside world communicates with a running Cadence workflow without interrupting it. Consider a document approval workflow that runs for days or weeks: at any moment, a manager clicks "Approve" on a dashboard and that signal needs to reach the workflow immediately — without restarting it or losing its place. Whether you're building human-in-the-loop approvals, broadcasting configuration changes to thousands of running workflows, coordinating multi-stage pipelines, or implementing liveness checks — signals are the right tool.
Signal handling feature in python client is now available in the latest cadence-python-client release, v0.2.2 — see the v0.2.2 release notes on GitHub and install from PyPI.
How the new Cadence Shard Manager Found and Mitigated Two Latent Deadlocks
We're rolling out a new Shard Manager service for Cadence that replaces the
existing hash-ring based routing, and it's coming to
the open-source release soon. The new architecture gives us load balancing,
graceful shard handovers, and the debuggability and observability we've used to
find the two deadlocks in this post. During the rollout, the Shard Manager
exposed two latent deadlocks in the Cadence Matching service. It moved traffic
to the healthy instance and kept the system running while two engineers fixed
them in a day, with six lines of code total.
To understand how this happened, we first need to understand the new
architecture.
Cadence has two sharded services,
Matching and History. In this blog post we will focus on the Matching service.
In the Matching service the shards are the Cadence task lists. A single Cadence
task list is owned by a single Matching instance, and all requests to that
particular task list are routed to the owning instance.
The new architecture with the Shard Manager has several benefits:
It's easier to reason about which instance of Cadence Matching owns a
task list.
Observability of the system is easier. With the hash-ring based routing the
state of the system is spread across all instances of all the Cadence services.
Meaning even subtle issues are very hard to debug and reason about.
A centralized component makes it possible to manage shards intelligently. For
example, we can now move shards between instances to balance the load. We can
isolate bad shards, and we can drain bad hosts.
The Shard Manager architecture looks like this.
1. Matching instances send heartbeats to Shard Manager
The Matching instances heartbeat periodically to the Shard Manager,
so the Shard Manager always knows which instances are alive.
The Shard Manager assigns shards to the instances and returns the
current assignments in the heartbeat responses.
On every change to the shard assignments, the Shard Manager pushes
the new routing map to the Frontend services, so they always have a
full view of which instance owns which task list.
A client request arrives at a Frontend service.
The Frontend looks up the owning instance in its routing map and
forwards the request directly to it.
We have been rolling out the new Shard Manager service to our environments for a
few weeks, we have currently rolled out to our customer facing staging
environments, and will roll out to production in the next few weeks.
On Thursday morning we woke up to this histogram of log messages in the Shard
Manager service:
errorwarn
0
20k
40k
60k
80k
22:0022:0722:1322:2022:26
Two spikes of thousands of error and warn messages in a very short time frame.
Apparently resolving within 10 minutes. So what went wrong? First let's look at
what the logs are, let's group them by message and error:
message
error
level
count()
Subscriber not keeping up with state updates, dropping update
null
warn
147,405
Internal service error
failed to assign ephemeral shard: no active executors available for namespace: cadence-matching-staging
error
136,876
No active executors found. Cannot assign shards.
null
error
296
So first we get a lot of updates in the system, so many that the instances cannot keep
up, then we start getting errors saying there are no active executors (Matching
instances). Let's check the deployment system and see if there were instances.
Number of instances according to the deployment system:
So there were instances! Interestingly right when the errors started happening
the number of instances dropped from 3 to 2. And when the errors stopped happening
the number of instances went back up to 3.
So it's strange, did the instance removal trigger something in the Shard
Manager?
A red herring - leader election
The Shard Manager elects a leader, and that leader is responsible for
detecting stale instances and removing them, and reassigning the shards.
Was the leader working? Let's check the leadership election logs of the Shard
Manager:
timestamp
message
hostname
21:48:05
Leadership period ended, voluntarily resigning
host-1
21:48:05
Became leader
host-2
21:53:05
Leadership period ended, voluntarily resigning
host-2
21:53:05
Became leader
host-1
21:58:05
Leadership period ended, voluntarily resigning
host-1
21:58:05
Became leader
host-2
22:03:05
Leadership period ended, voluntarily resigning
host-2
22:08:05
Leadership period ended, voluntarily resigning
host-1
22:08:06
Became leader
host-2
22:13:06
Leadership period ended, voluntarily resigning
host-2
22:13:06
Became leader
host-1
22:18:06
Leadership period ended, voluntarily resigning
host-1
We see that the leader resigns then another instance becomes leader etc. However
at 22:03 something interesting happens. host-2 resigns as leader, but there is no new
"Became leader" log. The next log message is host-1 resigning as leader. That's
strange, how can host-1 resign when it didn't become leader?
This thread led to a lot of investigation, which ultimately led us to the
conclusion: the Became leader log was simply never delivered. Logging
infrastructure in distributed systems is never 100% reliable.
There was always a leader, it just wasn't logged.
We now found what would turn out to be the right question; are the
executors (Matching instances) heartbeating? If not the Shard Manager server is
doing what it's supposed to do. Let's check the network metrics for the
heartbeats.
This is interesting, up until 20:40 we clearly have 3 heartbeats per second, one
for each Matching instance, exactly like we expect, but then, after 20:40, we
suddenly only have one heartbeat per second.
And even more interesting, right when the errors started the number of
heartbeats dropped all the way to zero. And when the errors stopped happening
the number of heartbeats went back up to 1.
The brief spikes to 4+ req/s at 22:09 and 22:26 are caused by shards moving
around — when an executor receives a request for a shard it doesn't recognize,
it heartbeats out of schedule to check if the shard was assigned to it in the
meantime.
But we have 3 instances, so why did two of them suddenly stop heartbeating? And
why did the third instance continue to heartbeat?
We also looked at the CPU utilization of the Matching instances, and we saw
this:
This is cool, we see that the CPU utilization was exactly the same for all three
instances, then two stopped heartbeating, so Shard Manager moved all the shards
to the last instance, and its CPU utilization went up. We also see that as soon
as the replacement instances appeared the Shard Manager assigned all the shards to it.
The gaps between instances are the deployment system's load balancer deciding
we didn't need three instances and removing one — then rolling back that decision
when it realized it had removed the instance doing all the work. And then at
around 22:50 there was a deploy of the Matching services, so all the instances
got replaced and then they all started heartbeating again.
So, two out of three Matching instances stopped heartbeating. And as soon as
the instances were replaced all instances started heartbeating again. This is
the telltale sign of a deadlock.
We checked logs of the locked Matching instances, and we saw that every second when
they should have heartbeated they emitted a log; still doing assignment, skipping heartbeat.
When an executor heartbeats the Shard Manager responds with a new assignment of
shards. The executor then reconciles its state: it stops any shards it no longer
owns and starts any newly assigned ones. During this reconciliation the executor
skips heartbeating. Normally this completes in milliseconds, but in this case
the reconciliation deadlocked, so the executor never heartbeated again.
It turns out that the Cadence Matching service was hiding not just one, but two latent deadlocks.
While they had completely different scopes and triggers, they both resulted
in the exact same catastrophic behavior for the Shard Manager: blocking the
Shard Manager heartbeat loop and failing liveness checks.
Finding the first deadlock: The Database Unavailability
So, we know that two instances of the Matching service deadlocked at around
20:40, let's check the logs for the Matching service around that time.
errorwarn
0
2k
4k
6k
8k
10k
20:3820:3920:4120:4220:43
We see a spike in both errors and warnings, let's group the logs by the message,
and the error:
So we see errors about the database being unavailable, we then see warnings
about task list managers (the shards managed by Shard Manager) stopping and
restarting. As it turns out, the most interesting log message is the get task list partition config from db. Let's look at the code around
that:
c.logger.Info("get task list partition config from db",
// push update notification to all non-root partitions on start
c.stopWG.Add(1)
gofunc(){
defer c.stopWG.Done()
c.notifyPartitionConfig(context.Background(),
nil, startConfig)
}()
}
And here we see the issue.
On line 7 we add one to the stopWG wait group. In Go, a
WaitGroup lets you wait for a set of
goroutines to finish before proceeding. Here, stopWG is used during
shutdown — the task list manager won't fully stop until every goroutine
registered with stopWG has completed.
Then in the goroutine we call the notifyPartitionConfig method, and here
the problem is!. We are calling notifyPartitionConfig with
context.Background(). The background context is a context that is never
cancelled, and which never times out.
In the notifyPartitionConfig method we make an RPC call, and due to the DB
unavailability this call hangs, and since we use the background context
we wait forever for the response.
In the Stop method of the TaskListManager we then wait for the stopWG,
however since the goroutine with the notifyPartitionConfig call is blocked
the stopWG never finishes.
This means that when the Shard Manager does the reconciliation, it will try to
stop the task list manager, but this never finishes, and the reconciliation
blocks heartbeating indefinitely!
The fix is simple, we just need to use a context that times out, as was
introduced in this pull request.
Finding the second deadlock: The Task Pump Self-Deadlock
Over the weekend, while the fix for the first deadlock was rolling out, we observed
the exact same behavior again: missing executors and failing liveness checks.
However, this time there were absolutely zero database issues at play.
By profiling the live service, we uncovered a second, cascading self-deadlock within the Cadence Matching.
The root cause originates from a self-deadlock within the getTasksPump goroutine in Cadence Matching.
Let's look at a simplified version of the execution chain that caused this trap:
func(tr *taskReader)getTasksPump(){
tr.stopWg.Add(1)
defer tr.stopWg.Done()// This must run to unblock Wait()
for{
// ... attempt to read tasks from DB ...
if err !=nil{
tr.handleErr(err)
}
}
}
func(tr *taskReader)handleErr(err error){
if_, ok := err.(*persistence.ConditionFailedError); ok {
// We lost ownership of the task list, shut down!
tr.tlMgr.Stop()
}
}
func(tr *taskReader)Stop(){
tr.cancel()
tr.stopWg.Wait()// Wait for getTasksPump to finish
}
And here is how the execution fatally loops back on itself:
On line 2, getTasksPump adds to the stopWg. It guarantees it will call Done() via the defer on line 3 when the function exits.
During its periodic event loop, it encounters a ConditionFailedError (indicating the task list has been taken over by another host) and calls handleErr.
handleErr detects the ownership loss and synchronously calls Stop() on the parent task list manager (tlMgr.Stop()). Crucially, the getTasksPump goroutine is the one executing this synchronous code path.
The task list manager begins tearing down sub-components and eventually calls taskReader.Stop() to wait for background goroutines to finish gracefully.
Inside taskReader.Stop(), it cancels its context and then calls tr.stopWg.Wait() (line 22) to wait for the getTasksPump goroutine to finish.
The Trap: Because getTasksPump is the very goroutine executing Wait(), it is effectively waiting for itself to finish. It is blocked mid-execution,
meaning it can never return, and the deferred tr.stopWg.Done() will never be called!
The result is a self-deadlock that can trap the goroutine for hours (we observed goroutines blocked for over 22 hours during the incident).
Just like the first deadlock, this completely stalls the Stop execution, which in turn freezes the reconciliation and blocks heartbeating indefinitely.
It is crucial to note that the two deadlocks were independent of the new Shard Manager integration itself; they were actually pre-existing issues deeply embedded in the Cadence Matching teardown logic.
However, the introduction of the Shard Manager acted as a catalyst, exposing these latent bugs.
Seeing how a localized teardown hang could freeze the entire node made us realize that decoupling liveness from teardown was a necessary architectural change to prevent node-wide failures.
This exposed two major design flaws in how we handled lifecycle management:
Coupled Liveness and Teardown Latency: The heartbeat is meant to be a lightweight RPC that fires on a strict interval to prove node liveness. By coupling this liveness signal to the synchronous teardown latency of potentially thousands of task list managers, a single stuck task list poisoned the entire executor's liveness signal.
Unbounded Synchronous Work:shardProcessorImpl.Stop() iterates through task list managers sequentially without any timeouts.
We rolled out fixes to solve both deadlocks and harden the architecture:
Fixing the Deadlocks (6 lines of code total): The fix for the first database-triggered deadlock was just three lines of code: replacing context.Background() with a context that has a timeout, as introduced in this pull request. The fix for the second self-deadlock was to rely only on context cancellation for a graceful shutdown, which was also accomplished in exactly three lines of code.
Hardening the Heartbeat Loop: To fix the exposed architectural flaw and decouple liveness from teardown latency, managedProcessor.processor.Stop() and Start()were moved to separate goroutines with non-blocking handoffs and strict timeouts. This guarantees that a single stuck task list manager can never hold an entire shard processor hostage indefinitely.
The new Shard Manager made both of these deadlocks visible — under the old hash-ring based routing, the same deadlocks would have caused silent degradation that would have been extremely difficult to diagnose.
Instead, the Shard Manager detected the missing heartbeats, automatically moved
all shards to the one healthy instance, and kept the system running while we
investigated. When new healthy instances appeared it immediately rebalanced
the load across them.
Found in a day, by two engineers, thanks to the
observability the Shard Manager provides.
Senior Manager at Uber, Cadence. Author of the Software Engineering Handbook
Hi Cadence Community,
At the beginning of every year, we share about what happened in the last year. Here’s a look back at our 2025 improvements, explaining what’s new, and how you can use them. 2025 was one of the most exciting years for the Cadence project and we hope this will only get better in the future. Here are some highlights:
Cadence is an open-source workflow orchestration platform. Our community currently includes more than 150 companies and it continues to grow. Many critical services are built with Cadence as it simplifies complex systems by separating business logic from its distributed system concerns.
Recently, Cadence was donated to Cloud Native Computing Foundation (CNCF), which is a major step towards making it an industry standard and receiving support from all around the world. In other words, the Cadence project is now a Linux Foundation project, licensed under Apache v2 which allows anyone to use and extend it freely. The project operates with open, merit-based governance, giving everyone the opportunity to help shape its future.
For a long time, controlling Cadence workflows was primarily done through the client SDKs or the administrative CLI. Our vision for Cadence Web has been to evolve it from a read-only inspection tool into a Unified Operational Hub for all workflow management needs. This means putting every essential workflow action directly in the hands of operators, support engineers, and developers—right where they observe workflows.
Over the past releases, we’ve been steadily introducing interactive capabilities such as Terminate, Cancel, Restart, Signal, and Reset. Each of these brought Cadence Web closer to becoming a full-featured operational console.
Are you struggling with uncontrolled concurrency when trying to process thousands of activities or child workflows? Do you find yourself hitting rate limits or overwhelming downstream services when running bulk operations? We've got great news for you!
Today, we're thrilled to announce Batch Future, a powerful new feature in the Cadence Go client that provides controlled concurrency for bulk operations. You can now process multiple activities in parallel while maintaining precise control over how many run simultaneously.
Cadence users, especially new users, often struggle with failed/stuck workflows and are unable to understand what is wrong with their workflow. This can now be addressed by a tool that runs on demand to check the workflow and provide diagnostics with actionable information via clear runbooks that users can follow. The overarching goal is to help cadence users understand what is wrong with their workflow
At Uber, we manage billions of workflows with lifetimes ranging from seconds to years. Over the course of their lifetime, workflow code logic often requires changes. To prevent non-deterministic errors that changes may cause, Cadence offers a Versioning feature. However, the feature's usage is limited because changes are only backward-compatible, but not forward-compatible. This makes potential rollbacks or workflow execution rescheduling unsafe.
To address these issues, we have made recent enhancements to the Versioning API, enabling the safe deployment of versioned workflows by separating code changes from the activation of new logic.