Skip to main content

Controlling Workflows From Web

· 2 min read
Assem Hafez
Software Engineer @ Uber

For a long time, controlling Cadence workflows was primarily done through the client SDKs or the administrative CLI. Our vision for Cadence Web has been to evolve it from a read-only inspection tool into a Unified Operational Hub for all workflow management needs. This means putting every essential workflow action directly in the hands of operators, support engineers, and developers—right where they observe workflows.

Over the past releases, we’ve been steadily introducing interactive capabilities such as Terminate, Cancel, Restart, Signal, and Reset. Each of these brought Cadence Web closer to becoming a full-featured operational console.

Today, we’re completing that journey with the introduction of the Start Workflow action. Operators can now manage the full workflow lifecycle—from initiation to completion—entirely within the Web UI.


Where to Find the New Controls

All workflow actions are designed to be discoverable and intuitive, fitting naturally into existing pages and workflows.

1. Workflow Execution Page

Workflow actions menu

2. Domain Page (Starting New Workflows)

Start new workflow


Try It Today!

With Start Workflow now available, Cadence Web provides a truly end-to-end operational surface for workflows. Whether you're debugging incidents, restarting workflows after a fix, or manually kicking off test runs, everything you need is now in one place.

If you haven’t updated yet, we highly recommend upgrading to the latest Cadence Web release and exploring the new Workflow Actions.

Cadence Joins CNCF (Cloud Native Computing Foundation)

· 4 min read
Ender Demirkaya
Senior Manager at Uber, Cadence. Author of the Software Engineering Handbook

using.png

We’re proud to announce that the Cadence project has joined the CNCF (Cloud Native Computing Foundation)®, the open-source foundation that hosts and maintains critical components of modern cloud-native infrastructure including Kubernetes®, Prometheus®, and Envoy® under the Linux Foundation®.

Cadence is an open-source, fault-tolerant, and highly scalable workflow orchestration engine created at Uber to help developers build and run resilient applications. It’s been powering thousands of use cases at Uber and other companies. By managing distributed state, retries, scaling, and failure recovery, Cadence enables teams to focus on business logic rather than infrastructure complexity. Mission-critical applications across industries including finance, e-commerce, healthcare, and transportation depend on Cadence.

Joining CNCF marks a significant milestone for the Cadence project, emphasizing the project’s open source commitment. With its open governance, companies can join as maintainers and help improve long-term confidence. Increased transparency in roadmap and execution make upcoming features predictable.

Since its inception, the Cadence project’s ecosystem has reached over 150 companies and counting. Partners like NetApp® Instaclustr adopted the project and have offered it as a managed solution at scale. With CNCF’s support, the project aims to further its mission of simplifying distributed service development while delivering production-grade reliability at scale.

In the last several years, Cadence has made significant investments in its scalability, reliability, multitenancy, deployment safety, and portability, laying the necessary foundation to build enterprise-level features at scale, efficiently and reliably. It’s now a great time to build those features together, and we invite anyone to be a part of this future. Especially in the era of AI, Cadence will play a crucial role in durable orchestration.

What’s Changing in the Community?

We’ll stop using our Slack workspace (uber-cadence.slack.com). Going forward, we’ll use CNCF’s Slack workspace (cloud-native.slack.com). Join this new workspace using Community Inviter and join the #cadence-users channel to contact us.

Our website (cadenceworkflow.io) and our GitHub org (github.com/cadence-workflow) will stay the same and we’ll continue sharing new features from there.

We’ll publish our roadmap at https://github.com/orgs/cadence-workflow/projects. We’ll hold community meetings to brainstorm about and prioritize upcoming features. Project tracking will move from internal tools to GitHub. Projects will have dedicated issues so you can track pull requests, updates, and timelines.

We’ll organize regular meetups (in-person and virtual) to showcase new features, have discussions, and learn from valuable guests.

For maintainers, we’ll hold regular meetings to update each other. If you’d like to become a maintainer, please contact us on Slack so we can help with starter tasks and larger projects as you gain experience.

How to Become a Maintainer?

We invite companies that are already using Cadence, or plan to adopt it in the future, to become official maintainers and help shape this critical piece of infrastructure for your organization.

With this important milestone, we are prioritizing the addition of new maintainers and working to make the onboarding experience as smooth as possible. Our goal is to scale the project responsibly across all areas including development, decision making, efficiency, modernization, prioritization, and more.

If you are interested, please reach out to us in the #cadence-users channel mentioned above, and we will help you find suitable projects to contribute to. If you already have something in mind, feel free to open an issue in the appropriate repository under github.com/cadence-workflow.

Acknowledgments

CNCF® and the CNCF logo design are registered trademarks of the Cloud Native Computing Foundation.

Envoy®, Kubernetes®, Prometheus®, and their logos are registered trademarks of The Linux Foundation® in the United States and other countries. No endorsement by The Linux Foundation is implied by the use of these marks.

Instaclustr® and NetApp® are trademarks of NetApp, Inc.

Introducing Batch Future with Concurrency Control

· 6 min read
Kevin Burns
Developer Advocate @ Uber

Are you struggling with uncontrolled concurrency when trying to process thousands of activities or child workflows? Do you find yourself hitting rate limits or overwhelming downstream services when running bulk operations? We've got great news for you!

Today, we're thrilled to announce Batch Future, a powerful new feature in the Cadence Go client that provides controlled concurrency for bulk operations. You can now process multiple activities in parallel while maintaining precise control over how many run simultaneously.

Workflow Diagnostics

· 3 min read
Sankari Gopalakrishnan
Senior Software Engineer @ Uber

Cadence users, especially new users, often struggle with failed/stuck workflows and are unable to understand what is wrong with their workflow. This can now be addressed by a tool that runs on demand to check the workflow and provide diagnostics with actionable information via clear runbooks that users can follow. The overarching goal is to help cadence users understand what is wrong with their workflow

Safe deployments of Versioned workflows

· 8 min read
Seva Kaloshin
Software Engineer II @ Uber

At Uber, we manage billions of workflows with lifetimes ranging from seconds to years. Over the course of their lifetime, workflow code logic often requires changes. To prevent non-deterministic errors that changes may cause, Cadence offers a Versioning feature. However, the feature's usage is limited because changes are only backward-compatible, but not forward-compatible. This makes potential rollbacks or workflow execution rescheduling unsafe.

To address these issues, we have made recent enhancements to the Versioning API, enabling the safe deployment of versioned workflows by separating code changes from the activation of new logic.

What is a Versioned Workflow?

Cadence reconstructs a workflow's execution history by replaying past events against your workflow code, expecting the exact same outcome every time. If your workflow code changes in an incompatible way, this replaying process can lead to non-deterministic errors.

A versioned workflow uses a Versioning feature to help you avoid errors. This allows developers to safely update their workflow code without breaking existing executions. The key is the workflow.GetVersion function (available in Go and Java). By using workflow.GetVersion, you can mark points in your code where changes occur, ensuring that future calls will return a specific version number.

Before the rollout, only instances of workflow code v0.1 existed:

v := workflow.GetVersion(ctx, "change-id", workflow.DefaultVersion, 1)
if v == workflow.DefaultVersion {
err = workflow.ExecuteActivity(ctx, ActivityA, data).Get(ctx, &result1)
} else {
err = workflow.ExecuteActivity(ctx, ActivityC, data).Get(ctx, &result1)
}

Deployment flow

Let’s consider an example deployment of a change from workflow code v0.1, where only FooActivity is supported.

// Git tag: v0.1 
func MyWorkflow(ctx workflow.Context) error {
return workflow.ExecuteActivity(ctx, FooActivity).Get(ctx, nil)
}

to workflow code v0.2, which introduces a new BarActivity and utilizes the Versioning feature:

// Git tag: v0.2 
func MyWorkflow(ctx workflow.Context) error {
version := workflow.GetVersion(ctx, "MyChange", workflow.DefaultVersion, 1)
if version == workflow.DefaultVersion {
return workflow.ExecuteActivity(ctx, FooActivity).Get(ctx, nil)
}
return workflow.ExecuteActivity(ctx, BarActivity).Get(ctx, nil)
}

Before the rollout, only instances of workflow code v0.1 existed:

old-deployment-flow-v0.1.png

Rollouts are typically performed gradually, with new workers replacing previous worker instances one at a time. This means that multiple workers with workflow code v0.1 and v0.2 can exist simultaneously. When a worker is replaced, a running workflow execution is rescheduled to another worker. Thanks to the Versioning feature, a worker with workflow code v0.2 can support a workflow execution started by a worker with workflow code v0.1.

old-deployment-flow-v0.1-v0.2.png During rollouts, the service should continue to serve production traffic, allowing new workflows to be initiated. If a new worker processes a "Start Workflow Execution" request, it will execute a workflow based on the new version. However, if an old worker handles the request, it will start a workflow based on the old version.

old-deployment-flow-v0.1-v0.2-start-workflow.png

If a rollout is completed successfully, both the new and old workflows will continue to execute simultaneously. old-deployment-flow-v0.2.png

Versioned Workflow Rescheduling Problem

Workflows typically execute on the same worker on which they started. However, various factors can necessitate rescheduling with a different worker.:

  • Worker Shutdown: Occurs when a worker is shut down due to reasons such as rollouts, rollbacks, restarts, or instance crashes.
  • Worker Unavailability: Occurs when a worker is running but loses connection to the server, becoming unavailable.
  • High Traffic Load: Occurs when a worker's sticky cache is fully utilized, preventing further workflow execution and causing the server to reschedule the workflow to another worker.

During a rollout or rollback, workflow rescheduling for workflow executions with new versions becomes unsafe, especially during rollbacks: workflow-rescheduling-problem.png

  • If an old workflow is rescheduled to either an old or a new worker, it generally processes correctly.
  • If a new workflow is rescheduled to an old worker, it will be blocked or even fail (depending on NonDeterministicWorkflowPolicy).

Why did it happen?

The old worker doesn't support the new version and cannot replay its history correctly, which leads to a non-deterministic error. The Versioning API allowed customers to make only backward-compatible changes to workflow code definitions; however, these changes were not forward-compatible.

At the same time, there were no workarounds allowing customers to make these changes forward-compatible, so they couldn't separate code changes from the activation of the new version.

What impact did we have at Uber?

Depending on the workflow code, code changes, and impact, to eliminate the negative impact of a rollback, a Cadence customer needed to identify all problematic workflows, terminate them if they did not fail automatically, and restart them. These steps resulted in a significant on-call burden, leading to possible SLO violations and incidents.

Based on customer impact, we introduced changes in the Versioning API, enabling customers to separate code changes from the activation of the new version.

ExecuteWithVersion and ExecuteWithMinVersion

The recent release of the Go SDK (Java soon) has extended the GetVersion function and introduced two new options:

// When it's executed for the first time, it returns 2, instead of 10 
version := workflow.GetVersion(ctx, "changeId", 1, 10, workflow.ExecuteWithVersion(2))

// When it's executed for the first time, it returns 1, instead of 10
version := workflow.GetVersion(ctx, "changeId", 1, 10, workflow.ExecuteWithMinVersion())

These two new options enable customers to choose which version should be returned when GetVersion is executed for the first time, instead of the maximum supported version.

  • ExecuteWithVersion returns a specified value.
  • ExecuteWithMinVersion returns a minimal supported version.

Let’s extend the example above and consider the deployment of versioned workflows with new functions:

Deployment of Versioned workflows

Step 0

The initial version remains v0.1

// Git tag: v0.1 
// MyWorkflow supports: workflow.DefaultVersion
func MyWorkflow(ctx workflow.Context) error {
return workflow.ExecuteActivity(ctx, FooActivity).Get(ctx, nil)
}

When a StartWorkflowExecution request is processed, a new workflow execution will have a DefaultVersion of the upcoming change ID.

new-deployment-flow-step-0.png

Step 1

GetVersion is still used; however, workflow.ExecuteWithVersion has also been added.

// Git tag: v0.2   
// MyWorkflow supports: workflow.DefaultVersion and 1
func MyWorkflow(ctx workflow.Context) error {
// When GetVersion is executed for the first time, workflow.DefaultVersion will be returned
version := workflow.GetVersion(ctx, "MyChange", workflow.DefaultVersion, 1, workflow.ExecuteWithVersion(workflow.DefaultVersion))

if version == workflow.DefaultVersion {
return workflow.ExecuteActivity(ctx, FooActivity).Get(ctx, nil)
}
return workflow.ExecuteActivity(ctx, BarActivity).Get(ctx, nil)
}

Worker v0.2 contains the new workflow code definition that supports the new logic. However, when a StartWorkflowExecution request is processed, a new workflow execution will still have the default version of the “MyChange” change ID.

new-deployment-flow-step-1.png

This change enables customers to easily roll back to worker v0.1 without encountering any non-deterministic errors.

Step 2

Once all v0.2 workers are replaced with v0.1 workers, we can deploy a new worker that begins workflow executions with the new version.

// Git tag: v0.3   
// MyWorkflow supports: workflow.DefaultVersion and 1
func MyWorkflow(ctx workflow.Context) error {
// When GetVersion is executed for the first time, Version #1 will be returned
version := workflow.GetVersion(ctx, "MyChange", workflow.DefaultVersion, 1)

if version == workflow.DefaultVersion {
return workflow.ExecuteActivity(ctx, FooActivity).Get(ctx, nil)
}
return workflow.ExecuteActivity(ctx, BarActivity).Get(ctx, nil)
}

Worker v0.3 contains the new workflow code definition that supports the new logic while still supporting the previous logic. Therefore, when a StartWorkflowExecution request is processed, a new workflow execution will have Version #1 of the “MyChange” change ID.

new-deployment-flow-step-2.png

This change enables customers to easily roll back to worker v0.2 without any non-deterministic errors, as both worker versions support "DefaultVersion" and "Version #1" of the “MyChange” change ID.

Step 3

Once all workers v0.3 replace the old worker v0.2 and all workflows with the DefaultVersion of “MyChange” are finished, we can deploy a new worker that starts workflow executions with the new version and doesn’t support the previous logic.

// Git tag: v0.4     
// MyWorkflow supports: 1
func MyWorkflow(ctx workflow.Context) error {
// When GetVersion is executed for the first time, Version #1 will be returned
_ := workflow.GetVersion(ctx, "MyChange", 1, 1)
return workflow.ExecuteActivity(ctx, BarActivity).Get(ctx, nil)
}

Worker v0.4 contains the new workflow code definition that supports the new logic but does not support the previous logic. Therefore, when a StartWorkflowExecution request is processed, a new workflow execution will have Version #1 of the “MyChange” change ID.

new-deployment-flow-step-3.png

This change finalizes the safe rollout of the new versioned workflow. At each step, both versions of workers are fully compatible with one another, making rollouts and rollbacks safe.

Differences with the previous deployment flow

The previous deployment flow for versioned workflows included only Steps 0, 2, and 3. Therefore, a direct upgrade from Step 0 to Step 2 (skipping Step 1) was not safe due to the inability to perform a safe rollback. The new functions enabled customers to have Step 1, thereby making the deployment process safe.

Conclusion

The new options introduced into GetVersion address gaps in the Versioning logic that previously led to failed workflow executions. This enhancement improves the safety of deploying versioned workflows, allowing for the separation of code changes from the activation of new logic, making the deployment process more predictable. This extension of GetVersion is a significant improvement that opens the way for future optimizations.

Adaptive Tasklist Scaler

· 5 min read
Zijian Chen
Software Engineer @ Uber

At Uber, we previously relied on a dynamic configuration service to manually control the number of partitions for scalable tasklists. This configuration approach introduced several operational challenges:

  • Error-prone: Manual updates and deployments were required.
  • Unresponsive: Adjustments were typically reactive, often triggered by customer reports or observed backlogs.
  • Irreversible: Once increased, the number of partitions was rarely decreased due to the complexity of the two-phase process, especially when anticipating future traffic spikes.

To address these issues, we introduced a new component in the Cadence Matching service: Adaptive Tasklist Scaler. This component dynamically monitors tasklist traffic and adjusts partition counts automatically. Since its rollout, we've seen a significant reduction in incidents and operational overhead caused by misconfigured tasklists.


What is a Scalable Tasklist?

A scalable tasklist is one that supports multiple partitions. Since Cadence’s Matching service is sharded by tasklist, all requests to a specific tasklist are routed to a single Matching host. To avoid bottlenecks and enhance scalability, tasklists can be partitioned so that multiple Matching hosts handle traffic concurrently.

These partitions are transparent to clients. When a request arrives at the Cadence server for a scalable tasklist, the server selects an appropriate partition. More details can be found in this document.

How Is the Number of Partitions Manually Configured?

The number of partitions for a tasklist is controlled by two dynamic configuration properties:

  1. matching.numTasklistReadPartitions: Specifies the number of read partitions.
  2. matching.numTasklistWritePartitions: Specifies the number of write partitions.

To prevent misconfiguration, a guardrail is in place to ensure that the number of read partitions is never less than the number of write partitions.

When increasing the number of partitions, both properties are typically updated simultaneously. However, due to the guardrail, the order of updates doesn't matter—read and write partitions can be increased in any sequence.

In contrast, decreasing the number of partitions is more complex and requires a two-phase process:

  1. First, reduce the number of write partitions.
  2. Then, wait for any backlog in the decommissioned partitions to drain completely.
  3. Finally, reduce the number of read partitions.

Because this process is tedious, error-prone, and backlog-sensitive, it is rarely performed in production environments.


How Does Adaptive Tasklist Scaler Work?

The architecture of the adaptive tasklist scaler is shown below:

adaptive tasklist scaler architecture

1. Migrating Configuration to the Database

The first key change was migrating partition count configuration from dynamic config to the Cadence cluster’s database. This allows the configuration to be updated programmatically.

  • The adaptive tasklist scaler runs in the root partition only.
  • It reads and updates the partition count.
  • Updates propagate to non-root partitions via a push model, and to pollers and producers via a pull model.
  • A version number is associated with each config. The version only increments through scaler updates, ensuring monotonicity and consistency across components.

2. Monitoring Tasklist Traffic

The scaler periodically monitors the write QPS of each tasklist.

  • If QPS exceeds an upscale threshold for a sustained period, the number of read and write partitions is increased proportionally.
  • If QPS falls below a downscale threshold, only the write partitions are reduced initially. The system then waits for drained partitions to clear before reducing the number of read partitions, ensuring backlog-free downscaling.

Enabling Adaptive Tasklist Scaler

Prerequisites

To use this feature, upgrade Cadence to v1.3.0 or later.

Also, migrate tasklist partition configurations to the database using this guide.

Configuration

The scaler is governed by the following dynamic configuration parameters:

  • matching.enableAdaptiveScaler: Enables the scaler at the tasklist level.
  • matching.partitionUpscaleSustainedDuration: Duration that QPS must stay above threshold before triggering upscale.
  • matching.partitionDownscaleSustainedDuration: Duration below threshold required before triggering downscale.
  • matching.adaptiveScalerUpdateInterval: Frequency at which the scaler evaluates and updates partition counts.
  • matching.partitionUpscaleRPS: QPS threshold per partition that triggers upscale.
  • matching.partitionDownscaleFactor: Factor applied to introduce hysteresis, lowering the QPS threshold for downscaling to avoid oscillations.

Monitoring and Metrics

Several metrics have been introduced to help monitor the scaler’s behavior:

QPS and Thresholds

  • estimated_add_task_qps_per_tl: Estimated QPS of task additions per tasklist.
  • tasklist_partition_upscale_threshold: Upscale threshold for task additions.
  • tasklist_partition_downscale_threshold: Downscale threshold for task additions.

The estimated_add_task_qps_per_tl value should remain between the upscale and downscale thresholds. If not, the scaler may not be functioning properly.

Partition Configurations

  • task_list_partition_config_num_read: Number of current read partitions.
  • task_list_partition_config_num_write: Number of current write partitions.
  • task_list_partition_config_version: Version of the current partition configuration.

These metrics are emitted by various components: root and non-root partitions, pollers, and producers. Their values should align under normal conditions, except immediately after updates.


Status at Uber

We enabled adaptive tasklist scaler across all Uber clusters in March 2025. Since its deployment:

  • Zero incidents have been reported due to misconfigured tasklists.
  • Operational workload related to manual scaling has been eliminated.
  • Scalability and resilience of Matching service have improved significantly.

Introducing cadence-web v4.0.0

· 6 min read
Adhitya Mamallan
Software Engineer II @ Uber

We are excited to announce the release of cadence-web v4.0.0—a complete rewrite of the Cadence web app. Cadence has always been about empowering developers to manage complex workflows, and with this release, we not only modernize the web interface by embracing today’s cutting-edge technologies but also strengthen the open source community by aligning our tools with the broader trends seen across the industry.

What's new in cadence-web v4.0.0

  • Revamped UI & Experience – A fresh, modern interface designed for better usability and efficiency.
  • Multi-Cluster Support – The UI can now connect to multiple Cadence clusters.
  • Performance Improvements – Faster load times, optimised API calls, and a smoother experience.

Cadence Repositories Have Moved!

· One min read
Josué Alexander Ibarra
Developer Advocate @ Uber

We’re excited to announce that all Cadence GitHub repositories have been consolidated under the cadence-workflow organization! 🎉

Previously, Cadence repositories were distributed across multiple organizations at Uber: uber, uber-go, uber-common. To improve developer cohesiveness and simplify access, the Cadence Core team has migrated all open-source repositories to the cadence-workflow organization.

For example, our main repository has moved from:

👉 uber/cadence

To its new home:

👉 cadence-workflow/cadence

You can find the full list of Cadence repositories here 👉 orgs/cadence-workflow/repositories

Zonal Isolation for Cadence Workflows

· 9 min read
Zijian Chen
Software Engineer @ Uber

At Uber, we want to achieve regional resilience such that losing a zone within a region can be tolerated without requiring a cross-region failover. We also want to make sure that losing a zone only affects a subset of workload, at most, rather than everything. However, in Cadence-based systems, the workload in a region is distributed randomly across all workers in the region at a “task-level granularity”, which means a workflow may be worked on by any worker in the region where the domain is active. To achieve this goal, we introduced Zonal Isolation for Cadence Workflows - a feature designed to pin workflows to the zone they are started in, so that zonal isolation can be achieved at a workflow-level.

What is Zonal Isolation for Cadence Workflows?

At high-level, Zonal Isolation for Cadence Workflows can be thought in 2 levels:

  1. Task-level isolation: All decision tasks and activity tasks of a workflow are only processed by workers from the zone where the workflow was started
  2. Infrastructure-level isolation: Within a regional Cadence cluster, workflows are handled by server instances in the same zone where they were started, and the corresponding data is stored in that zone as well.

Infrastructure-level isolation is quite challenging to implement as it requires significant changes to the core design of the Cadence server. Due to the complexity involved, support for this feature is not planned for the foreseeable future.

As a result, the focus remains on achieving task-level zonal isolation outside the Cadence server, which offers a more practical and immediate way to improve system resilience. It provides the capability of ensuring that an unhealthy zone (i.e. bad deployment of workers) only affect a subset of workflows (started from a certain zone) rather than every workflow in a Cadence domain.

Announcement: Cadence Helm Charts v0 Release

· 3 min read
Taylan Isikdemir
Sr. Staff Software Engineer @ Uber

We’ve heard your feedback: deploying Cadence has been a challenge, especially with limited documentation on operational aspects. So far, we’ve only provided a few docker compose files to help you get started on a development machine. However, deploying and managing Cadence at scale requires a deep understanding of underlying services, configurations and their dependencies.

To address these challenges, we’re launching several initiatives to make it easier to deploy and operate Cadence clusters. These include deployment specs for common scenarios, monitoring dashboards, alerts, runbooks, and more comprehensive documentation.