Skip to main content

15 posts tagged with "Deep Dives"

Deep Dives tag description

View All Tags

Bypass the 2 MB Limit Without Shrinking Your Workflow

· 3 min read
Kevin Burns
Developer Advocate @ Uber

The 2 MB per-payload limit in Cadence does not come with a helpful error. Your workflow does not receive a graceful degradation notice. It just fails. And if you have never hit the limit before, the stack trace is not obvious about what happened.

The claim-check pattern solves this completely. Instead of compressing a large payload and hoping it fits, you offload it to an external blob store and write only a small reference into Cadence history. The limit no longer applies to your payload; only to the reference, which is always tiny.

Your Workflow History Is Storing More Than You Think

· 3 min read
Kevin Burns
Developer Advocate @ Uber

That customer order you passed into your fulfillment activity (the one with the email address, the shipping details, and the internal pricing fields) is sitting in your workflow history as plaintext JSON right now. Anyone with read access to Cadence history can see it. So can anyone with access to your history storage backend. Data pipeline: workflow history exposed vs. protected with DataConverter This is not a bug. It is how Cadence works by design, and it is the right default for most workloads. But three problems follow from it in production, and most teams hit at least one of them before they know the solution exists.

Signal Handling in the Cadence Python Client

· 6 min read
Jiahao Li
Software Engineer @ Uber

Signals are how the outside world communicates with a running Cadence workflow without interrupting it. Consider a document approval workflow that runs for days or weeks: at any moment, a manager clicks "Approve" on a dashboard and that signal needs to reach the workflow immediately — without restarting it or losing its place. Whether you're building human-in-the-loop approvals, broadcasting configuration changes to thousands of running workflows, coordinating multi-stage pipelines, or implementing liveness checks — signals are the right tool.

Signal handling feature in python client is now available in the latest cadence-python-client release, v0.2.2 — see the v0.2.2 release notes on GitHub and install from PyPI.

Safe deployments of Versioned workflows

· 8 min read
Seva Kaloshin
Senior Software Engineer @ Uber

At Uber, we manage billions of workflows with lifetimes ranging from seconds to years. Over the course of their lifetime, workflow code logic often requires changes. To prevent non-deterministic errors that changes may cause, Cadence offers a Versioning feature. However, the feature's usage is limited because changes are only backward-compatible, but not forward-compatible. This makes potential rollbacks or workflow execution rescheduling unsafe.

To address these issues, we have made recent enhancements to the Versioning API, enabling the safe deployment of versioned workflows by separating code changes from the activation of new logic.

Adaptive Tasklist Scaler

· 5 min read
Zijian Chen
Software Engineer @ Uber

At Uber, we previously relied on a dynamic configuration service to manually control the number of partitions for scalable tasklists. This configuration approach introduced several operational challenges:

  • Error-prone: Manual updates and deployments were required.
  • Unresponsive: Adjustments were typically reactive, often triggered by customer reports or observed backlogs.
  • Irreversible: Once increased, the number of partitions was rarely decreased due to the complexity of the two-phase process, especially when anticipating future traffic spikes.

To address these issues, we introduced a new component in the Cadence Matching service: Adaptive Tasklist Scaler. This component dynamically monitors tasklist traffic and adjusts partition counts automatically. Since its rollout, we've seen a significant reduction in incidents and operational overhead caused by misconfigured tasklists.

Zonal Isolation for Cadence Workflows

· 9 min read
Zijian Chen
Software Engineer @ Uber

At Uber, we want to achieve regional resilience such that losing a zone within a region can be tolerated without requiring a cross-region failover. We also want to make sure that losing a zone only affects a subset of workload, at most, rather than everything. However, in Cadence-based systems, the workload in a region is distributed randomly across all workers in the region at a “task-level granularity”, which means a workflow may be worked on by any worker in the region where the domain is active. To achieve this goal, we introduced Zonal Isolation for Cadence Workflows - a feature designed to pin workflows to the zone they are started in, so that zonal isolation can be achieved at a workflow-level.

What is Zonal Isolation for Cadence Workflows?

At high-level, Zonal Isolation for Cadence Workflows can be thought in 2 levels:

  1. Task-level isolation: All decision tasks and activity tasks of a workflow are only processed by workers from the zone where the workflow was started
  2. Infrastructure-level isolation: Within a regional Cadence cluster, workflows are handled by server instances in the same zone where they were started, and the corresponding data is stored in that zone as well.

Infrastructure-level isolation is quite challenging to implement as it requires significant changes to the core design of the Cadence server. Due to the complexity involved, support for this feature is not planned for the foreseeable future.

As a result, the focus remains on achieving task-level zonal isolation outside the Cadence server, which offers a more practical and immediate way to improve system resilience. It provides the capability of ensuring that an unhealthy zone (i.e. bad deployment of workers) only affect a subset of workflows (started from a certain zone) rather than every workflow in a Cadence domain.

Minimizing blast radius in Cadence: Introducing Workflow ID-based Rate Limits

· 7 min read
Jakob Haahr Taankvist
Senior Software Engineer @ Uber

At Uber, we run several big multitenant Cadence clusters with hundreds of domains in each. The clusters being multi-tenant means potential noisy neighbor effects between domains.

An essential aspect of avoiding this is managing how workflows interact with our infrastructure to prevent any single workflow from causing instability for the whole cluster. To this end, we are excited to introduce Workflow ID-based rate limits — a new feature designed to protect our clusters from problematic workflows and ensure stability across the board.

Why Workflow ID-based Rate Limits?

We already have rate limits for how many requests can be sent to a domain. However, since Cadence is sharded on the workflow ID, a user-provided input, an overused workflow with a particular id might overwhelm a shard by making too many requests. There are two main ways this happens:

  1. A user starts, or signals the same workflow ID too aggressively,
  2. A workflow starts too many activities over a short period of time (e.g. thousands of activities in seconds).

2024 Cadence Yearly Roadmap Update

· 17 min read
Ender Demirkaya
Senior Manager at Uber, Cadence. Author of the Software Engineering Handbook

Introduction

If you haven’t heard about Cadence, this section is for you. In a short description, Cadence is a code-driven workflow orchestration engine. The definition itself may not tell enough, so it would help splitting it into three parts:

  • What’s a workflow? (everyone has a different definition)
  • Why does it matter to be code-driven?
  • Benefits of Cadence

What is a Workflow?

workflow.png

In the simplest definition, it is “a multi-step execution”. Step here represents individual operations that are a little heavier than small in-process function calls. Although they are not limited to those: it could be a separate service call, processing a large dataset, map-reduce, thread sleep, scheduling next run, waiting for an external input, starting a sub workflow etc. It’s anything a user thinks as a single unit of logic in their code. Those steps often have dependencies among themselves. Some steps, including the very first step, might require external triggers (e.g. button click) or schedules. In the more broader meaning, any multi-step function or service is a workflow in principle.

Cadence non-deterministic errors common question Q&A (part 1)

· 3 min read
Chris Qin
Applications Developer @ Uber

If I change code logic inside an Cadence activity (for example, my activity is calling database A but now I want it to call database B), will it trigger an non-deterministic error?

NO. This change will not trigger non-deterministic error.

An Activity is the smallest unit of execution for Cadence and what happens inside activities are not recorded as historical events and therefore will not be replayed. In short, this change is deterministic and it is fine to modify logic inside activities.

Does changing the workflow definition trigger non-deterministic errors?

YES. This is a very typical non-deterministic error.

When a new workflow code change is deployed, Cadence will find if it is compatible with Cadence history. Changes to workflow definition will fail the replay process of Cadence as it finds the new workflow definition incompatible with previous historical events.

Here is a list of common workflow definition changes.

  • Changing workflow parameter counts
  • Changing workflow parameter types
  • Changing workflow return types

The following changes are not categorized as definition changes and therefore will not trigger non-deterministic errors.

  • Changes of workflow return values
  • Changing workflow parameter names as they are just positional

Non-deterministic errors, replayers and shadowers

· 3 min read
Chris Qin
Applications Developer @ Uber

It is conceivable that developers constantly update their Cadence workflow code based upon new business use cases and needs. However, the definition of a Cadence workflow must be deterministic because behind the scenes cadence uses event sourcing to construct the workflow state by replaying the historical events stored for this specific workflow. Introducing components that are not compatible with an existing running workflow will yield to non-deterministic errors and sometimes developers find it tricky to debug. Consider the following workflow that executes two activities.

func SampleWorkflow(ctx workflow.Context, data string) (string, error) {
ao := workflow.ActivityOptions{
ScheduleToStartTimeout: time.Minute,
StartToCloseTimeout: time.Minute,
}
ctx = workflow.WithActivityOptions(ctx, ao)
var result1 string
err := workflow.ExecuteActivity(ctx, ActivityA, data).Get(ctx, &result1)
if err != nil {
return "", err
}
var result2 string
err = workflow.ExecuteActivity(ctx, ActivityB, result1).Get(ctx, &result2)
return result2, err
}