Skip to main content

10 posts tagged with "Deep Dives"

Deep Dives tag description

View All Tags

Zonal Isolation for Cadence Workflows

· 8 min read
Zijian Chen
Software Engineer @ Uber

At Uber, we want to achieve regional resilience such that losing a zone within a region can be tolerated without requiring a cross-region failover. We also want to make sure that losing a zone only affects a subset of workload, at most, rather than everything. However, in Cadence-based systems, the workload in a region is distributed randomly across all workers in the region at a “task-level granularity”, which means a workflow may be worked on by any worker in the region where the domain is active. To achieve this goal, we introduced Zonal Isolation for Cadence Workflows - a feature designed to pin workflows to the zone they are started in, so that zonal isolation can be achieved at a workflow-level.

What is Zonal Isolation for Cadence Workflows?

At high-level, Zonal Isolation for Cadence Workflows can be thought in 2 levels:

  1. Task-level isolation: All decision tasks and activity tasks of a workflow are only processed by workers from the zone where the workflow was started
  2. Infrastructure-level isolation: Within a regional Cadence cluster, workflows are handled by server instances in the same zone where they were started, and the corresponding data is stored in that zone as well.

Infrastructure-level isolation is quite challenging to implement as it requires significant changes to the core design of the Cadence server. Due to the complexity involved, support for this feature is not planned for the foreseeable future.

As a result, the focus remains on achieving task-level zonal isolation outside the Cadence server, which offers a more practical and immediate way to improve system resilience. It provides the capability of ensuring that an unhealthy zone (i.e. bad deployment of workers) only affect a subset of workflows (started from a certain zone) rather than every workflow in a Cadence domain.

Minimizing blast radius in Cadence: Introducing Workflow ID-based Rate Limits

· 7 min read
Jakob Haahr Taankvist
Software Engineer II @ Uber

At Uber, we run several big multitenant Cadence clusters with hundreds of domains in each. The clusters being multi-tenant means potential noisy neighbor effects between domains.

An essential aspect of avoiding this is managing how workflows interact with our infrastructure to prevent any single workflow from causing instability for the whole cluster. To this end, we are excited to introduce Workflow ID-based rate limits — a new feature designed to protect our clusters from problematic workflows and ensure stability across the board.

Why Workflow ID-based Rate Limits?

We already have rate limits for how many requests can be sent to a domain. However, since Cadence is sharded on the workflow ID, a user-provided input, an overused workflow with a particular id might overwhelm a shard by making too many requests. There are two main ways this happens:

  1. A user starts, or signals the same workflow ID too aggressively,
  2. A workflow starts too many activities over a short period of time (e.g. thousands of activities in seconds).

2024 Cadence Yearly Roadmap Update

· 17 min read
Ender Demirkaya
Senior Manager at Uber, Cadence. Author of the Software Engineering Handbook

Introduction

If you haven’t heard about Cadence, this section is for you. In a short description, Cadence is a code-driven workflow orchestration engine. The definition itself may not tell enough, so it would help splitting it into three parts:

  • What’s a workflow? (everyone has a different definition)
  • Why does it matter to be code-driven?
  • Benefits of Cadence

What is a Workflow?

workflow.png

In the simplest definition, it is “a multi-step execution”. Step here represents individual operations that are a little heavier than small in-process function calls. Although they are not limited to those: it could be a separate service call, processing a large dataset, map-reduce, thread sleep, scheduling next run, waiting for an external input, starting a sub workflow etc. It’s anything a user thinks as a single unit of logic in their code. Those steps often have dependencies among themselves. Some steps, including the very first step, might require external triggers (e.g. button click) or schedules. In the more broader meaning, any multi-step function or service is a workflow in principle.

Cadence non-derministic errors common question Q&A (part 1)

· 3 min read
Chris Qin
Applications Developer @ Uber

If I change code logic inside an Cadence activity (for example, my activity is calling database A but now I want it to call database B), will it trigger an non-deterministic error?

NO. This change will not trigger non-deterministic error.

An Activity is the smallest unit of execution for Cadence and what happens inside activities are not recorded as historical events and therefore will not be replayed. In short, this change is deterministic and it is fine to modify logic inside activities.

Does changing the workflow definition trigger non-determinstic errors?

YES. This is a very typical non-deterministic error.

When a new workflow code change is deployed, Cadence will find if it is compatible with Cadence history. Changes to workflow definition will fail the replay process of Cadence as it finds the new workflow definition imcompatible with previous historical events.

Here is a list of common workflow definition changes.

  • Changing workflow parameter counts
  • Changing workflow parameter types
  • Changing workflow return types

The following changes are not categorized as definition changes and therefore will not trigger non-deterministic errors.

  • Changes of workflow return values
  • Changing workflow parameter names as they are just positional

Non-deterministic errors, replayers and shadowers

· 3 min read
Chris Qin
Applications Developer @ Uber

It is conceivable that developers constantly update their Cadence workflow code based upon new business use cases and needs. However, the definition of a Cadence workflow must be deterministic because behind the scenes cadence uses event sourcing to construct the workflow state by replaying the historical events stored for this specific workflow. Introducing components that are not compatible with an existing running workflow will yield to non-deterministic errors and sometimes developers find it tricky to debug. Consider the following workflow that executes two activities.

func SampleWorkflow(ctx workflow.Context, data string) (string, error) {
ao := workflow.ActivityOptions{
ScheduleToStartTimeout: time.Minute,
StartToCloseTimeout: time.Minute,
}
ctx = workflow.WithActivityOptions(ctx, ao)
var result1 string
err := workflow.ExecuteActivity(ctx, ActivityA, data).Get(ctx, &result1)
if err != nil {
return "", err
}
var result2 string
err = workflow.ExecuteActivity(ctx, ActivityB, result1).Get(ctx, &result2)
return result2, err
}

Write your first workflow with Cadence

· 3 min read
Chris Qin
Applications Developer @ Uber

We have covered basic components of Cadence and how to implement a Cadence worker on local environment in previous blogs. In this blog, let's write your very first HelloWorld workflow with Cadence. I've started the Cadence backend server in background and registered a domain named test-domain. You may use the code snippet for the worker service in this blog Let's first write a activity, which takes a single string argument and print a log in the console.

func helloWorldActivity(ctx context.Context, name string) (string, error) {
logger := activity.GetLogger(ctx)
logger.Info("helloworld activity started")
return "Hello " + name + "!", nil
}

Bad practices and Anti-patterns with Cadence (Part 1)

· 3 min read
Chris Qin
Applications Developer @ Uber

In the upcoming blog series, we will delve into a discussion about common bad practices and anti-patterns related to Cadence. As diverse teams often encounter distinct business use cases, it becomes imperative to address the most frequently reported issues in Cadence workflows. To provide valuable insights and guidance, the Cadence team has meticulously compiled these common challenges based on customer feedback.

  • Reusing the same workflow ID for very active/continuous running workflows

Cadence organizes workflows based on their unique IDs, using a process called partitioning. If a workflow receives a large number of updates in a short period of time or frequently starts new runs using the continueAsNew function, all these updates will be directed to the same shard. Unfortunately, the Cadence backend is not equipped to handle this concentrated workload efficiently. As a result, a situation known as a "hot shard" arises, overloading the Cadence backend and worsening the problem.

Solution: Well, the best way to avoid this is simply just design your workflow in the way such that each workflow owns a uniformly distributed workflow ID across your Cadence domain. This will make sure that Cadence backend is able to evenly distribute the traffic with proper partition on your workflowIDs.

Implement a Cadence worker service from scratch

· 4 min read
Chris Qin
Applications Developer @ Uber

In the previous blog, we have introduced three critical components for a Cadence application: the Cadence backend, domain, and worker. Among these, the worker service is the most crucial focus for developers as it hosts the activities and workflows of a Cadence application. In this blog, I will provide a short tutorial on how to implement a simple worker service from scratch in Go.

To finish this tutorial, there are two prerequisites you need to finish first

  1. Register a Cadence domain for your worker. For this tutorial, I've already registered a domain named test-domain
  2. Start the Cadence backend server in background.

To get started, let's simply use the native HTTP package built in Go to start a process listening to port 3000. You may customize the port for your worker, but the port you choose should not conflict with existing port for your Cadence backend.

package main

import (
"fmt"
"net/http"
)

func main(){
fmt.Println("Cadence worker started at port 3000")
http.ListenAndServe(":3000", nil)
}

Understanding components of Cadence application

· 2 min read
Chris Qin
Applications Developer @ Uber

Cadence is a powerful, scalable, and fault-tolerant workflow orchestration framework that helps developers implement and manage complex workflow tasks. In most cases, developers contribute activities and workflows directly to their codebases, and they may not have a full understanding of the components behind a running Cadence application. We receive numerous inquiries about setting up Cadence in a local environment from scratch for testing. Therefore, in this article, we will explore the components that power a Cadence cluster.

There are three critical components that are essential for any Cadence application:

  1. A running Cadence backend server.
  2. A registered Cadence domain.
  3. A running Cadence worker that registers all workflows and activities.

Let's go over these components in more details.

Moving to gRPC

· 5 min read
Vytautas Karpavicius
Software Engineer @ Uber

Background

Cadence historically has been using TChannel transport with Thrift encoding for both internal RPC calls and communication with client SDKs. gRPC is becoming a de-facto industry standard with much better adoption and community support. It offers features such as authentication and streaming that are very relevant for Cadence. Moreover, TChannel is being deprecated within Uber itself, pushing an effort for this migration. During the last year we’ve implemented multiple changes in server and SDK that allows users to use gRPC in Cadence, as well as to upgrade their existing Cadence cluster in a backward compatible way. This post tracks the completed work items and our future plans.

Our Approach

With ~500 services using Cadence at Uber and many more open source customers around the world, we had to think about the gRPC transition in a backwards compatible way. We couldn’t simply flip transport and encoding everywhere. Instead we needed to support both protocols as an intermediate step to ensure a smooth transition for our users.

Cadence was using Thrift/TChannel not just for the API with client SDKs. They were also used for RPC calls between internal Cadence server components and also between different data centers. When starting this migration we had a choice of either starting with public APIs first or all the internal things within the server. We chose the latter one, so that we could gain experience and iterate faster within the server without disruption to the clients. With server side done and listening for both protocols, dynamic config flag was exposed to switch traffic internally. It allowed gradual deployment and provided an option to rollback if needed.