Activity and Workflow Retries

Activities and workflows can fail due to various intermediate conditions. In those cases, we want to retry the failed activity or child workflow or even the parent workflow. This can be achieved by supplying an optional retry policy. A retry policy looks like the following:

// RetryPolicy defines the retry policy.
RetryPolicy struct {
    // Backoff interval for the first retry. If coefficient is 1.0 then it is used for all retries.
    // Required, no default value.
    InitialInterval time.Duration

    // Coefficient used to calculate the next retry backoff interval.
    // The next retry interval is previous interval multiplied by this coefficient.
    // Must be 1 or larger. Default is 2.0.
    BackoffCoefficient float64

    // Maximum backoff interval between retries. Exponential backoff leads to interval increase.
    // This value is the cap of the interval. Default is 100x of initial interval.
    MaximumInterval time.Duration

    // Maximum time to retry. Either ExpirationInterval or MaximumAttempts is required.
    // When exceeded the retries stop even if maximum retries is not reached yet.
    ExpirationInterval time.Duration

    // Maximum number of attempts. When exceeded the retries stop even if not expired yet.
    // If not set or set to 0, it means unlimited, and relies on ExpirationInterval to stop.
    // Either MaximumAttempts or ExpirationInterval is required.
    MaximumAttempts int32

    // Non-Retriable errors. This is optional. Cadence server will stop retry if error reason matches this list.
    // Error reason for custom error is specified when your activity/workflow returns cadence.NewCustomError(reason).
    // Error reason for panic error is "cadenceInternal:Panic".
    // Error reason for any other error is "cadenceInternal:Generic".
    // Error reason for timeouts is: "cadenceInternal:Timeout TIMEOUT_TYPE". TIMEOUT_TYPE could be START_TO_CLOSE or HEARTBEAT.
    // Note that cancellation is not a failure, so it won't be retried.
    NonRetriableErrorReasons []string
}

To enable retry, supply a custom retry policy to ActivityOptions or ChildWorkflowOptions when you execute them.

expiration := time.Minute * 10
retryPolicy := &cadence.RetryPolicy{
    InitialInterval:    time.Second,
    BackoffCoefficient: 2,
    MaximumInterval:    expiration,
    ExpirationInterval: time.Minute * 10,
    MaximumAttempts:    5,
}
ao := workflow.ActivityOptions{
    ScheduleToStartTimeout: expiration,
    StartToCloseTimeout:    expiration,
    HeartbeatTimeout:       time.Second * 30,
    RetryPolicy:            retryPolicy, // Enable retry.
}
ctx = workflow.WithActivityOptions(ctx, ao)
activityFuture := workflow.ExecuteActivity(ctx, SampleActivity, params)

If activity heartbeat its progress before it failed, the retry attempt will contain the progress so activity implementation could resume from failed progress like:

func SampleActivity(ctx context.Context, inputArg InputParams) error {
    startIdx := inputArg.StartIndex
    if activity.HasHeartbeatDetails(ctx) {
        // Recover from finished progress.
        var finishedIndex int
        if err := activity.GetHeartbeatDetails(ctx, &finishedIndex); err == nil {
            startIdx = finishedIndex + 1 // Start from next one.
        }
    }

    // Normal activity logic...
    for i:=startIdx; i<inputArg.EndIdx; i++ {
        // Code for processing item i goes here...
        activity.RecordHeartbeat(ctx, i) // Report progress.
    }
}

Like retry for an activity, you need to supply a retry policy for ChildWorkflowOptions to enable retry for a child workflow. To enable retry for a parent workflow, supply a retry policy when you start a workflow via StartWorkflowOptions.

There are some subtle changes to workflow’s history events when RetryPolicy is used. For an activity with RetryPolicy:

For a workflow with RetryPolicy: