Activity and workflow retries
Activities and workflows can fail due to various intermediate conditions. In those cases, we want to retry the failed activity or child workflow or even the parent workflow. This can be achieved by supplying an optional retry policy. A retry policy looks like the following:
// RetryPolicy defines the retry policy.
RetryPolicy struct {
// Backoff interval for the first retry. If coefficient is 1.0 then it is used for all retries.
// Required, no default value.
InitialInterval time.Duration
// Coefficient used to calculate the next retry backoff interval.
// The next retry interval is previous interval multiplied by this coefficient.
// Must be 1 or larger. Default is 2.0.
BackoffCoefficient float64
// Maximum backoff interval between retries. Exponential backoff leads to interval increase.
// This value is the cap of the interval. Default is 100x of initial interval.
MaximumInterval time.Duration
// Maximum time to retry. Either ExpirationInterval or MaximumAttempts is required.
// When exceeded the retries stop even if maximum retries is not reached yet.
// First (non-retry) attempt is unaffected by this field and is guaranteed to run
// for the entirety of the workflow timeout duration (ExecutionStartToCloseTimeoutSeconds).
ExpirationInterval time.Duration
// Maximum number of attempts. When exceeded the retries stop even if not expired yet.
// If not set or set to 0, it means unlimited, and relies on ExpirationInterval to stop.
// Either MaximumAttempts or ExpirationInterval is required.
MaximumAttempts int32
// Non-Retriable errors. This is optional. Cadence server will stop retry if error reason matches this list.
// Error reason for custom error is specified when your activity/workflow returns cadence.NewCustomError(reason).
// Error reason for panic error is "cadenceInternal:Panic".
// Error reason for any other error is "cadenceInternal:Generic".
// Error reason for timeouts is: "cadenceInternal:Timeout TIMEOUT_TYPE". TIMEOUT_TYPE could be START_TO_CLOSE or HEARTBEAT.
// Note that cancellation is not a failure, so it won't be retried.
NonRetriableErrorReasons []string
}
To enable retry, supply a custom retry policy to ActivityOptions
or ChildWorkflowOptions
when you execute them.
expiration := time.Minute * 10
retryPolicy := &cadence.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 2,
MaximumInterval: expiration,
ExpirationInterval: time.Minute * 10,
MaximumAttempts: 5,
}
ao := workflow.ActivityOptions{
ScheduleToStartTimeout: expiration,
StartToCloseTimeout: expiration,
HeartbeatTimeout: time.Second * 30,
RetryPolicy: retryPolicy, // Enable retry.
}
ctx = workflow.WithActivityOptions(ctx, ao)
activityFuture := workflow.ExecuteActivity(ctx, SampleActivity, params)
If activity heartbeat its progress before it failed, the retry attempt will contain the progress so activity implementation could resume from failed progress like:
func SampleActivity(ctx context.Context, inputArg InputParams) error {
startIdx := inputArg.StartIndex
if activity.HasHeartbeatDetails(ctx) {
// Recover from finished progress.
var finishedIndex int
if err := activity.GetHeartbeatDetails(ctx, &finishedIndex); err == nil {
startIdx = finishedIndex + 1 // Start from next one.
}
}
// Normal activity logic...
for i:=startIdx; i<inputArg.EndIdx; i++ {
// Code for processing item i goes here...
activity.RecordHeartbeat(ctx, i) // Report progress.
}
}
Like retry for an activity, you need to supply a retry policy for ChildWorkflowOptions
to enable
retry for a child workflow. To enable retry for a parent workflow, supply a retry policy when
you start a workflow via StartWorkflowOptions
.
There are some subtle changes to workflow's history events when RetryPolicy
is used.
For an activity with RetryPolicy
:
- The
ActivityTaskScheduledEvent
will have extendedScheduleToStartTimeout
andScheduleToCloseTimeout
. These two timeouts will be overwritten by the server to be as long as the retry policy'sExpirationInterval
. If theExpirationInterval
is not specified, it will be overwritten to the workflow's timeout. - The
ActivityTaskStartedEvent
will not show up in history until the activity is completed or failed with no more retry. This is to avoid recording theActivityTaskStarted
event but later it failed and retried. Using theDescribeWorkflowExecution
API will return thePendingActivityInfo
and it will containattemptCount
if it is retrying.
For a workflow with RetryPolicy
:
- If a workflow failed and needs to retry, the workflow_execution will be closed with a
ContinueAsNew
event. The event will have theContinueAsNewInitiator
set toRetryPolicy
and the newRunID
for the next retry attempt. - The new attempt will be created immediately. But the first decision_task won't be scheduled until the backoff duration
which is also recorded in the new run's
WorkflowExecutionStartedEventAttributes
event asfirstDecisionTaskBackoffSeconds
.