# Cluster Setup
This section will help to understand what you need for setting up a Cadence cluster.
You need to understand some key config options in Cadence server. There are two main types of configs in Cadence server, static config and dynamic config.
Also, you need understand Cadence’s dependency --- a database(Cassandra or SQL based like MySQL/Postgres) and a metric server(typically Prometheus). Cadence also needs ElastiCache+Kafka if you need Advanced visibility feature to search workflows. And Cadence also depends on a blob store like S3 if you need to enable archival feature.
# Static configs
There are lots of configs in Cadence. Usually the default values or the recommended values in development.yaml should be good to go. Here are the most basic configuration that you should understand.
Config name | Explanation | Recommended value |
---|---|---|
numHistoryShards | This is the most important one in Cadence config.It will be a fixed number in the cluster forever. The only way to change it is to migrate to another cluster. Refer to Migrate cluster section. Some facts about it: 1. Each workflow will be mapped to a single shard. Within a shard, all the workflow creation/updates are serialized. 2. Each shard will be assigned to only one History node to own the shard, using a Consistent Hashing Ring. Each shard will consume a small amount of memory/CPU to do background processing. Therefore, a single History node cannot own too many shards. You may need to figure out a good number range based on your instance size(memory/CPU). 3. Also, you can’t add an infinite number of nodes to a cluster because this config is fixed. When the number of History nodes is closed or equal to numHistoryShards, there will be some History nodes that have no shards assigned to it. This will be wasting resources. Based on above, you don’t want to have a small number of shards which will limit the maximum size of your cluster. You also don’t want to have a too big number, which will require you to have a quite big initial size of the cluster. Also, typically a production cluster will start with a smaller number and then we add more nodes/hosts to it. But to keep high availability, it’s recommended to use at least 4 nodes for each service(Frontend/History/Matching) at the beginning. | 1K~16K depending on the size ranges of the cluster you expect to run, and the instance size. |
ringpop | This is the config to let all nodes of all services connected to each other. ALL the bootstrap nodes MUST be reachable by ringpop when a service is starting up, within a MaxJoinDuration. defaultMaxJoinDuration is 2 minutes. It’s not required that bootstrap nodes need to be Frontend/History or Matching. In fact, it can be running none of them as long as it runs Ringpop protocol. | For dns mode: Recommended to put the DNS of Frontend service For hosts or hostfile mode: A list of Frontend service node addresses if using hosts mode. Make sure all the bootstrap nodes are reachable at startup. |
publicClient | The Cadence Frontend service addresses that internal Cadence system(like system workflows) need to talk to. After connected, all nodes in Ringpop will form a ring with identifiers of what service they serve. Ideally Cadence should be able to get Frontend address from there. But Ringpop doesn’t expose this API yet. | Recommended be DNS of Frontend service, so that requests will be distributed to all Frontend nodes. Using localhost+Port or local container IP address+Port will not work if the IP/container is not running frontend |
services.NAME.rpc | Configuration of how to listen to network ports and serve traffic. bindOnLocalHost:true will bind on 127.0.0.1. It’s mostly for local development. In production usually you have to specify the IP that containers will use by using bindOnIP NAME is the matter for the “--services” option in the server startup command. | Name: Use as recommended in development.yaml. bindOnIP : an IP address that the container will serve the traffic with |
services.NAME.pprof | Golang profiling service , will bind on the same IP as RPC | a port that you want to serve pprof request |
services.Name.metrics | See Metrics&Logging section | cc |
clusterMetadata | Cadence cluster configuration. enableGlobalDomain:true will enable Cadence Cross datacenter replication(aka XDC) feature. failoverVersionIncrement: This decides the maximum clusters that you will have replicated to each other at the same time. For example 10 is sufficient for most cases. masterClusterName: a master cluster must be one of the enabled clusters, usually the very first cluster to start. It is only meaningful for internal purposes. currentClusterName: current cluster name using this config file. clusterInformation is a map from clusterName to the cluster configure initialFailoverVersion: each cluster must use a different value from 0 to failoverVersionIncrement-1. rpcName: must be “cadence-frontend”. Can be improved in this issue. rpcAddress: the address to talk to the Frontend of the cluster for inter-cluster replication. Note that even if you don’t need XDC replication right now, if you want to migrate data stores in the future, you should enable xdc from every beginning. You just need to use the same name of cluster for both masterClusterName and currentClusterName. See more details in Migration section. | As explanation. |
dcRedirectionPolicy | For allowing forwarding frontend requests from passive cluster to active clusters. | “selected-apis-forwarding” |
archival | This is for archival history feature, skip if you don’t need it. See more in History Archival section | N/A |
blobstore | This is also for archival history feature Default cadence server is using file based blob store implementation. | N/A |
domainDefaults | default config for each domain. Right now only being used for Archival feature. | N/A |
dynamicConfigClient | Dynamic config is a config manager that you can change config without restarting servers. It’s a good way for Cadence to keep high availability and make things easy to configure. Default cadence server is using FileBasedClientConfig. But you can implement the dynamic config interface if you have a better way to manage. | Same as the sample development config |
persistence | Configuration for data store / persistence layer. Values of DefaultStore VisibilityStore AdvancedVisibilityStore should be keys of map DataStores. DefaultStore is for core Cadence functionality. VisibilityStore is for basic visibility feature AdvancedVisibilityStore is for advanced visibility See persistence documentation (opens new window) about using different database for Cadence | As explanation |
# Dynamic configuration
There are more dynamic configurations than static configurations. Dynamic configs can be changed at the run time without restarting any server instances. The format of dynamic configuration is defined here (opens new window).
NOTE#1: the size related configuration numbers are based on byte.
NOTE#2: current default dynamic configuration is implemented as file based configuration. This feature (opens new window) will make it better to use as a real "dynamic" configuration.
NOTE#3: for <frontend,history,matching>.persistenceMaxQPS versus <frontend,history,matching>.persistenceGlobalMaxQPS --- persistenceMaxQPS is local for single node while persistenceGlobalMaxQPS is global for all node. persistenceGlobalMaxQPS is preferred if set as greater than zero. But by default it is zero so persistenceMaxQPS is being used.
TODO: some default values are N/A here because the default value is determined during the run time. For now you may look at Cadence from the code (opens new window) to understand the behavior. Or raise the question on Stack/GitHub/Slack if it's important for you.
# Dynamic Configuration shared for all four services: Frontend/Matching/History/Worker
Config Key | Explanation | Default Values |
---|---|---|
system.enableGlobalDomain | key for enable global domain | based on static config value: clusterMetadata.EnableGlobalDomain |
system.enableNewKafkaClient | key for using New Kafka client | #N/A |
system.enableVisibilitySampling | key for enable visibility sampling | TRUE |
system.enableReadFromClosedExecutionV2 | key for enable read from cadence_visibility.closed_executions_v2 | FALSE |
system.advancedVisibilityWritingMode | key for how to write to advanced visibility | common.GetDefaultAdvancedVisibilityWritingMode(isAdvancedVisConfigExist) |
history.emitShardDiffLog | whether emit the shard diff log | FALSE |
system.enableReadVisibilityFromES | key for enable read from elastic search | based on static config value: PersistenceConfig.AdvancedVisibilityStore |
frontend.disableListVisibilityByFilter | config to disable list open/close workflow using filter | FALSE |
system.historyArchivalStatus | key for the status of history archival | #N/A |
system.enableReadFromHistoryArchival | key for enabling reading history from archival store | #N/A |
system.visibilityArchivalStatus | key for the status of visibility archival | #N/A |
system.enableReadFromVisibilityArchival | key for enabling reading visibility from archival store | #N/A |
system.enableDomainNotActiveAutoForwarding | whether enabling DC auto forwarding to active cluster for signal / start / signal with start API if domain is not active | TRUE |
system.enableGracefulFailover | whether enabling graceful failover | FALSE |
system.transactionSizeLimit | the largest allowed transaction size to persistence | #N/A |
system.minRetentionDays | the minimal allowed retention days for domain | 1 |
system.maxDecisionStartToCloseSeconds | the minimal allowed decision start to close timeout in seconds | 240 |
system.disallowQuery | the key to disallow query for a domain | FALSE |
system.enablePriorityTaskProcessor | the key for enabling priority task processor | TRUE |
system.enableAuthorization | the key to enable authorization for a domain | #N/A |
limit.blobSize.error | the per event blob size limit, exceeding this will reject requests | 210241024 |
limit.blobSize.warn | the per event blob size limit for warning | 256*1024 |
limit.historySize.error | the per workflow execution history size limit, exceeding this will kill workflows | 20010241024 |
limit.historySize.warn | the per workflow execution history size limit for warning | 5010241024 |
limit.historyCount.error | the per workflow execution history event count limit, exceeding this will kill workflows | 200*1024 |
limit.historyCount.warn | the per workflow execution history event count limit for warning | 50*1024 |
limit.maxIDLength | the length limit for various IDs, including: Domain, TaskList, WorkflowID, ActivityID, TimerID,WorkflowType, ActivityType, SignalName, MarkerName, ErrorReason/FailureReason/CancelCause, Identity, RequestID | 1000 |
limit.maxIDWarnLength | the warn length limit for various IDs, including: Domain, TaskList, WorkflowID, ActivityID, TimerID, WorkflowType, ActivityType, SignalName, MarkerName, ErrorReason/FailureReason/CancelCause, Identity, RequestID | 150 |
# Dynamic Configuration for Frontend Service
Config Key | Explanation | Default Values |
---|---|---|
frontend.persistenceMaxQPS | the max qps frontend host can query DB | 2000 |
frontend.persistenceGlobalMaxQPS | the max qps frontend cluster can query DB | 0 |
frontend.visibilityMaxPageSize | default max size for ListWorkflowExecutions in one page | 1000 |
frontend.visibilityListMaxQPS | max qps frontend can list open/close workflows | 10 |
frontend.esVisibilityListMaxQPS | max qps frontend can list open/close workflows from ElasticSearch | 30 |
frontend.esIndexMaxResultWindow | ElasticSearch index setting max_result_window | 10000 |
frontend.historyMaxPageSize | default max size for GetWorkflowExecutionHistory in one page | common.GetHistoryMaxPageSize |
frontend.rps | workflow rate limit per second | 1200 |
frontend.domainrps | workflow domain rate limit per second per domain per frontend instance | 1200 |
frontend.globalDomainrps | workflow domain rate limit per second for the whole Cadence cluster | 0 |
frontend.historyMgrNumConns | for persistence cluster.NumConns | 10 |
frontend.throttledLogRPS | the rate limit on number of log messages emitted per second for throttled logger | 20 |
frontend.shutdownDrainDuration | the duration of traffic drain during shutdown | 0 |
frontend.enableClientVersionCheck | enables client version check for frontend | FALSE |
frontend.maxBadBinaries | the max number of bad binaries in domain config | domain.MaxBadBinaries |
frontend.validSearchAttributes | legal indexed keys that can be used in list APIs | definition.GetDefaultIndexedKeys() |
frontend.sendRawWorkflowHistory | whether to enable raw history retrieving | sendRawWorkflowHistory |
frontend.searchAttributesNumberOfKeysLimit | the limit of number of keys | 100 |
frontend.searchAttributesSizeOfValueLimit | the size limit of each value | 2*1024 |
frontend.searchAttributesTotalSizeLimit | the size limit of the whole map | 40*1024 |
frontend.visibilityArchivalQueryMaxPageSize | the maximum page size for a visibility archival query | 10000 |
frontend.visibilityArchivalQueryMaxRangeInDays | the maximum number of days for a visibility archival query | #N/A |
frontend.visibilityArchivalQueryMaxQPS | the timeout for a visibility archival query | #N/A |
frontend.domainFailoverRefreshInterval | the domain failover refresh timer | 10*time.Second |
frontend.domainFailoverRefreshTimerJitterCoefficient | the jitter for domain failover refresh timer jitter | 0.1 |
# Dynamic Configuration for Matching Service
Config Key | Explanation | Default Values |
---|---|---|
matching.rps | request rate per second for each matching host | 1200 |
matching.persistenceMaxQPS | the max qps matching host can query DB | 3000 |
matching.persistenceGlobalMaxQPS | the max qps matching cluster can query DB | 0 |
matching.minTaskThrottlingBurstSize | the minimum burst size for task list throttling | 1 |
matching.getTasksBatchSize | the maximum batch size to fetch from the task buffer | 1000 |
matching.longPollExpirationInterval | the long poll expiration interval in the matching service | time.Minute |
matching.enableSyncMatch | to enable sync match | TRUE |
matching.updateAckInterval | the interval for update ack | 1*time.Minute |
matching.idleTasklistCheckInterval | the IdleTasklistCheckInterval | 5*time.Minute |
matching.maxTasklistIdleTime | the max time tasklist being idle | 5*time.Minute |
matching.outstandingTaskAppendsThreshold | the threshold for outstanding task appends | 250 |
matching.maxTaskBatchSize | max batch size for task writer | 100 |
matching.maxTaskDeleteBatchSize | the max batch size for range deletion of tasks | 100 |
matching.throttledLogRPS | the rate limit on number of log messages emitted per second for throttled logger | 20 |
matching.numTasklistWritePartitions | the number of write partitions for a task list. It’s a little tricky to use this config. See Client worker setup section. | 1 |
matching.numTasklistReadPartitions | the number of read partitions for a task list. It’s a little tricky to use this config. See Client worker setup section. | 1 |
matching.forwarderMaxOutstandingPolls | the max number of inflight polls from the forwarder | 1 |
matching.forwarderMaxOutstandingTasks | the max number of inflight addTask/queryTask from the forwarder | 1 |
matching.forwarderMaxRatePerSecond | the max rate at which add/query can be forwarded | 10 |
matching.forwarderMaxChildrenPerNode | the max number of children per node in the task list partition tree | 20 |
matching.shutdownDrainDuration | the duration of traffic drain during shutdown | 0 |
# Dynamic Configuration for History Service
Config Key | Explanation | Default Values |
---|---|---|
history.rps | request rate per second for each history host | 3000 |
history.persistenceMaxQPS | the max qps history host can query DB | 9000 |
history.persistenceGlobalMaxQPS | the max qps history cluster can query DB | 0 |
history.historyVisibilityOpenMaxQPS | max qps one history host can write visibility open_executions | 300 |
history.historyVisibilityClosedMaxQPS | max qps one history host can write visibility closed_executions | 300 |
history.longPollExpirationInterval | the long poll expiration interval in the history service | time.Second*20 |
history.cacheInitialSize | initial size of history cache | 128 |
history.cacheMaxSize | max size of history cache | 512 |
history.cacheTTL | TTL of history cache | time.Hour |
history.shutdownDrainDuration | the duration of traffic drain during shutdown | 0 |
history.eventsCacheInitialSize | initial count of events cache | 128 |
history.eventsCacheMaxSize | max count of events cache | 512 |
history.eventsCacheMaxSizeInBytes | max size of events cache in bytes | 0 |
history.eventsCacheTTL | TTL of events cache | time.Hour |
history.eventsCacheGlobalEnable | enables global cache over all history shards | FALSE |
history.eventsCacheGlobalInitialSize | initial count of global events cache | 4096 |
history.eventsCacheGlobalMaxSize | max count of global events cache | 131072 |
history.acquireShardInterval | interval that timer used to acquire shard | time.Minute |
history.acquireShardConcurrency | number of goroutines that can be used to acquire shards in the shard controller. | 1 |
history.standbyClusterDelay | the artificial delay added to standby cluster's view of active cluster's time | 5*time.Minute |
history.standbyTaskMissingEventsResendDelay | the amount of time standby cluster's will wait (if events are missing)before calling remote for missing events | 15*time.Minute |
history.standbyTaskMissingEventsDiscardDelay | the amount of time standby cluster's will wait (if events are missing)before discarding the task | 25*time.Minute |
history.taskProcessRPS | the task processing rate per second for each domain | 1000 |
history.taskSchedulerType | the task scheduler type for priority task processor | int(task.SchedulerTypeWRR) |
history.taskSchedulerWorkerCount | the number of workers per host in task scheduler | 200 |
history.taskSchedulerShardWorkerCount | the number of worker per shard in task scheduler | 0 |
history.taskSchedulerQueueSize | the size of task channel for host level task scheduler | 10000 |
history.taskSchedulerShardQueueSize | the size of task channel for shard level task scheduler | 200 |
history.taskSchedulerDispatcherCount | the number of task dispatcher in task scheduler (only applies to host level task scheduler) | 1 |
history.taskSchedulerRoundRobinWeight | the priority weight for weighted round robin task scheduler | common.ConvertIntMapToDynamicConfigMapProperty(DefaultTaskPriorityWeight) |
history.activeTaskRedispatchInterval | the active task redispatch interval | 5*time.Second |
history.standbyTaskRedispatchInterval | the standby task redispatch interval | 30*time.Second |
history.taskRedispatchIntervalJitterCoefficient | the task redispatch interval jitter coefficient | 0.15 |
history.standbyTaskReReplicationContextTimeout | the context timeout for standby task re-replication | 3*time.Minute |
history.queueProcessorEnableSplit | indicates whether processing queue split policy should be enabled | FALSE |
history.queueProcessorSplitMaxLevel | the max processing queue level | 2 // 3 levels, start from 0 |
history.queueProcessorEnableRandomSplitByDomainID | indicates whether random queue split policy should be enabled for a domain | FALSE |
history.queueProcessorRandomSplitProbability | the probability for a domain to be split to a new processing queue | 0.01 |
history.queueProcessorEnablePendingTaskSplitByDomainID | indicates whether pending task split policy should be enabled | FALSE |
history.queueProcessorPendingTaskSplitThreshold | the threshold for the number of pending tasks per domain | common.ConvertIntMapToDynamicConfigMapProperty(DefaultPendingTaskSplitThreshold) |
history.queueProcessorEnableStuckTaskSplitByDomainID | indicates whether stuck task split policy should be enabled | FALSE |
history.queueProcessorStuckTaskSplitThreshold | the threshold for the number of attempts of a task | common.ConvertIntMapToDynamicConfigMapProperty(DefaultStuckTaskSplitThreshold) |
history.queueProcessorSplitLookAheadDurationByDomainID | the look ahead duration when spliting a domain to a new processing queue | 20*time.Minute |
history.queueProcessorPollBackoffInterval | the backoff duration when queue processor is throttled | 5*time.Second |
history.queueProcessorPollBackoffIntervalJitterCoefficient | backoff interval jitter coefficient | 0.15 |
history.queueProcessorEnablePersistQueueStates | indicates whether processing queue states should be persisted | FALSE |
history.queueProcessorEnableLoadQueueStates | indicates whether processing queue states should be loaded | FALSE |
history.timerTaskBatchSize | batch size for timer processor to process tasks | 100 |
history.timerTaskWorkerCount | number of task workers for timer processor | 10 |
history.timerTaskMaxRetryCount | max retry count for timer processor | 100 |
history.timerProcessorGetFailureRetryCount | retry count for timer processor get failure operation | 5 |
history.timerProcessorCompleteTimerFailureRetryCount | retry count for timer processor complete timer operation | 10 |
history.timerProcessorUpdateShardTaskCount | update shard count for timer processor | #N/A |
history.timerProcessorUpdateAckInterval | update interval for timer processor | 30*time.Second |
history.timerProcessorUpdateAckIntervalJitterCoefficient | the update interval jitter coefficient | 0.15 |
history.timerProcessorCompleteTimerInterval | complete timer interval for timer processor | 60*time.Second |
history.timerProcessorFailoverMaxPollRPS | max poll rate per second for timer processor | 1 |
history.timerProcessorMaxPollRPS | max poll rate per second for timer processor | 20 |
history.timerProcessorMaxPollInterval | max poll interval for timer processor | 5*time.Minute |
history.timerProcessorMaxPollIntervalJitterCoefficient | the max poll interval jitter coefficient | 0.15 |
history.timerProcessorSplitQueueInterval | the split processing queue interval for timer processor | 1*time.Minute |
history.timerProcessorSplitQueueIntervalJitterCoefficient | the split processing queue interval jitter coefficient | 0.15 |
history.timerProcessorMaxRedispatchQueueSize | the threshold of the number of tasks in the redispatch queue for timer processor | 10000 |
history.timerProcessorEnablePriorityTaskProcessor | indicates whether priority task processor should be used for timer processor | TRUE |
history.timerProcessorEnableMultiCursorProcessor | indicates whether multi-cursor queue processor should be used for timer processor | FALSE |
history.timerProcessorMaxTimeShift | the max shift timer processor can have | 1*time.Second |
history.timerProcessorHistoryArchivalSizeLimit | the max history size for inline archival | 500*1024 |
history.timerProcessorArchivalTimeLimit | the upper time limit for inline history archival | 1*time.Second |
history.transferTaskBatchSize | batch size for transferQueueProcessor | 100 |
history.transferProcessorFailoverMaxPollRPS | max poll rate per second for transferQueueProcessor | 1 |
history.transferProcessorMaxPollRPS | max poll rate per second for transferQueueProcessor | 20 |
history.transferTaskWorkerCount | number of worker for transferQueueProcessor | 10 |
history.transferTaskMaxRetryCount | max times of retry for transferQueueProcessor | 100 |
history.transferProcessorCompleteTransferFailureRetryCount | times of retry for failure | 10 |
history.transferProcessorUpdateShardTaskCount | update shard count for transferQueueProcessor | #N/A |
history.transferProcessorMaxPollInterval | max poll interval for transferQueueProcessor | 1*time.Minute |
history.transferProcessorMaxPollIntervalJitterCoefficient | the max poll interval jitter coefficient | 0.15 |
history.transferProcessorSplitQueueInterval | the split processing queue interval for transferQueueProcessor | 1*time.Minute |
history.transferProcessorSplitQueueIntervalJitterCoefficient | the split processing queue interval jitter coefficient | 0.15 |
history.transferProcessorUpdateAckInterval | update interval for transferQueueProcessor | 30*time.Second |
history.transferProcessorUpdateAckIntervalJitterCoefficient | the update interval jitter coefficient | 0.15 |
history.transferProcessorCompleteTransferInterval | complete timer interval for transferQueueProcessor | 60*time.Second |
history.transferProcessorMaxRedispatchQueueSize | the threshold of the number of tasks in the redispatch queue for transferQueueProcessor | 10000 |
history.transferProcessorEnablePriorityTaskProcessor | indicates whether priority task processor should be used for transferQueueProcessor | TRUE |
history.transferProcessorEnableMultiCursorProcessor | indicates whether multi-cursor queue processor should be used for transferQueueProcessor | FALSE |
history.transferProcessorVisibilityArchivalTimeLimit | the upper time limit for archiving visibility records | 200*time.Millisecond |
history.replicatorTaskBatchSize | batch size for ReplicatorProcessor | 100 |
history.replicatorTaskWorkerCount | number of worker for ReplicatorProcessor | 10 |
history.replicatorReadTaskMaxRetryCount | the number of read replication task retry time | 3 |
history.replicatorTaskMaxRetryCount | max times of retry for ReplicatorProcessor | 100 |
history.replicatorProcessorMaxPollRPS | max poll rate per second for ReplicatorProcessor | 20 |
history.replicatorProcessorUpdateShardTaskCount | update shard count for ReplicatorProcessor | #N/A |
history.replicatorProcessorMaxPollInterval | max poll interval for ReplicatorProcessor | 1*time.Minute |
history.replicatorProcessorMaxPollIntervalJitterCoefficient | the max poll interval jitter coefficient | 0.15 |
history.replicatorProcessorUpdateAckInterval | update interval for ReplicatorProcessor | 5*time.Second |
history.replicatorProcessorUpdateAckIntervalJitterCoefficient | the update interval jitter coefficient | 0.15 |
history.replicatorProcessorMaxRedispatchQueueSize | the threshold of the number of tasks in the redispatch queue for ReplicatorProcessor | 10000 |
history.replicatorProcessorEnablePriorityTaskProcessor | indicates whether priority task processor should be used for ReplicatorProcessor | FALSE |
history.executionMgrNumConns | persistence connections number for ExecutionManager | 50 |
history.historyMgrNumConns | persistence connections number for HistoryManager | 50 |
history.maximumBufferedEventsBatch | max number of buffer event in mutable state | 100 |
history.maximumSignalsPerExecution | max number of signals supported by single execution | 10000 |
history.shardUpdateMinInterval | the minimal time interval which the shard info can be updated | 5*time.Minute |
history.shardSyncMinInterval | the minimal time interval which the shard info should be sync to remote | 5*time.Minute |
history.shardSyncMinInterval | the sync shard jitter coefficient | #N/A |
history.defaultEventEncoding | the encoding type for history events | string(common.EncodingTypeThriftRW) |
history.numArchiveSystemWorkflows | key for number of archive system workflows running in total | 1000 |
history.archiveRequestRPS | the rate limit on the number of archive request per second | 300 // should be much smaller than frontend RPS |
history.enableAdminProtection | whether to enable admin checking | FALSE |
history.adminOperationToken | the token to pass admin checking | common.DefaultAdminOperationToken |
history.historyMaxAutoResetPoints | the key for max number of auto reset points stored in mutableState | DefaultHistoryMaxAutoResetPoints |
history.enableParentClosePolicy | whether to ParentClosePolicy | TRUE |
history.parentClosePolicyThreshold | decides that parent close policy will be processed by sys workers(if enabled) ifthe number of children greater than or equal to this threshold | 10 |
history.numParentClosePolicySystemWorkflows | key for number of parentClosePolicy system workflows running in total | 10 |
history.throttledLogRPS | the rate limit on number of log messages emitted per second for throttled logger | 4 |
history.stickyTTL | to expire a sticky tasklist if no update more than this duration | time.Hour24365 |
history.decisionHeartbeatTimeout | for decision heartbeat | time.Minute*30 |
history.DropStuckTaskByDomain | whether stuck timer/transfer task should be dropped for a domain | FALSE |
# Dynamic Configuration for System Worker Service
Config Key | Explanation | Default Values |
---|---|---|
worker.persistenceMaxQPS | the max qps worker host can query DB | 500 |
worker.persistenceGlobalMaxQPS | the max qps worker cluster can query DB | 0 |
worker.replicatorMetaTaskConcurrency | the number of coroutine handling metadata related tasks | #N/A |
worker.replicatorTaskConcurrency | the number of coroutine handling non metadata related tasks | #N/A |
worker.replicatorMessageConcurrency | the max concurrent tasks provided by messaging client | #N/A |
worker.replicatorActivityBufferRetryCount | the retry attempt when encounter retry error on activity | #N/A |
worker.replicatorHistoryBufferRetryCount | the retry attempt when encounter retry error on history | #N/A |
worker.replicationTaskMaxRetryCount | the max retry count for any task | #N/A |
worker.replicationTaskMaxRetryDuration | the max retry duration for any task | #N/A |
worker.replicationTaskContextDuration | the context timeout for apply replication tasks | #N/A |
worker.workerReReplicationContextTimeout | the context timeout for end to end re-replication process | #N/A |
worker.enableReplication | the feature flag for kafka replication | #N/A |
worker.indexerConcurrency | the max concurrent messages to be processed at any given time | 1000 |
worker.ESProcessorNumOfWorkers | num of workers for esProcessor | 1 |
worker.ESProcessorBulkActions | max number of requests in bulk for esProcessor | 1000 |
worker.ESProcessorBulkSize | max total size of bulk in bytes for esProcessor | 2<<24 // 16MB |
worker.ESProcessorFlushInterval | flush interval for esProcessor | 1*time.Second |
worker.EnableArchivalCompression | indicates whether blobs are compressed before they are archived | #N/A |
worker.WorkerHistoryPageSize | indicates the page size of history fetched from persistence for archival | #N/A |
worker.WorkerTargetArchivalBlobSize | indicates the target blob size in bytes for archival, actual blob size may vary | #N/A |
worker.ArchiverConcurrency | controls the number of coroutines handling archival work per archival workflow | 50 |
worker.ArchivalsPerIteration | controls the number of archivals handled in each iteration of archival workflow | 1000 |
worker.DeterministicConstructionCheckProbability | controls the probability of running a deterministic construction check for any given archival | #N/A |
worker.BlobIntegrityCheckProbability | controls the probability of running an integrity check for any given archival | #N/A |
worker.TimeLimitPerArchivalIteration | controls the time limit of each iteration of archival workflow | archiver.MaxArchivalIterationTimeout() |
worker.throttledLogRPS | the rate limit on number of log messages emitted per second for throttled logger | 20 |
worker.scannerPersistenceMaxQPS | the maximum rate of persistence calls from worker.Scanner | 100 |
worker.taskListScannerEnabled | indicates if task list scanner should be started as part of worker.Scanner | TRUE |
worker.historyScannerEnabled | indicates if history scanner should be started as part of worker.Scanner | TRUE |
worker.executionsScannerEnabled | indicates if executions scanner should be started as part of worker.Scanner | FALSE |
worker.executionsScannerConcurrency | indicates the concurrency of concrete execution scanner | 25 |
worker.executionsScannerBlobstoreFlushThreshold | indicates the flush threshold of blobstore in concrete execution scanner | 100 |
worker.executionsScannerActivityBatchSize | indicates the batch size of scanner activities | 25 |
worker.executionsScannerPersistencePageSize | indicates the page size of execution persistence fetches in concrete execution scanner | 1000 |
worker.executionsScannerInvariantCollectionMutableState | indicates if mutable state invariant checks should be run | TRUE |
worker.executionsScannerInvariantCollectionHistory | indicates if history invariant checks should be run | TRUE |
worker.currentExecutionsScannerEnabled | indicates if current executions scanner should be started as part of worker.Scanner | FALSE |
worker.currentExecutionsConcurrency | indicates the concurrency of current executions scanner | 25 |
worker.currentExecutionsBlobstoreFlushThreshold | indicates the flush threshold of blobstore in current executions scanner | 100 |
worker.currentExecutionsActivityBatchSize | indicates the batch size of scanner activities | 25 |
worker.currentExecutionsPersistencePageSize | indicates the page size of execution persistence fetches in current executions scanner | 1000 |
worker.currentExecutionsScannerInvariantCollectionHistory | indicates if history invariant checks should be run | FALSE |
worker.currentExecutionsInvariantCollectionMutableState | indicates if mutable state invariant checks should be run | TRUE |
worker.enableBatcher | decides whether start batcher in our worker | FALSE |
system.enableParentClosePolicyWorker | decides whether or not enable system workers for processing parent close policy task | TRUE |
system.enableStickyQuery | indicates if sticky query should be enabled per domain | TRUE |
# Dynamic Configuration for Cross DC replication feature
Config Key | Explanation | Default Values |
---|---|---|
history.ReplicationTaskFetcherParallelism | determines how many go routines we spin up for fetching tasks | 1 |
history.ReplicationTaskFetcherAggregationInterval | determines how frequently the fetch requests are sent | 2*time.Second |
history.ReplicationTaskFetcherTimerJitterCoefficient | the jitter for fetcher timer | 0.15 |
history.ReplicationTaskFetcherErrorRetryWait | the wait time when fetcher encounters error | time.Second |
history.ReplicationTaskFetcherServiceBusyWait | the wait time when fetcher encounters service busy error | 60*time.Second |
history.ReplicationTaskProcessorErrorRetryWait | the initial retry wait when we see errors in applying replication tasks | 50*time.Millisecond |
history.ReplicationTaskProcessorErrorRetryMaxAttempts | the max retry attempts for applying replication tasks | 5 |
history.ReplicationTaskProcessorNoTaskInitialWait | the wait time when not ask is returned | 2*time.Second |
history.ReplicationTaskProcessorCleanupInterval | determines how frequently the cleanup replication queue | 1*time.Minute |
history.ReplicationTaskProcessorCleanupJitterCoefficient | the jitter for cleanup timer | 0.15 |
history.ReplicationTaskProcessorReadHistoryBatchSize | the batch size to read history events | 5 |
history.ReplicationTaskProcessorStartWait | the wait time before each task processing batch | 5*time.Second |
history.ReplicationTaskProcessorStartWaitJitterCoefficient | the jitter for batch start wait timer | 0.9 |
history.ReplicationTaskProcessorHostQPS | the qps of task processing rate limiter on host level | 1500 |
history.ReplicationTaskProcessorShardQPS | the qps of task processing rate limiter on shard level | 5 |
history.ReplicationTaskGenerationQPS | the wait time between each replication task generation qps | 100 |
history.EnableConsistentQuery | indicates if consistent query is enabled for the cluster | TRUE |
history.EnableConsistentQueryByDomain | indicates if consistent query is enabled for a domain | FALSE |
history.MaxBufferedQueryCount | indicates the maximum number of queries which can be buffered at a given time for a single workflow | 1 |
history.mutableStateChecksumGenProbability | the probability [0-100] that checksum will be generated for mutable state | 0 |
history.mutableStateChecksumVerifyProbability | the probability [0-100] that checksum will be verified for mutable state | 0 |
history.mutableStateChecksumInvalidateBefore | the epoch timestamp before which all checksums are to be discarded | 0 |
history.ReplicationEventsFromCurrentCluster | a feature flag to allow cross DC replicate events that generated from the current cluster | FALSE |
history.NotifyFailoverMarkerInterval | determines the frequency to notify failover marker | 5*time.Second |
history.NotifyFailoverMarkerTimerJitterCoefficient | the jitter for failover marker notifier timer | 0.15 |