# Cluster Troubleshooting

This section is to cover some common operation issues as a RunBook. Feel free to add more, or raise issues in the to ask for more in cadence-docs (opens new window) project.Or talk to us in Slack support channel!

We will keep adding more stuff. Any contribution is very welcome.

# Errors

  • Persistence Max QPS Reached for List Operations
    • Check metrics to see how many List operations are performed per second on the domain. Alternatively you can enable debug log level to see more details of how a List request is ratelimited, if it's a staging/QA cluster.
    • Raise the ratelimiting for the domain if you believe the default ratelimit is too low
  • Failed to lock shard. Previous range ID: 132; new range ID: 133 and Failed to update shard. Previous range ID: 210; new range ID: 212
    • When this keep happening, it's very likely a critical configuration error. Either there are two clusters using the same database, or two clusters are using the same ringpop(bootstrap hosts).

# API high latency, timeout, Task disptaching slowness Or Too many operations onto DB and timeouts

  • If it happens after you attemped to truncate tables inorder to reuse the same database/keyspace for a new cluster, it's possible that the data is not deleted completely. You should make sure to shutdown the Cadence when trucating, and make sure the database is cleaned. Alternatively, use a different keyspace/database is a safer way.

  • Timeout pushing task to matching engine, e.g. "Fail to process task","service":"cadence-history","shard-id":431,"address":"172.31.48.64:7934","component":"transfer-queue-processor","cluster-name":"active","shard-id":431,"queue-task-id":590357768,"queue-task-visibility-timestamp":1637356594382077880,"xdc-failover-version":-24,"queue-task-type":0,"wf-domain-id":"f4d6824f-9d24-4a82-81e0-e0e080be4c21","wf-id":"55d64d58-e398-4bf5-88bc-a4696a2ba87f:63ed7cda-afcf-41cd-9d5a-ee5e1b0f2844","wf-run-id":"53b52ee0-3218-418e-a9bf-7768e671f9c1","error":"code:deadline-exceeded message:timeout","lifecycle":"ProcessingFailed","logging-call-at":"task.go:331"

    • If this happens after traffic increased for a certain domain, it's likely that a tasklist is overloaded. Consider scale up the tasklist
  • If the request volume aligned with the traffic increased on all domain, consider scale up the cluster