Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add panoptes-production-sidekiq-alt pod type #4425

Merged
merged 1 commit into from
Dec 23, 2024

Conversation

lcjohnso
Copy link
Member

To address high latency in lower priority queues while UserSeenSubjectsWorker jobs in data_high are dominating available threads, this PR adds a single dedicated pod that will service the default (most important), data_medium, and data_low queues. This change adds an additional pod to the minimum total number of pods, but this cost makes sense in the short term (i.e., while panoptes replica DB is out of service during migration).

@lcjohnso
Copy link
Member Author

On hold for now due to more aggressive autoscaling resolving backups w/o need for queue-specialized pods.

Copy link
Collaborator

@yuenmichelle1 yuenmichelle1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving and merging. Issue is still happening. Will create a write up on Slack thread.

Essentially: UserSeenSubjectsWorker hits a connection timeout on the update transaction. This leads to potential data locking. Not only that but UserSeenSubjectsWorker is set to retry (I believe until success), which leads to a backup of jobs in queue.

Current theory is that because we shut off our read-replica, even though this query and others do not utilize the read-replica, there are current heavy queries that were using read-replica that are now sharing resources from production db and making it harder for these smaller transactions to clear through.

However, because these workers are typically fast, CPU never hits threshold of autoscaler until job queue backup hits a really high amount, (by then latency of data_medium and lower priority queues are hours behind).

To mitigate, as Cliff has mentioned, we create a dedicated pod to deal with the lower priority queues. (Data_medium and lower)

@yuenmichelle1 yuenmichelle1 merged commit 3cb6280 into master Dec 23, 2024
8 checks passed
@yuenmichelle1 yuenmichelle1 deleted the sidekiq-addaltpod-1 branch December 23, 2024 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants