Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

pollMessages() in PostgresQueueDAO does not break when popMessages() returns empty message list. #236

Open
Veera93 opened this issue Jun 20, 2023 · 1 comment

Comments

@Veera93
Copy link

Veera93 commented Jun 20, 2023

Describe the bug
In pollMessages implementation we need to address two issues when messagesSlices object is empty.

  1. The loop continues when there are no eligible messages in the queue. This leads to unnecessary querying of the database until the timeout is reached, and the sleep time of 100ms escalates the problem. The mix of these two together causes high queries per second at scale.
  2. When the number of messages is less than the count specified, the thread blocks and holds onto the previous messages in the memory until timeout happens or the count is reached.

Please share your thoughts on this. If we agree, I can raise a Merge Request for the same.

Details
Conductor version: any
Persistence implementation: Postgres
Queue implementation: Postgres

To Reproduce
Steps to reproduce the behavior:

  1. Run conductor with Postgres as the persistence layer for Queue implementation.
  2. Monitor QPS in Postgres using pg_stat_statements
    Note: Here we are only seeing the effect of pollMessages called from the WorkflowReconciler where the timeout is 2000ms and count depends on the CPU core.

Expected behavior

Return the messages that are present instead of waiting and retrying till the timeout or count.

Screenshots
Just started the service and execute pg_stat_statements. Below, is the number of queries executed per minute from one instance just for WorkflowReconciler. The QSP increases exponentially if there are multiple instances and if the workers poll for activity.

Screenshot 2023-06-20 at 3 55 16 PM

Note: Here we are only seeing the effect of pollMessages called from the WorkflowReconciler. There are other places where pollMessage is being invoked.

Additional context
The MySQL implementation of the poll does not have this while(true) loop.

@Veera93
Copy link
Author

Veera93 commented Jun 22, 2023

In the conductor version 3.13.7, I see that the default timeout is 2000 ms, but we can override it to 0 which will execute timeout < 1 the if block for the desired results. But ideally, this timeout should be the query timeout. If this is changed in the future it would cause an issue.

Just curious to know why we are not using this timeout value as a query timeout for the database call.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant