-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Prometheus metrics on the RabbitMQ Worker #181
Enable Prometheus metrics on the RabbitMQ Worker #181
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I know this is merged already, but I noticed a few things where either I’m not understanding something or this implementation might be incorrect in some details. Hopefully it’s the first case, but I wanted to double check that :)
# In this case, a task ID was found neither in the | ||
# list of Pending, nor the list of Running tasks | ||
# in Redis. It was never attempted. | ||
metrics.TASKS_FAILED.labels( | ||
stage=task.operation, | ||
retries=0, | ||
failed_permanently=True, | ||
).inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly should_execute()
will return False
if the task has been cancelled before it’s been executed. As the task is never executed in this branch, why would we increase the TASKS_FAILED
counter?
A task that has never been executed cannot have failed (at least that’s my understanding). If I understand correctly, in this implementation, something like this could happen:
- User uploads a large number of files
- Users notices they made a mistake and cancels all running tasks
- The metric tracking failed tasks will show a lot of failed tasks
This could cause confusion, especially when instance admins have alert rules set up to be notified when the error rate increases etc..
If we want to track tasks cancelled by users, we should set up a separate metric and ideally track that metric close to where the actual cancel action happens.
metrics.TASKS_FAILED.labels( | ||
stage=task.operation, | ||
retries=task_retry_count, | ||
failed_permanently=False, | ||
).inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should go into the except
block (after log.exception("Error in task handling")
to ensure the counter is incremented right after a task has failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to keep the test cases/ensure that the existing tests are also executed against the new implementation. I can help port the existing test cases over to the RabbitMQ implementation if you like.
This fixes an issue with the metric for failed tasks and also adds test coverage for the metrics. Follow up to #181
This fixes an issue with the metric for failed tasks and also adds test coverage for the metrics. Follow up to #181
This fixes an issue with the metric for failed tasks and also adds test coverage for the metrics. Follow up to #181
No description provided.