Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Prometheus metrics on the RabbitMQ Worker #181

Merged

Conversation

catileptic
Copy link
Contributor

No description provided.

@catileptic catileptic merged commit d6d6129 into release/1.23.0 Jun 5, 2024
1 check passed
Copy link
Contributor

@tillprochaska tillprochaska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I know this is merged already, but I noticed a few things where either I’m not understanding something or this implementation might be incorrect in some details. Hopefully it’s the first case, but I wanted to double check that :)

Comment on lines +532 to +539
# In this case, a task ID was found neither in the
# list of Pending, nor the list of Running tasks
# in Redis. It was never attempted.
metrics.TASKS_FAILED.labels(
stage=task.operation,
retries=0,
failed_permanently=True,
).inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly should_execute() will return False if the task has been cancelled before it’s been executed. As the task is never executed in this branch, why would we increase the TASKS_FAILED counter?

A task that has never been executed cannot have failed (at least that’s my understanding). If I understand correctly, in this implementation, something like this could happen:

  • User uploads a large number of files
  • Users notices they made a mistake and cancels all running tasks
  • The metric tracking failed tasks will show a lot of failed tasks

This could cause confusion, especially when instance admins have alert rules set up to be notified when the error rate increases etc..

If we want to track tasks cancelled by users, we should set up a separate metric and ideally track that metric close to where the actual cancel action happens.

Comment on lines +496 to +500
metrics.TASKS_FAILED.labels(
stage=task.operation,
retries=task_retry_count,
failed_permanently=False,
).inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should go into the except block (after log.exception("Error in task handling") to ensure the counter is incremented right after a task has failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to keep the test cases/ensure that the existing tests are also executed against the new implementation. I can help port the existing test cases over to the RabbitMQ implementation if you like.

tillprochaska added a commit that referenced this pull request Jul 4, 2024
This fixes an issue with the metric for failed tasks and also adds test coverage for the metrics.

Follow up to #181
tillprochaska added a commit that referenced this pull request Jul 4, 2024
This fixes an issue with the metric for failed tasks and also adds test coverage for the metrics.

Follow up to #181
tillprochaska added a commit that referenced this pull request Jul 8, 2024
This fixes an issue with the metric for failed tasks and also adds test coverage for the metrics.

Follow up to #181
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants