Enable Prometheus metrics on the RabbitMQ Worker #181

catileptic · 2024-06-04T13:02:25Z

No description provided.

…q-worker

tillprochaska

Hey, I know this is merged already, but I noticed a few things where either I’m not understanding something or this implementation might be incorrect in some details. Hopefully it’s the first case, but I wanted to double check that :)

tillprochaska · 2024-06-05T17:02:28Z

servicelayer/taskqueue.py

+                # In this case, a task ID was found neither in the
+                # list of Pending, nor the list of Running tasks
+                # in Redis. It was never attempted.
+                metrics.TASKS_FAILED.labels(
+                    stage=task.operation,
+                    retries=0,
+                    failed_permanently=True,
+                ).inc()


If I understand correctly should_execute() will return False if the task has been cancelled before it’s been executed. As the task is never executed in this branch, why would we increase the TASKS_FAILED counter?

A task that has never been executed cannot have failed (at least that’s my understanding). If I understand correctly, in this implementation, something like this could happen:

User uploads a large number of files

Users notices they made a mistake and cancels all running tasks

The metric tracking failed tasks will show a lot of failed tasks

This could cause confusion, especially when instance admins have alert rules set up to be notified when the error rate increases etc..

If we want to track tasks cancelled by users, we should set up a separate metric and ideally track that metric close to where the actual cancel action happens.

tillprochaska · 2024-06-05T17:10:15Z

servicelayer/taskqueue.py

+                    metrics.TASKS_FAILED.labels(
+                        stage=task.operation,
+                        retries=task_retry_count,
+                        failed_permanently=False,
+                    ).inc()


I think this should go into the except block (after log.exception("Error in task handling") to ensure the counter is incremented right after a task has failed.

tillprochaska · 2024-06-05T17:18:31Z

servicelayer/taskqueue.py

Would be great to keep the test cases/ensure that the existing tests are also executed against the new implementation. I can help port the existing test cases over to the RabbitMQ implementation if you like.

This fixes an issue with the metric for failed tasks and also adds test coverage for the metrics. Follow up to #181

catileptic added 6 commits June 4, 2024 15:01

Enable Prometheus metrics on the RabbitMQ Worker

d0e4679

Extract Prom. metrics to separate module

d879e08

Fix linter errors, fix missing import

411f8e2

Fix import

cd82c66

Add comment

b64690f

Merge branch 'release/1.23.0' into feature/prometheus-metrics-rabbitm…

23ef2c8

…q-worker

catileptic merged commit d6d6129 into release/1.23.0 Jun 5, 2024
1 check passed

tillprochaska reviewed Jun 6, 2024

View reviewed changes

tillprochaska added a commit that referenced this pull request Jul 4, 2024

Fix Prometheus metrics

2be9117

This fixes an issue with the metric for failed tasks and also adds test coverage for the metrics. Follow up to #181

tillprochaska mentioned this pull request Jul 4, 2024

Fix Prometheus metrics #199

Merged

tillprochaska added a commit that referenced this pull request Jul 4, 2024

Fix Prometheus metrics

2493088

This fixes an issue with the metric for failed tasks and also adds test coverage for the metrics. Follow up to #181

tillprochaska added a commit that referenced this pull request Jul 8, 2024

Fix Prometheus metrics (#199)

decef6d

This fixes an issue with the metric for failed tasks and also adds test coverage for the metrics. Follow up to #181

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Prometheus metrics on the RabbitMQ Worker #181

Enable Prometheus metrics on the RabbitMQ Worker #181

catileptic commented Jun 4, 2024

tillprochaska left a comment

tillprochaska Jun 5, 2024

tillprochaska Jun 5, 2024

tillprochaska Jun 5, 2024

Enable Prometheus metrics on the RabbitMQ Worker #181

Enable Prometheus metrics on the RabbitMQ Worker #181

Conversation

catileptic commented Jun 4, 2024

tillprochaska left a comment

Choose a reason for hiding this comment

tillprochaska Jun 5, 2024

Choose a reason for hiding this comment

tillprochaska Jun 5, 2024

Choose a reason for hiding this comment

tillprochaska Jun 5, 2024

Choose a reason for hiding this comment