Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] growth in alert_rule_version table #1639

Open
yaskinny opened this issue Aug 17, 2024 · 9 comments
Open

[Bug] growth in alert_rule_version table #1639

yaskinny opened this issue Aug 17, 2024 · 9 comments
Labels
bug Something isn't working triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@yaskinny
Copy link

Describe the Bug
The operator appears to cause unexpected growth in the alert_rule_version table. I haven't investigated the root cause deeply, but the size of this table increases even without any updates. For example, I have set the re-evaluation interval for alerts to 10 minutes. Every 10 minutes, 500 new records are added to the table. I didn't check the diff between records to add more context but growth in records number is obvious. Additionally, when I delete a grafanaalertrule Custom Resource (CR) from the cluster, a large number of records are removed from this table, depending on how long the rule has existed—since every 10 minutes, multiple records are added for that specific grafanaalertrule. After stopping the operator, the growth in the table ceased.

I haven't updated to the latest version yet because I haven't found any mention of this issue in the release notes or in the repository's issue tracker.
Version
v5.9.1

To Reproduce

  1. Create alerts.
  2. Set the evaluation interval to X minutes.
  3. Check count of records in the alert_rule_version table

(I'm using PG 16 for database)

@yaskinny yaskinny added bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 17, 2024
@pb82
Copy link
Collaborator

pb82 commented Aug 19, 2024

@yaskinny Could this be versioning applied by Grafana (like it does with dashboards)? In this case, it's not an Operator issue. Or does this not happen when not using the Grafana Operator?

@yaskinny
Copy link
Author

yaskinny commented Aug 19, 2024

@pb82
I'm not sure operator or grafana fault this issue is.

I haven't yet got time to dig deeper and find the root cause, but the obvious thing is that as soon as i stop operator my table size does not grow anymore.

I have a doubt that there's a field in the alerts which operator is sending to grafana and that field is making grafana think that alert is updated and it is newer and causes a new record on the table. I'm not sure what that field is and where it should be handled(maybe its operator and has to either change that field to a dynamic data based on rule state not something random each time or it's grafana and does not check a field correctly).

if i get time, I'll investigate more and share the results with you.

here a sample alert that I'm using:

  - annotations:
      description: Rabbitmq node {{ index $labels "instance" }} has {{ $values.A.value |
        humanize }}
      summary: Rabbitmq Memory Limit
    condition: A
    execErrState: KeepLast
    for: 5m0s
    labels:
      severity: critical
      team: sre
    noDataState: OK
    title: RabbitmqMemoryLimitMetrics
    uid: ec0c9410a4c0af1ccf58cb23249de30d4addbb5b
    data:
    - datasourceUid: t-metrics
      model:
        datasource:
          type: prometheus
          uid: t-metrics
        editorMode: code
        expr: >-
          1 - (rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes ) < 0.25
        instant: true
        intervalMs: 25000
        legendFormat: __auto
        maxDataPoints: 43200
        range: false
        refId: A
      refId: A
      relativeTimeRange:
        from: 600

@theSuess theSuess added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 26, 2024
@theSuess
Copy link
Member

I'll try to reproduce the issue this week. If this is the case for all alerts, this should definetly be fixed soon

@DrDJIng
Copy link

DrDJIng commented Aug 28, 2024

I don't think this is an issue with the operator, but with grafana itself. We have had similar issues with this table using the sidecar provisioning:

Grafana Issue

I never had the time to deep dive into the Grafana code to find the issue, but my gut says there's some logic issues when comparing alerts to their old versions, causing an infinite growth.

We are seeing the same table growth now that we've switched to Operator, though much, much slower growth.

Our admittedly bad solution is an automated truncation on the table itself.

@redisded
Copy link

Hello, just to say we have the exact same problem here, using argocd and grafana-operator.
I've commented the grafana issue, as this seems more related to grafana itself than the operator, but I can provide more information on our installation or perform further tests if it can help reproduce the issue.

@weisdd
Copy link
Collaborator

weisdd commented Oct 9, 2024

As I see, the fix (grafana/grafana#89754) has been merged a few days ago, so now it's a matter of waiting for a new version of Grafana to be released. After that, from the grafana-operator side, we should think whether we want to only document how to configure the cleanup in grafana or we also should add some sane defaults to make it all automatic (although, it will only help with self-hosted Grafanas as we have no way to configure the same in "managed" Grafanas, so it's up to vendors to configure).

@weisdd
Copy link
Collaborator

weisdd commented Oct 22, 2024

v11.3.0 has just been released, it contains the configuration option that allows to enable the cleanup of the old rule versions.

@BinjaFan
Copy link

@weisdd
Where is this configuration? I started version 11.3.0, and my table keeps getting bigger continuously.
i can not find anything related to that in grafana documents.

@weisdd
Copy link
Collaborator

weisdd commented Oct 23, 2024

@BinjaFan it was added in grafana/grafana#89754:

[unified_alerting]
# Defines the limit of how many alert rule versions
# should be stored in the database for each alert rule in an organization including the current one.
# 0 value means no limit
rule_version_record_limit = 0

I haven't tried it myself yet, but I would assume it works once you adjust the setting accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

7 participants