Investigation: Support alerts based on Loki logs #3178

AverageMarcus · 2024-01-23T15:31:09Z

Now that we have Loki logs from kube-system and giantswarm namespaces available in our MCs it'd be nice to be able to configure alerts based on log output.

For example:

We've recently hit an issue in our testing environment where we've hit an AWS quota limit which now prevents up from creating any new ACM certificates. This quota doesn't have any way to view current usage in AWS (that I've been able to find) so we have no way to monitor our usage towards this limit. The only way we currently know that we've hit it is our components (irsa-operator) fail to create an ACM certificate and produce an error in the logs. This doesn't bubble up to any metrics so we currently can't be alerted to this problem happening.
It would be really useful if we could craft an alert that queried for LimitExceededException or similar in our operator logs and alerted us to this.

The text was updated successfully, but these errors were encountered:

QuentinBisson · 2024-02-07T13:14:33Z

Needs to be documented as well:

Basic recording rules
Basic alerting

hervenicol · 2024-07-01T10:26:34Z

Exploration

Docs:

Official entrypoint: https://grafana.com/docs/loki/latest/alert/
regarding cortextool / mimirtool / lokitool: Using mimirtool with loki rules grafana/mimir#3078

Tools:

cortextool - https://github.com/grafana/cortex-tools/releases

Questions/remarks:

we may need to remotewrite metrics from recordingrules to mimir/prometheus
- requires remote_write_config in ruler (https://grafana.com/docs/loki/latest/configure/#ruler)
- how do we inject this loki config? Do we need an operator?
- are metrics sent per-tenant? Will it create separate tenants in mimir?
alertmanager config to be setup (https://grafana.com/docs/loki/latest/configure/#ruler as alertmanager_client)
- how do we inject this loki config? Do we need an operator?
how to load rules?
- one-shot with cortextool
- how to automate this? Is there an operator that does it?
- look for loki-sc-rules pod in loki-backend.
- has to be done for each tenant
multi-tenancy (and tenants aggregation on read path) prevents reading rules from grafana.

hervenicol · 2024-07-09T10:24:54Z

alertmanager config done ✔
remotewrite config done for mimir ✔
load rules ongoing ⏲
access to rules from grafana broken for now because of tenants aggregation ❗

hervenicol · 2024-07-10T09:23:44Z

How to load Loki rules:

We can use the "alloy-rules" component that currently loads prometheus-rules to Mimir.
It expects to find prometheusRules and load them to Loki.

Caveat:
prometheus-operator checks that the rules are using a valid promQL syntax. And loki rules, although they are valid prometheusRules custom resources, use a logQL syntax that is slightly different.

So we have to disable prometheus-operator validation webhook for loki rules.
We have to differenciate loki rules from mimir rules anyway for Alloy.
We could either:

use a namespaceSelector, so that rules in the loki namespace are Loki rules
use a labelSelector, so we could tag loki rules with a specific label like application.giantswarm.io/prometheusRule/kind: loki

I have tested the namespace selector it successfully with this alloy config ($ k get cm alloy-rules -n mimir -oyaml | yq '.data."config.alloy"'):

mimir.rules.kubernetes "local" {
    address = "http://mimir-ruler.mimir.svc:8080/"
    tenant_id = "anonymous"
    rule_namespace_selector {
        match_expression {
          key = "kubernetes.io/metadata.name"
          operator = "NotIn"
          values = ["loki"]
        }
    }
}
loki.rules.kubernetes "local" {
    address = "http://loki-backend.loki.svc:3100"
    tenant_id = "golem"

    rule_namespace_selector {
        match_expression {
          key = "kubernetes.io/metadata.name"
          operator = "In"
          values = ["loki"]
        }
    }
}

And this addition to kube-prometheus-stack user-values:

k get cm -n org-giantswarm  golem-kube-prometheus-stack-user-values -oyaml | yq '.data.values' | yq '.kube-prometheus-stack.prometheusOperator.denyNamespaces'
- loki

QuentinBisson · 2024-07-10T09:28:27Z

Have you tried the loki backend sidecar to load rules ?

hervenicol · 2024-07-10T10:22:00Z

Have you tried the loki backend sidecar to load rules ?

No, because I was quite happy that Alloy could do it:
I like the idea that we have the same management of rules whatever the ruler.
And also that loki and mimir rules could depend on each other, and have similar CI. So in the end they could both be in the prometheus-rules repo, as they are distributed as prometheusRules CRs, that looks fine to me.

QuentinBisson · 2024-07-10T10:42:06Z

Sure. I'm wondering if we could consider options like: https://loki-operator.dev/docs/ruler_support.md/

QuentinBisson · 2024-07-10T11:23:10Z

Actually, I'm even more confused now :D

hervenicol · 2024-07-10T16:23:30Z

We already have a complete pipeline for PrometheusRules that we are used to, why should we look for a different solution ?

Loki Operator could be useful for dynamic alertmanager or remotewrite configs, but I don't think we need that.

QuentinBisson · 2024-07-10T20:29:44Z

So, coming from this comment, I thought we were still somewhat in the investigation phase and not the decision phase so i'm asking if there are other solutions out there that might fit better than our current flow?

To be honest, i'm not really happy about the validation hack we need to do to be able to use Loki alerting for a few reasons:

It's hacky at best, we don't know if this will keep working in the future and it's not per se a Prometheus Rules and i'm a fan of good semantics :)
What does it mean for validation in the prometheus-rules repo? Because I would assume pint does not support them so we will have to create some hack there?
Will that scale when we run tempo (there are talks about adding a ruler with TraceQL)
We already have that limitation (because alloy does not support it when loading rules) but that means we cannot use tenant federations for now and we should be okay with that :)

I'm definitely not saying this is not the solution we will end up using, maybe as a temporary solution, maybe it's complete, who knows, I really wanted to know what's out there before moving on to enabling this :)

Now, for a future possible itération, would it make sense that we ask some grafana people (maybe on community meetings if they would liké to sponsor, partner on some kind of Rules CRD in Say, alloy that could bé of type LogQL, TraceQl, PromQl and so on)? I think that would bé a cool topic

hervenicol · 2024-07-11T07:39:06Z

Now you're saying what you don't like with my proposal and hope to improve with another solution, and I like that :)

To be honest, i'm not really happy about the validation hack we need to do to be able to use Loki alerting for a few reasons:

It's hacky at best, we don't know if this will keep working in the future and it's not per se a Prometheus Rules and i'm a fan of good semantics :)

I can't see what makes you think it wouldn't work in the future. But that's a point we could work on.
It's not "per se" prometheusRules, but that's how GrafanaLabs decided it should be managed whether we do it via lokiOperator or Alloy.

The rules loader sidecar probably does not rely on these custom resources, that's true.

What does it mean for validation in the prometheus-rules repo? Because I would assume pint does not support them so we will have to create some hack there?

It obviously means some changes in the prometheus-rules repo.
I was thinking about having some new directories like helm/prometheus-rules/templates/platform/atlas/loki-alerting-rules/ and helm/prometheus-rules/templates/platform/atlas/loki-recording-rules/

And these directories would have a different validation, based on cortextool (or the newly released lokitool), which properly checks loki rules and logql syntax.
That's not as powerful as pint, but that's all we can do for loki rules anyway.

But we could have a separate repo for lokiRules if we have reasons to do that. This is a matter of how we package the rules, I was planning on doing it as a next step.

Will that scale when we run tempo (there are talks about adding a ruler with TraceQL)

I sure hope if grafanaLabs chose that path for mimir and loki rules, they should be consistent and have a similar feature in Alloy for tempo rules.

We already have that limitation (because alloy does not support it when loading rules) but that means we cannot use tenant federations for now and we should be okay with that :)

I think that's more related to the ruler than to how we load rules, right?

I'm definitely not saying this is not the solution we will end up using, maybe as a temporary solution, maybe it's complete, who knows, I really wanted to know what's out there before moving on to enabling this :)

I agree with the "it's not real prometheusRules so we shouldn't mix these". lokiOperator won't solve it, but I can try loading configMaps rules to loki with the sidecar loader.

Now, for a future possible itération, would it make sense that we ask some grafana people (maybe on community meetings if they would liké to sponsor, partner on some kind of Rules CRD in Say, alloy that could bé of type LogQL, TraceQl, PromQl and so on)? I think that would bé a cool topic

That could be a thing for the future, but for now the only thing that would change would be how to validate the expr part of the rule.

QuentinBisson · 2024-07-11T09:39:31Z

Thanks for going through that long one :D

Now you're saying what you don't like with my proposal and hope to improve with another solution, and I like that :)

To be honest, i'm not really happy about the validation hack we need to do to be able to use Loki alerting for a few reasons:

It's hacky at best, we don't know if this will keep working in the future and it's not per se a Prometheus Rules and i'm a fan of good semantics :)

I can't see what makes you think it wouldn't work in the future. But that's a point we could work on. It's not "per se" prometheusRules, but that's how GrafanaLabs decided it should be managed whether we do it via lokiOperator or Alloy.

The rules loader sidecar probably does not rely on these custom resources, that's true.

On that note, if we are not using it, we probably should disable the sidecar to save on a bit of resources :D
But you're right, I would hope they would add something else and not drop support for it :)

What does it mean for validation in the prometheus-rules repo? Because I would assume pint does not support them so we will have to create some hack there?

It obviously means some changes in the prometheus-rules repo. I was thinking about having some new directories like helm/prometheus-rules/templates/platform/atlas/loki-alerting-rules/ and helm/prometheus-rules/templates/platform/atlas/loki-recording-rules/

And these directories would have a different validation, based on cortextool (or the newly released lokitool), which properly checks loki rules and logql syntax. That's not as powerful as pint, but that's all we can do for loki rules anyway.

But we could have a separate repo for lokiRules if we have reasons to do that. This is a matter of how we package the rules, I was planning on doing it as a next step.

This is a good idea but I'm not sure if a directory makes is easier. What would you think about using a suffix for rules instead, the same way we do it today with test files?

So we could have something like this.

atlas/
  alerting-rules/
    alert.metrics.yaml
    alertLoki.logs.yaml
    alert.slo.yaml <- eventually

But that a different topic because maybe we want to have all alerts for a component in 1 place. we can discuss such things later :D I trust your expertise will find something

Will that scale when we run tempo (there are talks about adding a ruler with TraceQL)

I sure hope if grafanaLabs chose that path for mimir and loki rules, they should be consistent and have a similar feature in Alloy for tempo rules.

You're right yes

We already have that limitation (because alloy does not support it when loading rules) but that means we cannot use tenant federations for now and we should be okay with that :)

I think that's more related to the ruler than to how we load rules, right?

Actually, no because the mimir ruler supports federated rules groups which the prometheus rules solution would not allow us to use https://grafana.com/docs/mimir/latest/references/architecture/components/optional-grafana-mimir-ruler/#federated-rule-groups and the possibility to set a tenant on the alert or sourceTenants is not there in PrometheusRules but this is a issue Grafana have and not just us so this is quite a light argument :)

I'm definitely not saying this is not the solution we will end up using, maybe as a temporary solution, maybe it's complete, who knows, I really wanted to know what's out there before moving on to enabling this :)

I agree with the "it's not real prometheusRules so we shouldn't mix these". lokiOperator won't solve it, but I can try loading configMaps rules to loki with the sidecar loader.

Now, for a future possible itération, would it make sense that we ask some grafana people (maybe on community meetings if they would liké to sponsor, partner on some kind of Rules CRD in Say, alloy that could bé of type LogQL, TraceQl, PromQl and so on)? I think that would bé a cool topic

That could be a thing for the future, but for now the only thing that would change would be how to validate the expr part of the rule.

I would think regarding the last point that we could create a new set of CRs (maybe added to alloy with Grafana) that would for sure have a different validation but also have a tenant field and a supportTenants one. Also, I quite like the distinction in the loki operator between alerting and recording rules because they don't have the same expectations (i don't need labels on recording rules for instance) but that is as you said something to work on later

hervenicol · 2024-07-11T12:46:09Z

So, I understand better why you think using prometheusRules does not cover all use cases with Mimir and Loki.
But do you agree with the plan I presented (with selectorLabel would be best IMO) is a proper solution for now, until we understand what we want to do with multi-tenancy?

I also feel like loading Configmaps with the sidecar requires local rules storage, and may require some tweaks/hacks for multi-tenancy as well. Meaning, as-is, it does not work and we should disable it to save a few resources.

QuentinBisson · 2024-07-11T13:17:08Z

Yes I agree that it's the best solution short term and that was a really good exploration on your part :)

hervenicol · 2024-07-15T14:13:35Z

Current status

For now, loki rules will only work on Mimir-enabled clusters (ie CAPI).
It only works for MC logs
We can create prometheusRules with a label application.giantswarm.io/prometheus-rule-kind: loki, they will be taken as Loki rules
BUT ... prometheus-operator's validationwebhook will very probably deny them until olly-bundle is updated (ie v29.x).

Next steps

We will probably set loki rules in prometheus-rules, but that's still to be defined.

And in order to do it, I'd love to have a first useful rule. @AverageMarcus do you think you can provide a query? Even better if you can set it as a prometheusRule, with all the extra data: team, business hours, inhibitions, duration before paging...

Anyway, this won't be merged until we have a working v29.x release with the proper olly-bundle included.

AverageMarcus · 2024-07-16T07:10:03Z

It only works for MC logs

That's kinda a blocker for us to use this right now. 😞

I'll have a chat with the team later and see if there's anything on the stable-testing MCs that we might benefit from. I suspect we could come up with some stuff but not sure when we'll get back to you. Hope thats ok.

hervenicol · 2024-08-26T14:53:08Z

Ideas for Shield:

All of these we'd first start with visibility and then define what anomalous looks like and decide on alerts, but:

API server - kubectl execs - this should be rare, especially in prod environments => sum(count_over_time({scrape_job="audit-logs"} |~ "subresource.: *.exec"| json[1m])) by (objectRef_namespace, objectRef_name)
API server - requests for the content of certain secrets (kubeconfigs, important tokens, Shield canaries, etc.)
API server - webhook timeouts. See https://github.com/giantswarm/giantswarm/issues/31057
workloads - we would have temporary alerts for ops issues (timeouts, we have some known errors in logs, etc.)
SSH connections - who is connected, is it at a weird time or from a weird IP?
API server + Kyverno + Cilium(?) - some customers see intermittent webhook timeouts when kubectl-ing resources, but all the pods are healthy. We haven't been able to nail down under what circumstances the timeouts occur, but theories are either that they're working with huge/problematic resources, or that individual nodes have some underlying issue (like a single node is swamped, so the timouts are all for requests that are sent to that node). We'd need to first see the API server logging timeouts, and then see the logs of the kyverno to which the request was sent, along with potentially other logs from that node. But having some visibility of when the timeouts occur might already show a pattern. Related issue
Risky logs. If we will be storing customer logs on MCs, we open up a lot of potential risk due to what could be in them. We could identify things that shouldn't be there (financial info, secrets, etc.) and help customers sanitize them
API server / workloads - RBAC issues. We could identify if a workload is trying to do something it isn't allowed to do in order to catch RBAC bugs, also as a potential security mechanism.
API server - same as above, but for humans. Are human users trying to do things they don't have RBAC for?
API server - suspicious behaviors aside from kubectl exec. My team could come up with quite a few things that would look very weird.

As lots of these queries are heavy, I think it makes sense to generate metrics from recording rules that we can visualize efficiently afterwards.

QuentinBisson · 2024-09-16T12:32:32Z

I wanted to try out using only giantswarm as a tenant so we could create rules for all giantswarm components but it turns out Loki becomes really slow even on grizzly giantswarm/logging-operator#215. I'll try to figure out why it is so slow

QuentinBisson · 2024-09-19T11:52:12Z

@hervenicol do you think we should refine this ticket to have clearer next steps? Or do we move it to another epic ? cc @Rotfuks

QuentinBisson · 2024-09-19T12:28:32Z

I would think the investigation is done here, we like it, it works, but we are limited by the multi-tenancy for now so we should probably add this into the alertmanager migration epic instead?

Rotfuks · 2024-09-19T13:03:23Z

We can open up a specific epic for the alerts based on Events, adding the implementation, documentation and maybe some out-of-the-box alerts for our internal teams? Wdyt?

QuentinBisson · 2024-09-19T13:12:41Z

I'm fine with anything, I just don't think it makes sense to prevent the making loki nicer epic to be closed :D

Rotfuks · 2024-09-19T16:24:34Z

This story will be followed up with the epic around log-based alerting: #3678

Investigation therefor closed.

github-project-automation bot added this to Roadmap Jan 23, 2024

github-project-automation bot moved this to Inbox 📥 in Roadmap Jan 23, 2024

QuentinBisson added the needs/refinement Needs refinement in order to be actionable label Feb 6, 2024

QuentinBisson mentioned this issue Feb 6, 2024

Implement logging infrastructure #311

Closed

37 tasks

Rotfuks mentioned this issue Mar 27, 2024

Making Loki Nicer #3362

Closed

Rotfuks removed this from Roadmap Mar 27, 2024

Rotfuks removed the needs/refinement Needs refinement in order to be actionable label May 8, 2024

hervenicol self-assigned this Jul 1, 2024

architectbot added the team/atlas Team Atlas label Jul 1, 2024

This was referenced Jul 14, 2024

add support for loki rules to management clusters in alloy config giantswarm/prometheus-rules#1297

Merged

prometheus-operator: skip promql check for loki rules giantswarm/observability-bundle#215

Merged

This was referenced Sep 19, 2024

Log-based Alerting #3678

Open

Implementation of log-based alerting #3679

Open

Log-based alert rules for internal customers #3680

Open

Rotfuks closed this as completed Sep 19, 2024

Rotfuks changed the title ~~Support alerts based on Loki logs~~ Investigation: Support alerts based on Loki logs Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: Support alerts based on Loki logs #3178

Investigation: Support alerts based on Loki logs #3178

AverageMarcus commented Jan 23, 2024 •

edited

Loading

QuentinBisson commented Feb 7, 2024

hervenicol commented Jul 1, 2024 •

edited

Loading

hervenicol commented Jul 9, 2024

hervenicol commented Jul 10, 2024

QuentinBisson commented Jul 10, 2024

hervenicol commented Jul 10, 2024

QuentinBisson commented Jul 10, 2024

QuentinBisson commented Jul 10, 2024

hervenicol commented Jul 10, 2024

QuentinBisson commented Jul 10, 2024 •

edited

Loading

hervenicol commented Jul 11, 2024

QuentinBisson commented Jul 11, 2024

hervenicol commented Jul 11, 2024

QuentinBisson commented Jul 11, 2024 •

edited

Loading

hervenicol commented Jul 15, 2024

AverageMarcus commented Jul 16, 2024

hervenicol commented Aug 26, 2024 •

edited

Loading

QuentinBisson commented Sep 16, 2024

QuentinBisson commented Sep 19, 2024

QuentinBisson commented Sep 19, 2024

Rotfuks commented Sep 19, 2024

QuentinBisson commented Sep 19, 2024

Rotfuks commented Sep 19, 2024

Investigation: Support alerts based on Loki logs #3178

Investigation: Support alerts based on Loki logs #3178

Comments

AverageMarcus commented Jan 23, 2024 • edited Loading

QuentinBisson commented Feb 7, 2024

hervenicol commented Jul 1, 2024 • edited Loading

Exploration

Docs:

Tools:

Questions/remarks:

hervenicol commented Jul 9, 2024

hervenicol commented Jul 10, 2024

How to load Loki rules:

QuentinBisson commented Jul 10, 2024

hervenicol commented Jul 10, 2024

QuentinBisson commented Jul 10, 2024

QuentinBisson commented Jul 10, 2024

hervenicol commented Jul 10, 2024

QuentinBisson commented Jul 10, 2024 • edited Loading

hervenicol commented Jul 11, 2024

QuentinBisson commented Jul 11, 2024

hervenicol commented Jul 11, 2024

QuentinBisson commented Jul 11, 2024 • edited Loading

hervenicol commented Jul 15, 2024

Current status

Next steps

AverageMarcus commented Jul 16, 2024

hervenicol commented Aug 26, 2024 • edited Loading

QuentinBisson commented Sep 16, 2024

QuentinBisson commented Sep 19, 2024

QuentinBisson commented Sep 19, 2024

Rotfuks commented Sep 19, 2024

QuentinBisson commented Sep 19, 2024

Rotfuks commented Sep 19, 2024

AverageMarcus commented Jan 23, 2024 •

edited

Loading

hervenicol commented Jul 1, 2024 •

edited

Loading

QuentinBisson commented Jul 10, 2024 •

edited

Loading

QuentinBisson commented Jul 11, 2024 •

edited

Loading

hervenicol commented Aug 26, 2024 •

edited

Loading