Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: Support alerts based on Loki logs #3178

Closed
Tracked by #3362
AverageMarcus opened this issue Jan 23, 2024 · 23 comments
Closed
Tracked by #3362

Investigation: Support alerts based on Loki logs #3178

AverageMarcus opened this issue Jan 23, 2024 · 23 comments
Assignees
Labels
team/atlas Team Atlas

Comments

@AverageMarcus
Copy link
Member

AverageMarcus commented Jan 23, 2024

Now that we have Loki logs from kube-system and giantswarm namespaces available in our MCs it'd be nice to be able to configure alerts based on log output.

For example:

We've recently hit an issue in our testing environment where we've hit an AWS quota limit which now prevents up from creating any new ACM certificates. This quota doesn't have any way to view current usage in AWS (that I've been able to find) so we have no way to monitor our usage towards this limit. The only way we currently know that we've hit it is our components (irsa-operator) fail to create an ACM certificate and produce an error in the logs. This doesn't bubble up to any metrics so we currently can't be alerted to this problem happening.
It would be really useful if we could craft an alert that queried for LimitExceededException or similar in our operator logs and alerted us to this.

@github-project-automation github-project-automation bot moved this to Inbox 📥 in Roadmap Jan 23, 2024
@QuentinBisson QuentinBisson added the needs/refinement Needs refinement in order to be actionable label Feb 6, 2024
@QuentinBisson
Copy link

Needs to be documented as well:

Basic recording rules
Basic alerting

@Rotfuks Rotfuks removed this from Roadmap Mar 27, 2024
@Rotfuks Rotfuks removed the needs/refinement Needs refinement in order to be actionable label May 8, 2024
@hervenicol hervenicol self-assigned this Jul 1, 2024
@architectbot architectbot added the team/atlas Team Atlas label Jul 1, 2024
@hervenicol
Copy link

hervenicol commented Jul 1, 2024

Exploration

Docs:

Tools:

Questions/remarks:

  • we may need to remotewrite metrics from recordingrules to mimir/prometheus
  • alertmanager config to be setup (https://grafana.com/docs/loki/latest/configure/#ruler as alertmanager_client)
    • how do we inject this loki config? Do we need an operator?
  • how to load rules?
    • one-shot with cortextool
    • how to automate this? Is there an operator that does it?
    • look for loki-sc-rules pod in loki-backend.
    • has to be done for each tenant
  • multi-tenancy (and tenants aggregation on read path) prevents reading rules from grafana.

@hervenicol
Copy link

  • alertmanager config done ✔
  • remotewrite config done for mimir ✔
  • load rules ongoing ⏲
  • access to rules from grafana broken for now because of tenants aggregation ❗

@hervenicol
Copy link

How to load Loki rules:

We can use the "alloy-rules" component that currently loads prometheus-rules to Mimir.
It expects to find prometheusRules and load them to Loki.

Caveat:
prometheus-operator checks that the rules are using a valid promQL syntax. And loki rules, although they are valid prometheusRules custom resources, use a logQL syntax that is slightly different.

So we have to disable prometheus-operator validation webhook for loki rules.
We have to differenciate loki rules from mimir rules anyway for Alloy.
We could either:

  • use a namespaceSelector, so that rules in the loki namespace are Loki rules
  • use a labelSelector, so we could tag loki rules with a specific label like application.giantswarm.io/prometheusRule/kind: loki

I have tested the namespace selector it successfully with this alloy config ($ k get cm alloy-rules -n mimir -oyaml | yq '.data."config.alloy"'):

mimir.rules.kubernetes "local" {
    address = "http://mimir-ruler.mimir.svc:8080/"
    tenant_id = "anonymous"
    rule_namespace_selector {
        match_expression {
          key = "kubernetes.io/metadata.name"
          operator = "NotIn"
          values = ["loki"]
        }
    }
}
loki.rules.kubernetes "local" {
    address = "http://loki-backend.loki.svc:3100"
    tenant_id = "golem"

    rule_namespace_selector {
        match_expression {
          key = "kubernetes.io/metadata.name"
          operator = "In"
          values = ["loki"]
        }
    }
}

And this addition to kube-prometheus-stack user-values:

k get cm -n org-giantswarm  golem-kube-prometheus-stack-user-values -oyaml | yq '.data.values' | yq '.kube-prometheus-stack.prometheusOperator.denyNamespaces'
- loki

@QuentinBisson
Copy link

Have you tried the loki backend sidecar to load rules ?

@hervenicol
Copy link

Have you tried the loki backend sidecar to load rules ?

No, because I was quite happy that Alloy could do it:
I like the idea that we have the same management of rules whatever the ruler.
And also that loki and mimir rules could depend on each other, and have similar CI. So in the end they could both be in the prometheus-rules repo, as they are distributed as prometheusRules CRs, that looks fine to me.

@QuentinBisson
Copy link

Sure. I'm wondering if we could consider options like: https://loki-operator.dev/docs/ruler_support.md/

@QuentinBisson
Copy link

Actually, I'm even more confused now :D

@hervenicol
Copy link

We already have a complete pipeline for PrometheusRules that we are used to, why should we look for a different solution ?

Loki Operator could be useful for dynamic alertmanager or remotewrite configs, but I don't think we need that.

@QuentinBisson
Copy link

QuentinBisson commented Jul 10, 2024

So, coming from this comment, I thought we were still somewhat in the investigation phase and not the decision phase so i'm asking if there are other solutions out there that might fit better than our current flow?

To be honest, i'm not really happy about the validation hack we need to do to be able to use Loki alerting for a few reasons:

  1. It's hacky at best, we don't know if this will keep working in the future and it's not per se a Prometheus Rules and i'm a fan of good semantics :)
  2. What does it mean for validation in the prometheus-rules repo? Because I would assume pint does not support them so we will have to create some hack there?
  3. Will that scale when we run tempo (there are talks about adding a ruler with TraceQL)
  4. We already have that limitation (because alloy does not support it when loading rules) but that means we cannot use tenant federations for now and we should be okay with that :)

I'm definitely not saying this is not the solution we will end up using, maybe as a temporary solution, maybe it's complete, who knows, I really wanted to know what's out there before moving on to enabling this :)

Now, for a future possible itération, would it make sense that we ask some grafana people (maybe on community meetings if they would liké to sponsor, partner on some kind of Rules CRD in Say, alloy that could bé of type LogQL, TraceQl, PromQl and so on)? I think that would bé a cool topic

@hervenicol
Copy link

Now you're saying what you don't like with my proposal and hope to improve with another solution, and I like that :)

To be honest, i'm not really happy about the validation hack we need to do to be able to use Loki alerting for a few reasons:

  1. It's hacky at best, we don't know if this will keep working in the future and it's not per se a Prometheus Rules and i'm a fan of good semantics :)

I can't see what makes you think it wouldn't work in the future. But that's a point we could work on.
It's not "per se" prometheusRules, but that's how GrafanaLabs decided it should be managed whether we do it via lokiOperator or Alloy.

The rules loader sidecar probably does not rely on these custom resources, that's true.

  1. What does it mean for validation in the prometheus-rules repo? Because I would assume pint does not support them so we will have to create some hack there?

It obviously means some changes in the prometheus-rules repo.
I was thinking about having some new directories like helm/prometheus-rules/templates/platform/atlas/loki-alerting-rules/ and helm/prometheus-rules/templates/platform/atlas/loki-recording-rules/

And these directories would have a different validation, based on cortextool (or the newly released lokitool), which properly checks loki rules and logql syntax.
That's not as powerful as pint, but that's all we can do for loki rules anyway.

But we could have a separate repo for lokiRules if we have reasons to do that. This is a matter of how we package the rules, I was planning on doing it as a next step.

  1. Will that scale when we run tempo (there are talks about adding a ruler with TraceQL)

I sure hope if grafanaLabs chose that path for mimir and loki rules, they should be consistent and have a similar feature in Alloy for tempo rules.

  1. We already have that limitation (because alloy does not support it when loading rules) but that means we cannot use tenant federations for now and we should be okay with that :)

I think that's more related to the ruler than to how we load rules, right?

I'm definitely not saying this is not the solution we will end up using, maybe as a temporary solution, maybe it's complete, who knows, I really wanted to know what's out there before moving on to enabling this :)

I agree with the "it's not real prometheusRules so we shouldn't mix these". lokiOperator won't solve it, but I can try loading configMaps rules to loki with the sidecar loader.

Now, for a future possible itération, would it make sense that we ask some grafana people (maybe on community meetings if they would liké to sponsor, partner on some kind of Rules CRD in Say, alloy that could bé of type LogQL, TraceQl, PromQl and so on)? I think that would bé a cool topic

That could be a thing for the future, but for now the only thing that would change would be how to validate the expr part of the rule.

@QuentinBisson
Copy link

Thanks for going through that long one :D

Now you're saying what you don't like with my proposal and hope to improve with another solution, and I like that :)

To be honest, i'm not really happy about the validation hack we need to do to be able to use Loki alerting for a few reasons:

  1. It's hacky at best, we don't know if this will keep working in the future and it's not per se a Prometheus Rules and i'm a fan of good semantics :)

I can't see what makes you think it wouldn't work in the future. But that's a point we could work on. It's not "per se" prometheusRules, but that's how GrafanaLabs decided it should be managed whether we do it via lokiOperator or Alloy.

The rules loader sidecar probably does not rely on these custom resources, that's true.

On that note, if we are not using it, we probably should disable the sidecar to save on a bit of resources :D
But you're right, I would hope they would add something else and not drop support for it :)

  1. What does it mean for validation in the prometheus-rules repo? Because I would assume pint does not support them so we will have to create some hack there?

It obviously means some changes in the prometheus-rules repo. I was thinking about having some new directories like helm/prometheus-rules/templates/platform/atlas/loki-alerting-rules/ and helm/prometheus-rules/templates/platform/atlas/loki-recording-rules/

And these directories would have a different validation, based on cortextool (or the newly released lokitool), which properly checks loki rules and logql syntax. That's not as powerful as pint, but that's all we can do for loki rules anyway.

But we could have a separate repo for lokiRules if we have reasons to do that. This is a matter of how we package the rules, I was planning on doing it as a next step.

This is a good idea but I'm not sure if a directory makes is easier. What would you think about using a suffix for rules instead, the same way we do it today with test files?

So we could have something like this.

atlas/
  alerting-rules/
    alert.metrics.yaml
    alertLoki.logs.yaml
    alert.slo.yaml <- eventually

But that a different topic because maybe we want to have all alerts for a component in 1 place. we can discuss such things later :D I trust your expertise will find something

  1. Will that scale when we run tempo (there are talks about adding a ruler with TraceQL)

I sure hope if grafanaLabs chose that path for mimir and loki rules, they should be consistent and have a similar feature in Alloy for tempo rules.

You're right yes

  1. We already have that limitation (because alloy does not support it when loading rules) but that means we cannot use tenant federations for now and we should be okay with that :)

I think that's more related to the ruler than to how we load rules, right?

Actually, no because the mimir ruler supports federated rules groups which the prometheus rules solution would not allow us to use https://grafana.com/docs/mimir/latest/references/architecture/components/optional-grafana-mimir-ruler/#federated-rule-groups and the possibility to set a tenant on the alert or sourceTenants is not there in PrometheusRules but this is a issue Grafana have and not just us so this is quite a light argument :)

I'm definitely not saying this is not the solution we will end up using, maybe as a temporary solution, maybe it's complete, who knows, I really wanted to know what's out there before moving on to enabling this :)

I agree with the "it's not real prometheusRules so we shouldn't mix these". lokiOperator won't solve it, but I can try loading configMaps rules to loki with the sidecar loader.

Now, for a future possible itération, would it make sense that we ask some grafana people (maybe on community meetings if they would liké to sponsor, partner on some kind of Rules CRD in Say, alloy that could bé of type LogQL, TraceQl, PromQl and so on)? I think that would bé a cool topic

That could be a thing for the future, but for now the only thing that would change would be how to validate the expr part of the rule.

I would think regarding the last point that we could create a new set of CRs (maybe added to alloy with Grafana) that would for sure have a different validation but also have a tenant field and a supportTenants one. Also, I quite like the distinction in the loki operator between alerting and recording rules because they don't have the same expectations (i don't need labels on recording rules for instance) but that is as you said something to work on later

@hervenicol
Copy link

So, I understand better why you think using prometheusRules does not cover all use cases with Mimir and Loki.
But do you agree with the plan I presented (with selectorLabel would be best IMO) is a proper solution for now, until we understand what we want to do with multi-tenancy?

I also feel like loading Configmaps with the sidecar requires local rules storage, and may require some tweaks/hacks for multi-tenancy as well. Meaning, as-is, it does not work and we should disable it to save a few resources.

@QuentinBisson
Copy link

QuentinBisson commented Jul 11, 2024

Yes I agree that it's the best solution short term and that was a really good exploration on your part :)

@hervenicol
Copy link

Current status

  • For now, loki rules will only work on Mimir-enabled clusters (ie CAPI).
  • It only works for MC logs
  • We can create prometheusRules with a label application.giantswarm.io/prometheus-rule-kind: loki, they will be taken as Loki rules
  • BUT ... prometheus-operator's validationwebhook will very probably deny them until olly-bundle is updated (ie v29.x).

Next steps

We will probably set loki rules in prometheus-rules, but that's still to be defined.

And in order to do it, I'd love to have a first useful rule. @AverageMarcus do you think you can provide a query? Even better if you can set it as a prometheusRule, with all the extra data: team, business hours, inhibitions, duration before paging...

Anyway, this won't be merged until we have a working v29.x release with the proper olly-bundle included.

@AverageMarcus
Copy link
Member Author

It only works for MC logs

That's kinda a blocker for us to use this right now. 😞

I'll have a chat with the team later and see if there's anything on the stable-testing MCs that we might benefit from. I suspect we could come up with some stuff but not sure when we'll get back to you. Hope thats ok.

@hervenicol
Copy link

hervenicol commented Aug 26, 2024

Ideas for Shield:

All of these we'd first start with visibility and then define what anomalous looks like and decide on alerts, but:

  1. API server - kubectl execs - this should be rare, especially in prod environments => sum(count_over_time({scrape_job="audit-logs"} |~ "subresource.: *.exec"| json[1m])) by (objectRef_namespace, objectRef_name)
  2. API server - requests for the content of certain secrets (kubeconfigs, important tokens, Shield canaries, etc.)
  3. API server - webhook timeouts. See https://github.com/giantswarm/giantswarm/issues/31057
  4. workloads - we would have temporary alerts for ops issues (timeouts, we have some known errors in logs, etc.)
  5. SSH connections - who is connected, is it at a weird time or from a weird IP?
  6. API server + Kyverno + Cilium(?) - some customers see intermittent webhook timeouts when kubectl-ing resources, but all the pods are healthy. We haven't been able to nail down under what circumstances the timeouts occur, but theories are either that they're working with huge/problematic resources, or that individual nodes have some underlying issue (like a single node is swamped, so the timouts are all for requests that are sent to that node). We'd need to first see the API server logging timeouts, and then see the logs of the kyverno to which the request was sent, along with potentially other logs from that node. But having some visibility of when the timeouts occur might already show a pattern. Related issue
  7. Risky logs. If we will be storing customer logs on MCs, we open up a lot of potential risk due to what could be in them. We could identify things that shouldn't be there (financial info, secrets, etc.) and help customers sanitize them
  8. API server / workloads - RBAC issues. We could identify if a workload is trying to do something it isn't allowed to do in order to catch RBAC bugs, also as a potential security mechanism.
  9. API server - same as above, but for humans. Are human users trying to do things they don't have RBAC for?
  10. API server - suspicious behaviors aside from kubectl exec. My team could come up with quite a few things that would look very weird.

As lots of these queries are heavy, I think it makes sense to generate metrics from recording rules that we can visualize efficiently afterwards.

@QuentinBisson
Copy link

I wanted to try out using only giantswarm as a tenant so we could create rules for all giantswarm components but it turns out Loki becomes really slow even on grizzly giantswarm/logging-operator#215. I'll try to figure out why it is so slow

@QuentinBisson
Copy link

@hervenicol do you think we should refine this ticket to have clearer next steps? Or do we move it to another epic ? cc @Rotfuks

@QuentinBisson
Copy link

I would think the investigation is done here, we like it, it works, but we are limited by the multi-tenancy for now so we should probably add this into the alertmanager migration epic instead?

@Rotfuks
Copy link
Contributor

Rotfuks commented Sep 19, 2024

We can open up a specific epic for the alerts based on Events, adding the implementation, documentation and maybe some out-of-the-box alerts for our internal teams? Wdyt?

@QuentinBisson
Copy link

I'm fine with anything, I just don't think it makes sense to prevent the making loki nicer epic to be closed :D

@Rotfuks
Copy link
Contributor

Rotfuks commented Sep 19, 2024

This story will be followed up with the epic around log-based alerting: #3678 

Investigation therefor closed.

@Rotfuks Rotfuks closed this as completed Sep 19, 2024
@Rotfuks Rotfuks changed the title Support alerts based on Loki logs Investigation: Support alerts based on Loki logs Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/atlas Team Atlas
Projects
None yet
Development

No branches or pull requests

5 participants