-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigation: Support alerts based on Loki logs #3178
Comments
Needs to be documented as well: Basic recording rules |
ExplorationDocs:
Tools:
Questions/remarks:
|
|
How to load Loki rules:We can use the "alloy-rules" component that currently loads prometheus-rules to Mimir. Caveat: So we have to disable prometheus-operator validation webhook for loki rules.
I have tested the namespace selector it successfully with this alloy config (
And this addition to
|
Have you tried the loki backend sidecar to load rules ? |
No, because I was quite happy that Alloy could do it: |
Sure. I'm wondering if we could consider options like: https://loki-operator.dev/docs/ruler_support.md/ |
Actually, I'm even more confused now :D |
We already have a complete pipeline for PrometheusRules that we are used to, why should we look for a different solution ? Loki Operator could be useful for dynamic alertmanager or remotewrite configs, but I don't think we need that. |
So, coming from this comment, I thought we were still somewhat in the investigation phase and not the decision phase so i'm asking if there are other solutions out there that might fit better than our current flow? To be honest, i'm not really happy about the validation hack we need to do to be able to use Loki alerting for a few reasons:
I'm definitely not saying this is not the solution we will end up using, maybe as a temporary solution, maybe it's complete, who knows, I really wanted to know what's out there before moving on to enabling this :) Now, for a future possible itération, would it make sense that we ask some grafana people (maybe on community meetings if they would liké to sponsor, partner on some kind of Rules CRD in Say, alloy that could bé of type LogQL, TraceQl, PromQl and so on)? I think that would bé a cool topic |
Now you're saying what you don't like with my proposal and hope to improve with another solution, and I like that :)
I can't see what makes you think it wouldn't work in the future. But that's a point we could work on. The rules loader sidecar probably does not rely on these custom resources, that's true.
It obviously means some changes in the And these directories would have a different validation, based on But we could have a separate repo for lokiRules if we have reasons to do that. This is a matter of how we package the rules, I was planning on doing it as a next step.
I sure hope if grafanaLabs chose that path for mimir and loki rules, they should be consistent and have a similar feature in Alloy for tempo rules.
I think that's more related to the ruler than to how we load rules, right?
I agree with the "it's not real prometheusRules so we shouldn't mix these". lokiOperator won't solve it, but I can try loading configMaps rules to loki with the sidecar loader.
That could be a thing for the future, but for now the only thing that would change would be how to validate the |
Thanks for going through that long one :D
On that note, if we are not using it, we probably should disable the sidecar to save on a bit of resources :D
This is a good idea but I'm not sure if a directory makes is easier. What would you think about using a suffix for rules instead, the same way we do it today with test files? So we could have something like this.
But that a different topic because maybe we want to have all alerts for a component in 1 place. we can discuss such things later :D I trust your expertise will find something
You're right yes
Actually, no because the mimir ruler supports federated rules groups which the prometheus rules solution would not allow us to use https://grafana.com/docs/mimir/latest/references/architecture/components/optional-grafana-mimir-ruler/#federated-rule-groups and the possibility to set a tenant on the alert or sourceTenants is not there in PrometheusRules but this is a issue Grafana have and not just us so this is quite a light argument :)
I would think regarding the last point that we could create a new set of CRs (maybe added to alloy with Grafana) that would for sure have a different validation but also have a tenant field and a supportTenants one. Also, I quite like the distinction in the loki operator between alerting and recording rules because they don't have the same expectations (i don't need labels on recording rules for instance) but that is as you said something to work on later |
So, I understand better why you think using prometheusRules does not cover all use cases with Mimir and Loki. I also feel like loading Configmaps with the sidecar requires local rules storage, and may require some tweaks/hacks for multi-tenancy as well. Meaning, as-is, it does not work and we should disable it to save a few resources. |
Yes I agree that it's the best solution short term and that was a really good exploration on your part :) |
Current status
Next stepsWe will probably set loki rules in prometheus-rules, but that's still to be defined. And in order to do it, I'd love to have a first useful rule. @AverageMarcus do you think you can provide a query? Even better if you can set it as a Anyway, this won't be merged until we have a working v29.x release with the proper olly-bundle included. |
That's kinda a blocker for us to use this right now. 😞 I'll have a chat with the team later and see if there's anything on the stable-testing MCs that we might benefit from. I suspect we could come up with some stuff but not sure when we'll get back to you. Hope thats ok. |
Ideas for Shield: All of these we'd first start with visibility and then define what anomalous looks like and decide on alerts, but:
As lots of these queries are heavy, I think it makes sense to generate metrics from recording rules that we can visualize efficiently afterwards. |
I wanted to try out using only giantswarm as a tenant so we could create rules for all giantswarm components but it turns out Loki becomes really slow even on grizzly giantswarm/logging-operator#215. I'll try to figure out why it is so slow |
@hervenicol do you think we should refine this ticket to have clearer next steps? Or do we move it to another epic ? cc @Rotfuks |
I would think the investigation is done here, we like it, it works, but we are limited by the multi-tenancy for now so we should probably add this into the alertmanager migration epic instead? |
We can open up a specific epic for the alerts based on Events, adding the implementation, documentation and maybe some out-of-the-box alerts for our internal teams? Wdyt? |
I'm fine with anything, I just don't think it makes sense to prevent the making loki nicer epic to be closed :D |
This story will be followed up with the epic around log-based alerting: #3678 Investigation therefor closed. |
Now that we have Loki logs from
kube-system
andgiantswarm
namespaces available in our MCs it'd be nice to be able to configure alerts based on log output.For example:
We've recently hit an issue in our testing environment where we've hit an AWS quota limit which now prevents up from creating any new ACM certificates. This quota doesn't have any way to view current usage in AWS (that I've been able to find) so we have no way to monitor our usage towards this limit. The only way we currently know that we've hit it is our components (irsa-operator) fail to create an ACM certificate and produce an error in the logs. This doesn't bubble up to any metrics so we currently can't be alerted to this problem happening.
It would be really useful if we could craft an alert that queried for
LimitExceededException
or similar in our operator logs and alerted us to this.The text was updated successfully, but these errors were encountered: