Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerts] Determine how to ignore common errors #625

Open
alangsto opened this issue May 6, 2024 · 8 comments
Open

[Alerts] Determine how to ignore common errors #625

alangsto opened this issue May 6, 2024 · 8 comments

Comments

@alangsto
Copy link
Member

alangsto commented May 6, 2024

It is possible to ignore errors in New Relic, which 2U does for certain types of errors. See https://2u-internal.atlassian.net/wiki/spaces/AT/pages/16385812/Ignored+and+Expected+Errors+for+LMS+in+New+Relic.

Does Datadog have similar functionality for ignoring errors? I found https://docs.datadoghq.com/logs/error_tracking/excluding_logs/ but am not sure if this is a 1:1 solution for what New Relic has.

An example error we'd like to ignore is:

  • rest_framework.exceptions.Throttled

AC:

  • Create a discovery doc for how we can ignore errors in DD
@robrap
Copy link
Contributor

robrap commented May 18, 2024

  1. @alangsto also found https://docs.datadoghq.com/logs/error_tracking/manage_data_collection#add-a-nested-exclusion-filter-to-a-rule.
  2. Is this a ticket we need to look into sooner rather than later so that alert thresholds don't all get messed up, or is DD already not picking up 404, 401, and 403, which were a large part of our ignored errors?

@robrap
Copy link
Contributor

robrap commented May 20, 2024

@alangsto: Have you found that this is just a non-issue for now? If so, we can update the epic to "Datadog Migration Future" and review one more time when we are done.

UPDATE: I moved the rest of this comment for discovery of other DD error monitor types to a new ticket: #651.

@alangsto
Copy link
Member Author

@robrap I have not run into this issue yet, but that's with my work on setting up Cosmonauts monitors (which are for the most part fairly straight forward). Other teams may run into this issue, but it's difficult to know without investigating every alert condition in New Relic. I did add a small section in https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1008500757/How+to+migrate+from+New+Relic+to+Datadog#Migrating-a-NRQL-based-alert for how to filter by specific messages for a trace analytics monitor.

I have not investigated the two types of monitors you listed. Is this something we'd like to do investigation into to provide more info to teams?

@robrap
Copy link
Contributor

robrap commented May 21, 2024

@alangsto: Thank you. I moved this to the Future epic. We'll see if it comes up once the DRF error reporting in DD is fixed in #647.

Also, as noted above, I moved the other discovery into a new ticket which is also under the Future epic, and we'll see when and if anyone is interested in exploring those features.

@dianakhuang
Copy link
Member

I believe #738 is a duplicate of this ticket.

@robrap
Copy link
Contributor

robrap commented Aug 6, 2024

@dianakhuang: What do you think of closing this ticket in favor of your new ticket, which at least provides a clear example of an error we wish to ignore? Is there anything else from this ticket you'd like to bring in?

@dianakhuang
Copy link
Member

I think I would rather keep this ticket and move over the example. This ticket has a lot more info than mine does.

@robrap
Copy link
Contributor

robrap commented Aug 15, 2024

We need more details about the specific alerts that are triggering based on errors that we wish to ignore, so we have a specific case to fix. For now, marking this as P5, and may close (temporarily) until we have that information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

3 participants