Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production - [Alerting] Servicing jobs in R&D queues alert #4691

Closed
dotnet-eng-status bot opened this issue Dec 19, 2024 · 7 comments
Closed

Production - [Alerting] Servicing jobs in R&D queues alert #4691

dotnet-eng-status bot opened this issue Dec 19, 2024 · 7 comments
Labels
Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging)

Comments

@dotnet-eng-status
Copy link

💔 Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

  • ServicingJobs 3

Go to rule

@dotnet/dnceng, @dotnet/prodconsvcs, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-5aa74f27ef6445ce9d3d8d3d382e7e35

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging) and removed Active Alert Issues from Grafana alerts that are now active labels Dec 19, 2024
Copy link
Author

💚 Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Inactive Alert Issues from Grafana alerts that are now "OK" labels Dec 22, 2024
Copy link
Author

💔 Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

  • ServicingJobs 3

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Dec 23, 2024
Copy link
Author

💚 Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Go to rule

@dotnet-eng-status dotnet-eng-status bot removed the Inactive Alert Issues from Grafana alerts that are now "OK" label Dec 28, 2024
Copy link
Author

💔 Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

  • ServicingJobs 3

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Dec 28, 2024
Copy link
Author

💚 Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active and removed Inactive Alert Issues from Grafana alerts that are now "OK" labels Dec 29, 2024
Copy link
Author

💔 Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

  • ServicingJobs 2

Go to rule

@ilyas1974
Copy link
Contributor

I think this is a "one off" as we have not updated VS in a while. Once #4609 is merged, if we see this issue again, we can investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging)
Projects
None yet
Development

No branches or pull requests

1 participant