Production - [Alerting] Servicing jobs in R&D queues alert #4691

dotnet-eng-status · 2024-12-19T02:01:41Z

💔 Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

ServicingJobs 3

Go to rule

@dotnet/dnceng, @dotnet/prodconsvcs, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-5aa74f27ef6445ce9d3d8d3d382e7e35

dotnet-eng-status · 2024-12-22T17:03:05Z

💚 Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Go to rule

dotnet-eng-status · 2024-12-23T20:02:10Z

💔 Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

ServicingJobs 3

Go to rule

dotnet-eng-status · 2024-12-25T03:01:09Z

💚 Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Go to rule

dotnet-eng-status · 2024-12-28T14:02:40Z

💔 Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

ServicingJobs 3

Go to rule

dotnet-eng-status · 2024-12-29T13:01:54Z

💚 Metric state changed to ok

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

Go to rule

dotnet-eng-status · 2024-12-29T14:01:52Z

💔 Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

ServicingJobs 2

Go to rule

ilyas1974 · 2024-12-31T16:13:18Z

I think this is a "one off" as we have not updated VS in a while. Once #4609 is merged, if we see this issue again, we can investigate further.

dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Inactive Alert Issues from Grafana alerts that are now "OK" labels Dec 22, 2024

dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Dec 23, 2024

dotnet-eng-status bot removed the Inactive Alert Issues from Grafana alerts that are now "OK" label Dec 28, 2024

dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Dec 28, 2024

dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active and removed Inactive Alert Issues from Grafana alerts that are now "OK" labels Dec 29, 2024

ilyas1974 closed this as completed Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production - [Alerting] Servicing jobs in R&D queues alert #4691

Production - [Alerting] Servicing jobs in R&D queues alert #4691

dotnet-eng-status bot commented Dec 19, 2024

dotnet-eng-status bot commented Dec 22, 2024

dotnet-eng-status bot commented Dec 23, 2024

dotnet-eng-status bot commented Dec 25, 2024

dotnet-eng-status bot commented Dec 28, 2024

dotnet-eng-status bot commented Dec 29, 2024

dotnet-eng-status bot commented Dec 29, 2024

ilyas1974 commented Dec 31, 2024

Production - [Alerting] Servicing jobs in R&D queues alert #4691

Production - [Alerting] Servicing jobs in R&D queues alert #4691

Comments

dotnet-eng-status bot commented Dec 19, 2024

dotnet-eng-status bot commented Dec 22, 2024

dotnet-eng-status bot commented Dec 23, 2024

dotnet-eng-status bot commented Dec 25, 2024

dotnet-eng-status bot commented Dec 28, 2024

dotnet-eng-status bot commented Dec 29, 2024

dotnet-eng-status bot commented Dec 29, 2024

ilyas1974 commented Dec 31, 2024