Improve efficiency of handling gocd alerts #370

robrap · 2023-07-24T15:25:42Z

It takes a while to understand what is wrong in GoCD. Following up on an alert requires VPN and finding failure log in GoCD, and then knowing how to search for the actual failure, depending on where it failed. It would be great if the alerts had more context.

AC:

Timeboxed effort -- 1 day.

Some useful extract of the logs shows up in the Opsgenie alert (so that we can tell if it's a known/unknown issue, etc.)

Questions/Notes:

This work would only help in situations where you don't have to go on GoCD to re-run a stage anyhow (e.g. self-closing alerts).
We want to switch to ArgoCD & Kubernetes relatively soon; are there quick improvements to get more context in alerts, or are there improvements that would carry over?
Could we get the error details into the alert so VPN and GoCD login isn’t required?
The Runbook has some notes that can be referenced (or added to) for searching to find errors in logs of various stages.

dianakhuang · 2023-08-15T16:10:55Z

From what I understand, the best we can do is try to pull the GoCD logs from within a Python script in order to extract the failure information, I'm not sure how easy this will be to implement, and the testing for it would be a significant overhead in terms of doing this. I don't think it will be possible to fit this work into the timebox, and I'm not sure it's worth it.

robrap · 2023-08-15T16:19:37Z

@dianakhuang: Is there any way for the script that is failing to gather this info and put it somewhere (environment variable, other) that can be found and used elsewhere in the pipeline?

dianakhuang · 2023-08-15T16:26:04Z

Considering how the GoCD pipelines are configured and the many and varied scripts that are called from it, I don't think there's any way to centralize this behavior and make it consistent.

robrap · 2023-08-15T16:50:07Z

Sure. But is there a way to do it at all? Maybe once we can do it, it wouldn't take much to add it as-needed so that we can quickly see if we've hit one of the common/known issues vs not knowing what we hit. The alternative is what we have now, where every alert requires looking in to GoCD, even to see if it is a known issue.

robrap · 2023-08-29T14:03:53Z

[inform] I am re-opening this ticket, but it was put in the new "Icebox" option of the Arch-BOM project's "Backlog" option.

jristau1984 · 2024-07-22T13:40:23Z

Closing this.

dianakhuang self-assigned this Aug 8, 2023

dianakhuang closed this as completed Aug 29, 2023

dianakhuang reopened this Aug 29, 2023

jristau1984 added the on-call label Jul 1, 2024

jristau1984 added the wontfix This will not be worked on label Jul 22, 2024

jristau1984 closed this as completed Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency of handling gocd alerts #370

Improve efficiency of handling gocd alerts #370

robrap commented Jul 24, 2023 •

edited by timmc-edx

Loading

dianakhuang commented Aug 15, 2023

robrap commented Aug 15, 2023

dianakhuang commented Aug 15, 2023

robrap commented Aug 15, 2023

robrap commented Aug 29, 2023

jristau1984 commented Jul 22, 2024

Improve efficiency of handling gocd alerts #370

Improve efficiency of handling gocd alerts #370

Comments

robrap commented Jul 24, 2023 • edited by timmc-edx Loading

dianakhuang commented Aug 15, 2023

robrap commented Aug 15, 2023

dianakhuang commented Aug 15, 2023

robrap commented Aug 15, 2023

robrap commented Aug 29, 2023

jristau1984 commented Jul 22, 2024

robrap commented Jul 24, 2023 •

edited by timmc-edx

Loading