Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency of handling gocd alerts #370

Closed
1 task
robrap opened this issue Jul 24, 2023 · 6 comments
Closed
1 task

Improve efficiency of handling gocd alerts #370

robrap opened this issue Jul 24, 2023 · 6 comments
Assignees
Labels
on-call wontfix This will not be worked on

Comments

@robrap
Copy link
Contributor

robrap commented Jul 24, 2023

It takes a while to understand what is wrong in GoCD. Following up on an alert requires VPN and finding failure log in GoCD, and then knowing how to search for the actual failure, depending on where it failed. It would be great if the alerts had more context.

AC:

Timeboxed effort -- 1 day.

  • Some useful extract of the logs shows up in the Opsgenie alert (so that we can tell if it's a known/unknown issue, etc.)

Questions/Notes:

  • This work would only help in situations where you don't have to go on GoCD to re-run a stage anyhow (e.g. self-closing alerts).
  • We want to switch to ArgoCD & Kubernetes relatively soon; are there quick improvements to get more context in alerts, or are there improvements that would carry over?
  • Could we get the error details into the alert so VPN and GoCD login isn’t required?
  • The Runbook has some notes that can be referenced (or added to) for searching to find errors in logs of various stages.
@dianakhuang dianakhuang self-assigned this Aug 8, 2023
@dianakhuang
Copy link
Member

From what I understand, the best we can do is try to pull the GoCD logs from within a Python script in order to extract the failure information, I'm not sure how easy this will be to implement, and the testing for it would be a significant overhead in terms of doing this. I don't think it will be possible to fit this work into the timebox, and I'm not sure it's worth it.

@robrap
Copy link
Contributor Author

robrap commented Aug 15, 2023

@dianakhuang: Is there any way for the script that is failing to gather this info and put it somewhere (environment variable, other) that can be found and used elsewhere in the pipeline?

@dianakhuang
Copy link
Member

Considering how the GoCD pipelines are configured and the many and varied scripts that are called from it, I don't think there's any way to centralize this behavior and make it consistent.

@robrap
Copy link
Contributor Author

robrap commented Aug 15, 2023

Sure. But is there a way to do it at all? Maybe once we can do it, it wouldn't take much to add it as-needed so that we can quickly see if we've hit one of the common/known issues vs not knowing what we hit. The alternative is what we have now, where every alert requires looking in to GoCD, even to see if it is a known issue.

@robrap
Copy link
Contributor Author

robrap commented Aug 29, 2023

[inform] I am re-opening this ticket, but it was put in the new "Icebox" option of the Arch-BOM project's "Backlog" option.

@jristau1984 jristau1984 added the wontfix This will not be worked on label Jul 22, 2024
@jristau1984
Copy link

Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-call wontfix This will not be worked on
Projects
Archived in project
Development

No branches or pull requests

3 participants