Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On error handler in pipeline #47

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

skurfuerst
Copy link
Contributor

this on_error handler is executed if any other step in
the pipeline failed; and can be e.g. used to trigger a
notification on error.

The code can already be reviewed; I'll only add a README example.

this on_error handler is executed if any other step in
the pipeline failed; and can be e.g. used to trigger a
notification on error.
Copy link

codecov bot commented Dec 22, 2023

Codecov Report

Attention: 48 lines in your changes are missing coverage. Please review.

Comparison is base (d7fdf27) 71.07% compared to head (f64ad94) 70.42%.

Files Patch % Lines
prunner.go 65.46% 40 Missing and 8 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
- Coverage   71.07%   70.42%   -0.65%     
==========================================
  Files          20       20              
  Lines        2199     2330     +131     
==========================================
+ Hits         1563     1641      +78     
- Misses        517      562      +45     
- Partials      119      127       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Sebobo
Copy link
Member

Sebobo commented Jan 16, 2024

We need exactly this feature too, how can we help finishing this up and having a release?

@hlubek
Copy link
Member

hlubek commented Jan 16, 2024

I'm having a look at it right now 👍

Copy link
Member

@hlubek hlubek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think we really need this feature. So thanks for taking the initiative here.

I think we need to change the implementation a bit though to fit better into the async handling of prunner so we do not cause issues by the on error task execution (see my comment with the locked mutex).

// we use a detached taskRunner and scheduler to run the onError task, to
// run synchronously (as we are already in an async goroutine here), won't have any cycles,
// and to simplify the code.
taskRunner := r.createTaskRunner(j)
Copy link
Member

@hlubek hlubek Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have an issue cancelation here, since the scheduler / task runner is detached. So this could block the whole thing if the on error task is not finishing.

Locking the pipeline runner mutex (which is for all of prunner) for the whole on error task execution is not good, because we basically block the whole prunner process. It is only okay to write lock the mutex for data structure updates.

I think we need to base this on the *PipelineJob, where we capture the state/context for each running pipeline job. Maybe it's also enough to run the on error scheduler in a go routine and use the WaitGroup of *PipelineRunner to put it into the "normal" waiting behavior.

What about putting the on error task execution in startJob:

	// Run graph asynchronously
	r.wg.Add(1)
	go func() {
		defer r.wg.Done()
		lastErr := job.sched.Schedule(graph)
                if lastErr != nil {
                  // TODO Schedule the on error task (sync)
                }
		r.JobCompleted(job.ID, lastErr)
	}()

(we need to implement some kind of first task error state here though, since the last error is not really helpful)

Ideally we would put this behaviour in the taskctl.Scheduler itself, but it's more generic without real knowledge of output store etc. . I'm thinking of some kind of ghost task that only appears and runs on first error in the pipeline but is already defined. The issue here would be the variables for the failed task stdout/stderr that needs to be put into the variables 🤔.

@Sebobo
Copy link
Member

Sebobo commented Jan 17, 2024

Thx @skurfuerst and @hlubek for your work on this.

To fully explain our use case: We have a long list of exit codes for various failures in the pipeline including a map to translate the error codes for devs and editors in the UI. So the onError task should be able to catch that code and in our case write it to the database for persistence, so we can collect the error reasons for the whole pipeline history and it should also do some cleanup task.
Right now I can only interpret the job logs, which disappear after a while.

@Sebobo
Copy link
Member

Sebobo commented Jan 31, 2024

Any update on this? (Sorry for nagging)

@hlubek
Copy link
Member

hlubek commented Jan 31, 2024

@skurfuerst Did you find some time to look at my notes about the change? I think we could basically go with the implementation and improve it later, but I fear that we pull in a more complexity and concepts that might be hard to change later.

@skurfuerst
Copy link
Contributor Author

@hlubek thanks for your comments :) I sadly did not find time to fix your comments yet; but I can try ro work on this in about two weeks :) In case you can prototype your thoughts, feel free to do so :)

All the best, Sebastian

@Sebobo
Copy link
Member

Sebobo commented Feb 23, 2024

Ping :D

@Sebobo
Copy link
Member

Sebobo commented Apr 10, 2024

Pong ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants