-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix worker death #668
Fix worker death #668
Conversation
// Note: Ok, now I have visibility on the stdout stream | ||
// I don't think I want to send this to gpc | ||
// This might be strictly local debug | ||
// child.stdout!.on('data', (data) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incidental debugging code. I do want to leave it here but commented out for now.
// TODO job id maybe | ||
}); | ||
|
||
// Explicitly send a reject task error, to ensure the worker closes down | ||
publish(workerEvents.ENGINE_REJECT_TASK, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the actual fix and this is what WASN'T happening before.
It's not enough to tell the worker we've errored. We need to send an event to the parent process to tell it that we're dead.
I'm a little concerned about duplicating error messages here and so there's a little bit of gnarly code on the reporting side to handle that.
Short Description
This PR fixes an issue where the worker doesn't close down pooled child processes after uncaught exceptions, resulting in the worker refusing to work.
Related issue
Fixes #664
Implementation Details
What's basically happening is:
The engine should eventually be timing out the runs, but by this point lightning thinks they're dead and isn't listening to events. But I think the backlog will very slowly clear.
Anyway, as result of the fix, the error is handled gracefully, the pool re-allocates the worker thread, and everyone is happy.
QA Notes
I've added two integration tests, both of which reproduce very similar errors to main. And they both fail on main.
Checklist before requesting a review