Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix worker death #668

Merged
merged 6 commits into from
Apr 17, 2024
Merged

Fix worker death #668

merged 6 commits into from
Apr 17, 2024

Conversation

josephjclark
Copy link
Collaborator

@josephjclark josephjclark commented Apr 16, 2024

Short Description

This PR fixes an issue where the worker doesn't close down pooled child processes after uncaught exceptions, resulting in the worker refusing to work.

Related issue

Fixes #664

Implementation Details

What's basically happening is:

  • The engine is correctly catching uncaught exceptions (and I think process.exits) when they occur
  • And it is correctly sending error events out (these errors may be rubbish but I do think they are sending)
  • BUT the engine doesn't properly resolve the promise for that run, so the engine still thinks the workflow is executing
  • That means the child process pool doesn't allocate new resources for future work, and so chokes up

The engine should eventually be timing out the runs, but by this point lightning thinks they're dead and isn't listening to events. But I think the backlog will very slowly clear.

Anyway, as result of the fix, the error is handled gracefully, the pool re-allocates the worker thread, and everyone is happy.

QA Notes

I've added two integration tests, both of which reproduce very similar errors to main. And they both fail on main.

Checklist before requesting a review

  • I have performed a self-review of my code
  • I have added unit tests
  • Changesets have been added (if there are production code changes)

@josephjclark josephjclark marked this pull request as draft April 16, 2024 14:35
// Note: Ok, now I have visibility on the stdout stream
// I don't think I want to send this to gpc
// This might be strictly local debug
// child.stdout!.on('data', (data) => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incidental debugging code. I do want to leave it here but commented out for now.

// TODO job id maybe
});

// Explicitly send a reject task error, to ensure the worker closes down
publish(workerEvents.ENGINE_REJECT_TASK, {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual fix and this is what WASN'T happening before.

It's not enough to tell the worker we've errored. We need to send an event to the parent process to tell it that we're dead.

I'm a little concerned about duplicating error messages here and so there's a little bit of gnarly code on the reporting side to handle that.

@josephjclark josephjclark marked this pull request as ready for review April 16, 2024 15:09
@josephjclark josephjclark merged commit 94cec66 into main Apr 17, 2024
5 checks passed
@josephjclark josephjclark deleted the fix-worker-death branch April 17, 2024 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Worker Death
1 participant