Fix worker death #668

josephjclark · 2024-04-16T14:35:47Z

Short Description

This PR fixes an issue where the worker doesn't close down pooled child processes after uncaught exceptions, resulting in the worker refusing to work.

Related issue

Fixes #664

Implementation Details

What's basically happening is:

The engine is correctly catching uncaught exceptions (and I think process.exits) when they occur
And it is correctly sending error events out (these errors may be rubbish but I do think they are sending)
BUT the engine doesn't properly resolve the promise for that run, so the engine still thinks the workflow is executing
That means the child process pool doesn't allocate new resources for future work, and so chokes up

The engine should eventually be timing out the runs, but by this point lightning thinks they're dead and isn't listening to events. But I think the backlog will very slowly clear.

Anyway, as result of the fix, the error is handled gracefully, the pool re-allocates the worker thread, and everyone is happy.

QA Notes

I've added two integration tests, both of which reproduce very similar errors to main. And they both fail on main.

Checklist before requesting a review

I have performed a self-review of my code
I have added unit tests
Changesets have been added (if there are production code changes)

josephjclark · 2024-04-16T15:05:22Z

packages/engine-multi/src/worker/pool.ts

+      // Note: Ok, now I have visibility on the stdout stream
+      // I don't think I want to send this to gpc
+      // This might be strictly local debug
+      // child.stdout!.on('data', (data) => {


This is incidental debugging code. I do want to leave it here but commented out for now.

josephjclark · 2024-04-16T15:07:15Z

packages/engine-multi/src/worker/thread/helpers.ts

      // TODO job id maybe
    });
+
+    // Explicitly send a reject task error, to ensure the worker closes down
+    publish(workerEvents.ENGINE_REJECT_TASK, {


This is the actual fix and this is what WASN'T happening before.

It's not enough to tell the worker we've errored. We need to send an event to the parent process to tell it that we're dead.

I'm a little concerned about duplicating error messages here and so there's a little bit of gnarly code on the reporting side to handle that.

josephjclark added 2 commits April 16, 2024 14:10

engine: restore stout logging on inner thread

31af818

engine: on error, ensure that the pool task rejects properly

b213809

josephjclark marked this pull request as draft April 16, 2024 14:35

josephjclark added 2 commits April 16, 2024 16:00

tests: integration test for uncaught exception

0528519

tests: add another test for process.exit

4ad992b

josephjclark commented Apr 16, 2024

View reviewed changes

version: [email protected]

2c6a59e

josephjclark marked this pull request as ready for review April 16, 2024 15:09

tests: reorder

781a75d

josephjclark merged commit 94cec66 into main Apr 17, 2024
5 checks passed

josephjclark deleted the fix-worker-death branch April 17, 2024 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix worker death #668

Fix worker death #668

josephjclark commented Apr 16, 2024 •

edited

Loading

josephjclark Apr 16, 2024

josephjclark Apr 16, 2024

Fix worker death #668

Fix worker death #668

Conversation

josephjclark commented Apr 16, 2024 • edited Loading

Short Description

Related issue

Implementation Details

QA Notes

Checklist before requesting a review

josephjclark Apr 16, 2024

Choose a reason for hiding this comment

josephjclark Apr 16, 2024

Choose a reason for hiding this comment

josephjclark commented Apr 16, 2024 •

edited

Loading