Add is best model to training servicer #229

thodkatz · 2024-12-13T18:35:23Z

It builds upon #228 .

Whenever we have a new best model, the IsBestModel stream will yield a response. The client can utilize this to perform certain actions e.g. ilastik to propagateDirty any predictions performed on previous models.

- Supported operations: start, resume, pause, shutdown - pytorch-3dunet package is used as the framework to create the models

… failed I caught an edge case, where events are blocked, because we have exited the training, and the tasks of the queue would remain unprocessed.

Creating and closing processes and threads can be quite time consuming resulting to test timeouts if the tests performs a lot of actions.

Applying monkeypatch to a parent process won't propagated to a child process if the start method is spawn (macos) instead of fork (linux)

- To fix test on windows, convert label data to float64

The should stop callbacks are boolean, so we need to aggregate their return value. Previously the return value wasn't taken into account, and the callbacks were returning none

The enum is used for validation check before triggering one of them. Previously I was checking if the queue was alive, but that won't be enough, for example if you want to perform resume, while you are resumed, the queue is operational, but the action shouldn't be valid.

Move NamedInt and Tensor proto to a separate file so training proto can use as well

- The inference servicer had a procedure to list the available devices. This is needed or the training servicer as well. So list devices is decoupled to be shared.

If the training is running or paused, the forward, will retain the state after completion. But it requires to pause so we can release memory and do the forward pass.

codecov · 2024-12-20T22:07:40Z

Codecov Report

Attention: Patch coverage is 62.19512% with 372 lines in your changes missing coverage. Please review.

Project coverage is 62.83%. Comparing base (5ea5d3a) to head (14f81af).

Files with missing lines	Patch %	Lines
tiktorch/trainer.py	42.53%	127 Missing ⚠️
tiktorch/proto/training_pb2_grpc.py	54.78%	52 Missing ⚠️
tiktorch/server/session/backend/supervisor.py	68.86%	52 Missing ⚠️
tiktorch/proto/training_pb2.py	27.02%	27 Missing ⚠️
tiktorch/proto/utils_pb2.py	30.00%	21 Missing ⚠️
tiktorch/proto/inference_pb2.py	20.00%	20 Missing ⚠️
tiktorch/server/session/backend/commands.py	78.40%	19 Missing ⚠️
tiktorch/server/session/backend/base.py	66.66%	17 Missing ⚠️
tiktorch/server/session/process.py	72.34%	13 Missing ⚠️
tiktorch/server/session/rpc_interface.py	71.42%	10 Missing ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #229      +/-   ##
==========================================
- Coverage   64.60%   62.83%   -1.78%     
==========================================
  Files          40       47       +7     
  Lines        2195     2876     +681     
==========================================
+ Hits         1418     1807     +389     
- Misses        777     1069     +292

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Since both inference and training servicers have common the concept of id, the training session id was replaced with the model session one used for inference. This model session protobuf interfaced moved to a separate utils proto file. The PredictRequest being common, can be leveraged for abstraction.

- If the model was initial paused or running, save after completion retain the state, while temporarily pausing to perform the save. - The export will pause the training if not paused before.

Whenever we have a new model, a stream will yield a response. The client can utilize this to perform certain actions e.g. ilastik to propagateDirty any predictions performed on previous models.

The response of the best model stream will return an id. The id is increased by one, each time we have a new model. A client can identify if an action has been performed by an outdated model based on the id. If the current is greater, then a new best model exists.

thodkatz added 9 commits December 12, 2024 11:03

Add training service

486ad1c

- Supported operations: start, resume, pause, shutdown - pytorch-3dunet package is used as the framework to create the models

Add tests for submitting asynchronous requests while the training has…

563fd18

… failed I caught an edge case, where events are blocked, because we have exited the training, and the tasks of the queue would remain unprocessed.

Release devices when training session initialization has failed

52551f3

Splitting time consuming tests to multiple ones

99574c7

Creating and closing processes and threads can be quite time consuming resulting to test timeouts if the tests performs a lot of actions.

Remove monkeypatching since it doesn't work for spawn process with macos

faf5e5b

Applying monkeypatch to a parent process won't propagated to a child process if the start method is spawn (macos) instead of fork (linux)

Add logging for the training

432ae5e

- To fix test on windows, convert label data to float64

Fix should stop callbacks

b8e78ec

The should stop callbacks are boolean, so we need to aggregate their return value. Previously the return value wasn't taken into account, and the callbacks were returning none

Specify data type as long for unet config

6c1875a

thodkatz force-pushed the add-is-best-model-to-training-servicer branch 3 times, most recently from b60fb5f to 90b4a5e Compare December 19, 2024 11:15

thodkatz added 5 commits December 20, 2024 21:46

Add utils proto

637c4d1

Move NamedInt and Tensor proto to a separate file so training proto can use as well

Separate listing devices as an utility for inference and training

27b3923

- The inference servicer had a procedure to list the available devices. This is needed or the training servicer as well. So list devices is decoupled to be shared.

Add forward action to the training service

39672ca

Retain the initial state of training when forward

b6d63c7

If the training is running or paused, the forward, will retain the state after completion. But it requires to pause so we can release memory and do the forward pass.

Add preprocessing and postprocessing to forward method

50b0944

thodkatz force-pushed the add-is-best-model-to-training-servicer branch from 90b4a5e to 811ddd3 Compare December 20, 2024 22:04

thodkatz added 2 commits December 21, 2024 03:37

Add save and export to the training service

146305e

thodkatz force-pushed the add-is-best-model-to-training-servicer branch from 811ddd3 to b10d4f9 Compare December 21, 2024 10:52

thodkatz added 3 commits December 21, 2024 12:10

Retain the training state when saving and exporting

e4bc4d6

- If the model was initial paused or running, save after completion retain the state, while temporarily pausing to perform the save. - The export will pause the training if not paused before.

Add is best model stream

c9663d2

Whenever we have a new model, a stream will yield a response. The client can utilize this to perform certain actions e.g. ilastik to propagateDirty any predictions performed on previous models.

thodkatz force-pushed the add-is-best-model-to-training-servicer branch from b10d4f9 to 14f81af Compare December 21, 2024 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add is best model to training servicer #229

Add is best model to training servicer #229

thodkatz commented Dec 13, 2024 •

edited

Loading

codecov bot commented Dec 20, 2024 •

edited

Loading

Add is best model to training servicer #229

Are you sure you want to change the base?

Add is best model to training servicer #229

Conversation

thodkatz commented Dec 13, 2024 • edited Loading

codecov bot commented Dec 20, 2024 • edited Loading

Codecov Report

thodkatz commented Dec 13, 2024 •

edited

Loading

codecov bot commented Dec 20, 2024 •

edited

Loading