Add training service #225

thodkatz · 2024-12-09T02:05:50Z

An initial implementation of adding neural network training in tiktorch :P

It requires:

The package pytorch-3dunet is used as our framework to configure the models.

Supported workflows:

Start
Pause
Resume
Shutown
Get training state
Recover if training failed

The above are demonstrated with tests. Please check them first :)

For thread synchronization we rely on a thread-safe priority queue (todo: explain the scheme with the concept of ICommand, and the error handling).

Manual testing

I have created a setup to test the server with actual data. You can find more here: https://github.com/thodkatz/ilastik-playground

codecov · 2024-12-09T16:38:44Z

Codecov Report

Attention: Patch coverage is 66.79438% with 260 lines in your changes missing coverage. Please review.

Project coverage is 64.48%. Comparing base (5ea5d3a) to head (c699344).
Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
tiktorch/proto/training_pb2_grpc.py	55.14%	48 Missing ⚠️
tiktorch/trainer.py	65.48%	39 Missing ⚠️
tiktorch/server/session/backend/supervisor.py	75.51%	36 Missing ⚠️
tiktorch/proto/training_pb2.py	27.02%	27 Missing ⚠️
tiktorch/proto/inference_pb2.py	16.12%	26 Missing ⚠️
tiktorch/proto/utils_pb2.py	37.50%	15 Missing ⚠️
tiktorch/server/session/process.py	66.66%	14 Missing ⚠️
tiktorch/server/session/backend/commands.py	82.66%	13 Missing ⚠️
tiktorch/server/session/backend/base.py	71.42%	12 Missing ⚠️
tiktorch/server/grpc/training_servicer.py	87.01%	10 Missing ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #225      +/-   ##
==========================================
- Coverage   64.60%   64.48%   -0.12%     
==========================================
  Files          40       47       +7     
  Lines        2195     2689     +494     
==========================================
+ Hits         1418     1734     +316     
- Misses        777      955     +178

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

thodkatz · 2024-12-10T09:39:14Z

pytest.ini

@@ -1,7 +1,7 @@
 [pytest]
 python_files = test_*.py
 addopts =
-    --timeout 10
+    --timeout 60


The difference in creating threads, processes using a start method "spawn" instead of "fork" is quite significant, that led me to bump it, so the tests can pass for macos and windows platforms.

tiktorch/server/session/backend/supervisor.py

- Supported operations: start, resume, pause, shutdown - pytorch-3dunet package is used as the framework to create the models

… failed I caught an edge case, where events are blocked, because we have exited the training, and the tasks of the queue would remain unprocessed.

Creating and closing processes and threads can be quite time consuming resulting to test timeouts if the tests performs a lot of actions.

Applying monkeypatch to a parent process won't propagated to a child process if the start method is spawn (macos) instead of fork (linux)

- To fix test on windows, convert label data to float64

The should stop callbacks are boolean, so we need to aggregate their return value. Previously the return value wasn't taken into account, and the callbacks were returning none

The enum is used for validation check before triggering one of them. Previously I was checking if the queue was alive, but that won't be enough, for example if you want to perform resume, while you are resumed, the queue is operational, but the action shouldn't be valid.

Move NamedInt and Tensor proto to a separate file so training proto can use as well

- The inference servicer had a procedure to list the available devices. This is needed or the training servicer as well. So list devices is decoupled to be shared.

tests/test_server/test_grpc/test_training_servicer.py

k-dominik

Hey @thodkatz, this is exciting! Also thank you for being available in DMs/calls for some questions. After looking at the proposed changes I have to say that you followed the original design through the whole onion nicely.

The tests look also pretty thorough and instill confidence, especially given that we target linux, mac, and windows platforms.

Seeing how little you had to adapt the pytorch-3dunet trainer class gives me confidence, that it was a good decision to go for it.

--> LGTM

thodkatz · 2025-01-15T14:06:53Z

Hey @k-dominik ! Thanks a lot for the kind words :) It is true that the onion kind of structure needs to be revised and simplified, totally agree with that.

I hope the tests will help with the refactoring part.

Of course, I am always available :)

thodkatz force-pushed the add-training-servicer branch from bbf0243 to ace0793 Compare December 9, 2024 16:35

thodkatz force-pushed the add-training-servicer branch from ace0793 to cbb2619 Compare December 9, 2024 16:42

thodkatz commented Dec 10, 2024

View reviewed changes

thodkatz force-pushed the add-training-servicer branch 3 times, most recently from e9fc4b0 to f345f69 Compare December 10, 2024 13:30

k-dominik reviewed Dec 11, 2024

View reviewed changes

tiktorch/server/session/backend/supervisor.py Show resolved Hide resolved

thodkatz force-pushed the add-training-servicer branch 3 times, most recently from 750ee88 to d432e25 Compare December 11, 2024 15:23

thodkatz mentioned this pull request Dec 12, 2024

Add forward to training servicer #227

Merged

thodkatz added 9 commits December 12, 2024 11:03

Add training service

486ad1c

- Supported operations: start, resume, pause, shutdown - pytorch-3dunet package is used as the framework to create the models

Add tests for submitting asynchronous requests while the training has…

563fd18

… failed I caught an edge case, where events are blocked, because we have exited the training, and the tasks of the queue would remain unprocessed.

Release devices when training session initialization has failed

52551f3

Splitting time consuming tests to multiple ones

99574c7

Creating and closing processes and threads can be quite time consuming resulting to test timeouts if the tests performs a lot of actions.

Remove monkeypatching since it doesn't work for spawn process with macos

faf5e5b

Applying monkeypatch to a parent process won't propagated to a child process if the start method is spawn (macos) instead of fork (linux)

Add logging for the training

432ae5e

- To fix test on windows, convert label data to float64

Fix should stop callbacks

b8e78ec

The should stop callbacks are boolean, so we need to aggregate their return value. Previously the return value wasn't taken into account, and the callbacks were returning none

Specify data type as long for unet config

6c1875a

thodkatz force-pushed the add-training-servicer branch 6 times, most recently from b9153b1 to a088627 Compare December 19, 2024 10:00

thodkatz force-pushed the add-training-servicer branch 3 times, most recently from 09fcb85 to 4a5ed85 Compare December 20, 2024 15:37

thodkatz force-pushed the add-training-servicer branch 2 times, most recently from 52fb50f to 752cb59 Compare December 20, 2024 20:14

thodkatz added 2 commits December 20, 2024 21:46

Add utils proto

637c4d1

Move NamedInt and Tensor proto to a separate file so training proto can use as well

Separate listing devices as an utility for inference and training

27b3923

- The inference servicer had a procedure to list the available devices. This is needed or the training servicer as well. So list devices is decoupled to be shared.

thodkatz force-pushed the add-training-servicer branch from 752cb59 to 27b3923 Compare December 20, 2024 20:48

k-dominik reviewed Jan 13, 2025

View reviewed changes

tests/test_server/test_grpc/test_training_servicer.py Outdated Show resolved Hide resolved

k-dominik reviewed Jan 13, 2025

View reviewed changes

tests/test_server/test_grpc/test_training_servicer.py Outdated Show resolved Hide resolved

thodkatz added 2 commits January 14, 2025 13:50

Change dtype long to int64

1b0f188

Remove redundant wrapping of training session id in tests

c699344

k-dominik approved these changes Jan 15, 2025

View reviewed changes

thodkatz mentioned this pull request Jan 15, 2025

Remove legacy training code #230

Open

thodkatz merged commit 57df547 into ilastik:main Jan 16, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add training service #225

Add training service #225

thodkatz commented Dec 9, 2024 •

edited

Loading

codecov bot commented Dec 9, 2024 •

edited

Loading

thodkatz Dec 10, 2024

k-dominik left a comment

thodkatz commented Jan 15, 2025

Add training service #225

Add training service #225

Conversation

thodkatz commented Dec 9, 2024 • edited Loading

Manual testing

codecov bot commented Dec 9, 2024 • edited Loading

Codecov Report

thodkatz Dec 10, 2024

Choose a reason for hiding this comment

k-dominik left a comment

Choose a reason for hiding this comment

thodkatz commented Jan 15, 2025

thodkatz commented Dec 9, 2024 •

edited

Loading

codecov bot commented Dec 9, 2024 •

edited

Loading