Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add training service #225

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Conversation

thodkatz
Copy link
Collaborator

@thodkatz thodkatz commented Dec 9, 2024

An initial implementation of adding neural network training in tiktorch :P

It requires:

The package pytorch-3dunet is used as our framework to configure the models.

Supported workflows:

  • Start
  • Pause
  • Resume
  • Shutown
  • Get training state
  • Recover if training failed

The above are demonstrated with tests. Please check them first :)

For thread synchronization we rely on a thread-safe priority queue (todo: explain the scheme with the concept of ICommand, and the error handling).

Manual testing

I have created a setup to test the server with actual data. You can find more here: https://github.com/thodkatz/ilastik-playground

@thodkatz thodkatz force-pushed the add-training-servicer branch from bbf0243 to ace0793 Compare December 9, 2024 16:35
Copy link

codecov bot commented Dec 9, 2024

Codecov Report

Attention: Patch coverage is 70.46414% with 210 lines in your changes missing coverage. Please review.

Project coverage is 65.52%. Comparing base (3c1477f) to head (d432e25).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
tiktorch/proto/training_pb2_grpc.py 55.10% 44 Missing ⚠️
tiktorch/trainer.py 65.48% 39 Missing ⚠️
tiktorch/server/session/backend/supervisor.py 74.82% 37 Missing ⚠️
tiktorch/proto/training_pb2.py 60.24% 33 Missing ⚠️
tiktorch/server/session/process.py 66.66% 14 Missing ⚠️
tiktorch/server/session/backend/commands.py 82.66% 13 Missing ⚠️
tiktorch/server/session/backend/base.py 71.42% 12 Missing ⚠️
tiktorch/server/grpc/training_servicer.py 87.67% 9 Missing ⚠️
tiktorch/server/session/rpc_interface.py 70.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #225      +/-   ##
==========================================
+ Coverage   64.60%   65.52%   +0.92%     
==========================================
  Files          40       44       +4     
  Lines        2195     2770     +575     
==========================================
+ Hits         1418     1815     +397     
- Misses        777      955     +178     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@thodkatz thodkatz force-pushed the add-training-servicer branch from ace0793 to cbb2619 Compare December 9, 2024 16:42
@@ -1,7 +1,7 @@
[pytest]
python_files = test_*.py
addopts =
--timeout 10
--timeout 60
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference in creating threads, processes using a start method "spawn" instead of "fork" is quite significant, that led me to bump it, so the tests can pass for macos and windows platforms.

@thodkatz thodkatz force-pushed the add-training-servicer branch 3 times, most recently from e9fc4b0 to f345f69 Compare December 10, 2024 13:30
Comment on lines +187 to +192
def set_max_num_iterations(self, num_iterations: int):
raise NotImplementedError

def update_dataset(self):
raise NotImplementedError

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the biomodel supervisor need those?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hesitant to remove them as a reference of what the old design was intented to do. But I wasn't sure what to do with these. I could move them to the trainer one maybe

@thodkatz thodkatz force-pushed the add-training-servicer branch 3 times, most recently from 750ee88 to d432e25 Compare December 11, 2024 15:23
- Supported operations: start, resume, pause, shutdown
- pytorch-3dunet package is used as the framework to create the models
… failed

I caught an edge case, where events are blocked, because we have exited the training, and the tasks of the queue would remain unprocessed.
Creating and closing processes and threads can be quite time consuming
resulting to test timeouts if the tests performs a lot of actions.
Applying monkeypatch to a parent process won't propagated to a child process if the start method is spawn (macos) instead of fork (linux)
- To fix test on windows, convert label data to float64
The should stop callbacks are boolean, so we need to aggregate their return value. Previously the return value wasn't taken into account, and the callbacks were returning none
The enum is used for validation check before triggering one of them. Previously I was checking if the queue was alive, but that won't be enough, for example if you want to perform resume, while you are resumed, the queue is operational, but the action shouldn't be valid.
@thodkatz thodkatz force-pushed the add-training-servicer branch 6 times, most recently from b9153b1 to a088627 Compare December 19, 2024 10:00
@thodkatz thodkatz force-pushed the add-training-servicer branch 3 times, most recently from 09fcb85 to 4a5ed85 Compare December 20, 2024 15:37
@thodkatz thodkatz force-pushed the add-training-servicer branch 2 times, most recently from 52fb50f to 752cb59 Compare December 20, 2024 20:14
Move NamedInt and Tensor proto to a separate file so training proto can
use as well
- The inference servicer had a procedure to list the available devices.
  This is needed or the training servicer as well. So list devices is
  decoupled to be shared.
@thodkatz thodkatz force-pushed the add-training-servicer branch from 752cb59 to 27b3923 Compare December 20, 2024 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants