Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Add support for distributed training #257

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

tchaton
Copy link

@tchaton tchaton commented Sep 7, 2021

Description

This PR adds support for computing support on new data for running inference. This is only supported by LightningPrototypicalNetworks right now, as the classifier needs re-fitting.

This is used for Lightning Flash learn2learn integration: Lightning-Universe/lightning-flash#737

Replace this line by a one sentence summary of your PR.

If necessary, use the following space to provide context or more details.

Contribution Checklist

If your contribution modifies code in the core library (not docs, tests, or examples), please fill the following checklist.

  • My contribution is listed in CHANGELOG.md with attribution.
  • My contribution modifies code in the main library.
  • My modifications are tested.
  • My modifications are documented.

Optional

If you make major changes to the core library, please run make alltests and copy-paste the content of alltests.txt below.

[PASTE HERE]

@tchaton tchaton marked this pull request as ready for review September 8, 2021 08:50
learn2learn/utils/lightning.py Show resolved Hide resolved
learn2learn/utils/lightning.py Outdated Show resolved Hide resolved
Copy link
Member

@seba-1511 seba-1511 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @tchaton, and for the preliminary review @pietrolesci.

Some questions and minor remarks:

  • Do those changes remove the strict requirement for pytorch_lightning==1.0.2? (If yes, to be changed in lightning_episodic_module.py:16).
  • List contributions (crediting to yourself) in CHANGELOG.md?

def train_dataloader(self):
return EpisodicBatcher.epochify(
return Epochifier(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was Epochifier inadvertently removed? Or should it be replaced by TaskDataParallel?

Also, if I remember correctly Epochifier was preventing us from using pytorch_lightning > 1.0.2. I believe the fix was just to inherit from a data class in lightning. @nightlessbaron do you remember which class and if that was enough?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seba-1511, quick answer to your first question: Epochifier is replaced by TaskDataParallel

Copy link
Contributor

@nightlessbaron nightlessbaron Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seba-1511, as @pietrolesci mentioned, in the benchmark, we replaced Epochifier with TaskDistributedParallel. The downside is windows users can't make use of it as stated in #5358.

I believe the fix was just to inherit from a data class in lightning.

I couldn't figure it out. That is still unresolved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @seba-1511, yes it was inadvertently removed.

However, I added a TODO. It should be provided replaced by the TaskDataParallel, TaskDistributedDataParallel when tested on your side.

I will copy those files on the Flash side to no block the integration until the next learn2learn release.

Best,
T.C

learn2learn/utils/lightning.py Outdated Show resolved Hide resolved
learn2learn/algorithms/lightning/lightning_protonet.py Outdated Show resolved Hide resolved
@tchaton tchaton changed the title [Feat] Add support for enabling inference [Feat] Add support for distributed training Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants