Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature dataset checkpoint strategy #194

Merged
merged 29 commits into from
Jan 8, 2024
Merged

Feature dataset checkpoint strategy #194

merged 29 commits into from
Jan 8, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Dec 22, 2023

Description

This PR modifies enable_checkpoint in the Pipeline.generate to allow a more behaviour during the generation (saving while generating).

The following example allows saving the CustomDataset every 100 generations to disk.
By default, it will use DatasetCheckpoint(save_frequency=-1), which replicates the previous behaviour of enable_checkpoints=True.

from distilabel.utils.dataset import DatasetCheckpoint

pipe.generate(
    ...,
    dataset_strategy=DatasetCheckpoint(save_frequency=100)
)

Closes #168 and #193.

Copy link
Member

@gabrielmbmb gabrielmbmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @plaguss! Much needed feature :) it looks good, I left some comments

src/distilabel/dataset.py Outdated Show resolved Hide resolved
src/distilabel/dataset.py Outdated Show resolved Hide resolved
src/distilabel/dataset.py Outdated Show resolved Hide resolved
src/distilabel/utils/dataset.py Outdated Show resolved Hide resolved
src/distilabel/utils/dataset.py Outdated Show resolved Hide resolved
src/distilabel/utils/dataset.py Outdated Show resolved Hide resolved
src/distilabel/pipeline.py Outdated Show resolved Hide resolved
@plaguss plaguss requested a review from gabrielmbmb December 22, 2023 15:41
@plaguss plaguss self-assigned this Dec 22, 2023
@plaguss
Copy link
Contributor Author

plaguss commented Jan 4, 2024

Added a small reference in the docs, inside a new section of the Pipeline from the concept guides

Copy link
Member

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM even though I'm still pending to test it myself!

Also a general comment is that I'd like to discuss on the naming with the rest of the team, as IMO we could just have checkpoints_path and checkpoints_freq or something similar, rather than asking the user to instantiate a class, while we may use/re-use the class internally, but to avoid the user for instantiating multiple classes within the same function.

pyproject.toml Show resolved Hide resolved
src/distilabel/dataset.py Outdated Show resolved Hide resolved
src/distilabel/pipeline.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ignacioct ignacioct left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of personal doubts and minor suggestions, but looks really nice.

docs/technical-reference/pipeline.md Outdated Show resolved Hide resolved
src/distilabel/pipeline.py Outdated Show resolved Hide resolved
src/distilabel/pipeline.py Outdated Show resolved Hide resolved
src/distilabel/pipeline.py Outdated Show resolved Hide resolved
@plaguss plaguss merged commit d9ad890 into main Jan 8, 2024
4 checks passed
@plaguss plaguss deleted the feat/dataset-checkpoint branch January 8, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

could there be a feature to save while generating?
4 participants