Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix logging the number of generated samples for each leaf node #445

Closed

Conversation

bbrowning
Copy link
Contributor

We were logging the length of our generated_data list instead of the number of newly generated samples in each leaf node, causing confusion as this is typically 1, 2, 3, etc instead of 50, 500, 1000, etc that users would expect.

Fixes #227

We were logging the length of our `generated_data` list instead of the
number of newly generated samples in each leaf node, causing confusion
as this is typically 1, 2, 3, etc instead of 50, 500, 1000, etc that
users would expect.

Fixes instructlab#227

Signed-off-by: Ben Browning <[email protected]>
@bbrowning
Copy link
Contributor Author

CI runs are showing this is fixed with this PR - seeing log messages such like this which have the intended output:

2024-12-10T23:40:10.3629160Z INFO 2024-12-10 23:40:10,329 instructlab.sdg.generate_data:429: Generated 149 samples
2024-12-10T23:40:10.3629885Z DEBUG 2024-12-10 23:40:10,329 instructlab.sdg.generate_data:430: Generated data: Dataset({
2024-12-10T23:40:10.3630610Z     features: ['task_description', 'seed_question', 'seed_response', 'output'],
2024-12-10T23:40:10.3631105Z     num_rows: 149
2024-12-10T23:40:10.3631335Z })

@bbrowning bbrowning requested a review from a team December 11, 2024 01:18
Copy link
Member

@jaideepr97 jaideepr97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrowning should we also emit the leaf node that this number corresponds to? so users actually know what the leaf node is and how many sampels were generated for it

@bbrowning
Copy link
Contributor Author

Yes, that's a good idea - will adjust that, assuming this PR is not obviated by another in-progress on to split out generate_data.

@aakankshaduggal aakankshaduggal requested a review from a team January 16, 2025 20:08
@mergify mergify bot added the one-approval label Jan 16, 2025
Copy link
Contributor

mergify bot commented Jan 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 16, 2025
@bbrowning
Copy link
Contributor Author

Actually going to close this PR, as it is not obsolete since the splitting out of generate_data into multiple pieces merged. That work implicitly fixed this logging issue as part of the refactoring.

@bbrowning bbrowning closed this Jan 16, 2025
@bbrowning bbrowning deleted the fix-generated-samples-logging branch January 16, 2025 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ilab data generate does not specify the correct num of samples generated
3 participants