Skip to content

Conversation

@qusaiw
Copy link

@qusaiw qusaiw commented Oct 28, 2025

What does this PR do?

This PR adds support for loading datasets from local disk paths in addition to HuggingFace Hub datasets. This enables users to work with private/local datasets without needing to upload them to the Hub.

Changes Made:

  • Modified get_dataset() function to check if a dataset path exists locally before attempting to load from HuggingFace Hub
  • Added support for local datasets in both single dataset and dataset mixture modes
  • Properly handle DatasetDict split selection for local datasets
  • Added logging messages to indicate when loading from local disk vs Hub

Usage Examples:

Single dataset:

# From Hub (existing behavior)
dataset_name: HuggingFaceH4/ultrachat_200k

# From local disk (new)
dataset_name: local_sft

Dataset mixture:

dataset_mixture:
  datasets:
    - id: local_sft # Local dataset
      config: default
      split: train_sft
      columns:
        - messages
      weight: 0.6
    - id: HuggingFaceH4/ultrachat_200k  # Hub dataset
      config: default
      split: train_sft
      weight: 0.4

Benefits:

  • Work with proprietary/private datasets without uploading to Hub
  • Faster development iteration with local data
  • Useful for testing before publishing datasets
  • Maintains full backward compatibility

Testing

  • Tested with datasets saved using dataset.save_to_disk()
  • Verified backward compatibility with existing Hub-based configs
  • Tested both single dataset and mixture modes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant