Skip to content

Latest commit

 

History

History
44 lines (29 loc) · 952 Bytes

README.md

File metadata and controls

44 lines (29 loc) · 952 Bytes

Multi-Modality

Pytorch-Dataset

A PyTorch Code Dataset for Cutting-Edge Fine-tuning

Installation

You can install the package using pip

pip install pytorch-dataset

Usage

Downloader that downloads and unzips each repository in an account

from pytorch import GitHubRepoDownloader

# Example usage:
downloader = GitHubRepoDownloader(username="lucidrains", download_dir="lucidrains_repositories")
downloader.download_repositories()

Processor that cleans, formats, and submits the cleaned dataset to huggingface

from pytorch import CodeDatasetBuilder

# Example usage:
code_builder = CodeDatasetBuilder("lucidrains_repositories")

code_builder.save_dataset(
    "lucidrains_python_code_dataset", 
    exclude_files=["setup.py"], exclude_dirs=["tests"]
)

code_builder.push_to_hub("lucidrains_python_code_dataset", organization="kye")

License

MIT