data_loader

The purpose of this project is to provide a streaming dataloader for NLP which runs independently of the model. The project currently supports training but will also handle inference longer term. The motivation behind this was based on solving a few issues :

Issues with Python Performance Related to PreProcessing data
Support for handling Streaming Interfaces
Quick Startup to Avoid File Downloading and preprocessing

Quick Start

Install Rust curl https://sh.rustup.rs -sSf | sh
Install Libraries sudo apt-get install libzmq3-dev pkg-config libssl-dev
Build Rust Libary cargo build --release from rust folder -- Might take a while on first build
Run python3 top_run.py --task mlm --all from python folder -- What's Masked LM training using Pile Dataset

These instructions will run BERT pretraining using the Pile dataset and will avoid all the pain associated preprocessing datasets normally associated with pretraining.

Dataset support

The data loader is general but directly supports :

Wikipedia data from https://dumps.wikimedia.org/
Data stored in an Arror format a. Streaming support from HuggingFace Datasets

Status

This project is still a work in progress and in a relatively early stage. Basic functionallity for training is working with examples for masking and single and multi label sequence classification. Current focus is on greater generalization and ease of use along with bug fixing. The hooks exist to allow relatively easy addition of new tasks/datasets.

Operation/Examples

There are currently a few example cases which are directly supported and can be run using :

python3 top_run.py --task --all from the python folder a. task = ["mlm" (Masked Language Model), "clm" (GPT), "span" (T5), "multi-label"]

Masking/Pretraining

This example will run pretraining using a wiki dataset which is streamed from https://dumps.wikimedia.org/other/cirrussearch/current/enwiki-20221021-cirrussearch-content.json.gz. There are a few known issues with this dataset but it is converging over the initial iterations. The training for this will start immediately without the initial required preprocessing which makes pretraining so painful.

Run python3 top_run.py --task mlm --all from python folder

Configuration and Source Code for this Example can be found

https://github.com/andywag/streaming_data_loader/tree/master/rust/src/tasks/masking

Multi Label Classification

This example will run multilabel classification using an emotions dataset downloaded from Huggingface. The first step of the process is to download the arrow dataset so the first time this example is run will have a slight delay while this is downloaded. Arrow also supports a streaming mode which is being worked on now.

Run python3 top_run.py --task emot --all from python folder

Configuration and Source Code for this Example can be found

https://github.com/andywag/streaming_data_loader/tree/master/rust/src/tasks/multi_label

Architecture

The goal of this design is to allow easy addition of tasks and datasets with a basic API. The API is in heavy flux right now but examples of how to add tasks can be seen in the rust/tasks folder. There are 4 basic services that need to be configured. In general, addition of a dataset only requires

Specifying the format of the input file
Specifying the processing to convert the input data to the proper format for the input model

The project splits the training creation into 4 separate services which communicate based on rust channels. The services are :

Data Provider : Downloads Data from Internet or Local File and Streams the Data to the Batcher
Batcher : Preprocesses the data (tokenizes for NLP) and puts the data into a batch
Transport : Transports the Data to the Device Performing the Training (Currently Only Supports ZMQ)
Device : Runs the Training

Testing

The tests download datasets from huggingface using python which unfortunately takes time during the first run. This may appear like a hang so initial testing might be slow. After initial downloads testing is relatively fast.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
data		data
maturin_build		maturin_build
python		python
rust		rust
BasicRunner.ipynb		BasicRunner.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data_loader

Quick Start

Dataset support

Status

Operation/Examples

Masking/Pretraining

Multi Label Classification

Architecture

Testing

About

Releases

Packages

Languages

andywag/streaming_data_loader

Folders and files

Latest commit

History

Repository files navigation

data_loader

Quick Start

Dataset support

Status

Operation/Examples

Masking/Pretraining

Multi Label Classification

Architecture

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages