This folder contains example configuration files to easily and quickly reproduce the processing flow of Redpajama.
The raw data files can be downloaded from the same AWS link as in Redpajama/arXiv.
Once downloaded, use raw_arxiv_to_jsonl.py to convert from the original format to jsonl
that Data-Juicer can handle easily:
python tools/preprocess/raw_arxiv_to_jsonl.py \
--arxiv_src_dir <arxiv_src_dir> \
--target_dir <target_dir> \
--temp_dir <temp_dir> \
--num_proc <num_proc>
After conversion, modify the path configurations in redpajama-arxiv.yaml and execute the following command to reproduce the processing flow of RedPajama:
python tools/process_data.py --config configs/reproduced_redpajama/redpajama-arxiv.yaml
num_samples | num_tokens | peak_memory | wall_time | |
---|---|---|---|---|
redpajama | 1,724,497 | 30,667,506,934 | 35GB | total: 11h52min |
Data-Juicer | 2,675,426 | 30,338,153,178 | 21GB | preprocess: 5h21min read+unify: 25min remove_header_mapper: 5min remove_comments_mapper: 3min remove_bibliography_mapper: 4min expand_macro_mapper: 5min19s text_length_filter: 4min export: 43min total: 6h53min |
The raw data files can be downloaded from the same HuggingFace datasets as in Redpajama/Books.
Once downloaded, modify the path configurations in redpajama-books.yaml and execute the following command to reproduce the processing flow of RedPajama.
python tools/process_data.py --config configs/reproduced_redpajama/redpajama-books.yaml
num_samples | num_tokens | peak_memory | wall_time | |
---|---|---|---|---|
redpajama | 205,183 | 25,962,395,123 | 450GB | split_for_dedup: 5min dedup: 117min total: 122min |
Data-Juicer | 207,902 | 26,108,635,683 | 96GB | read+unify: 20min compute_hash: 78min dedup: 3min export: 3min total: 114min |
The raw data files can be downloaded from Google BigQuery as in Redpajama/Code.
Once downloaded, unzip and delete files whose extensions are not in the following whitelist:
.asm, .bat, .cmd, .c, .h, .cs, .cpp, .hpp, .c++, .h++, .cc, .hh, .C, .H, .cmake, .css, .dockerfile, .f90, .f, .f03, .f08, .f77, .f95, .for, .fpp, .go, .hs, .html, .java, .js, .jl, .lua, .md, .markdown, .php, .php3, .php4, .php5, .phps, .phpt, .pl, .pm, .pod, .perl, ps1, .psd1, .psm1, .py, .rb, .rs, .sql, .scala, .sh, .bash, .command, .zsh, .ts, .tsx, .tex, .vb, Dockerfile, Makefile, .xml, .rst, .m, .smali
After preparation, modify the path configurations in redpajama-code.yaml and execute the following command to reproduce the processing flow of redpajama:
python tools/process_data.py --config configs/redpajama/redpajama-code.yaml
num_samples | num_tokens | peak_memory | wall_time | |
---|---|---|---|---|
redpajama | 73,208,524 | 150,390,270,060 | 212GB | local-dedup: 37h global-dedup: 1h merge-dedup: 6h filter: 17h total: 61h |
Data-Juicer | 73,169,889 | 150,310,903,230 | 370GB | preprocess: 5h21min read+unify: 12h document_deduplicator: 20h clean_copyright_mappe: 3h maximum_line_length_filter: 2.5h average_line_length_filter: 2h alphanumeric_filter: 13h export: 2.5h total: 59h |
The raw data files can be downloaded from the same Archive link as in Redpajama/Stack_exchange.
Once downloaded, use raw_stackexchange_to_jsonl.py to convert from the original format to jsonl
that Data-Juicer can handle easily:
python tools/preprocess/raw_arxiv_stackexchange_to_jsonl.py \
--src_dir <src_dir> \
--target_dir <target_dir> \
--topk <topk> \
--num_proc <num_proc> \
After conversion, modify the path configurations in redpajama-stackexchange.yaml and execute the following command to reproduce the processing flow of redpajama:
python tools/process_data.py --config configs/redpajama/redpajama-stackexchange.yaml
num_samples | num_tokens | peak_memory | wall_time | |
---|---|---|---|---|
redpajama | 29,825,086 | 20,502,757,123 | >500GB | filter: 170min postprocess: 90min total: 260min |
Data-Juicer | 29,825,086 | 20,628,082,262 | 100GB | preprocess: 210min read+unify: 86min clean_html: 15min language_id_score_filter: 18min total: 391min |