Skip to content

Latest commit

 

History

History
 
 

reproduced_redpajama

Redpajama Config Files

This folder contains example configuration files to easily and quickly reproduce the processing flow of Redpajama.

arXiv

The raw data files can be downloaded from the same AWS link as in Redpajama/arXiv.

Once downloaded, use raw_arxiv_to_jsonl.py to convert from the original format to jsonl that Data-Juicer can handle easily:

python tools/preprocess/raw_arxiv_to_jsonl.py           \
    --arxiv_src_dir       <arxiv_src_dir>    \
    --target_dir          <target_dir>       \
    --temp_dir            <temp_dir>         \
    --num_proc            <num_proc>

After conversion, modify the path configurations in redpajama-arxiv.yaml and execute the following command to reproduce the processing flow of RedPajama:

python tools/process_data.py --config configs/reproduced_redpajama/redpajama-arxiv.yaml

Comparison

num_samples num_tokens peak_memory wall_time
redpajama 1,724,497 30,667,506,934 35GB total: 11h52min
Data-Juicer 2,675,426 30,338,153,178 21GB preprocess: 5h21min
read+unify: 25min
remove_header_mapper: 5min
remove_comments_mapper: 3min
remove_bibliography_mapper: 4min
expand_macro_mapper: 5min19s
text_length_filter: 4min
export: 43min
total: 6h53min

Books

The raw data files can be downloaded from the same HuggingFace datasets as in Redpajama/Books.

Once downloaded, modify the path configurations in redpajama-books.yaml and execute the following command to reproduce the processing flow of RedPajama.

python tools/process_data.py --config configs/reproduced_redpajama/redpajama-books.yaml

Comparison

num_samples num_tokens peak_memory wall_time
redpajama 205,183 25,962,395,123 450GB split_for_dedup: 5min
dedup: 117min
total: 122min
Data-Juicer 207,902 26,108,635,683 96GB read+unify: 20min
compute_hash: 78min
dedup: 3min
export: 3min
total: 114min

Code

The raw data files can be downloaded from Google BigQuery as in Redpajama/Code.

Once downloaded, unzip and delete files whose extensions are not in the following whitelist:

.asm, .bat, .cmd, .c, .h, .cs, .cpp, .hpp, .c++, .h++, .cc, .hh, .C, .H, .cmake, .css, .dockerfile, .f90, .f, .f03, .f08, .f77, .f95, .for, .fpp, .go, .hs, .html, .java, .js, .jl, .lua, .md, .markdown, .php, .php3, .php4, .php5, .phps, .phpt, .pl, .pm, .pod, .perl,  ps1, .psd1, .psm1, .py, .rb, .rs, .sql, .scala, .sh, .bash, .command, .zsh, .ts, .tsx, .tex, .vb, Dockerfile, Makefile, .xml, .rst, .m, .smali

After preparation, modify the path configurations in redpajama-code.yaml and execute the following command to reproduce the processing flow of redpajama:

python tools/process_data.py --config configs/redpajama/redpajama-code.yaml

Comparison

num_samples num_tokens peak_memory wall_time
redpajama 73,208,524 150,390,270,060 212GB local-dedup: 37h
global-dedup: 1h
merge-dedup: 6h
filter: 17h
total: 61h
Data-Juicer 73,169,889 150,310,903,230 370GB preprocess: 5h21min
read+unify: 12h
document_deduplicator: 20h
clean_copyright_mappe: 3h
maximum_line_length_filter: 2.5h
average_line_length_filter: 2h
alphanumeric_filter: 13h
export: 2.5h
total: 59h

StackExchange

The raw data files can be downloaded from the same Archive link as in Redpajama/Stack_exchange.

Once downloaded, use raw_stackexchange_to_jsonl.py to convert from the original format to jsonl that Data-Juicer can handle easily:

python tools/preprocess/raw_arxiv_stackexchange_to_jsonl.py           \
    --src_dir       <src_dir>      \
    --target_dir    <target_dir>   \
    --topk          <topk>         \
    --num_proc      <num_proc>     \

After conversion, modify the path configurations in redpajama-stackexchange.yaml and execute the following command to reproduce the processing flow of redpajama:

python tools/process_data.py --config configs/redpajama/redpajama-stackexchange.yaml

Comparison

num_samples num_tokens peak_memory wall_time
redpajama 29,825,086 20,502,757,123 >500GB filter: 170min
postprocess: 90min
total: 260min
Data-Juicer 29,825,086 20,628,082,262 100GB preprocess: 210min
read+unify: 86min
clean_html: 15min
language_id_score_filter: 18min
total: 391min