Skip to content

howard-hou/json2binidx_tool

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

jsonl to binidx tool

This repository can serve for dataset preparation of RWKV model,

To speedup the tokenization, please install the fast RWKV Tokenizer written in Rust.

pip install pyrwkv-tokenizer

The multilingual rwkv-6-world models use a new tokenizer rwkv_vocab_v20230424.txt.

python tools/preprocess_data.py --input ./sample.jsonl --output-prefix ./data/sample --vocab ./rwkv_vocab_v20230424.txt --dataset-impl mmap --tokenizer-type RWKVTokenizer --append-eod

The jsonl format sample (one line for each document):

{"text": "This is the first document."}
{"text": "Hello\nWorld"}
{"text": "1+1=2\n1+2=3\n2+2=4"}

generated by code like this:

ss = json.dumps({"meta": meta, "text": text}, ensure_ascii=False)
out.write(ss + "\n")

About

update to support RUST RWKV Tokenizer

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%