Skip to content

Commit

Permalink
tests : test-tokenizer-0.sh print more info (ggerganov#7402)
Browse files Browse the repository at this point in the history
  • Loading branch information
ggerganov authored and teleprint-me committed May 23, 2024
1 parent f3ddffe commit d17d582
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 2 deletions.
2 changes: 1 addition & 1 deletion convert-hf-to-gguf-update.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ class TOKENIZER_TYPE(IntEnum):
{"name": "mpt", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mosaicml/mpt-7b", },
{"name": "starcoder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/bigcode/starcoder2-3b", },
{"name": "gpt-2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/openai-community/gpt2", },
{"name": "stablelm", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b", },
{"name": "stablelm2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b", },
{"name": "refact", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/smallcloudai/Refact-1_6-base", },
{"name": "command-r", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/CohereForAI/c4ai-command-r-v01", },
{"name": "qwen2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/Qwen/Qwen1.5-7B", },
Expand Down
7 changes: 6 additions & 1 deletion tests/test-tokenizer-0.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,15 @@ make -j tests/test-tokenizer-0

printf "Testing %s on %s ...\n" $name $input

set -e

printf "Tokenizing using (py) Python AutoTokenizer ...\n"
python3 ./tests/test-tokenizer-0.py ./models/tokenizers/$name --fname-tok $input > /tmp/test-tokenizer-0-$name-py.log 2>&1
cat /tmp/test-tokenizer-0-$name-py.log | grep "tokenized in"

printf "Tokenizing using (cpp) llama.cpp ...\n"
./tests/test-tokenizer-0 ./models/ggml-vocab-$name.gguf $input > /tmp/test-tokenizer-0-$name-cpp.log 2>&1

cat /tmp/test-tokenizer-0-$name-py.log | grep "tokenized in"
cat /tmp/test-tokenizer-0-$name-cpp.log | grep "tokenized in"

diff $input.tok $input.tokcpp > /dev/null 2>&1
Expand Down

0 comments on commit d17d582

Please sign in to comment.