Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer BPE fixes #7530

Merged
merged 29 commits into from
Jun 18, 2024
Merged

Tokenizer BPE fixes #7530

merged 29 commits into from
Jun 18, 2024

Conversation

jaime-m-p
Copy link
Collaborator

@jaime-m-p jaime-m-p commented May 25, 2024

Modifications to make BPE tokenizer match AutoTokenizer.

Tested with vocabs from models:

  • gpt-2
  • llama-bpe
  • falcon (FAIL: test-tokenizer-random: generator_random_unicodes)
  • deepseek-coder
  • deepseek-llm (FAIL: problems with unicode \u2029: "\u2029 \uA3E4")
  • starcoder
  • jina-v2-de, jina-v2-es, jina-v2-code
  • smaug-bpe
  • phi-2

@github-actions github-actions bot added testing Everything test related python python script changes labels May 25, 2024
Copy link
Contributor

github-actions bot commented May 25, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 553 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8443.31ms p(95)=20166.76ms fails=, finish reason: stop=500 truncated=53
  • Prompt processing (pp): avg=97.93tk/s p(95)=396.92tk/s
  • Token generation (tg): avg=50.86tk/s p(95)=51.39tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=tokenizer-bpe-fixes commit=fef99155cc99b1a54863f921b659c9c84a23e1e4

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717020989 --> 1717021615
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 263.41, 263.41, 263.41, 263.41, 263.41, 740.82, 740.82, 740.82, 740.82, 740.82, 820.73, 820.73, 820.73, 820.73, 820.73, 836.13, 836.13, 836.13, 836.13, 836.13, 860.87, 860.87, 860.87, 860.87, 860.87, 863.86, 863.86, 863.86, 863.86, 863.86, 860.97, 860.97, 860.97, 860.97, 860.97, 871.44, 871.44, 871.44, 871.44, 871.44, 887.88, 887.88, 887.88, 887.88, 887.88, 878.43, 878.43, 878.43, 878.43, 878.43, 870.15, 870.15, 870.15, 870.15, 870.15, 861.31, 861.31, 861.31, 861.31, 861.31, 884.38, 884.38, 884.38, 884.38, 884.38, 904.94, 904.94, 904.94, 904.94, 904.94, 910.64, 910.64, 910.64, 910.64, 910.64, 911.94, 911.94, 911.94, 911.94, 911.94, 907.98, 907.98, 907.98, 907.98, 907.98, 846.95, 846.95, 846.95, 846.95, 846.95, 850.67, 850.67, 850.67, 850.67, 850.67, 856.63, 856.63, 856.63, 856.63, 856.63, 860.22, 860.22, 860.22, 860.22, 860.22, 871.79, 871.79, 871.79, 871.79, 871.79, 874.24, 874.24, 874.24, 874.24, 874.24, 874.8, 874.8, 874.8, 874.8, 874.8, 873.46, 873.46, 873.46, 873.46, 873.46, 890.33, 890.33, 890.33, 890.33, 890.33, 888.83, 888.83, 888.83, 888.83, 888.83, 887.29, 887.29, 887.29, 887.29, 887.29, 887.88, 887.88, 887.88, 887.88, 887.88, 890.84, 890.84, 890.84, 890.84, 890.84, 888.79, 888.79, 888.79, 888.79, 888.79, 889.0, 889.0, 889.0, 889.0, 889.0, 887.25, 887.25, 887.25, 887.25, 887.25, 887.9, 887.9, 887.9, 887.9, 887.9, 889.9, 889.9, 889.9, 889.9, 889.9, 887.92, 887.92, 887.92, 887.92, 887.92, 878.53, 878.53, 878.53, 878.53, 878.53, 875.68, 875.68, 875.68, 875.68, 875.68, 873.33, 873.33, 873.33, 873.33, 873.33, 872.75, 872.75, 872.75, 872.75, 872.75, 874.17, 874.17, 874.17, 874.17, 874.17, 873.39, 873.39, 873.39, 873.39, 873.39, 861.41, 861.41, 861.41, 861.41, 861.41, 862.72, 862.72, 862.72, 862.72, 862.72, 861.06, 861.06, 861.06, 861.06, 861.06, 859.17, 859.17, 859.17, 859.17, 859.17, 863.49, 863.49, 863.49, 863.49, 863.49, 865.54, 865.54, 865.54, 865.54, 865.54, 864.54, 864.54, 864.54, 864.54, 864.54, 865.98, 865.98, 865.98, 865.98, 865.98, 864.86, 864.86, 864.86, 864.86, 864.86, 867.18, 867.18, 867.18, 867.18, 867.18, 869.7, 869.7, 869.7, 869.7, 869.7, 868.51, 868.51, 868.51, 868.51, 868.51, 873.05, 873.05, 873.05, 873.05, 873.05, 873.03, 873.03, 873.03, 873.03, 873.03, 872.67, 872.67, 872.67, 872.67, 872.67, 873.98, 873.98, 873.98, 873.98, 873.98, 875.03, 875.03, 875.03, 875.03, 875.03, 874.39, 874.39, 874.39, 874.39, 874.39, 872.41, 872.41, 872.41, 872.41]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717020989 --> 1717021615
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 32.37, 32.37, 32.37, 32.37, 32.37, 29.89, 29.89, 29.89, 29.89, 29.89, 31.27, 31.27, 31.27, 31.27, 31.27, 33.25, 33.25, 33.25, 33.25, 33.25, 33.8, 33.8, 33.8, 33.8, 33.8, 35.8, 35.8, 35.8, 35.8, 35.8, 36.69, 36.69, 36.69, 36.69, 36.69, 36.55, 36.55, 36.55, 36.55, 36.55, 35.73, 35.73, 35.73, 35.73, 35.73, 35.53, 35.53, 35.53, 35.53, 35.53, 35.36, 35.36, 35.36, 35.36, 35.36, 34.78, 34.78, 34.78, 34.78, 34.78, 33.76, 33.76, 33.76, 33.76, 33.76, 31.49, 31.49, 31.49, 31.49, 31.49, 31.28, 31.28, 31.28, 31.28, 31.28, 31.53, 31.53, 31.53, 31.53, 31.53, 32.03, 32.03, 32.03, 32.03, 32.03, 31.88, 31.88, 31.88, 31.88, 31.88, 31.98, 31.98, 31.98, 31.98, 31.98, 32.09, 32.09, 32.09, 32.09, 32.09, 32.33, 32.33, 32.33, 32.33, 32.33, 32.02, 32.02, 32.02, 32.02, 32.02, 32.01, 32.01, 32.01, 32.01, 32.01, 31.99, 31.99, 31.99, 31.99, 31.99, 32.05, 32.05, 32.05, 32.05, 32.05, 31.59, 31.59, 31.59, 31.59, 31.59, 31.81, 31.81, 31.81, 31.81, 31.81, 32.0, 32.0, 32.0, 32.0, 32.0, 32.06, 32.06, 32.06, 32.06, 32.06, 32.19, 32.19, 32.19, 32.19, 32.19, 32.21, 32.21, 32.21, 32.21, 32.21, 32.3, 32.3, 32.3, 32.3, 32.3, 32.17, 32.17, 32.17, 32.17, 32.17, 32.05, 32.05, 32.05, 32.05, 32.05, 31.36, 31.36, 31.36, 31.36, 31.36, 30.94, 30.94, 30.94, 30.94, 30.94, 30.74, 30.74, 30.74, 30.74, 30.74, 30.71, 30.71, 30.71, 30.71, 30.71, 30.81, 30.81, 30.81, 30.81, 30.81, 30.99, 30.99, 30.99, 30.99, 30.99, 31.11, 31.11, 31.11, 31.11, 31.11, 31.18, 31.18, 31.18, 31.18, 31.18, 30.92, 30.92, 30.92, 30.92, 30.92, 30.41, 30.41, 30.41, 30.41, 30.41, 29.67, 29.67, 29.67, 29.67, 29.67, 29.6, 29.6, 29.6, 29.6, 29.6, 29.59, 29.59, 29.59, 29.59, 29.59, 29.59, 29.59, 29.59, 29.59, 29.59, 29.56, 29.56, 29.56, 29.56, 29.56, 29.62, 29.62, 29.62, 29.62, 29.62, 29.64, 29.64, 29.64, 29.64, 29.64, 29.63, 29.63, 29.63, 29.63, 29.63, 29.45, 29.45, 29.45, 29.45, 29.45, 29.41, 29.41, 29.41, 29.41, 29.41, 29.48, 29.48, 29.48, 29.48, 29.48, 29.57, 29.57, 29.57, 29.57, 29.57, 29.73, 29.73, 29.73, 29.73, 29.73, 29.79, 29.79, 29.79, 29.79, 29.79, 29.83, 29.83, 29.83, 29.83, 29.83, 29.87, 29.87, 29.87, 29.87]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717020989 --> 1717021615
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.11, 0.11, 0.11, 0.11, 0.11, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.27, 0.27, 0.27, 0.27, 0.27, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.28, 0.28, 0.28, 0.28, 0.28, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.14, 0.14, 0.14, 0.14, 0.14, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.28, 0.28, 0.28, 0.28, 0.28, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.38, 0.38, 0.38, 0.38, 0.38, 0.29, 0.29, 0.29, 0.29, 0.29, 0.37, 0.37, 0.37, 0.37, 0.37, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.45, 0.45, 0.45, 0.45, 0.45, 0.58, 0.58, 0.58, 0.58, 0.58, 0.4, 0.4, 0.4, 0.4, 0.4, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.34, 0.34, 0.34, 0.34, 0.34, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717020989 --> 1717021615
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0]
                    
Loading

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label May 25, 2024
@mofosyne
Copy link
Collaborator

If merged, don't forget to check #7402

@bartowski1182
Copy link
Contributor

Would we ideally remake any quants from before this change? How impactful is it?

@jaime-m-p
Copy link
Collaborator Author

If merged, don't forget to check #7402

smaug-bpe works fine now.

@jaime-m-p
Copy link
Collaborator Author

Would we ideally remake any quants from before this change? How impactful is it?

Until now, none of this changes are modifying the GGUF processing.
I'm "hardcoding" some variable flags for now, currently GGUF files does not contains this information.

Note: this is only true for BPE tokenizer fixes, for fixing the SPM tokenizer I had to modify the convert-hf-to-gguf.py: 5b61c04

jaime-m-p added 3 commits May 25, 2024 21:30
Initialize values when loading the model vocab.
For now, only <mask> token, needed for 'jina-v2'.
@bartowski1182
Copy link
Contributor

@jaime-m-p would it be easier if I just opened a PR against your fork for Smaug and we get it all in at once? Or would you rather merge this then get the Smaug changes in

llama.cpp Outdated
Comment on lines 2081 to 2086
// tokenizer flags
bool tokenizer_add_space_prefix = true;
bool tokenizer_special_add_bos = false;
bool tokenizer_special_add_eos = false;
bool tokenizer_ignore_merges = false;
bool tokenizer_mask_lstrip = false;
Copy link
Contributor

@teleprint-me teleprint-me May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Looks good. I'll see if I can target these member variables.

Comment on lines +12282 to +12290
case LLAMA_VOCAB_PRE_TYPE_LLAMA3:
regex_exprs = {
// original regex from tokenizer.json
//"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",

// adapted: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080233989
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
};
break;
Copy link
Contributor

@teleprint-me teleprint-me May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now because they're still required, but we shouldn't require these in the future. This will be error prone. I'll be embedding these the pre_tokenizer field instead so we can just read from them directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the variable regex_exprs should be placed and loaded as the other tokenizer 'flags'.

Comment on lines 12356 to 12365
default:
GGML_ASSERT(false);
// default regex for BPE tokenization pre-processing
regex_exprs = {
"[\\p{P}\\$\\+<=>\\^~\\|]+",
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
"\\p{N}+",
"[0-9][0-9][0-9]",
};
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct. This should be GPT-2 pre-tokenizer by default. Otherwise, it should use the defined pre-tokenizer.

llama.cpp Outdated
@@ -12686,7 +12731,7 @@ static void tokenizer_st_partition(const llama_vocab & vocab, std::forward_list<

// if a fragment is text ( not yet processed )
if (fragment.type == FRAGMENT_BUFFER_VARIANT_TYPE_RAW_TEXT) {
auto * raw_text = &(fragment.raw_text);
auto & raw_text = fragment.raw_text;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this on the stack?

Copy link
Collaborator Author

@jaime-m-p jaime-m-p May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto & raw_text = fragment.raw_text;
expands to
const std::string & raw_text = fragment.raw_text;

It's a reference, this is only 8 bytes in the stack (like a pointer).

In this case, you can see it as sintactic sugar to (const pointer to const value):
std::string const * const raw_text = &(fragment.raw_text);

Comment on lines +342 to +345
# "llama-spm", # SPM
# "phi-3", # SPM
# "bert-bge", # WPM
# "jina-v2-en", # WPM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think these are needed. Maybe for WPM, but WPM (BERT) defaults to BPE (GPT). Still don't know enough about it, yet.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 25, 2024

Phi and Deepseek are BPE (GPT-2). I was able to get Phi 1, 1.5, and 2 all working correctly.

@jaime-m-p
Copy link
Collaborator Author

@jaime-m-p would it be easier if I just opened a PR against your fork for Smaug and we get it all in at once? Or would you rather merge this then get the Smaug changes in

@bartowski1182 I think you can merge now, the issues are not in your added code.
It will work as good/bad as other BPE models.
When I merge this PR it should work as expected.

@github-actions github-actions bot added documentation Improvements or additions to documentation build Compilation issues script Script related android Issues specific to Android Nvidia GPU Issues specific to Nvidia GPUs nix Issues specific to consuming flake.nix, or generally concerned with ❄ Nix-based llama.cpp deployment Vulkan Issues specific to the Vulkan backend examples devops improvements to build systems and github actions server ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Kompute https://github.com/KomputeProject/kompute/ labels Jun 14, 2024
@jaime-m-p
Copy link
Collaborator Author

@ggerganov
I'm done for now.

The model deepseek-llm only fails with unicode character \u2029 in some cases like "\u2029 \uA3E4".
This string ends up with one extra token due to an extra split that AutoTokenizer seems to skip.

The model falcon has similar issues (maybe related), an extra split is performed:

falcon  : [..., 106, 227, 204, 168, 129, 221, 182, 236, ...]
Expected: [..., 106, 227,   2723,   129, 221, 182, 236, ...]

Maybe I will return to this later.
I have detokenizer fixes blocked.

@ggerganov
Copy link
Owner

ggerganov commented Jun 18, 2024

Thank you for the fixes - feel free to merge

Do you have ideas how to resolve the tokenization problems with added tokens? For example, with deepseek-coder it currently fails to tokenize ü correctly:

make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-deepseek-coder.gguf
src: 'Führer'
res: 'Führer'
tok: 37 2864 71 6247 
main : failed test:    'Führer'
main : detokenized to: 'Führer' instead of 'Führer'
main : expected tokens:     37 'F',  32009 'ü',     71 'h',   6247 'rer', 
main : got tokens:          37 'F',   2864 'ü',     71 'h',   6247 'rer',

The reason AFAIK is that AutoTokenizer treats added tokens (such as 32009 'ü' in this case) with higher priority (some discussion here: #7144)

@jaime-m-p
Copy link
Collaborator Author

@ggerganov
I'm planing to open another PR to fix detokenizer issues (including that).

test-tokenizer-0 is not doing the same 'encoding' as convert-hf-to-gguf-update.py.

const bool add_special = false;
for (const auto & test_kv : k_tests) {
const std::vector<llama_token> res = llama_tokenize(ctx, test_kv.first, add_special);

res = tokenizer.encode(text, add_special_tokens=False)

AutoTokenizer.encode() is parsing special tokens by default, so we need to do the same:

llama.cpp/tests/test-tokenizer-0.cpp
- llama_tokenize(ctx, test_kv.first, add_special);
+ llama_tokenize(ctx, test_kv.first, add_special, true);

Tests passed

@jaime-m-p jaime-m-p merged commit 37bef89 into ggerganov:master Jun 18, 2024
66 checks passed
@@ -766,9 +772,16 @@ std::vector<std::string> unicode_regex_split(const std::string & text, const std
bpe_offsets = unicode_regex_split_stl(text_collapsed, regex_expr_collapsed, bpe_offsets);
} else {
// no unicode category used, we can use std::wregex directly
const std::wstring wtext = unicode_wstring_from_utf8(text);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to remove this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is still here (line 778):

llama.cpp/unicode.cpp

Lines 777 to 783 in 37bef89

// std::wregex \s does not mach non-ASCII whitespaces, using 0x0B as fallback
std::wstring wtext(cpts.begin(), cpts.end());
for (size_t i = 0; i < wtext.size(); ++i) {
if (wtext[i] > 0x7F && unicode_cpt_flags(wtext[i]).is_whitespace) {
wtext[i] = 0x0B;
}
}

Since unicode_wstring_from_utf8(text) is the same as const auto cpts = unicode_cpts_from_utf8(text) just copy the unicode characters, then collapse the unicode whitespaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply. I didn't see that (Between the lack of sleep, heat, and killer headache) 😅.

Comment on lines +779 to 787
for (size_t i = 0; i < wtext.size(); ++i) {
if (wtext[i] > 0x7F && unicode_cpt_flags(wtext[i]).is_whitespace) {
wtext[i] = 0x0B;
}
}

//printf("text: %s\n", text.c_str());
//printf("regex_expr: %s\n", regex_expr.c_str());
bpe_offsets = unicode_regex_split_stl(wtext, wregex_expr, bpe_offsets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wtext is still applied here.

"Lo": CODEPOINT_FLAG_LETTER, # Other Letter
"Lt": CODEPOINT_FLAG_LETTER, # Titlecase Letter
"Lu": CODEPOINT_FLAG_LETTER, # Uppercase Letter
"L&": CODEPOINT_FLAG_LETTER, # Cased Letter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find any reference to L&. What is this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fount it here: https://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt.

And the description: https://www.regular-expressions.info/unicode.html.

\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).

Copy link
Contributor

@teleprint-me teleprint-me Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why! The UnicodeData.txt doesn't have or even define this.

12:58:44 | /mnt/valerie/public/gpt
(.venv) git:(main | Δ) λ python -i gen/test.py
done!
>>> categories = set()
>>> for cpt in codepoints:
...     if cpt.general_category:
...             categories.add(cpt.general_category)
... 
>>> categories
{'Nl', 'Pf', 'Pd', 'Co', 'Cc', 'Lu', 'Me', 'Pe', 'Sk', 'Lt', 'Cf', 'Lm', 'Lo', 'Pi', 'Zp', 'Ll', 'No', 'Pc', 'Sm', 'Mn', 'Zl', 'Zs', 'Nd', 'Mc', 'So', 'Ps', 'Po', 'Cs', 'Sc'}
>>> "L&" in categories
False

Copy link
Contributor

@teleprint-me teleprint-me Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L& is a regular expression property that's used as an alias.

The symbol "L&" indicates characters of General_Category Lu, Ll, or Lt (uppercase, lowercase, or titlecase letter).

L& as used in these comments is an alias for the derived LC value (cased letter) for the General_Category property, as documented in PropertyValueAliases.txt.

https://www.unicode.org/reports/tr44/#Comments

This seems to be more applicable to the perl compatible regular expression property.

https://perldoc.perl.org/perluniprops

I don't know. Still learning. Possible I'm missing plenty more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it will never be included:

    # codepoint category flags
    codepoint_flags[cpt] = UNICODE_CATEGORY_TO_FLAG[categ]

You would need to parse Special and Line Break to include special codepoint sets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All L& unicode ranges in LineBreak.txt are already present in UnicodeData.txt as Ll, Lu ...
Since it is never used, maybe we can just remove L& from the map UNICODE_CATEGORY_TO_FLAG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
android Issues specific to Android Apple Metal https://en.wikipedia.org/wiki/Metal_(API) build Compilation issues devops improvements to build systems and github actions documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ nix Issues specific to consuming flake.nix, or generally concerned with ❄ Nix-based llama.cpp deployment Nvidia GPU Issues specific to Nvidia GPUs python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level script Script related server SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants