Tokenizer BPE fixes #7530

jaime-m-p · 2024-05-25T02:45:33Z

Modifications to make BPE tokenizer match AutoTokenizer.

Tested with vocabs from models:

github-actions · 2024-05-25T03:30:10Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 553 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8443.31ms p(95)=20166.76ms fails=, finish reason: stop=500 truncated=53
Prompt processing (pp): avg=97.93tk/s p(95)=396.92tk/s
Token generation (tg): avg=50.86tk/s p(95)=51.39tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=tokenizer-bpe-fixes commit=fef99155cc99b1a54863f921b659c9c84a23e1e4

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717020989 --> 1717021615
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 263.41, 263.41, 263.41, 263.41, 263.41, 740.82, 740.82, 740.82, 740.82, 740.82, 820.73, 820.73, 820.73, 820.73, 820.73, 836.13, 836.13, 836.13, 836.13, 836.13, 860.87, 860.87, 860.87, 860.87, 860.87, 863.86, 863.86, 863.86, 863.86, 863.86, 860.97, 860.97, 860.97, 860.97, 860.97, 871.44, 871.44, 871.44, 871.44, 871.44, 887.88, 887.88, 887.88, 887.88, 887.88, 878.43, 878.43, 878.43, 878.43, 878.43, 870.15, 870.15, 870.15, 870.15, 870.15, 861.31, 861.31, 861.31, 861.31, 861.31, 884.38, 884.38, 884.38, 884.38, 884.38, 904.94, 904.94, 904.94, 904.94, 904.94, 910.64, 910.64, 910.64, 910.64, 910.64, 911.94, 911.94, 911.94, 911.94, 911.94, 907.98, 907.98, 907.98, 907.98, 907.98, 846.95, 846.95, 846.95, 846.95, 846.95, 850.67, 850.67, 850.67, 850.67, 850.67, 856.63, 856.63, 856.63, 856.63, 856.63, 860.22, 860.22, 860.22, 860.22, 860.22, 871.79, 871.79, 871.79, 871.79, 871.79, 874.24, 874.24, 874.24, 874.24, 874.24, 874.8, 874.8, 874.8, 874.8, 874.8, 873.46, 873.46, 873.46, 873.46, 873.46, 890.33, 890.33, 890.33, 890.33, 890.33, 888.83, 888.83, 888.83, 888.83, 888.83, 887.29, 887.29, 887.29, 887.29, 887.29, 887.88, 887.88, 887.88, 887.88, 887.88, 890.84, 890.84, 890.84, 890.84, 890.84, 888.79, 888.79, 888.79, 888.79, 888.79, 889.0, 889.0, 889.0, 889.0, 889.0, 887.25, 887.25, 887.25, 887.25, 887.25, 887.9, 887.9, 887.9, 887.9, 887.9, 889.9, 889.9, 889.9, 889.9, 889.9, 887.92, 887.92, 887.92, 887.92, 887.92, 878.53, 878.53, 878.53, 878.53, 878.53, 875.68, 875.68, 875.68, 875.68, 875.68, 873.33, 873.33, 873.33, 873.33, 873.33, 872.75, 872.75, 872.75, 872.75, 872.75, 874.17, 874.17, 874.17, 874.17, 874.17, 873.39, 873.39, 873.39, 873.39, 873.39, 861.41, 861.41, 861.41, 861.41, 861.41, 862.72, 862.72, 862.72, 862.72, 862.72, 861.06, 861.06, 861.06, 861.06, 861.06, 859.17, 859.17, 859.17, 859.17, 859.17, 863.49, 863.49, 863.49, 863.49, 863.49, 865.54, 865.54, 865.54, 865.54, 865.54, 864.54, 864.54, 864.54, 864.54, 864.54, 865.98, 865.98, 865.98, 865.98, 865.98, 864.86, 864.86, 864.86, 864.86, 864.86, 867.18, 867.18, 867.18, 867.18, 867.18, 869.7, 869.7, 869.7, 869.7, 869.7, 868.51, 868.51, 868.51, 868.51, 868.51, 873.05, 873.05, 873.05, 873.05, 873.05, 873.03, 873.03, 873.03, 873.03, 873.03, 872.67, 872.67, 872.67, 872.67, 872.67, 873.98, 873.98, 873.98, 873.98, 873.98, 875.03, 875.03, 875.03, 875.03, 875.03, 874.39, 874.39, 874.39, 874.39, 874.39, 872.41, 872.41, 872.41, 872.41]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717020989 --> 1717021615
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 32.37, 32.37, 32.37, 32.37, 32.37, 29.89, 29.89, 29.89, 29.89, 29.89, 31.27, 31.27, 31.27, 31.27, 31.27, 33.25, 33.25, 33.25, 33.25, 33.25, 33.8, 33.8, 33.8, 33.8, 33.8, 35.8, 35.8, 35.8, 35.8, 35.8, 36.69, 36.69, 36.69, 36.69, 36.69, 36.55, 36.55, 36.55, 36.55, 36.55, 35.73, 35.73, 35.73, 35.73, 35.73, 35.53, 35.53, 35.53, 35.53, 35.53, 35.36, 35.36, 35.36, 35.36, 35.36, 34.78, 34.78, 34.78, 34.78, 34.78, 33.76, 33.76, 33.76, 33.76, 33.76, 31.49, 31.49, 31.49, 31.49, 31.49, 31.28, 31.28, 31.28, 31.28, 31.28, 31.53, 31.53, 31.53, 31.53, 31.53, 32.03, 32.03, 32.03, 32.03, 32.03, 31.88, 31.88, 31.88, 31.88, 31.88, 31.98, 31.98, 31.98, 31.98, 31.98, 32.09, 32.09, 32.09, 32.09, 32.09, 32.33, 32.33, 32.33, 32.33, 32.33, 32.02, 32.02, 32.02, 32.02, 32.02, 32.01, 32.01, 32.01, 32.01, 32.01, 31.99, 31.99, 31.99, 31.99, 31.99, 32.05, 32.05, 32.05, 32.05, 32.05, 31.59, 31.59, 31.59, 31.59, 31.59, 31.81, 31.81, 31.81, 31.81, 31.81, 32.0, 32.0, 32.0, 32.0, 32.0, 32.06, 32.06, 32.06, 32.06, 32.06, 32.19, 32.19, 32.19, 32.19, 32.19, 32.21, 32.21, 32.21, 32.21, 32.21, 32.3, 32.3, 32.3, 32.3, 32.3, 32.17, 32.17, 32.17, 32.17, 32.17, 32.05, 32.05, 32.05, 32.05, 32.05, 31.36, 31.36, 31.36, 31.36, 31.36, 30.94, 30.94, 30.94, 30.94, 30.94, 30.74, 30.74, 30.74, 30.74, 30.74, 30.71, 30.71, 30.71, 30.71, 30.71, 30.81, 30.81, 30.81, 30.81, 30.81, 30.99, 30.99, 30.99, 30.99, 30.99, 31.11, 31.11, 31.11, 31.11, 31.11, 31.18, 31.18, 31.18, 31.18, 31.18, 30.92, 30.92, 30.92, 30.92, 30.92, 30.41, 30.41, 30.41, 30.41, 30.41, 29.67, 29.67, 29.67, 29.67, 29.67, 29.6, 29.6, 29.6, 29.6, 29.6, 29.59, 29.59, 29.59, 29.59, 29.59, 29.59, 29.59, 29.59, 29.59, 29.59, 29.56, 29.56, 29.56, 29.56, 29.56, 29.62, 29.62, 29.62, 29.62, 29.62, 29.64, 29.64, 29.64, 29.64, 29.64, 29.63, 29.63, 29.63, 29.63, 29.63, 29.45, 29.45, 29.45, 29.45, 29.45, 29.41, 29.41, 29.41, 29.41, 29.41, 29.48, 29.48, 29.48, 29.48, 29.48, 29.57, 29.57, 29.57, 29.57, 29.57, 29.73, 29.73, 29.73, 29.73, 29.73, 29.79, 29.79, 29.79, 29.79, 29.79, 29.83, 29.83, 29.83, 29.83, 29.83, 29.87, 29.87, 29.87, 29.87]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717020989 --> 1717021615
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.11, 0.11, 0.11, 0.11, 0.11, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.27, 0.27, 0.27, 0.27, 0.27, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.28, 0.28, 0.28, 0.28, 0.28, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.14, 0.14, 0.14, 0.14, 0.14, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.28, 0.28, 0.28, 0.28, 0.28, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.38, 0.38, 0.38, 0.38, 0.38, 0.29, 0.29, 0.29, 0.29, 0.29, 0.37, 0.37, 0.37, 0.37, 0.37, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.45, 0.45, 0.45, 0.45, 0.45, 0.58, 0.58, 0.58, 0.58, 0.58, 0.4, 0.4, 0.4, 0.4, 0.4, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.34, 0.34, 0.34, 0.34, 0.34, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717020989 --> 1717021615
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0]

mofosyne · 2024-05-25T15:48:01Z

If merged, don't forget to check #7402

bartowski1182 · 2024-05-25T16:35:16Z

Would we ideally remake any quants from before this change? How impactful is it?

jaime-m-p · 2024-05-25T16:41:02Z

If merged, don't forget to check #7402

smaug-bpe works fine now.

jaime-m-p · 2024-05-25T17:02:35Z

Would we ideally remake any quants from before this change? How impactful is it?

Until now, none of this changes are modifying the GGUF processing.
I'm "hardcoding" some variable flags for now, currently GGUF files does not contains this information.

Note: this is only true for BPE tokenizer fixes, for fixing the SPM tokenizer I had to modify the convert-hf-to-gguf.py: 5b61c04

Initialize values when loading the model vocab.

For now, only <mask> token, needed for 'jina-v2'.

bartowski1182 · 2024-05-25T20:34:06Z

@jaime-m-p would it be easier if I just opened a PR against your fork for Smaug and we get it all in at once? Or would you rather merge this then get the Smaug changes in

teleprint-me · 2024-05-25T21:13:27Z

llama.cpp

+    // tokenizer flags
+    bool tokenizer_add_space_prefix = true;
+    bool tokenizer_special_add_bos  = false;
+    bool tokenizer_special_add_eos  = false;
+    bool tokenizer_ignore_merges    = false;
+    bool tokenizer_mask_lstrip      = false;


Okay. Looks good. I'll see if I can target these member variables.

teleprint-me · 2024-05-25T21:14:48Z

llama.cpp

+            case LLAMA_VOCAB_PRE_TYPE_LLAMA3:
+                regex_exprs = {
+                    // original regex from tokenizer.json
+                    //"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
+
+                    // adapted: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080233989
+                    "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
+                };
+                break;


This is fine for now because they're still required, but we shouldn't require these in the future. This will be error prone. I'll be embedding these the pre_tokenizer field instead so we can just read from them directly.

Yes, the variable regex_exprs should be placed and loaded as the other tokenizer 'flags'.

teleprint-me · 2024-05-25T21:16:11Z

llama.cpp

            default:
-                GGML_ASSERT(false);
+                // default regex for BPE tokenization pre-processing
+                regex_exprs = {
+                    "[\\p{P}\\$\\+<=>\\^~\\|]+",
+                    "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
+                    "\\p{N}+",
+                    "[0-9][0-9][0-9]",
+                };
                break;
        }


This isn't correct. This should be GPT-2 pre-tokenizer by default. Otherwise, it should use the defined pre-tokenizer.

teleprint-me · 2024-05-25T21:31:46Z

llama.cpp

@@ -12686,7 +12731,7 @@ static void tokenizer_st_partition(const llama_vocab & vocab, std::forward_list<

            // if a fragment is text ( not yet processed )
            if (fragment.type == FRAGMENT_BUFFER_VARIANT_TYPE_RAW_TEXT) {
-                auto * raw_text = &(fragment.raw_text);
+                auto & raw_text = fragment.raw_text;


Is this on the stack?

auto & raw_text = fragment.raw_text;
expands to
const std::string & raw_text = fragment.raw_text;

It's a reference, this is only 8 bytes in the stack (like a pointer).

In this case, you can see it as sintactic sugar to (const pointer to const value):
std::string const * const raw_text = &(fragment.raw_text);

teleprint-me · 2024-05-25T21:35:56Z

tests/test-tokenizer-random.py

+        # "llama-spm",   # SPM
+        # "phi-3",       # SPM
+        # "bert-bge",    # WPM
+        # "jina-v2-en",  # WPM


Yeah, I don't think these are needed. Maybe for WPM, but WPM (BERT) defaults to BPE (GPT). Still don't know enough about it, yet.

teleprint-me · 2024-05-25T21:40:32Z

Phi and Deepseek are BPE (GPT-2). I was able to get Phi 1, 1.5, and 2 all working correctly.

jaime-m-p · 2024-05-25T21:44:43Z

@jaime-m-p would it be easier if I just opened a PR against your fork for Smaug and we get it all in at once? Or would you rather merge this then get the Smaug changes in

@bartowski1182 I think you can merge now, the issues are not in your added code.
It will work as good/bad as other BPE models.
When I merge this PR it should work as expected.

jaime-m-p · 2024-06-17T20:11:00Z

@ggerganov
I'm done for now.

The model deepseek-llm only fails with unicode character \u2029 in some cases like "\u2029 \uA3E4".
This string ends up with one extra token due to an extra split that AutoTokenizer seems to skip.

The model falcon has similar issues (maybe related), an extra split is performed:

falcon  : [..., 106, 227, 204, 168, 129, 221, 182, 236, ...]
Expected: [..., 106, 227,   2723,   129, 221, 182, 236, ...]

Maybe I will return to this later.
I have detokenizer fixes blocked.

ggerganov · 2024-06-18T07:40:00Z

Thank you for the fixes - feel free to merge

Do you have ideas how to resolve the tokenization problems with added tokens? For example, with deepseek-coder it currently fails to tokenize ü correctly:

make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-deepseek-coder.gguf

src: 'Führer'
res: 'Führer'
tok: 37 2864 71 6247 
main : failed test:    'Führer'
main : detokenized to: 'Führer' instead of 'Führer'
main : expected tokens:     37 'F',  32009 'ü',     71 'h',   6247 'rer', 
main : got tokens:          37 'F',   2864 'ü',     71 'h',   6247 'rer',

The reason AFAIK is that AutoTokenizer treats added tokens (such as 32009 'ü' in this case) with higher priority (some discussion here: #7144)

jaime-m-p · 2024-06-18T15:50:18Z

@ggerganov
I'm planing to open another PR to fix detokenizer issues (including that).

test-tokenizer-0 is not doing the same 'encoding' as convert-hf-to-gguf-update.py.

llama.cpp/tests/test-tokenizer-0.cpp

Lines 195 to 198 in 91c188d

    
           const bool add_special = false; 
        
           for (const auto & test_kv : k_tests) { 
        
               const std::vector<llama_token> res = llama_tokenize(ctx, test_kv.first, add_special);

llama.cpp/convert-hf-to-gguf-update.py

Line 314 in 91c188d

res = tokenizer.encode(text, add_special_tokens=False)

AutoTokenizer.encode() is parsing special tokens by default, so we need to do the same:

llama.cpp/tests/test-tokenizer-0.cpp
- llama_tokenize(ctx, test_kv.first, add_special);
+ llama_tokenize(ctx, test_kv.first, add_special, true);

Tests passed

teleprint-me · 2024-06-18T19:47:48Z

unicode.cpp

@@ -766,9 +772,16 @@ std::vector<std::string> unicode_regex_split(const std::string & text, const std
                bpe_offsets = unicode_regex_split_stl(text_collapsed, regex_expr_collapsed, bpe_offsets);
            } else {
                // no unicode category used, we can use std::wregex directly
-                const std::wstring wtext       = unicode_wstring_from_utf8(text);


Did you intend to remove this?

Is still here (line 778):

llama.cpp/unicode.cpp

Lines 777 to 783 in 37bef89

// std::wregex \s does not mach non-ASCII whitespaces, using 0x0B as fallback

std::wstring wtext(cpts.begin(), cpts.end());

for (size_t i = 0; i < wtext.size(); ++i) {

if (wtext[i] > 0x7F && unicode_cpt_flags(wtext[i]).is_whitespace) {

wtext[i] = 0x0B;

}

}

Since unicode_wstring_from_utf8(text) is the same as const auto cpts = unicode_cpts_from_utf8(text) just copy the unicode characters, then collapse the unicode whitespaces.

Thanks for the reply. I didn't see that (Between the lack of sleep, heat, and killer headache) 😅.

teleprint-me · 2024-06-18T19:49:23Z

unicode.cpp

+                for (size_t i = 0; i < wtext.size(); ++i) {
+                    if (wtext[i] > 0x7F && unicode_cpt_flags(wtext[i]).is_whitespace) {
+                        wtext[i] = 0x0B;
+                    }
+                }
+
                //printf("text: %s\n", text.c_str());
                //printf("regex_expr: %s\n", regex_expr.c_str());
                bpe_offsets = unicode_regex_split_stl(wtext, wregex_expr, bpe_offsets);


wtext is still applied here.

teleprint-me · 2024-06-21T00:48:07Z

scripts/gen-unicode-data.py

+    "Lo": CODEPOINT_FLAG_LETTER,       # Other Letter
+    "Lt": CODEPOINT_FLAG_LETTER,       # Titlecase Letter
+    "Lu": CODEPOINT_FLAG_LETTER,       # Uppercase Letter
+    "L&": CODEPOINT_FLAG_LETTER,       # Cased Letter


I cannot find any reference to L&. What is this?

I fount it here: https://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt.

And the description: https://www.regular-expressions.info/unicode.html.

\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).

That's why! The UnicodeData.txt doesn't have or even define this.

12:58:44 | /mnt/valerie/public/gpt (.venv) git:(main | Δ) λ python -i gen/test.py done! >>> categories = set() >>> for cpt in codepoints: ... if cpt.general_category: ... categories.add(cpt.general_category) ... >>> categories {'Nl', 'Pf', 'Pd', 'Co', 'Cc', 'Lu', 'Me', 'Pe', 'Sk', 'Lt', 'Cf', 'Lm', 'Lo', 'Pi', 'Zp', 'Ll', 'No', 'Pc', 'Sm', 'Mn', 'Zl', 'Zs', 'Nd', 'Mc', 'So', 'Ps', 'Po', 'Cs', 'Sc'} >>> "L&" in categories False

L& is a regular expression property that's used as an alias.

The symbol "L&" indicates characters of General_Category Lu, Ll, or Lt (uppercase, lowercase, or titlecase letter).

L& as used in these comments is an alias for the derived LC value (cased letter) for the General_Category property, as documented in PropertyValueAliases.txt.

https://www.unicode.org/reports/tr44/#Comments

This seems to be more applicable to the perl compatible regular expression property.

https://perldoc.perl.org/perluniprops

I don't know. Still learning. Possible I'm missing plenty more.

Yeah, it will never be included:

# codepoint category flags codepoint_flags[cpt] = UNICODE_CATEGORY_TO_FLAG[categ]

You would need to parse Special and Line Break to include special codepoint sets.

All L& unicode ranges in LineBreak.txt are already present in UnicodeData.txt as Ll, Lu ...
Since it is never used, maybe we can just remove L& from the map UNICODE_CATEGORY_TO_FLAG.

jaime-m-p added 8 commits May 25, 2024 00:13

Update random test: add_bos_token

e013b23

Add BPE models for testing

55e387b

bugfix: custom regex split fails with codepoint 0

fe3c531

Refactor llm_tokenizer_bpe: move code to constructor

6f4c300

Update random test: add_eos_token

614d0bb

Add BPE models for testing

6168399

Move 'add_special_bos/eos' logic to llm_tokenizer_bpe

0794b77

Fix falcon punctuation regex

51e933a

github-actions bot added testing Everything test related python python script changes labels May 25, 2024

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label May 25, 2024

jaime-m-p mentioned this pull request May 25, 2024

Add Smaug 70B support to conversion #7402

Merged

jaime-m-p added 3 commits May 25, 2024 21:30

Better name functions to append token/bos/eos

1d2f3ad

Move tokenizer flags to vocab structure.

c83ea1a

Initialize values when loading the model vocab.

Allow lstrip for 'added_tokens'

615f425

For now, only <mask> token, needed for 'jina-v2'.

teleprint-me reviewed May 25, 2024

View reviewed changes

Default values for special_add_bos/eos

f84b04f

teleprint-me reviewed May 25, 2024

View reviewed changes

Fix default value for WPM special_add_eos

7a5578f

jaime-m-p added 3 commits June 14, 2024 20:12

Skip missing byte tokens (falcon)

0575023

Update brute force random test

8cda5af

Better unicode data generation

4af5478

jaime-m-p and others added 4 commits June 15, 2024 03:32

Merge branch 'master' into tokenizer-bpe-fixes

e28d0e4

Fix merge: renamed and deleted files

903e47f

Replace char32_t with uint32_t

b7ee827

Merge branch 'master' into tokenizer-bpe-fixes

b8929d5

ggerganov approved these changes Jun 18, 2024

View reviewed changes

jaime-m-p merged commit 37bef89 into ggerganov:master Jun 18, 2024
66 checks passed

teleprint-me reviewed Jun 18, 2024

View reviewed changes

teleprint-me reviewed Jun 21, 2024

View reviewed changes

kylo5aby mentioned this pull request Jul 30, 2024

src: remove duplicate function llama_should_add_bos_token #8778

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer BPE fixes #7530

Tokenizer BPE fixes #7530

jaime-m-p commented May 25, 2024 •

edited

Loading

github-actions bot commented May 25, 2024 •

edited

Loading

mofosyne commented May 25, 2024

bartowski1182 commented May 25, 2024

jaime-m-p commented May 25, 2024

jaime-m-p commented May 25, 2024

bartowski1182 commented May 25, 2024

teleprint-me May 25, 2024 •

edited

Loading

teleprint-me May 25, 2024 •

edited

Loading

jaime-m-p May 25, 2024

teleprint-me May 25, 2024

teleprint-me May 25, 2024

jaime-m-p May 25, 2024 •

edited

Loading

teleprint-me May 25, 2024

teleprint-me commented May 25, 2024 •

edited

Loading

jaime-m-p commented May 25, 2024

jaime-m-p commented Jun 17, 2024

ggerganov commented Jun 18, 2024 •

edited

Loading

jaime-m-p commented Jun 18, 2024

teleprint-me Jun 18, 2024

jaime-m-p Jun 18, 2024

teleprint-me Jun 18, 2024

teleprint-me Jun 18, 2024

teleprint-me Jun 21, 2024

jaime-m-p Jun 21, 2024

teleprint-me Jun 21, 2024 •

edited

Loading

teleprint-me Jun 21, 2024 •

edited

Loading

teleprint-me Jun 21, 2024

jaime-m-p Jun 22, 2024

	// std::wregex \s does not mach non-ASCII whitespaces, using 0x0B as fallback
	std::wstring wtext(cpts.begin(), cpts.end());
	for (size_t i = 0; i < wtext.size(); ++i) {
	if (wtext[i] > 0x7F && unicode_cpt_flags(wtext[i]).is_whitespace) {
	wtext[i] = 0x0B;
	}
	}

Tokenizer BPE fixes #7530

Tokenizer BPE fixes #7530

Conversation

jaime-m-p commented May 25, 2024 • edited Loading

github-actions bot commented May 25, 2024 • edited Loading

mofosyne commented May 25, 2024

bartowski1182 commented May 25, 2024

jaime-m-p commented May 25, 2024

jaime-m-p commented May 25, 2024

bartowski1182 commented May 25, 2024

teleprint-me May 25, 2024 • edited Loading

Choose a reason for hiding this comment

teleprint-me May 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaime-m-p May 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teleprint-me commented May 25, 2024 • edited Loading

jaime-m-p commented May 25, 2024

jaime-m-p commented Jun 17, 2024

ggerganov commented Jun 18, 2024 • edited Loading

jaime-m-p commented Jun 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teleprint-me Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

teleprint-me Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaime-m-p commented May 25, 2024 •

edited

Loading

github-actions bot commented May 25, 2024 •

edited

Loading

teleprint-me May 25, 2024 •

edited

Loading

teleprint-me May 25, 2024 •

edited

Loading

jaime-m-p May 25, 2024 •

edited

Loading

teleprint-me commented May 25, 2024 •

edited

Loading

ggerganov commented Jun 18, 2024 •

edited

Loading

teleprint-me Jun 21, 2024 •

edited

Loading

teleprint-me Jun 21, 2024 •

edited

Loading