High frequency token segmented into letter sequence when input is a tsv file #967

TingxunShi · 2024-01-30T03:41:58Z

Hi,

Currently I need to train a SP model. Since the input files are too large (~80G, 500M lines), I set up the program to use tsv file instead of raw text files as input. The format of input file is simply "word\tfrequency". I trained the model using unigram-lm. After training, I found some short but high frequency words are tokenized into character sequences. Below is an example:

echo "i am going to the park on monday" | spm_encode --model=sp-lm.model
▁ i ▁ a m ▁going ▁to ▁the ▁park ▁ o n ▁monday

However the frequency is:

i       144250667
am      5376197
on      79402723
park    1233890

The result is somewhat counterintuitive. Did I make any mistake or is it the algorithm indeed does or something else? The word freq file was generated from an untokenized dataset, has ~60M lines. My training command is

spm_train \
    --input=counts \
    --input_format=tsv \
    --model_prefix=prefix \
    --vocab_size=48000 \
    --character_coverage=0.9999 \
    --num_threads=50 \
    --max_sentence_length=2048 \
    --normalization_rule_name=identity \
    --unk_surface="<unk>" \
    --train_extremely_large_corpus \
    --user_defined_symbols="<foo>,<bar>" \
    --byte_fallback \
    --split_digits

The text was updated successfully, but these errors were encountered:

taku910 · 2024-01-30T06:47:11Z

This is a very difficult bug to fix. This frequency information is not used when extracting the initial seed tokens. This is due to the fact that frequencies are not available to extract high frequency substrings in the Suffix Array.

The only workaround now is to write high frequency tokens in multiple times in the TSV file

TingxunShi · 2024-01-30T06:52:59Z

Thank you for the clarification. I also trained a model and set the model_type to BPE. Despite it being slightly slower, the results now appear much more in line with what is typically expected.

bauwenst · 2024-02-28T14:58:04Z

The only workaround now is to write high frequency tokens in multiple times in the TSV file

@taku910 Why is this the case? You would expect every word of the TSV file to be unique, so it's quite surprising that something special happens when you repeat a word.

DmitriiP20 · 2024-11-20T19:03:58Z

This is a very difficult bug to fix. This frequency information is not used when extracting the initial seed tokens. This is due to the fact that frequencies are not available to extract high frequency substrings in the Suffix Array.

Does that mean that the Unigram model doesn't use the frequency information at all in the .tsv <word \tab freq> input file?

taku910 added the bug label Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High frequency token segmented into letter sequence when input is a tsv file #967

High frequency token segmented into letter sequence when input is a tsv file #967

TingxunShi commented Jan 30, 2024 •

edited

Loading

taku910 commented Jan 30, 2024

TingxunShi commented Jan 30, 2024

bauwenst commented Feb 28, 2024

DmitriiP20 commented Nov 20, 2024 •

edited

Loading

High frequency token segmented into letter sequence when input is a tsv file #967

High frequency token segmented into letter sequence when input is a tsv file #967

Comments

TingxunShi commented Jan 30, 2024 • edited Loading

taku910 commented Jan 30, 2024

TingxunShi commented Jan 30, 2024

bauwenst commented Feb 28, 2024

DmitriiP20 commented Nov 20, 2024 • edited Loading

TingxunShi commented Jan 30, 2024 •

edited

Loading

DmitriiP20 commented Nov 20, 2024 •

edited

Loading