Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High frequency token segmented into letter sequence when input is a tsv file #967

Open
TingxunShi opened this issue Jan 30, 2024 · 4 comments
Labels

Comments

@TingxunShi
Copy link

TingxunShi commented Jan 30, 2024

Hi,

Currently I need to train a SP model. Since the input files are too large (~80G, 500M lines), I set up the program to use tsv file instead of raw text files as input. The format of input file is simply "word\tfrequency". I trained the model using unigram-lm. After training, I found some short but high frequency words are tokenized into character sequences. Below is an example:

echo "i am going to the park on monday" | spm_encode --model=sp-lm.model
▁ i ▁ a m ▁going ▁to ▁the ▁park ▁ o n ▁monday

However the frequency is:

i       144250667
am      5376197
on      79402723
park    1233890

The result is somewhat counterintuitive. Did I make any mistake or is it the algorithm indeed does or something else? The word freq file was generated from an untokenized dataset, has ~60M lines. My training command is

spm_train \
    --input=counts \
    --input_format=tsv \
    --model_prefix=prefix \
    --vocab_size=48000 \
    --character_coverage=0.9999 \
    --num_threads=50 \
    --max_sentence_length=2048 \
    --normalization_rule_name=identity \
    --unk_surface="<unk>" \
    --train_extremely_large_corpus \
    --user_defined_symbols="<foo>,<bar>" \
    --byte_fallback \
    --split_digits
@taku910 taku910 added the bug label Jan 30, 2024
@taku910
Copy link
Collaborator

taku910 commented Jan 30, 2024

This is a very difficult bug to fix. This frequency information is not used when extracting the initial seed tokens. This is due to the fact that frequencies are not available to extract high frequency substrings in the Suffix Array.

The only workaround now is to write high frequency tokens in multiple times in the TSV file

@TingxunShi
Copy link
Author

Thank you for the clarification. I also trained a model and set the model_type to BPE. Despite it being slightly slower, the results now appear much more in line with what is typically expected.

@bauwenst
Copy link

The only workaround now is to write high frequency tokens in multiple times in the TSV file

@taku910 Why is this the case? You would expect every word of the TSV file to be unique, so it's quite surprising that something special happens when you repeat a word.

@DmitriiP20
Copy link

DmitriiP20 commented Nov 20, 2024

This is a very difficult bug to fix. This frequency information is not used when extracting the initial seed tokens. This is due to the fact that frequencies are not available to extract high frequency substrings in the Suffix Array.

Does that mean that the Unigram model doesn't use the frequency information at all in the .tsv <word \tab freq> input file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants