You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently I need to train a SP model. Since the input files are too large (~80G, 500M lines), I set up the program to use tsv file instead of raw text files as input. The format of input file is simply "word\tfrequency". I trained the model using unigram-lm. After training, I found some short but high frequency words are tokenized into character sequences. Below is an example:
echo "i am going to the park on monday" | spm_encode --model=sp-lm.model
▁ i ▁ a m ▁going ▁to ▁the ▁park ▁ o n ▁monday
However the frequency is:
i 144250667
am 5376197
on 79402723
park 1233890
The result is somewhat counterintuitive. Did I make any mistake or is it the algorithm indeed does or something else? The word freq file was generated from an untokenized dataset, has ~60M lines. My training command is
This is a very difficult bug to fix. This frequency information is not used when extracting the initial seed tokens. This is due to the fact that frequencies are not available to extract high frequency substrings in the Suffix Array.
The only workaround now is to write high frequency tokens in multiple times in the TSV file
Thank you for the clarification. I also trained a model and set the model_type to BPE. Despite it being slightly slower, the results now appear much more in line with what is typically expected.
The only workaround now is to write high frequency tokens in multiple times in the TSV file
@taku910 Why is this the case? You would expect every word of the TSV file to be unique, so it's quite surprising that something special happens when you repeat a word.
This is a very difficult bug to fix. This frequency information is not used when extracting the initial seed tokens. This is due to the fact that frequencies are not available to extract high frequency substrings in the Suffix Array.
Does that mean that the Unigram model doesn't use the frequency information at all in the .tsv <word \tab freq> input file?
Hi,
Currently I need to train a SP model. Since the input files are too large (~80G, 500M lines), I set up the program to use tsv file instead of raw text files as input. The format of input file is simply "word\tfrequency". I trained the model using unigram-lm. After training, I found some short but high frequency words are tokenized into character sequences. Below is an example:
However the frequency is:
The result is somewhat counterintuitive. Did I make any mistake or is it the algorithm indeed does or something else? The word freq file was generated from an untokenized dataset, has ~60M lines. My training command is
The text was updated successfully, but these errors were encountered: