token_stats.py - what does it do (and how)? #4998

krisstud · 2023-03-30T07:48:28Z

krisstud
Mar 30, 2023

It seems that this script outputs the tokenized statistics of whole episodes, while data_stats gives statistics per turn. Is this correct?

Getting the statistics is horribly slow, and as far as i can tell does not use cuda. If my assumptions are correct, why is this so? Right now my best estimate is that i need more than 48 hours to generate statistics for my dataset using the CPU.

Changing the code to self.opt['no_cuda'] = False does not change this behavior.

mojtaba-komeili · 2023-03-30T17:15:30Z

mojtaba-komeili
Mar 30, 2023

I don't think this script uses any operation that can be accelerated via GPU. It mostly needs raw python string processing and dictionary operations for tokenizing the text string. So, using GPU will not help in speeding it up.

8 replies

klshuster Apr 7, 2023

I'm not really sure why your dataset took so long - when you run parlai token_stats on any other ParlAI dataset, does it also take that long?

I just ran your exact command with the msc dataset (236k utterances) and it took like 70 seconds.

Also, specifying --init-model with this script won't work like you expect it to. To tokenize according to bb3, you'd want to specify --dict-tokenizer gpt2. This still takes roughly the same amount of time

klshuster Apr 7, 2023

If you want to try using HF's "fast" tokenization, you can change this line to use GPT2TokenizerFast like you have in your script above

krisstud Apr 7, 2023
Author

Thank you for the correction. I couldn't quite understand why the --init-model param was there from the documentation, so i assumed it carried a reference to the correct tokenizer for that model. The result is the same using the corrected param. I can see that it's just one thread bouncing around the CPU at 100% load. Using the HF implementation i get ~25% load on all threads. This can't be the only explanation though.

My dataset is 351 670 utterances in the ParlAI format. It passes verify and i can start training with it. The MSC dataset is fast, as expected. SaferDialogues is too small to effectively test with. I started the thread so that others might benefit. I have my statistics, and i assume (hope) this does not affect tokenization during training.

klshuster Apr 7, 2023

the MSC dataset is fast, as expected

For token_stats? And yes agreed, there is something else going on here.

krisstud Apr 7, 2023
Author

the MSC dataset is fast, as expected

For token_stats? And yes agreed, there is something else going on here.

Yes. If it's any help, building the dictionary took seconds using:

!parlai build_dict --task custom_parlaiformatted_dataset_task --dict-file outpath
Building dictionary: 100%|██████████████████| 270k/270k [00:04<00:00, 62.0kex/s]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token_stats.py - what does it do (and how)? #4998

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

token_stats.py - what does it do (and how)? #4998

krisstud Mar 30, 2023

Replies: 1 comment · 8 replies

mojtaba-komeili Mar 30, 2023

klshuster Apr 7, 2023

klshuster Apr 7, 2023

krisstud Apr 7, 2023 Author

klshuster Apr 7, 2023

krisstud Apr 7, 2023 Author

krisstud
Mar 30, 2023

Replies: 1 comment 8 replies

mojtaba-komeili
Mar 30, 2023

krisstud Apr 7, 2023
Author

krisstud Apr 7, 2023
Author