-
Notifications
You must be signed in to change notification settings - Fork 29.5k
added cached tokenizer #35218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
added cached tokenizer #35218
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 🔥 🔥
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Yup, large part of the overhead is gone now. I ran a quick check and you save roughly 1 second per batch in the open ASR leaderboard for whisper-turbo. I think there may still be some overhead from the tokenizer, but I'm not sure how much exactly. You should recompute the RTFx on the full test set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! We already have a state for the _added_tokens_encoder
which should be filling this!
The main issue is just that when performances are required, _added_tokens_encoder
should be "called" instead of added_tokens_encoder
!
But very very welcome! Let's reduce the overhead!
""" | ||
return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])} | |
return self._added_tokens_encoder |
this would not work as you probably need the content to be sorted, which is why we have the non-sorted _added_tokens_encoder
. We can actually sort it (define it as OrderedDict) as we deprecated python <= 3.9!
Also you could compute with both slow and fast tokenizers, I don't know which you are using! |
What does this PR do?
This PR introduces a caching mechanism for the added_tokens_encoder property in tokenizers to improve performance. Previously, the added_tokens_encoder mapping was recomputed every time the property was accessed, leading to redundant computation during tasks that frequently access it, such as tokenization or decoding.
Motivation and Context
The motivation for this change is to optimize tokenizer performance, especially in workflows that repeatedly access the added_tokens_encoder property. By caching the result, this PR reduces overhead and improves runtime efficiency without altering the existing behavior of the library.
Key changes:
The added_tokens_encoder mapping is now cached on the first access and reused for subsequent calls.
The caching mechanism is implemented in a way that is backward-compatible and avoids unnecessary recomputation.
Some benchmarks
Composite Results
AMI Results
Earnings22 Results
Gigaspeech Results
LibriSpeech Clean Results
LibriSpeech Other Results
SPGISpeech Results
TEDLIUM Results
Voxpopuli Results
Benchmark scripts available there: https://github.com/huggingface/open_asr_leaderboard/tree/main/transformers
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker
@Vaibhavs10
This changes was suggested by @pzelasko