Skip to content

added cached tokenizer #35218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Deep-unlearning
Copy link

@Deep-unlearning Deep-unlearning commented Dec 11, 2024

What does this PR do?

This PR introduces a caching mechanism for the added_tokens_encoder property in tokenizers to improve performance. Previously, the added_tokens_encoder mapping was recomputed every time the property was accessed, leading to redundant computation during tasks that frequently access it, such as tokenization or decoding.

Motivation and Context
The motivation for this change is to optimize tokenizer performance, especially in workflows that repeatedly access the added_tokens_encoder property. By caching the result, this PR reduces overhead and improves runtime efficiency without altering the existing behavior of the library.

Key changes:

The added_tokens_encoder mapping is now cached on the first access and reused for subsequent calls.
The caching mechanism is implemented in a way that is backward-compatible and avoids unnecessary recomputation.

Some benchmarks

Composite Results

Model Composite WER (%) Composite RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 7.92 278.32 / 202.95 / 36%
distil/whisper-distil-large-v3 7.52 282.46 / 214.42 / 32%
distil/whisper-distil-medium.en 8.76 406.96 / 279.73 / 45%
openai/whisper-large 7.94 167.43 / 143.76 / 16%
openai/whisper-large-v2 7.83 167.95 / 144.45 / 16%
openai/whisper-large-v3 7.44 169.26 / 145.51 / 16%
openai/whisper-large-v3-turbo 7.83 268.72 / 197.98 / 36%
openai/whisper-medium.en 8.09 222.49 / 182.13 / 22%
openai/whisper-small.en 8.59 359.18 / 268.91 / 34%
openai/whisper-base.en 10.32 483.69 / 320.67 / 50%
openai/whisper-tiny.en 12.81 532.03 / 348.12 / 53%

AMI Results

Model AMI WER (%) AMI RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 14.67 120.15 / 103.50 / 16%
distil/whisper-distil-large-v3 15.16 119.29 / 104.33 / 14%
distil/whisper-distil-medium.en 16.12 189.32 / 152.03 / 25%
openai/whisper-large 16.73 82.81 / 76.15 / 9%
openai/whisper-large-v2 16.74 85.65 / 79.49 / 7%
openai/whisper-large-v3 15.95 84.31 / 77.97 / 8%
openai/whisper-large-v3-turbo 16.13 116.17 / 98.83 / 18%
openai/whisper-medium.en 16.68 78.47 / 76.86 / 2%
openai/whisper-small.en 17.93 197.70 / 168.88 / 17%
openai/whisper-base.en 21.13 224.91 / 181.10 / 24%
openai/whisper-tiny.en 24.24 271.98 / 228.77 / 19%

Earnings22 Results

Model Earnings22 WER (%) Earnings22 RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 12.19 279.17 / 212.11 / 32%
distil/whisper-distil-large-v3 11.79 281.64 / 219.27 / 28%
distil/whisper-distil-medium.en 12.99 408.40 / 291.33 / 40%
openai/whisper-large 12.91 156.36 / 138.56 / 13%
openai/whisper-large-v2 12.05 173.81 / 151.92 / 14%
openai/whisper-large-v3 11.29 171.74 / 149.66 / 15%
openai/whisper-large-v3-turbo 11.63 274.35 / 202.67 / 35%
openai/whisper-medium.en 12.63 251.39 / 204.49 / 23%
openai/whisper-small.en 12.97 390.44 / 303.05 / 29%
openai/whisper-base.en 15.09 554.06 / 370.98 / 49%
openai/whisper-tiny.en 19.12 439.19 / 323.27 / 36%

Gigaspeech Results

Model GigaSpeech WER (%) GigaSpeech RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 10.32 242.64 / 178.28 / 26%
distil/whisper-distil-large-v3 10.08 245.04 / 185.02 / 32%
distil/whisper-distil-medium.en 11.30 351.03 / 242.87 / 45%
openai/whisper-large 10.76 137.20 / 118.69 / 16%
openai/whisper-large-v2 10.67 139.24 / 120.05 / 15%
openai/whisper-large-v3 10.02 141.93 / 122.97 / 16%
openai/whisper-large-v3-turbo 10.14 229.71 / 168.52 / 36%
openai/whisper-medium.en 11.03 177.60 / 151.70 / 17%
openai/whisper-small.en 11.35 271.56 / 213.19 / 27%
openai/whisper-base.en 12.83 357.94 / 253.20 / 41%
openai/whisper-tiny.en 14.08 421.61 / 284.52 / 48%

LibriSpeech Clean Results

Model LibriSpeech Clean WER (%) LibriSpeech Clean RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 2.94 286.00 / 205.44 / 39%
distil/whisper-distil-large-v3 2.54 288.02 / 217.52 / 32%
distil/whisper-distil-medium.en 3.69 415.82 / 280.95 / 48%
openai/whisper-large 2.73 181.37 / 150.35 / 21%
openai/whisper-large-v2 2.83 159.01 / 135.81 / 17%
openai/whisper-large-v3 2.01 179.93 / 151.42 / 19%
openai/whisper-large-v3-turbo 2.10 278.29 / 201.89 / 38%
openai/whisper-medium.en 3.02 244.38 / 196.85 / 24%
openai/whisper-small.en 3.05 408.91 / 280.23 / 46%
openai/whisper-base.en 4.25 583.91 / 353.97 / 65%
openai/whisper-tiny.en 5.66 639.70 / 376.14 / 70%

LibriSpeech Other Results

Model LibriSpeech Other WER (%) LibriSpeech Other RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 6.84 248.08 / 177.63 / 40%
distil/whisper-distil-large-v3 5.19 259.09 / 199.72 / 30%
distil/whisper-distil-medium.en 8.35 349.71 / 236.81 / 48%
openai/whisper-large 5.54 164.39 / 138.73 / 18%
openai/whisper-large-v2 5.14 162.81 / 139.05 / 17%
openai/whisper-large-v3 3.91 163.21 / 140.22 / 16%
openai/whisper-large-v3-turbo 4.24 257.22 / 188.87 / 36%
openai/whisper-medium.en 5.85 222.76 / 181.65 / 23%
openai/whisper-small.en 7.25 367.64 / 262.68 / 40%
openai/whisper-base.en 10.35 445.31 / 293.26 / 52%
openai/whisper-tiny.en 15.45 420.61 / 298.15 / 41%

SPGISpeech Results

Model SPGISpeech WER (%) SPGISpeech RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 3.30 331.26 / 232.50 / 42%
distil/whisper-distil-large-v3 3.27 337.55 / 249.00 / 36%
distil/whisper-distil-medium.en 3.83 478.64 / 318.96 / 50%
openai/whisper-large 3.20 198.02 / 167.48 / 18%
openai/whisper-large-v2 3.87 196.77 / 166.89 / 18%
openai/whisper-large-v3 2.94 197.37 / 166.92 / 18%
openai/whisper-large-v3-turbo 2.97 320.11 / 229.57 / 39%
openai/whisper-medium.en 3.33 218.35 / 285.07 / 31%
openai/whisper-small.en 3.60 427.56 / 307.90 / 39%
openai/whisper-base.en 4.26 601.14 / 372.83 / 61%
openai/whisper-tiny.en 5.93 648.97 / 398.03 / 63%

TEDLIUM Results

Model TEDLIUM WER (%) TEDLIUM RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 4.87 274.60 / 197.85 / 39%
distil/whisper-distil-large-v3 3.86 294.14 / 217.54 / 35%
distil/whisper-distil-medium.en 4.84 425.02 / 282.89 / 50%
openai/whisper-large 3.91 166.87 / 143.34 / 16%
openai/whisper-large-v2 3.90 166.91 / 143.77 / 16%
openai/whisper-large-v3 3.86 166.75 / 142.18 / 17%
openai/whisper-large-v3-turbo 3.57 288.34 / 199.61 / 44%
openai/whisper-medium.en 4.11 237.28 / 185.40 / 28%
openai/whisper-small.en 4.07 352.07 / 263.51 / 34%
openai/whisper-base.en 4.87 507.93 / 336.00 / 51%
openai/whisper-tiny.en 5.97 571.50 / 352.79 / 62%

Voxpopuli Results

Model VoxPopuli WER (%) VoxPopuli RTFx (With/Without/Improvement)
distil/whisper-distil-large-v2 8.24 348.26 / 249.25 / 40%
distil/whisper-distil-large-v3 8.25 359.48 / 262.70 / 37%
distil/whisper-distil-medium.en 9.00 525.00 / 345.95 / 52%
openai/whisper-large 7.76 218.21 / 182.69 / 19%
openai/whisper-large-v2 7.48 219.32 / 182.27 / 20%
openai/whisper-large-v3 9.54 213.33 / 180.51 / 18%
openai/whisper-large-v3-turbo 11.87 339.76 / 247.99 / 37%
openai/whisper-medium.en 8.06 309.17 / 239.06 / 29%
openai/whisper-small.en 8.50 478.84 / 336.49 / 42%
openai/whisper-base.en 9.76 681.44 / 418.28 / 63%
openai/whisper-tiny.en 12.00 647.46 / 405.49 / 60%

Benchmark scripts available there: https://github.com/huggingface/open_asr_leaderboard/tree/main/transformers

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker
@Vaibhavs10
This changes was suggested by @pzelasko

Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 🔥 🔥

@Vaibhavs10 Vaibhavs10 requested a review from eustlb December 11, 2024 18:46
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@pzelasko
Copy link

Yup, large part of the overhead is gone now. I ran a quick check and you save roughly 1 second per batch in the open ASR leaderboard for whisper-turbo. I think there may still be some overhead from the tokenizer, but I'm not sure how much exactly. You should recompute the RTFx on the full test set.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! We already have a state for the _added_tokens_encoder which should be filling this!
The main issue is just that when performances are required, _added_tokens_encoder should be "called" instead of added_tokens_encoder!
But very very welcome! Let's reduce the overhead!

"""
return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])}
return self._added_tokens_encoder

this would not work as you probably need the content to be sorted, which is why we have the non-sorted _added_tokens_encoder. We can actually sort it (define it as OrderedDict) as we deprecated python <= 3.9!

@ArthurZucker ArthurZucker added the Core: Tokenization Internals of the library; Tokenization. label Dec 23, 2024
@ArthurZucker
Copy link
Collaborator

Also you could compute with both slow and fast tokenizers, I don't know which you are using!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants