feat!: special tokens encoded by default #512

benbrandt · 2024-12-14T20:57:44Z

Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well.

benbrandt · 2024-12-14T20:59:14Z

@Jeadie I saw you guys forked to do something similar. Since your fork doesn't have issues, I thought I would ask the question here:

Is this feature the main reason you had to fork? and if so, is the reason because embedding models often need to add in extra tokens at the beginning/end and these weren't accounted for in the chunk size?

Jeadie · 2024-12-15T10:39:22Z

Yes, some embedding models have hard constraints on the number of tokens they can use as input. This includes special tokens. We needed to account for this when splitting inputs in preparation. This PR has those changes in our upstream: spiceai/spiceai#3713

I had a note to upstream our changes, so I'm glad to see you've already done it

benbrandt · 2024-12-15T10:46:49Z

@Jeadie thanks for the input. It totally makes sense. I'm trying to figure out if I need a way to let the user decide this behavior to be somewhat backwards compatible in case like for tiktoken the recommendation is not to enable this in most cases... Which might require a wrapper. But I think having it be the default makes sense as I think in most cases this is desired

Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well.

codecov · 2025-01-16T06:34:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.39%. Comparing base (4e0d998) to head (ae239fd).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #512   +/-   ##
=======================================
  Coverage   99.39%   99.39%           
=======================================
  Files          11       11           
  Lines        1981     1984    +3     
=======================================
+ Hits         1969     1972    +3     
  Misses         12       12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benbrandt · 2025-01-16T07:57:03Z

@Jeadie this will be out in the v0.21.0 release. Hopefully this means you can switch back to upstream (and get a few other minor optimizations)
Thanks for your patience. Holiday schedules meant this took longer than I'd hoped.

benbrandt force-pushed the special-tokens branch from 0377d87 to a72c0ff Compare January 16, 2025 05:59

benbrandt force-pushed the special-tokens branch from a72c0ff to 6aa12b4 Compare January 16, 2025 06:08

benbrandt added 2 commits January 16, 2025 07:12

test: fix python tests

35bec4c

docs: clarify which tokenizers are affected

ae239fd

benbrandt force-pushed the special-tokens branch from 4b2691e to ae239fd Compare January 16, 2025 06:22

benbrandt merged commit 9da8748 into main Jan 16, 2025
26 checks passed

benbrandt deleted the special-tokens branch January 16, 2025 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: special tokens encoded by default #512

feat!: special tokens encoded by default #512

benbrandt commented Dec 14, 2024

benbrandt commented Dec 14, 2024

Jeadie commented Dec 15, 2024

benbrandt commented Dec 15, 2024

codecov bot commented Jan 16, 2025 •

edited

Loading

benbrandt commented Jan 16, 2025 •

edited

Loading

feat!: special tokens encoded by default #512

feat!: special tokens encoded by default #512

Conversation

benbrandt commented Dec 14, 2024

benbrandt commented Dec 14, 2024

Jeadie commented Dec 15, 2024

benbrandt commented Dec 15, 2024

codecov bot commented Jan 16, 2025 • edited Loading

Codecov Report

benbrandt commented Jan 16, 2025 • edited Loading

codecov bot commented Jan 16, 2025 •

edited

Loading

benbrandt commented Jan 16, 2025 •

edited

Loading