Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: special tokens encoded by default #512

Merged
merged 3 commits into from
Jan 16, 2025
Merged

feat!: special tokens encoded by default #512

merged 3 commits into from
Jan 16, 2025

Conversation

benbrandt
Copy link
Owner

Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well.

@benbrandt
Copy link
Owner Author

@Jeadie I saw you guys forked to do something similar. Since your fork doesn't have issues, I thought I would ask the question here:

Is this feature the main reason you had to fork? and if so, is the reason because embedding models often need to add in extra tokens at the beginning/end and these weren't accounted for in the chunk size?

@Jeadie
Copy link
Contributor

Jeadie commented Dec 15, 2024

Yes, some embedding models have hard constraints on the number of tokens they can use as input. This includes special tokens. We needed to account for this when splitting inputs in preparation. This PR has those changes in our upstream: spiceai/spiceai#3713

I had a note to upstream our changes, so I'm glad to see you've already done it

@benbrandt
Copy link
Owner Author

@Jeadie thanks for the input. It totally makes sense. I'm trying to figure out if I need a way to let the user decide this behavior to be somewhat backwards compatible in case like for tiktoken the recommendation is not to enable this in most cases... Which might require a wrapper. But I think having it be the default makes sense as I think in most cases this is desired

Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well.
Copy link

codecov bot commented Jan 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.39%. Comparing base (4e0d998) to head (ae239fd).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #512   +/-   ##
=======================================
  Coverage   99.39%   99.39%           
=======================================
  Files          11       11           
  Lines        1981     1984    +3     
=======================================
+ Hits         1969     1972    +3     
  Misses         12       12           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benbrandt benbrandt merged commit 9da8748 into main Jan 16, 2025
26 checks passed
@benbrandt benbrandt deleted the special-tokens branch January 16, 2025 06:44
@benbrandt
Copy link
Owner Author

benbrandt commented Jan 16, 2025

@Jeadie this will be out in the v0.21.0 release. Hopefully this means you can switch back to upstream (and get a few other minor optimizations)
Thanks for your patience. Holiday schedules meant this took longer than I'd hoped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants