Skip to content

Conversation

benbrandt
Copy link
Owner

Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well.

@benbrandt
Copy link
Owner Author

@Jeadie I saw you guys forked to do something similar. Since your fork doesn't have issues, I thought I would ask the question here:

Is this feature the main reason you had to fork? and if so, is the reason because embedding models often need to add in extra tokens at the beginning/end and these weren't accounted for in the chunk size?

@Jeadie
Copy link
Contributor

Jeadie commented Dec 15, 2024

Yes, some embedding models have hard constraints on the number of tokens they can use as input. This includes special tokens. We needed to account for this when splitting inputs in preparation. This PR has those changes in our upstream: spiceai/spiceai#3713

I had a note to upstream our changes, so I'm glad to see you've already done it

@benbrandt
Copy link
Owner Author

@Jeadie thanks for the input. It totally makes sense. I'm trying to figure out if I need a way to let the user decide this behavior to be somewhat backwards compatible in case like for tiktoken the recommendation is not to enable this in most cases... Which might require a wrapper. But I think having it be the default makes sense as I think in most cases this is desired

Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well.
Copy link

codecov bot commented Jan 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.39%. Comparing base (4e0d998) to head (ae239fd).
Report is 139 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #512   +/-   ##
=======================================
  Coverage   99.39%   99.39%           
=======================================
  Files          11       11           
  Lines        1981     1984    +3     
=======================================
+ Hits         1969     1972    +3     
  Misses         12       12           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@benbrandt benbrandt merged commit 9da8748 into main Jan 16, 2025
26 checks passed
@benbrandt benbrandt deleted the special-tokens branch January 16, 2025 06:44
@benbrandt
Copy link
Owner Author

benbrandt commented Jan 16, 2025

@Jeadie this will be out in the v0.21.0 release. Hopefully this means you can switch back to upstream (and get a few other minor optimizations)
Thanks for your patience. Holiday schedules meant this took longer than I'd hoped.

benbrandt added a commit that referenced this pull request Jan 17, 2025
@benbrandt
Copy link
Owner Author

@Jeadie I am reverting this. It seems that this has caused many issues in most other cases.

What I would suggest is for your case, since the model needs to add one extra token at the beginning, setting your chunk size to 1 less than the max context window and then it should work, right?

Happy to discuss more though if this is not the case (could also hop on a call to discuss)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants