-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: special tokens encoded by default #512
Conversation
@Jeadie I saw you guys forked to do something similar. Since your fork doesn't have issues, I thought I would ask the question here: Is this feature the main reason you had to fork? and if so, is the reason because embedding models often need to add in extra tokens at the beginning/end and these weren't accounted for in the chunk size? |
Yes, some embedding models have hard constraints on the number of tokens they can use as input. This includes special tokens. We needed to account for this when splitting inputs in preparation. This PR has those changes in our upstream: spiceai/spiceai#3713 I had a note to upstream our changes, so I'm glad to see you've already done it |
@Jeadie thanks for the input. It totally makes sense. I'm trying to figure out if I need a way to let the user decide this behavior to be somewhat backwards compatible in case like for tiktoken the recommendation is not to enable this in most cases... Which might require a wrapper. But I think having it be the default makes sense as I think in most cases this is desired |
0377d87
to
a72c0ff
Compare
Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well.
a72c0ff
to
6aa12b4
Compare
4b2691e
to
ae239fd
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #512 +/- ##
=======================================
Coverage 99.39% 99.39%
=======================================
Files 11 11
Lines 1981 1984 +3
=======================================
+ Hits 1969 1972 +3
Misses 12 12 ☔ View full report in Codecov by Sentry. |
@Jeadie this will be out in the v0.21.0 release. Hopefully this means you can switch back to upstream (and get a few other minor optimizations) |
Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well.