-
Notifications
You must be signed in to change notification settings - Fork 675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tokenizers] Add int32 option to encoding #3571
Conversation
Fixes: #3566 |
extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/Encoding.java
Show resolved
Hide resolved
* @return the {@link NDList} | ||
*/ | ||
public static NDList toNDList( | ||
Encoding[] encodings, NDManager manager, boolean withTokenType, boolean int32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we leverage the instance method toNdList here? It seems like we should be able to loop over all the encondings and call encoding.toNdList(...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are different, the static one is batched (Encoding[]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but per encoding they are the same. Why can't we do something like this?
public static NDList toNDList(Encoding[] encodings, NDManager manager, boolean withTokenType, boolean int32) {
NDList list = new NDList();
for (int i = 0; i < encodings.length; i++) {
NDList encoding = encodings[i].toNDList(manager, withTokenType, int32);
list.addAll(encoding);
}
return list;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an optimization for performance (was trying to compete with TEI). It reduced number of memory copies to create a batched NDArray. This way it only invoke JNI 3 times, while the loop one will copy N times and then use stack to batchify them.
I actually not sure if this is necessary.
* @param encodings the {@code Encoding} batch | ||
* @param manager the {@link NDManager} to create the NDList | ||
* @param withTokenType true to include the token type id | ||
* @param int32 true to use int32 datatype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my understanding, why do we have a boolean for int32? Should these methods accept a dtype argument for more flexibility? Is there an issue with Rust/Candle that prevents us from doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the token ids are always integer, it actually should always be int64, but some of the rust implementation currently only accept int32, that's a workaround for candle.
@siddvenk Any more comments? |
lgtm, thanks for the explanations! |
Description
Brief description of what this PR is about