[tokenizers] Add int32 option to encoding #3571

xyang16 · 2024-12-30T20:16:39Z

Description

Brief description of what this PR is about

If this change is a backward incompatible change, why must this change be made?
Interesting edge cases to note here

frankfliu · 2024-12-30T20:20:11Z

Fixes: #3566

extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/Encoding.java

siddvenk · 2024-12-30T20:30:56Z

extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/Encoding.java

+     * @return the {@link NDList}
+     */
+    public static NDList toNDList(
+            Encoding[] encodings, NDManager manager, boolean withTokenType, boolean int32) {


could we leverage the instance method toNdList here? It seems like we should be able to loop over all the encondings and call encoding.toNdList(...)?

They are different, the static one is batched (Encoding[])

yes, but per encoding they are the same. Why can't we do something like this?

public static NDList toNDList(Encoding[] encodings, NDManager manager, boolean withTokenType, boolean int32) { NDList list = new NDList(); for (int i = 0; i < encodings.length; i++) { NDList encoding = encodings[i].toNDList(manager, withTokenType, int32); list.addAll(encoding); } return list; }

This is an optimization for performance (was trying to compete with TEI). It reduced number of memory copies to create a batched NDArray. This way it only invoke JNI 3 times, while the loop one will copy N times and then use stack to batchify them.

I actually not sure if this is necessary.

siddvenk · 2024-12-30T20:31:50Z

extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/Encoding.java

+     * @param encodings the {@code Encoding} batch
+     * @param manager the {@link NDManager} to create the NDList
+     * @param withTokenType true to include the token type id
+     * @param int32 true to use int32 datatype


for my understanding, why do we have a boolean for int32? Should these methods accept a dtype argument for more flexibility? Is there an issue with Rust/Candle that prevents us from doing so?

the token ids are always integer, it actually should always be int64, but some of the rust implementation currently only accept int32, that's a workaround for candle.

frankfliu · 2025-01-07T17:56:40Z

@siddvenk Any more comments?

siddvenk · 2025-01-13T19:21:53Z

@siddvenk Any more comments?

lgtm, thanks for the explanations!

[tokenizers] Add int32 option to encoding

1453c89

xyang16 requested review from zachgk and a team as code owners December 30, 2024 20:16

frankfliu approved these changes Dec 30, 2024

View reviewed changes

siddvenk reviewed Dec 30, 2024

View reviewed changes

siddvenk approved these changes Jan 13, 2025

View reviewed changes

siddvenk merged commit f0ee0ad into deepjavalibrary:master Jan 13, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tokenizers] Add int32 option to encoding #3571

[tokenizers] Add int32 option to encoding #3571

xyang16 commented Dec 30, 2024

frankfliu commented Dec 30, 2024

siddvenk Dec 30, 2024

frankfliu Dec 30, 2024

siddvenk Dec 30, 2024

frankfliu Dec 30, 2024

siddvenk Dec 30, 2024

frankfliu Dec 30, 2024

frankfliu commented Jan 7, 2025

siddvenk commented Jan 13, 2025

[tokenizers] Add int32 option to encoding #3571

[tokenizers] Add int32 option to encoding #3571

Conversation

xyang16 commented Dec 30, 2024

Description

frankfliu commented Dec 30, 2024

siddvenk Dec 30, 2024

Choose a reason for hiding this comment

frankfliu Dec 30, 2024

Choose a reason for hiding this comment

siddvenk Dec 30, 2024

Choose a reason for hiding this comment

frankfliu Dec 30, 2024

Choose a reason for hiding this comment

siddvenk Dec 30, 2024

Choose a reason for hiding this comment

frankfliu Dec 30, 2024

Choose a reason for hiding this comment

frankfliu commented Jan 7, 2025

siddvenk commented Jan 13, 2025