Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tokenizers] Add int32 option to encoding #3571

Merged
merged 1 commit into from
Jan 13, 2025

Conversation

xyang16
Copy link
Contributor

@xyang16 xyang16 commented Dec 30, 2024

Description

Brief description of what this PR is about

  • If this change is a backward incompatible change, why must this change be made?
  • Interesting edge cases to note here

@xyang16 xyang16 requested review from zachgk and a team as code owners December 30, 2024 20:16
@frankfliu
Copy link
Contributor

Fixes: #3566

* @return the {@link NDList}
*/
public static NDList toNDList(
Encoding[] encodings, NDManager manager, boolean withTokenType, boolean int32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we leverage the instance method toNdList here? It seems like we should be able to loop over all the encondings and call encoding.toNdList(...)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are different, the static one is batched (Encoding[])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but per encoding they are the same. Why can't we do something like this?

public static NDList toNDList(Encoding[] encodings, NDManager manager, boolean withTokenType, boolean int32) {
    NDList list = new NDList();
    for (int i = 0; i < encodings.length; i++) {
        NDList encoding = encodings[i].toNDList(manager, withTokenType, int32);
        list.addAll(encoding);
    }
    return list;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an optimization for performance (was trying to compete with TEI). It reduced number of memory copies to create a batched NDArray. This way it only invoke JNI 3 times, while the loop one will copy N times and then use stack to batchify them.

I actually not sure if this is necessary.

* @param encodings the {@code Encoding} batch
* @param manager the {@link NDManager} to create the NDList
* @param withTokenType true to include the token type id
* @param int32 true to use int32 datatype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding, why do we have a boolean for int32? Should these methods accept a dtype argument for more flexibility? Is there an issue with Rust/Candle that prevents us from doing so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the token ids are always integer, it actually should always be int64, but some of the rust implementation currently only accept int32, that's a workaround for candle.

@frankfliu
Copy link
Contributor

@siddvenk Any more comments?

@siddvenk
Copy link
Contributor

@siddvenk Any more comments?

lgtm, thanks for the explanations!

@siddvenk siddvenk merged commit f0ee0ad into deepjavalibrary:master Jan 13, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants