Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get methods for HuggingFaceTokenizer fields #2956

Closed
ylwu-amzn opened this issue Jan 22, 2024 · 9 comments
Closed

Add get methods for HuggingFaceTokenizer fields #2956

ylwu-amzn opened this issue Jan 22, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@ylwu-amzn
Copy link
Contributor

ylwu-amzn commented Jan 22, 2024

Description

We need to get the tokenizer config like model maxLength. But now the HuggingFaceTokenizer class doesn't have get method for this. Suggest add get methods for HuggingFaceTokenizer fields.

One minor suggestion, return builder for configure method

      public void configure(Map<String, ?> arguments) {
            for (Map.Entry<String, ?> entry : arguments.entrySet()) {
                options.put(entry.getKey(), entry.getValue().toString());
            }
        }

change to

      public Builder configure(Map<String, ?> arguments) {
            for (Map.Entry<String, ?> entry : arguments.entrySet()) {
                options.put(entry.getKey(), entry.getValue().toString());
            }
            return this;
        }

References

  • list reference and related literature
  • list known implementations
@ylwu-amzn ylwu-amzn added the enhancement New feature or request label Jan 22, 2024
@frankfliu
Copy link
Contributor

Can you more more context about this request? In which use case you need to get the maxLength of a tokenizer?

@ylwu-amzn
Copy link
Contributor Author

We want to do some checking if user's input exceeds max length or not, if yes, we will throw our own exception with readable error message.

@frankfliu
Copy link
Contributor

How can you check the maxLength before tokenize?

@ylwu-amzn
Copy link
Contributor Author

We have our own translator, https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/SentenceTransformerTranslator.java#L28

We want to add some checking in processInput, if the input text has more tokens than max length, we will throw exception

@frankfliu
Copy link
Contributor

The number of tokens is purely depends on the tokenizer, sometimes the single word will have 2 tokens. the word count is not the same as token count. My questions is how can you know if an input text exceed the max token length.

The right way is to add a function: Encoding.hasOverflowTokens(), you can check if if the output tokens exceed the maxLength. But this approach has performance hit.

@frankfliu
Copy link
Contributor

I created a PR to return if the output tokens exceed the max length: #2957

Please take a look and see if this resolve your issue.

@ylwu-amzn
Copy link
Contributor Author

Cool, thanks @frankfliu

@frankfliu
Copy link
Contributor

I added this PR: #2958

But be ware that in most cases getMaxLength() will return -1, which means if it truncation kicks in, it fallback to use maxModelLength. While maxModelLength is set by your code (or default to 256).

The right way of using maxLength is you manually set it (that's why we don't have a getter to begin with)

@ylwu-amzn
Copy link
Contributor Author

I added this PR: #2958

But be ware that in most cases getMaxLength() will return -1, which means if it truncation kicks in, it fallback to use maxModelLength. While maxModelLength is set by your code (or default to 256).

The right way of using maxLength is you manually set it (that's why we don't have a getter to begin with)

Got it, make sense, I see huggingface tokeizer support options, we will try that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants