Add get methods for HuggingFaceTokenizer fields #2956

ylwu-amzn · 2024-01-22T02:34:09Z

Description

We need to get the tokenizer config like model maxLength. But now the HuggingFaceTokenizer class doesn't have get method for this. Suggest add get methods for HuggingFaceTokenizer fields.

One minor suggestion, return builder for configure method

      public void configure(Map<String, ?> arguments) {
            for (Map.Entry<String, ?> entry : arguments.entrySet()) {
                options.put(entry.getKey(), entry.getValue().toString());
            }
        }

change to

      public Builder configure(Map<String, ?> arguments) {
            for (Map.Entry<String, ?> entry : arguments.entrySet()) {
                options.put(entry.getKey(), entry.getValue().toString());
            }
            return this;
        }

References

list reference and related literature
list known implementations

The text was updated successfully, but these errors were encountered:

frankfliu · 2024-01-22T02:45:27Z

Can you more more context about this request? In which use case you need to get the maxLength of a tokenizer?

ylwu-amzn · 2024-01-22T03:27:07Z

We want to do some checking if user's input exceeds max length or not, if yes, we will throw our own exception with readable error message.

frankfliu · 2024-01-22T03:30:23Z

How can you check the maxLength before tokenize?

ylwu-amzn · 2024-01-22T04:50:00Z

We have our own translator, https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/SentenceTransformerTranslator.java#L28

We want to add some checking in processInput, if the input text has more tokens than max length, we will throw exception

frankfliu · 2024-01-22T04:55:48Z

The number of tokens is purely depends on the tokenizer, sometimes the single word will have 2 tokens. the word count is not the same as token count. My questions is how can you know if an input text exceed the max token length.

The right way is to add a function: Encoding.hasOverflowTokens(), you can check if if the output tokens exceed the maxLength. But this approach has performance hit.

frankfliu · 2024-01-22T19:39:24Z

I created a PR to return if the output tokens exceed the max length: #2957

Please take a look and see if this resolve your issue.

ylwu-amzn · 2024-01-22T20:14:44Z

Cool, thanks @frankfliu

frankfliu · 2024-01-22T23:57:12Z

I added this PR: #2958

But be ware that in most cases getMaxLength() will return -1, which means if it truncation kicks in, it fallback to use maxModelLength. While maxModelLength is set by your code (or default to 256).

The right way of using maxLength is you manually set it (that's why we don't have a getter to begin with)

ylwu-amzn · 2024-01-23T18:13:43Z

I added this PR: #2958

But be ware that in most cases getMaxLength() will return -1, which means if it truncation kicks in, it fallback to use maxModelLength. While maxModelLength is set by your code (or default to 256).

The right way of using maxLength is you manually set it (that's why we don't have a getter to begin with)

Got it, make sense, I see huggingface tokeizer support options, we will try that

ylwu-amzn added the enhancement New feature or request label Jan 22, 2024

frankfliu closed this as completed Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add get methods for HuggingFaceTokenizer fields #2956

Add get methods for HuggingFaceTokenizer fields #2956

ylwu-amzn commented Jan 22, 2024 •

edited

Loading

frankfliu commented Jan 22, 2024

ylwu-amzn commented Jan 22, 2024

frankfliu commented Jan 22, 2024

ylwu-amzn commented Jan 22, 2024

frankfliu commented Jan 22, 2024

frankfliu commented Jan 22, 2024

ylwu-amzn commented Jan 22, 2024

frankfliu commented Jan 22, 2024

ylwu-amzn commented Jan 23, 2024

Add get methods for HuggingFaceTokenizer fields #2956

Add get methods for HuggingFaceTokenizer fields #2956

Comments

ylwu-amzn commented Jan 22, 2024 • edited Loading

Description

References

frankfliu commented Jan 22, 2024

ylwu-amzn commented Jan 22, 2024

frankfliu commented Jan 22, 2024

ylwu-amzn commented Jan 22, 2024

frankfliu commented Jan 22, 2024

frankfliu commented Jan 22, 2024

ylwu-amzn commented Jan 22, 2024

frankfliu commented Jan 22, 2024

ylwu-amzn commented Jan 23, 2024

ylwu-amzn commented Jan 22, 2024 •

edited

Loading