-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add get methods for HuggingFaceTokenizer fields #2956
Comments
Can you more more context about this request? In which use case you need to get the maxLength of a tokenizer? |
We want to do some checking if user's input exceeds max length or not, if yes, we will throw our own exception with readable error message. |
How can you check the maxLength before tokenize? |
We have our own translator, https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/SentenceTransformerTranslator.java#L28 We want to add some checking in |
The number of tokens is purely depends on the tokenizer, sometimes the single word will have 2 tokens. the word count is not the same as token count. My questions is how can you know if an input text exceed the max token length. The right way is to add a function: |
I created a PR to return if the output tokens exceed the max length: #2957 Please take a look and see if this resolve your issue. |
Cool, thanks @frankfliu |
I added this PR: #2958 But be ware that in most cases The right way of using |
Got it, make sense, I see huggingface tokeizer support options, we will try that |
Description
We need to get the tokenizer config like model maxLength. But now the HuggingFaceTokenizer class doesn't have get method for this. Suggest add get methods for HuggingFaceTokenizer fields.
One minor suggestion, return builder for
configure
methodchange to
References
The text was updated successfully, but these errors were encountered: