-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Implement register custom sparse tokenizer from local files #3170
Comments
For option1 and 2, only function_name is "SPARSE_TOKENIZE" will work, right? |
Yes, we should also include the function name field. Edited. |
This will be challenging for security. Suggest consult with security experts first. |
Option1 seems better, also a question: do we need to support network file system registration? Placing files on production machines could be a maintenance overhead. |
I think support network file system will bring more security challenge. I know some service providers implement machanisms to upload files to clusters. E.g. custom packages on AOS. It is a common use case for customizing analyzers from synonyms files |
Good to see you creating this RFC! Can we also support pre-trained tokenizer just like pre-trained models? This would benefit users using either dense or sparse models. |
Sorry I don't get your point. I think currently our pre-trained tokenizer can be used just like other DL models. What else do we need to do? |
Wait a bit. Do you mean we can currently use pre-trained tokenizer without specifying |
Agree this will bring more security challenges, but from open source perspective, this might be a valid use case, a configuration can be introduced to control the behavior in open source and AOS separately. This is not high priority, it's fine to implement this in a future release. |
Yes, we can use model API to register & deploy the pretrained tokenizer
|
From the document, this API can register model. I think the customer also needs a light-weight method to register pre-trained tokenizer, where only need a tokenizer.
I am assuming that this RFC and all these three options is about tokenizer, not about all the model. |
Do you mean the inner huggingface tokenizer instead of SPARSE_TOKENIZE model? Currently users doesn't interact with it directly in neural sparse search, it's out of scope for this RFC. Please feel free to create a dedicated RFC to support it if needed. |
Thanks for clarification between the hugging face tokenizer and sparse_tokenize model. As for the 3rd option, train API does not make sense to me. Creating a tokenizer does not follow the current documentation of train API https://opensearch.org/docs/latest/ml-commons-plugin/api/train-predict/train/. |
Hey @zhichao-aws I was doing something similar with mounting a file within the request for image support within ml commons see here #3152 . I am curious to understand how it works with the following
Currently I am at a cross roads to see if we should implement the local file path within the image support, so looking forward to see how you lead this feature to gain some insight. |
Hi @brianf-aws , OpenSearch already has features to read synonyms from local files: https://github.com/opensearch-project/OpenSearch/blob/4213cc27305c37ea71e5b5a5addd17e5383e8029/server/src/main/java/org/opensearch/index/analysis/Analysis.java#L321. For the permission issue, we need to use relative path or set the CONF_DIR. (ref). In my POC for the feature, the files are saved under the |
Background
Neural Sparse is a semantic search method which is built on native Lucene inverted index. The documents and queries are encoded into sparse vectors, where the entry represents the token and their corresponding semantic weight.
For the neural sparse doc-only mode, we need a sparse encoding neural model for ingestion, and a tokenizer for query. We use consistent tokenizer construction with Hugging Face, i.e. the tokenizer is determined by
tokenizer.json
(example) config file. For the model pre-trained by opensearch, we also need aidf.json
file for token weight (example).In OpenSearch ml-commons, the tokenizer is wrapped as a SPARSE_TOKENIZE model. This provides a consistent API user experience of doc-only mode and bi-encoder mode. To register a sparse tokenizer, users can choose the pre-trained tokenizer provided by OpenSearch, or build a zip file containing
tokenizer.json
andidf.json
and then register from URL.What are we going to do?
Implement customized sparse tokenizer by reading config file from local file system. This makes the tokenizer registry more flexible. Some service providers have more strict restriction on customized torchscript file, while the sparse tokenizer don't need to interact with torchscript during run-time. We need a registry option that explicity exclude torchscript resources, to achieve a more fine-grained security control for different model types. Besides, we also don't need to upload a zipped file since these config file are much smaller than model weights.
User Experience
Here are different ways to implement the API. The tricky part is, currently this feature only works for tokenizer registry. The new fields are not general for other model types.
Option 1: register API + put new fields at model config (preferred)
For this option we implement a new class named
SparseTokenizerModelConfig
and put these fields in the body ofmodel_config
. In this way we don't need to alter the register logic of other model types.Option 2: register API + put new fields at top-level request body
For this option we put new fields in the top-level request body of register model API. The cons is we'll have redundant fields for other model register request object.
Option 3: register the tokenizer from train model API
For this option we can init the tokenizer object using training API. There will be more code changes and special logics for this option.
The text was updated successfully, but these errors were encountered: