Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Implement register custom sparse tokenizer from local files #3170

Open
zhichao-aws opened this issue Oct 25, 2024 · 15 comments
Open

[RFC] Implement register custom sparse tokenizer from local files #3170

zhichao-aws opened this issue Oct 25, 2024 · 15 comments
Assignees
Labels

Comments

@zhichao-aws
Copy link
Member

zhichao-aws commented Oct 25, 2024

Background

Neural Sparse is a semantic search method which is built on native Lucene inverted index. The documents and queries are encoded into sparse vectors, where the entry represents the token and their corresponding semantic weight.

For the neural sparse doc-only mode, we need a sparse encoding neural model for ingestion, and a tokenizer for query. We use consistent tokenizer construction with Hugging Face, i.e. the tokenizer is determined by tokenizer.json (example) config file. For the model pre-trained by opensearch, we also need a idf.json file for token weight (example).

In OpenSearch ml-commons, the tokenizer is wrapped as a SPARSE_TOKENIZE model. This provides a consistent API user experience of doc-only mode and bi-encoder mode. To register a sparse tokenizer, users can choose the pre-trained tokenizer provided by OpenSearch, or build a zip file containing tokenizer.json and idf.json and then register from URL.

What are we going to do?

Implement customized sparse tokenizer by reading config file from local file system. This makes the tokenizer registry more flexible. Some service providers have more strict restriction on customized torchscript file, while the sparse tokenizer don't need to interact with torchscript during run-time. We need a registry option that explicity exclude torchscript resources, to achieve a more fine-grained security control for different model types. Besides, we also don't need to upload a zipped file since these config file are much smaller than model weights.

User Experience

Here are different ways to implement the API. The tricky part is, currently this feature only works for tokenizer registry. The new fields are not general for other model types.

Option 1: register API + put new fields at model config (preferred)

POST /_plugins/_ml/models/_register
{
    "name": "my custom tokenizer",
    "function_name": "SPARSE_TOKENIZE",
    "model_group_id": "Z1eQf4oB5Vm0Tdw8EIP2",
    "model_config": {
        "tokenizer_config_file": "/path/to/tokenizer.json",
        "idf_file": "/path/to/idf.json"
    }
}

For this option we implement a new class named SparseTokenizerModelConfig and put these fields in the body of model_config. In this way we don't need to alter the register logic of other model types.

Option 2: register API + put new fields at top-level request body

POST /_plugins/_ml/models/_register
{
    "name": "my custom tokenizer",
    "function_name": "SPARSE_TOKENIZE",
    "model_group_id": "Z1eQf4oB5Vm0Tdw8EIP2",
    "tokenizer_config_file": "/path/to/tokenizer.json",
    "idf_file": "/path/to/idf.json"
}

For this option we put new fields in the top-level request body of register model API. The cons is we'll have redundant fields for other model register request object.

Option 3: register the tokenizer from train model API

POST /_plugins/_ml/_train/sparse_tokenize
{
    "parameters": {
        "tokenizer_config_file": "/path/to/tokenizer.json",
        "idf_file": "/path/to/idf.json"
    }
}

For this option we can init the tokenizer object using training API. There will be more code changes and special logics for this option.

@xinyual
Copy link
Collaborator

xinyual commented Oct 25, 2024

For option1 and 2, only function_name is "SPARSE_TOKENIZE" will work, right?

@zhichao-aws
Copy link
Member Author

For option1 and 2, only function_name is "SPARSE_TOKENIZE" will work, right?

Yes, we should also include the function name field. Edited.

@ylwu-amzn
Copy link
Collaborator

This will be challenging for security. Suggest consult with security experts first.

@zane-neo
Copy link
Collaborator

Option1 seems better, also a question: do we need to support network file system registration? Placing files on production machines could be a maintenance overhead.

@zhichao-aws
Copy link
Member Author

Option1 seems better, also a question: do we need to support network file system registration? Placing files on production machines could be a maintenance overhead.

I think support network file system will bring more security challenge. I know some service providers implement machanisms to upload files to clusters. E.g. custom packages on AOS. It is a common use case for customizing analyzers from synonyms files

@yuye-aws
Copy link
Member

Good to see you creating this RFC! Can we also support pre-trained tokenizer just like pre-trained models? This would benefit users using either dense or sparse models.

@zhichao-aws
Copy link
Member Author

Good to see you creating this RFC! Can we also support pre-trained tokenizer just like pre-trained models? This would benefit users using either dense or sparse models.

Sorry I don't get your point. I think currently our pre-trained tokenizer can be used just like other DL models. What else do we need to do?

@yuye-aws
Copy link
Member

Good to see you creating this RFC! Can we also support pre-trained tokenizer just like pre-trained models? This would benefit users using either dense or sparse models.

Sorry I don't get your point. I think currently our pre-trained tokenizer can be used just like other DL models. What else do we need to do?

Wait a bit. Do you mean we can currently use pre-trained tokenizer without specifying tokenizer_config_file and idf_file?

@zane-neo
Copy link
Collaborator

Option1 seems better, also a question: do we need to support network file system registration? Placing files on production machines could be a maintenance overhead.

I think support network file system will bring more security challenge. I know some service providers implement machanisms to upload files to clusters. E.g. custom packages on AOS. It is a common use case for customizing analyzers from synonyms files

Agree this will bring more security challenges, but from open source perspective, this might be a valid use case, a configuration can be introduced to control the behavior in open source and AOS separately. This is not high priority, it's fine to implement this in a future release.

@zhichao-aws
Copy link
Member Author

Good to see you creating this RFC! Can we also support pre-trained tokenizer just like pre-trained models? This would benefit users using either dense or sparse models.

Sorry I don't get your point. I think currently our pre-trained tokenizer can be used just like other DL models. What else do we need to do?

Wait a bit. Do you mean we can currently use pre-trained tokenizer without specifying tokenizer_config_file and idf_file?

Yes, we can use model API to register & deploy the pretrained tokenizer

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

tokenizer_config_file and idf_file approach is the new manner we want to introduce

@yuye-aws
Copy link
Member

Yes, we can use model API to register & deploy the pretrained tokenizer

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

tokenizer_config_file and idf_file approach is the new manner we want to introduce

From the document, this API can register model. I think the customer also needs a light-weight method to register pre-trained tokenizer, where only need a tokenizer.

To register a sparse tokenizer, users can choose the pre-trained tokenizer provided by OpenSearch, or build a zip file containing tokenizer.json and idf.json and then register from URL.

I am assuming that this RFC and all these three options is about tokenizer, not about all the model.

@zhichao-aws
Copy link
Member Author

Yes, we can use model API to register & deploy the pretrained tokenizer

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

tokenizer_config_file and idf_file approach is the new manner we want to introduce

From the document, this API can register model. I think the customer also needs a light-weight method to register pre-trained tokenizer, where only need a tokenizer.

To register a sparse tokenizer, users can choose the pre-trained tokenizer provided by OpenSearch, or build a zip file containing tokenizer.json and idf.json and then register from URL.

I am assuming that this RFC and all these three options is about tokenizer, not about all the model.

Do you mean the inner huggingface tokenizer instead of SPARSE_TOKENIZE model? Currently users doesn't interact with it directly in neural sparse search, it's out of scope for this RFC. Please feel free to create a dedicated RFC to support it if needed.

@yuye-aws
Copy link
Member

Do you mean the inner huggingface tokenizer instead of SPARSE_TOKENIZE model? Currently users doesn't interact with it directly in neural sparse search, it's out of scope for this RFC. Please feel free to create a dedicated RFC to support it if needed.

Thanks for clarification between the hugging face tokenizer and sparse_tokenize model. As for the 3rd option, train API does not make sense to me. Creating a tokenizer does not follow the current documentation of train API https://opensearch.org/docs/latest/ml-commons-plugin/api/train-predict/train/.

@mingshl mingshl added feature and removed untriaged labels Nov 5, 2024
@mingshl mingshl moved this from Untriaged to In Progress in ml-commons projects Nov 5, 2024
@brianf-aws
Copy link
Contributor

Hey @zhichao-aws I was doing something similar with mounting a file within the request for image support within ml commons see here #3152 . I am curious to understand how it works with the following

  • How do you get around having permission to read a file?
  • Where exactly is the file being stored? (Is this within a cluster, if so then the path will differ to what a user expects)

Currently I am at a cross roads to see if we should implement the local file path within the image support, so looking forward to see how you lead this feature to gain some insight.

@zhichao-aws
Copy link
Member Author

Hey @zhichao-aws I was doing something similar with mounting a file within the request for image support within ml commons see here #3152 . I am curious to understand how it works with the following

  • How do you get around having permission to read a file?
  • Where exactly is the file being stored? (Is this within a cluster, if so then the path will differ to what a user expects)

Currently I am at a cross roads to see if we should implement the local file path within the image support, so looking forward to see how you lead this feature to gain some insight.

Hi @brianf-aws , OpenSearch already has features to read synonyms from local files: https://github.com/opensearch-project/OpenSearch/blob/4213cc27305c37ea71e5b5a5addd17e5383e8029/server/src/main/java/org/opensearch/index/analysis/Analysis.java#L321.

For the permission issue, we need to use relative path or set the CONF_DIR. (ref).

In my POC for the feature, the files are saved under the config directory of the cluster node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

No branches or pull requests

7 participants