Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass model and tokenizer arguments when instantiating SentenceTransformer #1103

Closed

Conversation

aphedges
Copy link
Contributor

Fixes #794.

The Transformer class already has model_args and tokenizer_args arguments in its constructor to allow overriding when instantiating models and tokenizers from the transformers library. However, there is no way to set these from the SentenceTransformers class, which is the one that many users would be instantiating instead. I have added arguments to the SentenceTransformers constructor to allow these to be more easily used.

My specific motivation in this case is to make disabling the fast tokenizer easier, but it should work for any arguments that users might want to override.

@nreimers
Copy link
Member

Thanks for the PR. I will have a look after my vacation (after Aug. 23rd)

@aphedges
Copy link
Contributor Author

@nreimers, I'm wondering if you have time to look at this PR now.

@nreimers
Copy link
Member

nreimers commented Sep 1, 2021

@aphedges Sadly not yet, sorry for that. I'm currently work on a major release of new models. Once that is done, I will have a look the PRs and integrate new functions into SBERT

@aphedges
Copy link
Contributor Author

@nreimers, understood! Thanks for the update!

@aphedges aphedges force-pushed the 794-sentence-transformer-model-args branch from 412c1c2 to 121fec1 Compare November 4, 2021 17:23
@aphedges
Copy link
Contributor Author

@nreimers, sorry to keep bothering you about this, but I was hoping you could take a look at this PR. If there is anything I can do to make it easier to review (e.g. splitting it up into multiple PRs), please let me know.

@nreimers
Copy link
Member

Sorry that I did not yet look into it.

Could you write a bit more about the use-case? The use case is when you run SentenceTransformers in a multi-process setup to disable the fast tokenizer? The linked issue in tokenizer appears that this is still an issue, or?

@aphedges
Copy link
Contributor Author

It's okay! I know you have other priorities when it comes to this library.

Your recounting of the use case is correct. My specific use case is running SentenceTransformers in a multi-threaded setup (e.g. Flask). As far as I'm aware, the underlying problem will always be present because of differences between the CPython and Rust runtimes. This allows me to disable the fast tokenizer by just passing in a parameter.

I specifically made this fix more general than needed to allow any model or tokenizer arguments to be passed when the model and tokenizer are loaded. There are probably other use cases for this PR that I am unaware of.

@aphedges
Copy link
Contributor Author

aphedges commented Dec 4, 2021

@nreimers, just pinging this again.

@aphedges aphedges force-pushed the 794-sentence-transformer-model-args branch from 121fec1 to 94247f2 Compare November 15, 2023 21:12
@aphedges
Copy link
Contributor Author

I have rebased this branch against master to resolve conflicts.

@tomaarsen
Copy link
Collaborator

Hello!

Apologies for the delay. This has been implemented via #2578 and released in the v3.0 release last week. See the documentation for it here.

  • Tom Aarsen

@tomaarsen tomaarsen closed this Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants