Pass model and tokenizer arguments when instantiating `SentenceTransformer` #1103

aphedges · 2021-08-10T19:12:22Z

Fixes #794.

The Transformer class already has model_args and tokenizer_args arguments in its constructor to allow overriding when instantiating models and tokenizers from the transformers library. However, there is no way to set these from the SentenceTransformers class, which is the one that many users would be instantiating instead. I have added arguments to the SentenceTransformers constructor to allow these to be more easily used.

My specific motivation in this case is to make disabling the fast tokenizer easier, but it should work for any arguments that users might want to override.

nreimers · 2021-08-12T19:26:22Z

Thanks for the PR. I will have a look after my vacation (after Aug. 23rd)

aphedges · 2021-08-31T21:17:47Z

@nreimers, I'm wondering if you have time to look at this PR now.

nreimers · 2021-09-01T06:56:46Z

@aphedges Sadly not yet, sorry for that. I'm currently work on a major release of new models. Once that is done, I will have a look the PRs and integrate new functions into SBERT

aphedges · 2021-09-16T03:06:31Z

@nreimers, understood! Thanks for the update!

aphedges · 2021-11-12T19:15:35Z

@nreimers, sorry to keep bothering you about this, but I was hoping you could take a look at this PR. If there is anything I can do to make it easier to review (e.g. splitting it up into multiple PRs), please let me know.

nreimers · 2021-11-13T19:12:07Z

Sorry that I did not yet look into it.

Could you write a bit more about the use-case? The use case is when you run SentenceTransformers in a multi-process setup to disable the fast tokenizer? The linked issue in tokenizer appears that this is still an issue, or?

aphedges · 2021-11-13T19:41:06Z

It's okay! I know you have other priorities when it comes to this library.

Your recounting of the use case is correct. My specific use case is running SentenceTransformers in a multi-threaded setup (e.g. Flask). As far as I'm aware, the underlying problem will always be present because of differences between the CPython and Rust runtimes. This allows me to disable the fast tokenizer by just passing in a parameter.

I specifically made this fix more general than needed to allow any model or tokenizer arguments to be passed when the model and tokenizer are loaded. There are probably other use cases for this PR that I am unaware of.

aphedges · 2021-12-04T18:45:21Z

@nreimers, just pinging this again.

aphedges · 2023-11-15T21:13:17Z

I have rebased this branch against master to resolve conflicts.

tomaarsen · 2024-06-04T21:50:53Z

Hello!

Apologies for the delay. This has been implemented via #2578 and released in the v3.0 release last week. See the documentation for it here.

Tom Aarsen

aphedges mentioned this pull request Aug 10, 2021

Experiencing https://github.com/huggingface/tokenizers/issues/537 issue when sentence-transformer is used for generating embeddings #794

Open

aphedges force-pushed the 794-sentence-transformer-model-args branch from 412c1c2 to 121fec1 Compare November 4, 2021 17:23

aphedges added 2 commits November 15, 2023 16:07

Replace mutable defaults in Transformers.__init__

12d8f38

Pass model and tokenizer arguments to Transformers

94247f2

aphedges force-pushed the 794-sentence-transformer-model-args branch from 121fec1 to 94247f2 Compare November 15, 2023 21:12

tomaarsen closed this Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass model and tokenizer arguments when instantiating `SentenceTransformer` #1103

Pass model and tokenizer arguments when instantiating `SentenceTransformer` #1103

aphedges commented Aug 10, 2021

nreimers commented Aug 12, 2021

aphedges commented Aug 31, 2021

nreimers commented Sep 1, 2021

aphedges commented Sep 16, 2021

aphedges commented Nov 12, 2021

nreimers commented Nov 13, 2021

aphedges commented Nov 13, 2021

aphedges commented Dec 4, 2021

aphedges commented Nov 15, 2023

tomaarsen commented Jun 4, 2024

Pass model and tokenizer arguments when instantiating SentenceTransformer #1103

Pass model and tokenizer arguments when instantiating SentenceTransformer #1103

Conversation

aphedges commented Aug 10, 2021

nreimers commented Aug 12, 2021

aphedges commented Aug 31, 2021

nreimers commented Sep 1, 2021

aphedges commented Sep 16, 2021

aphedges commented Nov 12, 2021

nreimers commented Nov 13, 2021

aphedges commented Nov 13, 2021

aphedges commented Dec 4, 2021

aphedges commented Nov 15, 2023

tomaarsen commented Jun 4, 2024

Pass model and tokenizer arguments when instantiating `SentenceTransformer` #1103

Pass model and tokenizer arguments when instantiating `SentenceTransformer` #1103