Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiencing https://github.com/huggingface/tokenizers/issues/537 issue when sentence-transformer is used for generating embeddings #794

Open
MyBruso opened this issue Mar 5, 2021 · 20 comments

Comments

@MyBruso
Copy link

MyBruso commented Mar 5, 2021

Hello,
I am using sentence-transformer to get the text embeddings using SentenceTransformer.encode().
This function is getting invoked from multi-threaded program.
I see sometimes this encode method fails with 'Already Borrowed' exception.
The issue is same or similar to huggingface/tokenizers#537.
My sentence-transformer version is 0.4.1.2.

Which version of sentence-transformer will not have this issue? Or what is the alternative to resolve this issue?

@nreimers
Copy link
Member

nreimers commented Mar 5, 2021

Sentence-Transformer is using the huggingface code. Check the linked issue for solutions how to solve it:

  • Multi-threading does in most cases not make sense in Python (due to global interpreter lock). It is better to design a multi-process architecture in Python
  • If you still stick to multi-threading, ensure that each thread has its own SentenceTransformer object. Sharing the object between threads leads to the linked issues
  • You can also try to not use the Rust based fast tokenizer from Huggingface, but to use slow Python based tokenizer:

In the model you are using, there is a folder 0_Transformer, which contains a tokenizer_config.json. Add in that json a new entry: "use_fast": false

This should load the slower Python tokenizer.

@MyBruso
Copy link
Author

MyBruso commented Mar 5, 2021

Thank you @nreimers, I will try using "use_fast": false in the tokenizer_config.json.

Multi-threading consumer can certainly be looked into; however I do not have control over there.

In my service, a single model is used for generating embeddings; so I load it once and allow to serve embeddings from the same model. Some of the clients are multi-threaded and there it is causing the issue.

@aphedges
Copy link
Contributor

aphedges commented Apr 9, 2021

I am in a similar situation to @MyBruso, and I am encountering the same problem.

@nreimers' suggestion of setting "use_fast": false in tokenizer_config.json did not work for me, unfortunately. The code to call the fast tokenizer is still being used. I will try to load a separate SentenceTransformers object in each thread, or try to see if I can figure out why your suggested config fix isn't working. I'll update here if I find anything useful.

@HariWu1995
Copy link

@nreimers Excuse me, I have tried use_fast = false as you suggest but the error still happens.
But it's some kind of weird: When I train with tensorflow.keras.losses.categorical_crossentropy, it runs without error. The bug only happens when I use my customized loss functions as below

def weighted_cross_entropy(y_true, y_pred, pos_weight=1.618):
  losses = y_true * -K.log(y_pred) * pos_weight + (1-y_true) * -K.log(1-y_pred)
  losses = K.clip(losses, 0.0, 9.7)
  return K.mean(losses)

or

def categorical_focal_loss(n_classes, alpha: float=0.25, gamma: float=2.):
  alpha = np.array([[alpha]*n_classes], dtype=np.float32)
  def categorical_focal_loss_fixed(y_true, y_pred):
    # Clip the prediction value to prevent NaN's and Inf's
    epsilon = K.epsilon()
    y_pred = K.clip(y_pred, epsilon, 1.-epsilon)
    # Calculate Cross Entropy
    cross_entropy = -y_true * K.log(y_pred)
    # Calculate Focal Loss
    loss = alpha * K.pow(1-y_pred, gamma) * cross_entropy
    # Compute mean loss in mini_batch
    return K.mean(K.sum(loss, axis=-1))
  return categorical_focal_loss_fixed

I have used no library other than tensorflow and keras, so I have no idea what's wrong!

@aphedges
Copy link
Contributor

I spent some time inspecting the code, and I figured out a solution similar to the one @nreimers suggested but actually works for me: add "tokenizer_args": {"use_fast": false} to sentence_bert_config.json in 0_Transformer. The problem is that transformers decides to use the fast or slow tokenizer before it even reads tokenizer_config.json, so adding the configuration in there doesn't help.

For context, I am using sentence-transformers==1.0.3 and transformers==4.4.2. The model was trained with sentence-transformers==0.3.6 and transformers<=3.3.1 (the exact transformers version used is not documented).

@nreimers
Copy link
Member

@aphedges Great, thanks for posting the solution here

@aphedges
Copy link
Contributor

You're welcome! I probably wouldn't have figured it out without your initial comment, so thank you.

I hope transformers/tokenizers fixes the underlying bug at some point, but this fix should be sufficient for now.

@MyBruso
Copy link
Author

MyBruso commented Apr 14, 2021

Thank you for sharing your findings @aphedges. By the way I am using pretrained distilbert-base-nli-stsb-mean-tokens model from sentence-transformer. I will try adding "tokenizer_args": {"use_fast": false} to sentence_bert_config.json.

@MyBruso
Copy link
Author

MyBruso commented Apr 20, 2021

Thank you @nreimers and @aphedges. I am not experiencing this 'Already borrowed' error now.

Since we have touched this topic of using 'SentenceTransformer' object from multithreaded program, I am checking this here, is there any way to improve the encode() ( get embeddings) API's response time when used from a multithreaded server?

  • I have tried using separate 'SentenceTransformer' object in each thread, however it decreases the per request response time as loading time of model becomes overhead for each request.
  • I have seen drastic improvement in response time when the server is single threaded; which is not an option as it adds to overall response time at client side because request starts piling up.

@nreimers
Copy link
Member

Hi @MyBruso
Have a look at these two articles:
https://huggingface.co/blog/bert-cpu-scaling-part-1
https://robloxtechblog.com/how-we-scaled-bert-to-serve-1-billion-daily-requests-on-cpus-d99be090db26

Instead of threads you should use processes. The process should have the model loaded. For each process, you should limit the number of threads the process can use.

@jkhalsa-arabesque
Copy link

any news?

@MyBruso
Copy link
Author

MyBruso commented May 21, 2021

any news?

@jkhalsa-arabesque I am assuming you are curious to know if I could use model in multi-process environment. The answer to that is there is a limitation to use multi-processed server at my end.

@jkhalsa-arabesque
Copy link

any news?

@jkhalsa-arabesque I am assuming you are curious to know if I could use model in multi-process environment. The answer to that is there is a limitation to use multi-processed server at my end.

I was actually curious if the underlying issue has been resolved.
With previous versions of tranformers and tokenizers I didn't have this problem. When I am using cuda, it just seg faults

@MyBruso
Copy link
Author

MyBruso commented May 21, 2021

any news?

@jkhalsa-arabesque I am assuming you are curious to know if I could use model in multi-process environment. The answer to that is there is a limitation to use multi-processed server at my end.

I was actually curious if the underlying issue has been resolved.
With previous versions of tranformers and tokenizers I didn't have this problem. When I am using cuda, it just seg faults

If you are referring to 'Already borrowed' exception issue; I am not observing it and solution is as per this comment above.

If it is about whether it has been fixed in transformer library; I do not have any latest updates on it.

@donderom
Copy link

If pretrained model's name is known it could be done programmatically:

from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer

model_name = "paraphrase-distilroberta-base-v2"
model = SentenceTransformer(model_name)
model.tokenizer = AutoTokenizer.from_pretrained(f"sentence-transformers/{model_name}", use_fast=False)

@aphedges
Copy link
Contributor

Recently, there was a fix in transformers (huggingface/transformers#12550) that greatly reduced but not completely eliminated this error. The PR says that it's impossible to fully prevent the error because of how Python-Rust integration works. It should be in v4.9.0 or later.

The only way to truly prevent it is to not use the Rust-based tokenizer, but the workarounds that have been posted in this thread are annoying to use. I have just created a PR (#1103) to make disabling the fast tokenizer easier. If my PR is merged, then using something as simple as SentenceTransformer("sentence-transformers/stsb-roberta-base-v2", tokenizer_args={"use_fast": False}) would be able to disable it.

@CaptXiong
Copy link

Recently, there was a fix in transformers (huggingface/transformers#12550) that greatly reduced but not completely eliminated this error. The PR says that it's impossible to fully prevent the error because of how Python-Rust integration works. It should be in v4.9.0 or later.

The only way to truly prevent it is to not use the Rust-based tokenizer, but the workarounds that have been posted in this thread are annoying to use. I have just created a PR (#1103) to make disabling the fast tokenizer easier. If my PR is merged, then using something as simple as SentenceTransformer("sentence-transformers/stsb-roberta-base-v2", tokenizer_args={"use_fast": False}) would be able to disable it.

I am just curious that which one is more fast, the way to disable the fast tokenizer or the way in transformers to reduce borrowed situation?

@aphedges
Copy link
Contributor

aphedges commented Dec 3, 2021

@CaptXiong, I'm not sure what you are asking. Whether it is faster to use the slow tokenizer (by disabling the fast tokenizer) or use the fast tokenizer (by using the fix that reduces the borrowing problem)? I would hope the faster tokenizer would still be much faster, but I have not benchmarked it myself.

@CaptXiong
Copy link

CaptXiong commented Dec 6, 2021

@aphedges Thank you! I referenced your code to rewrite the SentenceTransformer class and the bug didn't come out again.
And I have test the speed of fast tokenizer and disabled tokenizer, there are not much difference between them for one sentence encoding.

@gsakkis
Copy link

gsakkis commented Dec 18, 2023

Is there still interest in this? If so I can post a PR fixing it by storing the tokenizer in thread-local storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants