-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix #30] Fix multithread race condition on global variable _factory #33
base: master
Are you sure you want to change the base?
Conversation
A lock is used to avoid a race condition on the global variable _factory: as soon as the line '_factory = DetectorFactory()' was executed, the other threads jumped the condition 'if _factory is None' going through 'create' and '_create_detector', until they met the line 'if not self.langlist'. Since the thread that created the global DetectorFactory didn't necessarily have the time to populate the language profiles, the other threads start raising LangDetectException until the langlist is no longer empty (which is not good either: it can have just one language...). Since the global variable is used to avoid reloading the profiles again and again (which can take some time), I'm leaving it global, but a lock is used to ensure that other threads can use the fully loaded object, instead of a yet-to-populate langlist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch.
I think using threading.local
(https://docs.python.org/2/library/threading.html#threading.local) would be more appropriate, since the rest of the Factory code is not thread-safe as well.
Hi, Sorry for the (enormous) delay. I've been very busy lately. However, the two main functions ( So, if you make a However, I still think a global lock should be used, this time on both _messages and _factory: And you can say: 40M is nothing for today's RAM. However, there's a second and more annoying problem: the initial delay to load all the profiles into RAM. As you know, Python has a GIL, and if the programmer's threads are not working around it, he will effectively wait Using only 10 threads, this time is around 0.5 sec. using the global lock (my initial solution), but more than 5 seconds using Also, the variable used by each thread is forgotten when the job finishes. So there's no reuse of the profiles loaded in the past if the threads are killed and re-launched: think of a program which, instead of using daemon threads in permanence, uses a pool of threads (to speed things up) launched in a callback of a rare event to spare resources. The So I'm in favor of using And thanks for the repo. :) NB: In the screenshots below I'm executing the solution with the global lock just after starting ipython and importing your module, so I'm not reusing the global variable (i.e. I'm making a fair comparison!). Subsequent calls using the global lock make the time drop by more than half (0.19s instead of 0.5s), so my C++ soul is in peace. 😉 |
Will this be merged in the near future? In the meantime, maybe this can be of use:
|
Sadly it seems the author abandoned this repo. In this case it would be thoughtful if he shared control over it, otherwise the open PRs and Issues will remain forever unanswered. Regretfully this is a very recurrent problem in open-source... |
@danilo-augusto hey i am still facing the same issued you had raised. ive been using the detect() in my huggingface datasets using the datasets.map() in multithreading setup. and i am facing the same issues. |
Any update on langdetect ? |
A lock is used to avoid a race condition on the global variable _factory: as soon as the line '_factory = DetectorFactory()' was executed, the other threads jumped the condition 'if _factory is None' going through 'create' and '_create_detector', until they meet the line 'if not self.langlist'.
Since the thread that created the global DetectorFactory didn't necessarily have the time to populate the language profiles, the other threads start raising LangDetectException until the langlist is no longer empty (which is not good either: it can have just one language...).
Since the global variable is used to avoid reloading the profiles again and again (which can take some time), I'm leaving it global, but a lock is used to ensure that other threads can use the fully loaded object, instead of a yet-to-populate langlist.