-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread explosion #50
Comments
TL;DR: the last thing I'd expect is that that commenting out the code you did will prevent an explosion in the number of used threads. That is, I can understand that my code failed to prevent a threads explosion or failures, given the vagaries of the OpenBLAS / OpenMP implementations. But I can't see how this code, which only reduces the number of used threads, causes an increase in the number of used threads. Gory details: The whole point of the commented-out lines is to reduce the amount of parallelism used in nested Python processes. That is, we start with the top-level Python process, and when we call a parallel loop, we fork worker sub-processes. Because we are forking, all the (typically very large) arrays are available "for free" (well, except for the cost of a fork, of course). However, anything we compute in the worker processes has to be copied back to the main process (the multi-processing shared memory arrays in Python were very flaky when I tried to use them to avoid this - a saga all of its own). This is clunky, but ended up being the least-bad approach I could find to get performance given the code is written in Python. Naively, since each forked worker is a brand-new standalone process, if you use parallelism inside it (e.g. OpenMP or whatever), it will try to take over all the parallel cores in the machine. So on a machine with N cores you will end with N^2 active threads, also known as "having a very bad day". This seems to be what you are seeing. (As a side note, N should be the number of physical rather than logical cores - you get no performance boost from the hyper-threads for dense computation code. In fact, you typically lose performance, because you are increasing cache pressure and memory footprint.) At any rate, the intent of the code you commented out is to say "if I am in a worker process, restrict the amount of parallelism I am using to only my fair share of the machine". It also gives the worker processes nice names for logging etc. but that's a side benefit. This is somewhat fragile due to the fact that OpenMP really hates the idea of (1) having a process use parallelism in OpenMP; (2) calling fork (not using OpenMP); (3) using OpenMP again in the child process. The problem is that OpenMP creates its worker threads pool and when forking, the child worker doesn't have these threads, which makes sense. However, because we forked, the in-memory data structures OpenMP uses think the threads exists, and hilarity ensues. A simple workaround would have been a way to reset OpenMP in the worker process, but OpenMP, in its infinite wisdom, does not provide such a function. Therefore in my own C++ extension codes I do not use OpenMP. I manually use C++ threads, where I work around the problem by spinning up my threads for each parallel loop (yes, this is less efficient than it could have been, but seems to work OK in practice). So much for intent - I'm trying to figure out what is happening in practice in your case. Looking at https://groups.google.com/g/openblas-users/c/W6ehBvPsKTw I see that whether OpenBLAS uses OpenMP or pthreads "depends". Seeing issues like OpenMathLib/OpenBLAS#294 and OpenMathLib/OpenBLAS#240 it seems people are aware of the problem and that they have some workarounds that work sometimes? Hard to say. To figure things out, one would need to dynamically track the tree of processes and threads that is created, and understand who is creating all these threads (OpenMP? OpenBLAS? Someone else?), and what mysterious reason causes asking for less threads (in the commented out code) actually triggers the creation of more threads. I'd start with Bottom line... I've no idea what is happening in your case. My takeaway from this is that using parallelism in Python is a losing proposition. Perhaps the "just write it in Julia and call it from a sequential Python or R wrapper" approach is the least insane option after all. Sigh. |
On powerful machines it easily happens to get an error like (see also issue #24):
My fix was to rewrite metacells/utilities/parallel.py as below. Importantly, this was actually faster on a beefy HPC cluster node (50 parralel piles, 32 CPUs, 220GB of RAM). Thus, I wanted to ask if we could at least get a flag such that metacells runs code like below?
The changes below are the lines commented out in
set_processors_count()
and_invocation()
:The text was updated successfully, but these errors were encountered: