Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing with import_from_path cause ModuleNotFoundError on Mac #1677

Closed
ChangqingW opened this issue Sep 26, 2024 · 1 comment
Closed

Comments

@ChangqingW
Copy link

ChangqingW commented Sep 26, 2024

When a script is outside of current working directory and used via import_from_path, e.g.:
path/to/tmp.py:

from multiprocessing import Pool
def f(a):
    print(a)
    #print(sys.path)
    return a ** 2
def f2():
    p = Pool(5)
    rtn = p.map(f, [1,2,3,4,5])
    return rtn
if __name__ == "__main__":
    print(f2())

In R:

tmp <- reticulate::import_from_path('tmp', 'path/to')
tmp$f2()

Causes ModuleNotFoundError trying to import itself:

Process SpawnPoolWorker-1:
Process SpawnPoolWorker-2:
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/queues.py", line 368, in get
    return _ForkingPickler.loads(res)
ModuleNotFoundError: No module named 'tmp'
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/queues.py", line 368, in get
    return _ForkingPickler.loads(res)
ModuleNotFoundError: No module named 'tmp'
Process SpawnPoolWorker-4:
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/queues.py", line 368, in get
    return _ForkingPickler.loads(res)
ModuleNotFoundError: No module named 'tmp'
Process SpawnPoolWorker-3:
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/queues.py", line 368, in get
    return _ForkingPickler.loads(res)
ModuleNotFoundError: No module named 'tmp'
Process SpawnPoolWorker-5:
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/queues.py", line 368, in get
    return _ForkingPickler.loads(res)
ModuleNotFoundError: No module named 'tmp'
...
(Can't even Ctrl-C out of this...)

However if you repeat the module path twice at the very top of the script:

import os
import sys
x = sys.path.pop(0) # it there already but one copy is not enough...
sys.path.insert(0, x) # append also works
sys.path.insert(0, x) # has to be twice
from multiprocessing import Pool
def f(a):
    print(a)
    return a ** 2
def f2():
    p = Pool(5)
    rtn = p.map(f, [1,2,3,4,5])
    return rtn
if __name__ == "__main__":
    print(f2())

It will work again.
Also tested on Linux (Ubuntu and RedHat), both are fine. Only occurs with Macs.

Session Info:

> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-apple-darwin20
Running under: macOS Ventura 13.6.9

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

time zone: Australia/Melbourne
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
 [1] compiler_4.4.1    here_1.0.1        Matrix_1.7-0      rprojroot_2.0.4
 [5] cli_3.6.3         Rcpp_1.0.13       reticulate_1.39.0 grid_4.4.1
 [9] jsonlite_1.8.9    rlang_1.1.4       png_0.1-8         lattice_0.22-6
ChangqingW added a commit to ChangqingW/FLAMES-R that referenced this issue Sep 26, 2024
@t-kalinowski
Copy link
Member

Thank you for the bug report! I can also reproduce this on macOS but not on Linux.

After some investigation, it turns out this issue arises from how multiprocessing.Pool sets up the worker pool. On macOS, the default method is ‘spawn’, while on Linux, it’s ‘fork’.

The way import_from_path() works is by temporarily modifying sys.path so that modules in the specified path can be located. Once these modules are found, sys.path is restored to its previous value.

After the initial import_from_path() call, subsequent import calls to the same module work because the module is already loaded and reused from sys.modules["my_module"]. Forked processes can also execute import my_module even if my_module is no longer in sys.path because they inherit the same sys.modules as the parent process. However, newly spawned processes do not inherit this and require my_module to be discoverable in sys.path.

I believe this is a potential bug or feature request for multiprocessing.Pool with the 'spawn’ method, rather than an issue with reticulate. In any case, I’m not sure what could be done here, short of patching the installed multiprocessing module or somehow intercepting and modifying all spawned subprocesses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants