Replace multiprocessing pool with futures executors #719

steinitzu · 2023-10-27T02:33:02Z

Resolves: #699

Added NullExecutor fallback implementation as well which just runs the task in the same thread and wraps in a future. So we have the same interface in single-threaded mode and don't have to check whether a pool is there.

netlify · 2023-10-27T02:33:07Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`e478f89`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65401b53eb32330008fcfa2a

steinitzu · 2023-10-27T04:08:54Z

tests/common/runners/test_runnable.py

 def test_fail_on_process_worker_started_early() -> None:
    # process pool cannot be started before class instance is created: mapping not exist in worker
-    with Pool(4) as p:
+    with ProcessPoolExecutor(4) as p:
        r = _TestRunnableWorkerMethod(4)
        with pytest.raises(KeyError):
            r._run(p)
-        p.close()
+        p.shutdown(wait=True)


@rudolfix This test fails, doesn't raise and I'm not totally clear on what it's doing. Looks like the executor initializes the process pool lazily on first tasks, so maybe this order doesn't matter now?

if processes are lazily instantiated then yes. just remove the with raises and test if the run was successful.

rudolfix

this is neat

rudolfix · 2023-10-27T10:31:00Z

dlt/common/runners/pool_runner.py

@@ -59,7 +85,7 @@ def _run_func() -> bool:
        if pool:
            logger.info("Closing processing pool")
            # terminate pool and do not join


please remove outdated comments

rudolfix · 2023-10-27T10:32:49Z

dlt/common/runners/pool_runner.py

@@ -59,7 +85,7 @@ def _run_func() -> bool:
        if pool:
            logger.info("Closing processing pool")
            # terminate pool and do not join
-            pool.terminate()
+            pool.shutdown(wait=True)


I hope that will never lock. the process pool was locking sometimes. impossible to debug

I think the old one used a semaphore under the hood in the stdlib. And if the system sigkilled a child process, perhaps due to memory, or a process failed, it could lock. On kubernetes where you have limited CPU, os.cpucount returns incorrect value. Over provisioning procs would stall out the program too.

dlt/common/runners/pool_runner.py

rudolfix · 2023-10-27T10:42:24Z

tests/common/runners/test_runnable.py

@@ -17,9 +18,9 @@ def test_runnable_process_pool(method: str) -> None:
    # 4 tasks
    r = _TestRunnableWorker(4)
    # create 4 workers
-    with Pool(4) as p:
+    with ProcessPoolExecutor(4) as p:


does this line still work

multiprocessing.set_start_method(method, force=True)

?

we need to make sure all methods work. on windows there's only spawn. (also make sure those tests run on windows on CI)

@steinitzu looks like you need to pass this explicitly: https://stackoverflow.com/questions/61860800/running-a-processpoolexecutor-in-ipython

Yes, I updated this in tests ProcessPoolExecutor(4, mp_context=multiprocessing.get_context(method))
The global multiprocessing seems to work too but cleaner to not change global settings in tests

Still apparently the case with "spawn" that processes start upfront, so testing both

rudolfix · 2023-10-27T10:42:47Z

tests/common/runners/test_runnable.py

 def test_fail_on_process_worker_started_early() -> None:
    # process pool cannot be started before class instance is created: mapping not exist in worker
-    with Pool(4) as p:
+    with ProcessPoolExecutor(4) as p:
        r = _TestRunnableWorkerMethod(4)
        with pytest.raises(KeyError):
            r._run(p)
-        p.close()
+        p.shutdown(wait=True)


if processes are lazily instantiated then yes. just remove the with raises and test if the run was successful.

z3z1ma · 2023-10-29T17:39:17Z

This is very nicely done 🚀

I wonder if there is a possibility to supply a custom executor. Given that, you could parallelize normalization across nodes using something like Ray, at the expense of network. But given a large enough pool it could pay off.

steinitzu · 2023-10-30T16:05:29Z

This is very nicely done 🚀

I wonder if there is a possibility to supply a custom executor. Given that, you could parallelize normalization across nodes using something like Ray, at the expense of network. But given a large enough pool it could pay off.

This would be cool and easy to do, anything with the same futures interface should work. Question how we would pass it. I think just an executor/pool argument in normalize, load, run would be good and that should supersede config.

z3z1ma · 2023-10-30T18:10:29Z

This is very nicely done 🚀
I wonder if there is a possibility to supply a custom executor. Given that, you could parallelize normalization across nodes using something like Ray, at the expense of network. But given a large enough pool it could pay off.

This would be cool and easy to do, anything with the same futures interface should work. Question how we would pass it. I think just an executor/pool argument in normalize, load, run would be good and that should supersede config.

Indeed, for example https://docs.dask.org/en/stable/futures.html# offers an Executor compatible interface out of the box that scales out to multiple nodes. Or a bespoke implementation could be supplied.

The primary consideration to take this to the next level is on data locality. Data storage should ideally leverage fsspec. So that things like from dlt.common.storages import NormalizeStorage, SchemaStorage, LoadStorage, LoadStorageConfiguration, NormalizeStorageConfiguration all leverage fsspec. This, when configured, would mean parallelization across nodes as well as persistence of pipeline state across nodes becomes trivial. The parameterization of Executor is still useful even prior to the above. But we should consider the above as deeply synergistic.

…n workflow

steinitzu commented Oct 27, 2023

View reviewed changes

rudolfix requested changes Oct 27, 2023

View reviewed changes

steinitzu added 4 commits October 30, 2023 11:14

Replace multiprocessing pool with futures executors

417d6e8

Always fallback on NullExecutor

fc260e6

Update runnable test

2a6ca39

Remove comments

7279700

steinitzu force-pushed the sthor/futures-execturos branch from add0724 to 7279700 Compare October 30, 2023 15:34

Update runnable tests with all methods

b6fb333

rudolfix marked this pull request as ready for review October 30, 2023 20:05

passes mt context to ProcessPoolExecutor, fixes caching in test_commo…

e478f89

…n workflow

rudolfix force-pushed the sthor/futures-execturos branch from a648e07 to e478f89 Compare October 30, 2023 21:08

rudolfix merged commit 7c85bfe into devel Oct 30, 2023

rudolfix deleted the sthor/futures-execturos branch October 30, 2023 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace multiprocessing pool with futures executors #719

Replace multiprocessing pool with futures executors #719

steinitzu commented Oct 27, 2023 •

edited

Loading

netlify bot commented Oct 27, 2023 •

edited

Loading

steinitzu Oct 27, 2023

rudolfix Oct 27, 2023

rudolfix left a comment

rudolfix Oct 27, 2023

rudolfix Oct 27, 2023

z3z1ma Oct 29, 2023

rudolfix Oct 27, 2023

rudolfix Oct 29, 2023

steinitzu Oct 30, 2023

steinitzu Oct 30, 2023

rudolfix Oct 27, 2023

z3z1ma commented Oct 29, 2023

steinitzu commented Oct 30, 2023

z3z1ma commented Oct 30, 2023

Replace multiprocessing pool with futures executors #719

Replace multiprocessing pool with futures executors #719

Conversation

steinitzu commented Oct 27, 2023 • edited Loading

netlify bot commented Oct 27, 2023 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

z3z1ma commented Oct 29, 2023

steinitzu commented Oct 30, 2023

z3z1ma commented Oct 30, 2023

steinitzu commented Oct 27, 2023 •

edited

Loading

netlify bot commented Oct 27, 2023 •

edited

Loading