Parallelize decorator #965

steinitzu · 2024-02-13T23:40:54Z

Description

Implements #934

Draft for now, may need to test more + add docs

Added a parallelized argument to resource. For me that seems like a good interface.
Under the hood it's basically the same decorator from the test implementation (maybe it makes sense to use it directly in some cases?) or we could integrate this more deeply in resource.

There was a lot of overhead coming from all the polling in PipeIterator when constantly yielding None, so I tried to implement more intelligent waiting for futures. Basically just block until some future completes when the pool is full instead of blindly polling and looping over the resources and that seems to make a big difference.
+ pretty big refactor of PipeIterator -> the futures worker pool is a separate class

Benchmark before and after optimizing: https://gist.github.com/steinitzu/671b47cdb7cbf61b2fab46cf1faefd86
The original iterator had a lot of overhead, especially when number of resources is <= number of workers.

netlify · 2024-02-13T23:40:59Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`d376cb5`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65e1048b812d2000083b9cc7
😎 Deploy Preview	https://deploy-preview-965--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

rudolfix

I'm OK with the code refactor. Code duplication you mentioned worries me. We didnt have it before!

Could you also add a parallelize method to the DltResource that will wrap existing gen so regular resources (ie. from verified sources) can be converted to parallel ones?

also you must test parallel transformers!

rudolfix · 2024-02-14T16:29:48Z

dlt/extract/concurrency.py

+        if self.free_slots == 0:
+            return None
+
+        if self.free_slots < 0:  # TODO: Sanity check during dev, should never happen


user assert for that!

dlt/extract/concurrency.py

dlt/extract/pipe.py

rudolfix · 2024-02-14T16:42:02Z

dlt/extract/pipe.py

-                            return ResolvablePipeItem(item, step, pipe, meta)
+                            pipe_item = ResolvablePipeItem(pipe_item, step, pipe, meta)
+
+                    if isinstance(pipe_item.item, Awaitable) or callable(pipe_item.item):


this duplication looks really suspicious. AFAIK we were returning Awaitable/Callable item to the next function and resolved the item there. I do not see a reason for this code to be there. if this is a result of the code refactor then IMO something is still not right there

It's an optimization, but I'm not totally happy with how this looks either. Want to give it a little more love.
The basic idea behind this is that we start multiple futures on each __next__ iteration, instead of just adding one future and doing the whole dance "poll -> iterate all sources -> repeat" each time. It seems to make a big difference.

My benchmark with and without this block:

Done in 10.724883079528809 seconds Extracted 5000 items total from 5 resources Done in 20.511144161224365 seconds Extracted 5000 items total from 5 resources

rudolfix · 2024-02-14T16:49:41Z

dlt/extract/decorators.py

+    @wraps(f)
+    def _wrap(*args: Any, **kwargs: Any) -> Any:  # TODO: Type correctly
+        gen = f(*args, **kwargs)
+        if inspect.isfunction(gen):


if gen is a function, we'll call it when pipe iterator is constructed. do not start evaluating gen here! context:

pipes are cloned before evaluation. so if gen is generator function you'll be able to evaluate it multiple times

maybe someone will pass the source to a different thread for evaluation. we support parallel pipelines now (and this decorator is made to support that)
instead check if this is generator function and if not raise a nice exception. please look into exceptions.py and use proper base class (where you can pass function name and a few other so user knows what is going on)

rudolfix · 2024-02-14T16:50:23Z

dlt/extract/decorators.py

+        gen = f(*args, **kwargs)
+        if inspect.isfunction(gen):
+            gen = gen()
+        if inspect.isasyncgen(gen):


rudolfix · 2024-02-14T17:08:30Z

one more comment to refactor: please split Pipe from PipeIterator. Also I think you can move items.py to typing.py but that's up to you

sh-rp · 2024-02-19T12:14:10Z

I did not review the PR, but want to point out that returning None from a resource to skip to the next in round robin mode should still work for backwards compatibility at least alex is using this.

rudolfix · 2024-02-19T15:19:26Z

I did not review the PR, but want to point out that returning None from a resource to skip to the next in round robin mode should still work for backwards compatibility at least alex is using this.

💯 we have a test that makes sure that round robin really works test_rotation_on_none and all other tests must pass

rudolfix

besides all the comments:

modification to add_limit worries my. why it works right now? we are still yielding awaitables from the resource itslef no?
we need a test for add limit for parallelize (both resource and transformer)
also we need docs, but let's make this work first

dlt/extract/concurrency.py

rudolfix · 2024-02-19T15:31:07Z

dlt/extract/concurrency.py

+        if not self.futures:
+            return None
+
+        if (item := self.resolve_next_future_no_wait()) is not None:


why this is needed? we'll get the same result using for .. as_completed loop below?

It's to guarantee ordered results from each resource. I.e. if there are 2 or more futures from the same pipe already done when this is called, we get the results in FIFO order, since the self._futures dict is in insertion order but as_completed yields in undefined order.

dlt/extract/concurrency.py

dlt/extract/decorators.py

dlt/extract/source.py

dlt/extract/utils.py

tests/extract/test_decorators.py

rudolfix · 2024-02-19T16:18:31Z

tests/pipeline/test_resources_evaluation.py

+
+    if parallelized:
+        assert len(threads) > 1
+        assert execution_order == ["one", "two", "one", "two", "one", "two"]


I hope it sleeps long enough so test is really deteministic :)

Probably beter to not count on it. Don't think we need to check exact order.

rudolfix · 2024-02-19T16:18:53Z

tests/pipeline/test_resources_evaluation.py

-#         assert len(threads) == 1
+@pytest.mark.parametrize("parallelized", [True, False])
+def test_parallelized_resource(parallelized: bool) -> None:
+    os.environ["EXTRACT__NEXT_ITEM_MODE"] = "fifo"


what for round robin? test it for both. I want to see what is the difference!

Yeah can parametrize it. Should be not much difference. It's going round-robin implicitly but resetting at some points depending on timings.

steinitzu · 2024-02-19T17:09:56Z

I did not review the PR, but want to point out that returning None from a resource to skip to the next in round robin mode should still work for backwards compatibility at least alex is using this.

This should still be the same. Assuming the tests covering this are good also.
The only difference now is that returning callables moves to next source as well in some cases.

rudolfix · 2024-02-19T20:58:18Z

@steinitzu one more ask - when this is implemented could you check if our sql_database can be executed in parallel? looks like engine is thread safe and we open connection when already (possibly) in a thread

steinitzu · 2024-02-19T21:15:23Z

@steinitzu one more ask - when this is implemented could you check if our sql_database can be executed in parallel? looks like engine is thread safe and we open connection when already (possibly) in a thread

Good call, wonder if that will work with streaming results. The conn pool is thread safe, but we're streaming chunks on one connection to load a table.

rudolfix

I have a feeling that this PR got more chaotic. but now I think I understand various edge cases. Please read my comments. There's not so much fixing.

btw. I realized that generators can be read from multiple threads not sure where I got the idea that it is not the case :)

rudolfix · 2024-02-21T16:10:22Z

tests/pipeline/test_resources_evaluation.py

+
+    result = list(some_source())
+
+    assert set(result) == {1, 2, 3, 4, 5, -1, -2, -3, -4, -5}


the result does not look parallelized. first pos data is evaluated and then neg data. I expect we do not really run it in the theread

I'm not checking the order here since it's not that guaranteed, just that all items are here. But I updated all the tests to check thread idents ( >1 thread and not main thread)
Would be nice to check execution order is roughly interleaved but tricky without making flaky tests

rudolfix · 2024-02-21T16:38:52Z

dlt/extract/pipe_iterator.py

+                        self._futures_pool.submit(pipe_item, block=True)
+                        pipe_item = None
+                        if len(self._futures_pool) >= sources_count:
+                            # Return here so we're not collecting done futures forever


submitting, not collecting?

rudolfix · 2024-02-21T16:45:58Z

dlt/extract/pipe_iterator.py

+            else:
+                self._current_source_index = (self._current_source_index - 1) % sources_count
+            while True:
+                # get next item from the current source


why this is gone?

# if we have checked all sources once and all returned None, then we can sleep a bit if self._current_source_index == first_evaluated_index: sleep(self.futures_poll_interval)

we must keep it. if all generators return None we cannot loop forever. instead of sleeping we can return None though and let next handle the sleep part

Figured the future submitting would return eventually. But yes also need to handle the returning None case. I'm returning None from this now and __next__ does polling

yeah! you do optimized polling when waiting for a future but if there are no futures, there's no sleep and we max the cpu

dlt/extract/concurrency.py

rudolfix · 2024-02-21T17:31:32Z

dlt/extract/pipe_iterator.py

+            # do we need new item?
+            if pipe_item is None:
+                # if none then take element from the newest source
+                pipe_item = self._get_source_item()


W8,

we were taking items from futures pool without blocking before going for next item in the sources.

if None we tried to get items from the source.

if None we checked if iterator is maybe exhausted and possibly exit

if not we were sleeping a little bit and go to 1

this pattern must remain. we cannot have any blocking operations anywhere because we'll start starving our pools.

we added additional complications by adding generators that return None. we prevent sources from starving futures by exiting when all sources returned None and taking items from future pool

there is OFC a problem because we sleep when there's no data anywhere. and this sleep is unconditional. we can improve it a little by waiting on a futures pool instead of sleeping (if any elements in it) and make our parallel and async generators to trigger a semaphore on which we sleep (but must be in thread storage let's not do it)

you can decrease CPU usage by moving done to special done dicts and improving _next_done_future method

btw. we always try to resolve items in the pool before getting more stuff from the sources

So now this is pretty much it. "get source item" -> "wait for a resolved future (with timeout)" -> repeat. Just the order is opposite, which I don't think matters other than we don't waste 10ms polling at the beginning of extract, right?
https://github.com/dlt-hub/dlt/blob/sthor%2Fparallelize-decorator/dlt/extract/pipe_iterator.py#L151-L169

I considered doing something with locks or async "Event" yielded from resource initially, to skip pending sources and only wait exactly as needed. But got complicated. But don't think we have too much overhead anyway.

you can decrease CPU usage by moving done to special done dicts and improving _next_done_future method

This is good, will do that.

rudolfix · 2024-02-21T17:34:50Z

dlt/extract/concurrency.py

+                # jobs to the pool, we ned to change this whole method to be inside a `threading.Lock`
+                self._wait_for_free_slot()
+            else:
+                return None


this is dangerous. now you call submit without checking this. IMO we do not need a submit without waiting. or raise exception

Done this, submit always blocks

rudolfix · 2024-02-21T17:46:30Z

dlt/extract/pipe_iterator.py

+                        else:
+                            pipe_item = ResolvablePipeItem(pipe_item, step, pipe, meta)
+
+                    if isinstance(pipe_item.item, Awaitable) or callable(pipe_item.item):


there's really no reason to have it. as I mentioned next will handle this. there's small overhead in the main loop for checking the pool for done and going back here just to get new item so if you want then submit without blocking and if it fails return the item so blocking version in next handles this.
if len(self._futures_pool) >= sources_count: is super arbitrary

rudolfix · 2024-02-21T17:47:34Z

dlt/extract/pipe_iterator.py

+
+                if pipe_item is None:
+                    # Block until a future is resolved
+                    pipe_item = self._futures_pool.resolve_next_future()


we should never block when resolving next future.

we could do it as an optimzation for polling sleep if there's anything in the pool but with timeout of the futures_poll_interval

This makes sense and I don't see any performance impact. That is using the poll_interval as timeout for done future, doing that now

steinitzu · 2024-02-22T23:51:15Z

@rudolfix Few more adjustements and tests.

The duplicate "submit futures" block in _get_source_item is gone. I was getting huge overhead (2x runtime in some cases) without it originally, but now after all the changes I see no real difference so there's no need for it.

rudolfix

Now LGTM!
Check this commit: 6af5c1a WDYT?

Please amend docs:

In resource.md please add a chapter

### Declare parallel and async resources

Where you just mention that async generators or parallel flag can be used to allow many resources to be executed at once. Then link to performance.md
In the performance kick our @dlt.defer example and use parallelize instead. Explain how it works with generators and regular functions.
provide code snippets, mind that code snippets are tested.

There are two examples in examples section where defer on transformers should be replaced with parallelize
Remember to change the source of example in the relevant test!
in chess_production:

# this resource takes data from players and returns profiles
    # it uses `defer` decorator to enable parallel run in thread pool.
    # defer requires return at the end so we convert yield into return (we return one item anyway)
    # you can still have yielding transformers, look for the test named `test_evolve_schema`
    @dlt.transformer(data_from=players, write_disposition="replace")
    @dlt.defer

and transformers

# a special case where just one item is retrieved in transformer
    # a whole transformer may be marked for parallel execution
    @dlt.transformer
    @dlt.defer
    def species(pokemon_details):

rudolfix · 2024-02-27T19:13:31Z

dlt/extract/pipe_iterator.py

+            else:
+                self._current_source_index = (self._current_source_index - 1) % sources_count
+            while True:
+                # get next item from the current source


yeah! you do optimized polling when waiting for a future but if there are no futures, there's no sleep and we max the cpu

steinitzu · 2024-02-29T01:36:09Z

Updated the docs. Couple of concerns:

I left one instance of @defer in the pokemon example since it's a different use case. We're not planning on removing it, right?
The "parallel pipelines asyncio" example I'm not sure if it works. For me it hangs sometimes and the generators don't look like they're even executing (print debugging) though the test passes https://github.com/dlt-hub/dlt/blob/sthor%2Fdestination-docs/docs/website/docs/reference/performance_snippets/performance-snippets.py#L123 . It's the same on the devel branch, was this working at some point?

rudolfix · 2024-02-29T22:25:02Z

Updated the docs. Couple of concerns:

I left one instance of @defer in the pokemon example since it's a different use case. We're not planning on removing it, right?

No, it still works well when you want parallelize a loop.

The "parallel pipelines asyncio" example I'm not sure if it works. For me it hangs sometimes and the generators don't look like they're even executing (print debugging) though the test passes https://github.com/dlt-hub/dlt/blob/sthor%2Fdestination-docs/docs/website/docs/reference/performance_snippets/performance-snippets.py#L123 . It's the same on the devel branch, was this working at some point?

It runs. But console output gets disabled when two pipelines are gathered at the same time. Something evil must happen there... It does not matter if we run the pools in extract step or if we run extract step at all. if I serialize all the calls by locking in _run_pipeline then the problem is gone. any ideas what could be happening?
also it indeed locks from time to time. probably connected to the above

steinitzu requested a review from sh-rp February 13, 2024 23:40

steinitzu changed the title ~~Sthor/parallelize decorator~~ Parallelize decorator Feb 14, 2024

rudolfix requested changes Feb 14, 2024

View reviewed changes

steinitzu force-pushed the sthor/parallelize-decorator branch 2 times, most recently from b9f0ddf to 57df1d9 Compare February 14, 2024 23:33

rudolfix requested changes Feb 19, 2024

View reviewed changes

steinitzu added 15 commits February 20, 2024 16:05

Refactoring of pipe thread pool and reduce polling

4a5370c

Decorator

b59a59e

Tests for parallel

0d51fdc

Include items module

0c73473

Names

4fe6507

Separate modules for pipe and pipe_iterator

fbe4d7f

Overload arg order

deea305

Keep parenthesis

9820a54

Use assert for sanity check

2cc0ced

Parallelize resource method, handle transformers

fba59f6

Comments, small refactor

ea882de

Handle non-iterator transformers

6bcd43d

Rename WorkerPool -> FuturesPool, small cleanups

84f3618

Fix transformer, test bare generator

c9c1824

Source parallelize skip invalid resources, docstring

6fbd859

steinitzu force-pushed the sthor/parallelize-decorator branch from 6f05f2e to 6fbd859 Compare February 20, 2024 22:37

rudolfix requested changes Feb 21, 2024

View reviewed changes

steinitzu added 4 commits February 22, 2024 15:12

Handle wrapped generator function

549d5fa

Don't test exec order, only check that multiple threads are used

7a0d8ae

Poll futures with timeout, no check sources_count, test gen.close()

ec5a965

Always block when submitting futures, remove redundant submit future

38f7069

steinitzu added 2 commits February 23, 2024 14:43

Revert limit async gen

96dcfe3

Always check done futures

b7acf93

steinitzu force-pushed the sthor/parallelize-decorator branch from c469541 to b7acf93 Compare February 23, 2024 19:44

steinitzu added 2 commits February 23, 2024 14:46

Fix type error

0882987

Typing

acb3b45

steinitzu force-pushed the sthor/parallelize-decorator branch from 2a96b74 to acb3b45 Compare February 23, 2024 20:34

rudolfix added 2 commits February 27, 2024 22:03

Merge branch 'devel' into sthor/parallelize-decorator

b4278f9

adds additional sleep when futures pool is empty

6af5c1a

rudolfix marked this pull request as ready for review February 27, 2024 21:07

rudolfix requested changes Feb 27, 2024

View reviewed changes

Update docs and snippets

19e09cb

small docs fixes

85b7f62

logs daemon signals message instead of printing

d376cb5

rudolfix merged commit 4a3d6ab into devel Mar 1, 2024
58 of 66 checks passed

rudolfix deleted the sthor/parallelize-decorator branch March 1, 2024 10:09


		result = list(some_source())

		assert set(result) == {1, 2, 3, 4, 5, -1, -2, -3, -4, -5}

Parallelize decorator #965

Parallelize decorator #965

Conversation

steinitzu commented Feb 13, 2024 • edited Loading

Description

netlify bot commented Feb 13, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix commented Feb 14, 2024

sh-rp commented Feb 19, 2024

rudolfix commented Feb 19, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steinitzu commented Feb 19, 2024

rudolfix commented Feb 19, 2024

steinitzu commented Feb 19, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steinitzu commented Feb 22, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steinitzu commented Feb 29, 2024

rudolfix commented Feb 29, 2024

steinitzu commented Feb 13, 2024 •

edited

Loading

netlify bot commented Feb 13, 2024 •

edited

Loading