-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allows to run parallel pipelines in separate threads #813
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
2847c5b
to
7ebbe00
Compare
context = self._thread_context(spec) | ||
return spec in context | ||
|
||
def _thread_context( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you use pythons thread local context to do all this? https://docs.python.org/3/library/threading.html#thread-local-data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I know it but when you look at the code, there are exceptions to that behavior.
- some type of context are available globally (I use main thread id)
- there's a special treatment of the executor thread pool. I use a context of a thread that started a pool, not the current thread
so yeah I could use local()
but there are exceptions so I'd need to keep more dictionaries. or you can force the thread id for local()?
class DataWriterMetrics(NamedTuple): | ||
file_path: str | ||
items_count: int | ||
file_size: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe column count? but that is not really important tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to add elapsed time (start stop). Column count is not known at this moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not during extract. but it is known during normalize. you can however get the column count from the relevant schema...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes elapsed would be cool too!
@@ -249,6 +250,17 @@ The default is to not parallelize normalization and to perform it in the main pr | |||
Normalization is CPU bound and can easily saturate all your cores. Never allow `dlt` to use all cores on your local machine. | |||
::: | |||
|
|||
:::caution | |||
The default method of spawning a process pool on Linux is **fork**. If you are using threads in your code (or libraries that use threads), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a link to some further explanation of this in the python docs maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! could you propose a link? then I'll add it
@@ -1,6 +1,6 @@ | |||
[tool.poetry] | |||
name = "dlt" | |||
version = "0.4.1a0" | |||
version = "0.4.1a1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a test somewhere to check that the version number is in sync everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean. there's just one source of truth for the version and this is the toml file. or you want to have a tests that compares the toml file with the installed package? or maybe that should be a lint step where we force people to make dev
when versions are not in sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my main question is why not use the python built in thread locals? or do you need to be able to access another threads locals?
Description
test_parallel_threads_pipeline
for an example.asyncio
to run pipelines in parallelContains content of #807