[Bug]: jupysql + DuckDB -- reading json lines fails to parallelise across threads #635
Open
1 task done
Labels
bug
Something isn't working
What happened?
I'm using DuckDB in JupyterLab with jupysql (0.7.4) and running into a funny performance issue, where I'm not seeing parallel loading of JSON lines files that I see when I use DuckDB via the Python API directly. (I've just been eye-balling htop to see when all my threads light up, or if only one is used)
Note that
%config SqlMagic.autopandas = False
was set for the following comparisons.Weirdly, I worked out that the following variations on the above query do in fact allow parallelisation:
limit
clausecount
the rows insteadpartition
over the resultsHere's a Python function to generate some dummy JSON lines data for debugging. (with these defaults, writes about 7GB to disk):
While putting this together I realised that there is actually no parallelisation for DuckDB 7.1.0, and I needed to use a 0.7.2 pre-release to get any parallelisation, even in the Python example. So it could be a little premature to be debugging this.
Also, not sure if this is the right place for this issue, but it's a starting point and can move if needed.
DuckDB Engine Version
0.7.0
DuckDB Version
0.7.2.dev2699
SQLAlchemy Version
2.0.11
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: