[Bug]: jupysql + DuckDB -- reading json lines fails to parallelise across threads #635

ned2 · 2023-04-29T15:51:44Z

What happened?

I'm using DuckDB in JupyterLab with jupysql (0.7.4) and running into a funny performance issue, where I'm not seeing parallel loading of JSON lines files that I see when I use DuckDB via the Python API directly. (I've just been eye-balling htop to see when all my threads light up, or if only one is used)

Note that %config SqlMagic.autopandas = False was set for the following comparisons.

%%sql tmp <<
-- this does *not* parallelise 
select 1 from read_ndjson_auto('jsonl_data/*.jsonl')

# whereas running in python *does* parallelise
# doing a fetchall to match jupysql automatically pulling the complete set  
results = duckdb.sql(f"select 1 from read_ndjson_auto('jsonl_data/*.jsonl')").fetchall()

Weirdly, I worked out that the following variations on the above query do in fact allow parallelisation:

adding a limit clause
count the rows instead
partition over the results

%%sql tmp <<
-- this *does* parallelise 
select 1 from read_ndjson_auto('jsonl_data/*.jsonl') limit 100000000

%%sql tmp <<
-- this *does* parallelise 
select count() from read_ndjson_auto('jsonl_data/*.jsonl')

%%sql tmp <<
-- this *does* parallelise 
select row_number() OVER (PARTITION BY id) as seqnum from read_ndjson_auto('jsonl_data/*.jsonl')

Here's a Python function to generate some dummy JSON lines data for debugging. (with these defaults, writes about 7GB to disk):

def make_dummy_jsonl(path, num_files=200, length=1_000_000):
    target_path = Path(path)
    target_path.mkdir(parents=True, exist_ok=True)
    for file_num in range(1, num_files + 1):
        data = [
            {
                "id": row_num,
                "val1": randrange(100),
                "val2": randrange(100),
                "val3": randrange(100),
            }
            for row_num in range(1, length + 1)
        ]
        with open(target_path / f"dummy_{file_num:03}.jsonl", "w") as file:
            file.write("\n".join(json.dumps(row) for row in data))

make_dummy_jsonl("jsonl_data")

While putting this together I realised that there is actually no parallelisation for DuckDB 7.1.0, and I needed to use a 0.7.2 pre-release to get any parallelisation, even in the Python example. So it could be a little premature to be debugging this.

Also, not sure if this is the right place for this issue, but it's a starting point and can move if needed.

DuckDB Engine Version

0.7.0

DuckDB Version

0.7.2.dev2699

SQLAlchemy Version

2.0.11

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ned2 · 2023-04-29T16:54:24Z

I just took magic_duckdb for a spin (a DuckDB specific SQL magic tool for Jupyter) and found that parallelisation works with it. Furthermore, magic_duck seems to be generally faster that JupySQL, offering comparable speeds to using DuckDB directly. I realised during all this testing that queries in JupySQL suffer a noticeable performance hit.

ned2 added the bug Something isn't working label Apr 29, 2023

ned2 assigned Mause Apr 29, 2023

ned2 changed the title ~~[Bug]: jupysql + DuckDB -- reading json lines with fails to parallelise across threads~~ [Bug]: jupysql + DuckDB -- reading json lines fails to parallelise across threads Apr 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: jupysql + DuckDB -- reading json lines fails to parallelise across threads #635

[Bug]: jupysql + DuckDB -- reading json lines fails to parallelise across threads #635

ned2 commented Apr 29, 2023 •

edited

Loading

ned2 commented Apr 29, 2023 •

edited

Loading

[Bug]: jupysql + DuckDB -- reading json lines fails to parallelise across threads #635

[Bug]: jupysql + DuckDB -- reading json lines fails to parallelise across threads #635

Comments

ned2 commented Apr 29, 2023 • edited Loading

What happened?

DuckDB Engine Version

DuckDB Version

SQLAlchemy Version

Relevant log output

Code of Conduct

ned2 commented Apr 29, 2023 • edited Loading

ned2 commented Apr 29, 2023 •

edited

Loading

ned2 commented Apr 29, 2023 •

edited

Loading