Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: jupysql + DuckDB -- reading json lines fails to parallelise across threads #635

Open
1 task done
ned2 opened this issue Apr 29, 2023 · 1 comment
Open
1 task done
Assignees
Labels
bug Something isn't working

Comments

@ned2
Copy link

ned2 commented Apr 29, 2023

What happened?

I'm using DuckDB in JupyterLab with jupysql (0.7.4) and running into a funny performance issue, where I'm not seeing parallel loading of JSON lines files that I see when I use DuckDB via the Python API directly. (I've just been eye-balling htop to see when all my threads light up, or if only one is used)

Note that %config SqlMagic.autopandas = False was set for the following comparisons.

%%sql tmp <<
-- this does *not* parallelise 
select 1 from read_ndjson_auto('jsonl_data/*.jsonl')
# whereas running in python *does* parallelise
# doing a fetchall to match jupysql automatically pulling the complete set  
results = duckdb.sql(f"select 1 from read_ndjson_auto('jsonl_data/*.jsonl')").fetchall()

Weirdly, I worked out that the following variations on the above query do in fact allow parallelisation:

  • adding a limit clause
  • count the rows instead
  • partition over the results
%%sql tmp <<
-- this *does* parallelise 
select 1 from read_ndjson_auto('jsonl_data/*.jsonl') limit 100000000
%%sql tmp <<
-- this *does* parallelise 
select count() from read_ndjson_auto('jsonl_data/*.jsonl')
%%sql tmp <<
-- this *does* parallelise 
select row_number() OVER (PARTITION BY id) as seqnum from read_ndjson_auto('jsonl_data/*.jsonl')

Here's a Python function to generate some dummy JSON lines data for debugging. (with these defaults, writes about 7GB to disk):

def make_dummy_jsonl(path, num_files=200, length=1_000_000):
    target_path = Path(path)
    target_path.mkdir(parents=True, exist_ok=True)
    for file_num in range(1, num_files + 1):
        data = [
            {
                "id": row_num,
                "val1": randrange(100),
                "val2": randrange(100),
                "val3": randrange(100),
            }
            for row_num in range(1, length + 1)
        ]
        with open(target_path / f"dummy_{file_num:03}.jsonl", "w") as file:
            file.write("\n".join(json.dumps(row) for row in data))

make_dummy_jsonl("jsonl_data")

While putting this together I realised that there is actually no parallelisation for DuckDB 7.1.0, and I needed to use a 0.7.2 pre-release to get any parallelisation, even in the Python example. So it could be a little premature to be debugging this.

Also, not sure if this is the right place for this issue, but it's a starting point and can move if needed.

DuckDB Engine Version

0.7.0

DuckDB Version

0.7.2.dev2699

SQLAlchemy Version

2.0.11

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@ned2 ned2 added the bug Something isn't working label Apr 29, 2023
@ned2 ned2 changed the title [Bug]: jupysql + DuckDB -- reading json lines with fails to parallelise across threads [Bug]: jupysql + DuckDB -- reading json lines fails to parallelise across threads Apr 29, 2023
@ned2
Copy link
Author

ned2 commented Apr 29, 2023

I just took magic_duckdb for a spin (a DuckDB specific SQL magic tool for Jupyter) and found that parallelisation works with it. Furthermore, magic_duck seems to be generally faster that JupySQL, offering comparable speeds to using DuckDB directly. I realised during all this testing that queries in JupySQL suffer a noticeable performance hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants