feat: no longer load full table into ram in write by using concurrent write #2289

aersam · 2024-03-15T14:25:50Z

Description

This is a followup of #2265

It additionally uses streams/channels to concurrently write at the cost of more memory consumption. Default is keeping one recordbatch in RAM only, so it's opt-in.

I tested this with a local file and it went from 700s to 200s if I work with 10 concurrent streams. Of course memory consumption goes up, but given that we currently load the whole table in RAM, it's OK :)

This adds a depenency on async-channel as I need a multi-consumer channel.

Related Issue(s)

Fixes #2255

…iter2

… write-iter2

…iter2

ion-elgreco · 2024-03-28T22:11:24Z

@aersam I think we should max out the concurrent streams for python users.

In most use cases we are passing a recordBatchReader where the recordBatches are already in memory before constructing the reader, in that case you won't see any memory difference. And it wouldn't be different than the prior behavior since the reader was always collected.

I also have one suggestion on the python side, I think it's better if we simplify it and just provide a parameter called parallelize, which is always set to True. If users want to control the amount of concurrent streams, they should set an env_var which we then can parse in python_lib, if it's not set and parallelize = True, then we take the max possible streams.

aersam · 2024-03-29T07:19:45Z

How about parallelize:bool|int on python side? 🙂

ion-elgreco · 2024-03-29T07:27:02Z

@aersam that also works! :)

…iter3

aersam · 2024-04-15T08:44:55Z

I finally had the time to update this branch with the new parallel parameter in python. Hope it's looking good now!

ion-elgreco · 2024-04-15T09:03:08Z

@aersam btw, did you have any profiling numbers on speed ups/memory trade offs when parallel is True. Would be nice to share those in the release notes later on

aersam · 2024-04-15T10:45:02Z

I only did some manual test on my own data, but could probably write some benchmark in python, using duckdb or polars as source. Would it make sense to add this to the code somehow?

ion-elgreco · 2024-04-15T14:37:19Z

@aersam here you could add it, and even maybe reuse some of the benchmarks there: https://github.com/delta-io/delta-rs/tree/main/crates/benchmarks

aersam · 2024-04-17T19:15:07Z

I did some very basic benchmarking, but the results were not as I hoped :) While RAM consumption is significantly lower, the speed is not good enough yet. I think maybe the channel must be bigger, I'll do some more testing

I did my test quick and dirty using python, I can share the code if you want. Basically it's this:

import duckdb
from deltalake.writer import write_deltalake
from uuid import uuid4

with duckdb.connect() as con: # get your 42.parquet here: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-big-data-age.html
    con.execute("select b, random() as a from read_parquet('42.parquet') limit 300000000")
    reader = con.fetch_record_batch()
    write_deltalake(f"_test/{uuid4()}", reader, schema=reader.schema, mode="overwrite", engine="rust")

aersam · 2024-04-19T05:31:59Z

Pretty sure the non-async write causes issues. But object_store 0.10 will change a lot there, so maybe better to wait for that

ion-elgreco · 2024-04-19T06:42:02Z

Pretty sure the non-async write causes issues. But object_store 0.10 will change a lot there, so maybe better to wait for that

Yes let's see how effective these changes are with new upload trait

ion-elgreco · 2024-06-11T20:52:15Z

@aersam fyi, ObjectStore just got bumped in the repo to 0.10

ion-elgreco · 2024-07-24T17:02:34Z

@aersam hey, do you think you have time to resolve the merge conflicts?

aersam · 2024-07-24T17:08:56Z

@aersam hey, do you think you have time to resolve the merge conflicts?

I'm sorry, I had a bit a shift in priorities, so it will take time to do so. Especially since there were quite some changes in the writer as I see

ion-elgreco · 2024-07-24T17:22:30Z

@aersam hey, do you think you have time to resolve the merge conflicts?

I'm sorry, I had a bit a shift in priorities, so it will take time to do so. Especially since there were quite some changes in the writer as I see

No worries! Just ping me once it's ready for another review round

aersam and others added 25 commits March 7, 2024 21:15

close to compiling

565f43d

still learning :)

3a52bb7

some compile errors

30a5463

another bug fix

cde4207

clippy feedback

6743373

test compilation

577442b

wip on tests

4b276a7

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

9d022cb

…iter2

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

d1352fa

…iter2

cleanup

d4d82ce

wip on fixes

385c935

more fixes

023df09

more fixes

0397a0c

fmt

c83f947

adjust test

f131eb1

use into()

a3d5585

we need GIL, no?

965968c

clippy, your so right

83d398f

revert 965968c and 965968c

98bf7ec

Merge branch 'main' into write-iter2

44cd5b9

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

5ae3599

…iter2

fmt

28eba65

Merge branch 'write-iter2' of https://github.com/aersam/delta-rs into…

c66762a

… write-iter2

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

6e742a9

…iter2

use tasks for writing

cf375b9

aersam requested review from MrPowers, wjones127, fvaleye, roeap and ion-elgreco as code owners March 15, 2024 14:25

test fixews

4655742

aersam mentioned this pull request Apr 10, 2024

Schema evolution mergeSchema support #1386

Closed

aersam added 4 commits April 15, 2024 10:16

Merge branch 'main' of https://github.com/aersam/delta-rs into write-…

704d86b

…iter3

fmt

2b135f9

use parallel as arg

e5f12fb

ruff

6242179

aersam added 2 commits April 15, 2024 11:15

parallel

68ea7ed

remove fancy union syntax

0bbf3b5

ion-elgreco mentioned this pull request Nov 3, 2024

rust engine consume a lot of memory compared to pyarrow #2968

Open

rtyler self-assigned this Nov 3, 2024

rtyler marked this pull request as draft November 3, 2024 02:04

rtyler added this to the Rust v1.0.0 milestone Nov 3, 2024

sh-rp mentioned this pull request Nov 6, 2024

[Don't Merge] Setting to control delta job count for each delta write dlt-hub/dlt#2031

Closed

Gilbert09 mentioned this pull request Nov 6, 2024

fix(data-warehouse): Monkey patch DLT to reduce mem consumption PostHog/posthog#26040

Merged

rtyler modified the milestones: Rust v1.0.0, v0.23 Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: no longer load full table into ram in write by using concurrent write #2289

feat: no longer load full table into ram in write by using concurrent write #2289

aersam commented Mar 15, 2024

ion-elgreco commented Mar 28, 2024 •

edited

Loading

aersam commented Mar 29, 2024

ion-elgreco commented Mar 29, 2024

aersam commented Apr 15, 2024

ion-elgreco commented Apr 15, 2024

aersam commented Apr 15, 2024

ion-elgreco commented Apr 15, 2024

aersam commented Apr 17, 2024

aersam commented Apr 19, 2024

ion-elgreco commented Apr 19, 2024

ion-elgreco commented Jun 11, 2024

ion-elgreco commented Jul 24, 2024

aersam commented Jul 24, 2024

ion-elgreco commented Jul 24, 2024

feat: no longer load full table into ram in write by using concurrent write #2289

Are you sure you want to change the base?

feat: no longer load full table into ram in write by using concurrent write #2289

Conversation

aersam commented Mar 15, 2024

Description

Related Issue(s)

ion-elgreco commented Mar 28, 2024 • edited Loading

aersam commented Mar 29, 2024

ion-elgreco commented Mar 29, 2024

aersam commented Apr 15, 2024

ion-elgreco commented Apr 15, 2024

aersam commented Apr 15, 2024

ion-elgreco commented Apr 15, 2024

aersam commented Apr 17, 2024

aersam commented Apr 19, 2024

ion-elgreco commented Apr 19, 2024

ion-elgreco commented Jun 11, 2024

ion-elgreco commented Jul 24, 2024

aersam commented Jul 24, 2024

ion-elgreco commented Jul 24, 2024

ion-elgreco commented Mar 28, 2024 •

edited

Loading