Pyarrow direct loading #679

steinitzu · 2023-10-09T02:09:05Z

Continuation of #662

netlify · 2023-10-09T02:09:09Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`2b96904`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/652da78cc45f6000082a7b4f

rudolfix · 2023-10-10T08:22:39Z

@sh-rp please take a look
@steinitzu it looks good. code structure is OK. what we still need

a pipeline test where we load arrow tables and panda frames to all destinations that support parquet
a documentation: let's add it to "Verified Sources" under Arrow Table / Pandas
I will still do the review

rudolfix

this is good!

we need tests I mentioned in review
please also test what happens if we add incremental or any other filter/map function
I think it is trivial to modify ItemTransform to deconstruct Table or Panda into rows and make incremental working. this of course will destroy the table. maybe there's a way to "rescue" filter functions? maybe you can filter arrow table by a python lambda

dlt/common/storages/normalize_storage.py

dlt/extract/extract.py

dlt/normalize/items_normalizers.py

dlt/normalize/normalize.py

sh-rp · 2023-10-10T13:54:07Z

Nice PR! The one thing that came to my mind is, that with the way it is implemented now we are losing the ability to create variant columns. This means if we try to load an arrow table that has a different datatype on a column, it will now fail when updating the schema in the extract phase but not create a variant. It seems we could solve this with rename_columns in arrow, but this would mean having some kind of variant detection step in the extract phase, or alternatively push the schema updating downstream into the normalizer phase (which might be nice anyway to keep it similar to the way it is now). Also should we give the users the option to extract complex fields in parquet into subtables?

PS: Ok, if we push this to the normalizer, then we lose the ability to just copy those files to the load stage..

steinitzu · 2023-10-10T19:01:45Z

this is good!

* we need tests I mentioned in review

* please also test what happens if we add incremental or any other filter/map function

* I think it is trivial to modify `ItemTransform` to deconstruct Table or Panda into rows and make incremental working. this of course will destroy the table. maybe there's a way to "rescue" filter functions? maybe you can filter arrow table by a python lambda

I think we could do incremental efficiently with min/max functions on arrow table directly. But custom aggregate functions look complicated in arrow.
Maybe with pandas we can make it work generally (hoping the conversion arrow -> pandas -> arrow is not destructive)

rudolfix · 2023-10-11T09:57:17Z

@sh-rp good points! but in most cases people want Arrow tables to have strict schema (and typically the data is already well defined)

I like the idea to extract the structs into separate tables. We could do that in parquet normalizer but OFC the tables would need to be rewritten - most probably using duckdb as engine. btw. that would be FAST - compare to our standard normalizer

rudolfix

added some comments on incremental

dlt/extract/incremental/__init__.py

dlt/extract/incremental/transform.py

rudolfix · 2023-10-11T10:54:26Z

dlt/extract/incremental/transform.py

+
+        # Filter out all rows which have cursor value equal to last value
+        # and unique id exists in state
+        tbl = tbl.append_column("_dlt_index", pa.array(range(tbl.num_rows)))


I think you can create index after you filter the rows in next line

rudolfix · 2023-10-11T11:13:11Z

dlt/extract/incremental/transform.py

+            unique_values = [(i, uq_val) for i, uq_val in unique_values if uq_val not in incremental_state['unique_hashes']]
+            keep_idx = pa.array(i for i, _ in unique_values)
+            # Filter the table
+            tbl = tbl.filter(pa.compute.is_in(tbl["_dlt_index"], keep_idx))


this will keep only records with unique values and remove all the records that are "newer" than unique right? because keep_idx is only on unique values. maybe you should use is_not_in?

also adding index could be deferred to the moment we need to remove something below. (and maybe primary key could be used for that if present)

rudolfix

looks good! I assume that is still WIP (some tests and docs mostly)
also you detect arrow tables in lists but IMO when you write to tables in extractor you assume that items are always actual objects not lists. if I'm not right you can ignore it.

also in arrow incremental IMO you delete too many records but that will come out in the tests

amazing work!

dlt/extract/incremental/__init__.py

dlt/normalize/normalize.py

steinitzu · 2023-10-12T18:18:24Z

looks good! I assume that is still WIP (some tests and docs mostly) also you detect arrow tables in lists but IMO when you write to tables in extractor you assume that items are always actual objects not lists. if I'm not right you can ignore it.

also in arrow incremental IMO you delete too many records but that will come out in the tests

amazing work!

Yep tests and docs coming, needs more thorough testing. I'm hoping I can parametrize the current incremental tests easily to run with all formats, that would be best.

also you detect arrow tables in lists but IMO when you write to tables in extractor you assume that items are always actual objects not lists

You might be right. Needs tests with lists as well.

some

rudolfix

LGTM! thx for incremental tests!
this needs to be fixed though

 Redshift cannot load TIME columns from parquet files. Switch to direct INSERT file format or convert `datetime.time` objects in your data to `str` or `datetime.datetime`

Also Athena has the same problem

steinitzu · 2023-10-16T16:13:29Z

LGTM! thx for incremental tests! this needs to be fixed though

 Redshift cannot load TIME columns from parquet files. Switch to direct INSERT file format or convert `datetime.time` objects in your data to `str` or `datetime.datetime`

Should work now! Had a typo in this check b9f7aaf

rudolfix · 2023-10-16T18:12:14Z

LGTM! thx for incremental tests! this needs to be fixed though
 Redshift cannot load TIME columns from parquet files. Switch to direct INSERT file format or convert `datetime.time` objects in your data to `str` or `datetime.datetime`
Should work now! Had a typo in this check b9f7aaf

@steinitzu now there's some problems with types. ie. binary values are returned as hex strings on redshift. which is the case in other tests...

rudolfix

so good!

rudolfix mentioned this pull request Oct 10, 2023

[WIP] POC - direct PyArrow load #662

Closed

rudolfix requested changes Oct 10, 2023

View reviewed changes

steinitzu force-pushed the sthor/pyarrow-load branch from f5874f0 to 7d0f648 Compare October 10, 2023 16:06

rudolfix requested changes Oct 11, 2023

View reviewed changes

rudolfix reviewed Oct 12, 2023

View reviewed changes

dlt/extract/incremental/__init__.py Outdated Show resolved Hide resolved

dlt/normalize/normalize.py Outdated Show resolved Hide resolved

steinitzu force-pushed the sthor/pyarrow-load branch from 5a620b0 to 6bc0c0a Compare October 14, 2023 19:04

steinitzu added 18 commits October 15, 2023 14:21

poc: direct pyarrow load

d30eb1b

arrow to schema types with precision, test

730dce2

Fix naming copy_atomic -> move_atomic

41e2805

jsonl/parquet file normalizer classes

9433cc4

pathlib extension checks

3393110

indent fix

40de1b5

Write parquet with original schema

a203822

extract refactoring

c2cd0bd

Init testing, bugfix

487987c

Fix import filename

2b94489

Dep

817eca5

Mockup incremental implementation for arrow tables

9973bb8

Create loadstorage per filetype, import with hardlink

6203d71

Fallback for extract item format, detect type of lists

4fa718d

Error message, load tests with arrow

1000613

Some incremental optimizations

5f452b6

some

Incremental fixes and run incremental tests on arrow & pandas

b414a69

Add/update normalize tests

274f0ee

steinitzu added 6 commits October 15, 2023 14:21

Fix load test

6b4d43a

Lint

919c5ab

Add docs page for arrow loading

d58f77e

Handle none capability

73b481b

Fix extract lists

7df1252

Exclude TIME in redshift test

3662756

steinitzu force-pushed the sthor/pyarrow-load branch from 78dd7dc to 3662756 Compare October 15, 2023 18:22

steinitzu added 3 commits October 15, 2023 14:58

Fix type errors

f16a7ba

Typo

93811e0

Create col from numpy array for >200x speedup, index after filter

33ce792

steinitzu force-pushed the sthor/pyarrow-load branch from 9f685ab to 33ce792 Compare October 15, 2023 20:05

rudolfix previously approved these changes Oct 15, 2023

View reviewed changes

rudolfix marked this pull request as ready for review October 16, 2023 15:40

in -> not in

b9f7aaf

steinitzu dismissed rudolfix’s stale review via b9f7aaf October 16, 2023 16:11

steinitzu and others added 2 commits October 16, 2023 15:44

Format binary as hex for redshift

b78618c

enables bool and duckdb test on pyarrow loading

2b96904

rudolfix approved these changes Oct 16, 2023

View reviewed changes

rudolfix merged commit d3db284 into devel Oct 16, 2023
35 of 39 checks passed

rudolfix deleted the sthor/pyarrow-load branch October 16, 2023 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyarrow direct loading #679

Pyarrow direct loading #679

steinitzu commented Oct 9, 2023

netlify bot commented Oct 9, 2023 •

edited

Loading

rudolfix commented Oct 10, 2023

rudolfix left a comment

sh-rp commented Oct 10, 2023 •

edited

Loading

steinitzu commented Oct 10, 2023

rudolfix commented Oct 11, 2023

rudolfix left a comment

rudolfix Oct 11, 2023

rudolfix Oct 11, 2023

rudolfix left a comment

steinitzu commented Oct 12, 2023

rudolfix left a comment •

edited

Loading

steinitzu commented Oct 16, 2023

rudolfix commented Oct 16, 2023

rudolfix left a comment

Pyarrow direct loading #679

Pyarrow direct loading #679

Conversation

steinitzu commented Oct 9, 2023

netlify bot commented Oct 9, 2023 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

rudolfix commented Oct 10, 2023

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Oct 10, 2023 • edited Loading

steinitzu commented Oct 10, 2023

rudolfix commented Oct 11, 2023

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix Oct 11, 2023

Choose a reason for hiding this comment

rudolfix Oct 11, 2023

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

steinitzu commented Oct 12, 2023

rudolfix left a comment • edited Loading

Choose a reason for hiding this comment

steinitzu commented Oct 16, 2023

rudolfix commented Oct 16, 2023

rudolfix left a comment

Choose a reason for hiding this comment

netlify bot commented Oct 9, 2023 •

edited

Loading

sh-rp commented Oct 10, 2023 •

edited

Loading

rudolfix left a comment •

edited

Loading