-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyarrow direct loading #679
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
@sh-rp please take a look
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good!
- we need tests I mentioned in review
- please also test what happens if we add incremental or any other filter/map function
- I think it is trivial to modify
ItemTransform
to deconstruct Table or Panda into rows and make incremental working. this of course will destroy the table. maybe there's a way to "rescue" filter functions? maybe you can filter arrow table by a python lambda
Nice PR! The one thing that came to my mind is, that with the way it is implemented now we are losing the ability to create variant columns. This means if we try to load an arrow table that has a different datatype on a column, it will now fail when updating the schema in the extract phase but not create a variant. It seems we could solve this with rename_columns in arrow, but this would mean having some kind of variant detection step in the extract phase, or alternatively push the schema updating downstream into the normalizer phase (which might be nice anyway to keep it similar to the way it is now). Also should we give the users the option to extract complex fields in parquet into subtables? PS: Ok, if we push this to the normalizer, then we lose the ability to just copy those files to the load stage.. |
f5874f0
to
7d0f648
Compare
I think we could do incremental efficiently with min/max functions on arrow table directly. But custom aggregate functions look complicated in arrow. |
@sh-rp good points! but in most cases people want Arrow tables to have strict schema (and typically the data is already well defined) I like the idea to extract the structs into separate tables. We could do that in parquet normalizer but OFC the tables would need to be rewritten - most probably using duckdb as engine. btw. that would be FAST - compare to our standard normalizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some comments on incremental
dlt/extract/incremental/transform.py
Outdated
|
||
# Filter out all rows which have cursor value equal to last value | ||
# and unique id exists in state | ||
tbl = tbl.append_column("_dlt_index", pa.array(range(tbl.num_rows))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can create index after you filter the rows in next line
dlt/extract/incremental/transform.py
Outdated
unique_values = [(i, uq_val) for i, uq_val in unique_values if uq_val not in incremental_state['unique_hashes']] | ||
keep_idx = pa.array(i for i, _ in unique_values) | ||
# Filter the table | ||
tbl = tbl.filter(pa.compute.is_in(tbl["_dlt_index"], keep_idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this will keep only records with unique values and remove all the records that are "newer" than unique right? because keep_idx is only on unique values. maybe you should use
is_not_in
? - also adding index could be deferred to the moment we need to remove something below. (and maybe primary key could be used for that if present)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! I assume that is still WIP (some tests and docs mostly)
also you detect arrow tables in lists but IMO when you write to tables in extractor you assume that items are always actual objects not lists. if I'm not right you can ignore it.
also in arrow incremental IMO you delete too many records but that will come out in the tests
amazing work!
Yep tests and docs coming, needs more thorough testing. I'm hoping I can parametrize the current incremental tests easily to run with all formats, that would be best.
You might be right. Needs tests with lists as well. |
5a620b0
to
6bc0c0a
Compare
78dd7dc
to
3662756
Compare
9f685ab
to
33ce792
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thx for incremental tests!
this needs to be fixed though
Redshift cannot load TIME columns from parquet files. Switch to direct INSERT file format or convert `datetime.time` objects in your data to `str` or `datetime.datetime`
Also Athena has the same problem
Should work now! Had a typo in this check b9f7aaf |
@steinitzu now there's some problems with types. ie. binary values are returned as hex strings on redshift. which is the case in other tests... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so good!
Continuation of #662