-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add load_id to arrow tables in extract step instead of normalize #1449
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
dlt/extract/extractors.py
Outdated
# Inject the parts of normalize configuration that are used here | ||
@with_config( | ||
spec=ItemsNormalizerConfiguration, sections=(known_sections.NORMALIZE, "parquet_normalizer") | ||
) | ||
def __init__(self, *args: Any, add_dlt_load_id: bool = False, **kwargs: Any) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my solution to support the old normalize config.
I think it makes sense to keep it and have it consistent with object normalizer config, rather than move to extract
section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm idea is very good.. but maybe you could decorate a method in this class. not init? so you call it just to retrieve config,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I think we miss some tests. There's for sure test that checks if load_id is added. However now we need test two things:
- adding load_id in extract + sensitivity to your config "hack" - just pipeline.extract() arrow table and see if extracted parquet has load_id
- load some json data and request parquet in nromalize stage - this will create parquet file in normalizer and we test if load id is added there
536f0f2
to
706dce5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steinitzu thx for the tests!
still a few issues with normalizing column names. pls check
Updated with normalized identifiers :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steinitzu looks good! but new test is not passing. IMO some column mismatch?
@rudolfix this is fixed by b1be9c9 |
…o write-load-id-in-extract
this is really weird. I added one more tests and the columns are added at the end. current version is btw. better, update_table is done at the end to modify actual schema |
Description
_dlt_load_id
to arrow tables in extract step before writing file to disk.Should be backwards compatible, aside from the edge case where you upgrade dlt in between extract and normalize
Will add/check tests
Related Issues
Step 1 of #1317
Additional Context