Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply hints for nested tables #2165

Open
wants to merge 12 commits into
base: devel
Choose a base branch
from
Open

Conversation

steinitzu
Copy link
Collaborator

Description

Draft of nested table hints implementation:

apply_hints(path=['a', 'b', 'c'], columns=...)

Is working so far but there are some bugs and tests needed.

Related Issues

Additional Context

Copy link

netlify bot commented Dec 19, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit e51bd25
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/679a7aff51c87d000948a3af

zilto added 2 commits January 28, 2025 18:52
Adding this type annotation fixed 69 failing tests. The missing Optional
impacted the dlt.common.validation.validate_dict().validate_prop()
functions to parse the RESTAPIConfig object
@zilto zilto marked this pull request as ready for review January 29, 2025 01:42
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my suggestion how to deal with naming convention. Docs requirements are in the ticket.

full_path = (root_table_name,) + path
table = instance.compute_table_schema(item, meta)
if not table.get("name"):
table["name"] = "__".join(full_path) # TODO: naming convention
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute_table_chain must take NamingConvention instance that has a method to join path and we do not need to hardcode the "__".

overall this is a weakness of dlt that it relies on such separator and stores only normalized names in the schema. we lose a little bit of lineage information but right now we can't really avoid that without a big rewrite

@zilto zilto force-pushed the define-hints-nested-tables branch from bef7a3f to e51bd25 Compare January 29, 2025 19:01
@zilto
Copy link
Collaborator

zilto commented Jan 29, 2025

The current implementation adds the tables to the schema (as tested), but it doesn't affect how the data is loaded.

For example, the hints will appear in

pipeline.default_schema.tables.keys()
# ignoring the dlt tables
# 'nested_data', 'override_child_outer1', 'override_child_outer1_innerfoo','nested_data__outer1', 'nested_data__outer1__innerfoo'

Whereas the normalizer row counts show no ingested data for the tables

pipeline.last_trace.last_normalize_info.row_counts
# 'nested_data': 2, 'nested_data__outer1': 2, 'nested_data__outer1__innerfoo': 2

I believe changes need to be made to Extractor._write_to_dynamic_table() and _get_dynamic_table_name() to push data to the right table. (Extractor._write_to_static_table() should rely on the explicitly provided table name).

The extractor would need to hold some mapping, but it could be more appropriate to move the logic to dlt.common.normalizers.json.helpers or to a Schema method?

@rudolfix
Copy link
Collaborator

Relational normalizer follows its logic of creating nested tables and column names. it comes only from the data. there's no mechanism to rename those, except the root table name which the user must set.

dlt is data first, not schema first. it is counterintuitive if you chose to start your work with schema, not data.

I assume that in example you are giving, you used a custom table name for nested table. If this is not the case ping me on slack. maybe there's a bug somewhere

in the ticket above, there's a note:

You still may allow users to specify table_name on the nested hint. If you do so, you'll need to modify the normalizer so it maps paths to those names. IMO this is for another ticket and bigger overhaul of the schema
prevent following to be set on nested table:
parent_table_name: TTableHintTemplate[str] = None,
incremental: TIncrementalConfig = None,

so I'd say we block setting table name on nested hints (also parent name and incremental do make sense)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify schema modification of child tables
3 participants