Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql_database.with_resources(*) reflects unnecessary table schemas #2121

Open
jeff-skoldberg-gmds opened this issue Dec 4, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@jeff-skoldberg-gmds
Copy link

dlt version

1.4.0

Describe the problem

Context in this slack thread.

Summary

This code does schema reflection for every table and column in the database.schema, even if only one table is passed in.

source = sql_database(
    credentials=self.connection_string,
    schema=schema,
    reflection_level="full",
).with_resources(*table_names)

However, this code does exactly what a user would expect, it only does schema reflection for the necessary columns:

source = sql_database(
    table_names=table_names,
    credentials=self.connection_string,
    schema=schema,
    reflection_level="full",
)

bug or feature?

There is a debate in the slack thread that this is not a bug. But I would ask these questions:

  1. Would a user ever want dlt to do schema reflections for tables & columns they are not using?
  2. Can a user leverage the extra column schema reflections? And even if they can, what would they do with it?
  3. Is there value to reflecting the schema for unused columns? If yes, what value?

If the answers are negative on these questions, I feel like we have to categorize this as a bug.

Personally, I can't imagine a single user would want dlt to reflect the schema for random columns they are not using.

impact to my code

using table_names= kwarg limits me with certain features, since now I am not defining specific resources, it is just table names. Therefore I have to do hints after the source is created instead of using config up front with the hints. (Maybe I'm doing something wrong, but that has been my experience as a non-expert here...)

Expected behavior

I would expect that only tables and columns requested by the user undergo schema reflection.

Steps to reproduce

Run this:

source = sql_database(
    credentials=self.connection_string,
    schema=schema,
    reflection_level="full",
).with_resources(*table_names)

I think this is the same behavior for all sql_database types, but I am using DB2 if that matters at all.

Operating system

Windows

Runtime environment

Local

Python version

3.11

dlt data source

sql_database

dlt destination

DuckDB, Snowflake

Other deployment details

No response

Additional information

Work around is to pass a list of tables then do something like this:

            if tbl.get("incremental_column"):
                initial_value = (
                    datetime.strptime(tbl["initial_value"], "%Y-%m-%d")
                    if tbl.get("initial_value")
                    else None
                )

                logger.info(
                    f"Applying hints for table: {tbl['name']}, cursor column: {tbl['incremental_column']}, initial value: {initial_value}, write disposition: {tbl['write_disposition']}, primary key: {tbl['primary_key']}"
                )
                table_resource.apply_hints(
                    incremental=dlt.sources.incremental(
                        cursor_path=tbl["incremental_column"],
                        initial_value=initial_value,
                        lag=172800,  # 2 days to account for late arriving data
                    ),
                    write_disposition="merge",
                    primary_key=tbl["primary_key"],
                )
@rudolfix rudolfix moved this from Todo to In Progress in dlt core library Dec 9, 2024
@rudolfix
Copy link
Collaborator

@jeff-skoldberg-gmds I do not understand the kind of problem you have with passing a list of tables upfront. I assume you know them? The problem you describe comes from the way Python executes code: it will first create sql_database instance and then call with_resources on it. That's why the code in sql_database is not aware of what happens later.

How we can make your code simpler? If you know all your table names in advance, you can pass them to the sql_database with the defer_table_reflect set to True. In that case reflection happens only when particular table is selected so you will be able to use with_resources several times on the same instance to get copies and tables that you do not finally extract won't be reflected at all.

the other options is passing your own metadata with all tables already defined. You mention that you do not want to reflect the columns that you do not use so if you know your schema in advance you could just create the metadata yourself.

Another option coming to my mind is that we add an option that fully skips reflection and in that case we just infer everything from the data and you must provide a custom SELECT statement to extract

@rudolfix rudolfix self-assigned this Dec 10, 2024
@rudolfix rudolfix added the question Further information is requested label Dec 10, 2024
@jeff-skoldberg-gmds
Copy link
Author

jeff-skoldberg-gmds commented Dec 11, 2024

Hi @rudolfix , thanks for your questions here. Let me see if I can organize my thoughts.

why do I not want to pass a list of tables

I like the idea of defining resources up front in config, including what column is the cursor column, etc, write disposition, etc.
I haven't tried, but I guessed you can't pass a list of dlt resources to the tables argument. Is that wrong?
If I go with a list of tables, I think I have to apply these hints AFTER the source is created, and all of these hints are now applies as code instead of config.
Maybe I'm thinking about this all wrong. I will admit, I'm not really sure on the best approach here.

Passing a list of tables is working for me, but I would like the flexibility to pass resources.

defer_table_reflection plus with_resources

Sounds very promising. I will have to test to see if this solves it. I trust it does...
This sounds like it is the fix.

skipping reflection, passing a SQL query

This honestly sounds like a dreamy bonus feature. I have posted a lot of challenges in Slack about schema reflection. Pyarrow seems to very much hate my source data.
Passing a SQL query could also provide other hacks for filtering the data (start date, custom look backs, etc.)
Also I could eliminate some views in the upstream database where they did not provide a good cursor column.
Do you want me to create a feature request for this?

closing thoughts

While I see that I have some good work-arounds to solve the unnecessary reflection (defer reflection), and I understand logically why the code does what it does, this unnecessary reflection is behavior that no one would ever use. I guess if it is simple to turn off with "defer reflection", then it is not worth fixing. But man, it just seems like something you don't want in your product.

@rudolfix
Copy link
Collaborator

rudolfix commented Dec 15, 2024

just a quick question to:

why do I not want to pass a list of tables

did you try to use sql_table directly? it is a function that creates dlt resources corresponding to tables. you are free to pass the to pipeline directly or return them from your custom dlt.source function. this is common pattern when you control which tables to extract via some custom config file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: In Progress
Development

No branches or pull requests

2 participants