Add capability to create sql resources from queries #1997

erik-meko · 2024-10-28T15:15:36Z

Feature description

Create a sql resource in dlt.sources.sql_database that allows creating models/tables with sql. Could be named sql_query and work like sql_table, except that instead of creating a resource given a single table, it can use multiple database objects as input, preferably defined using a sql statement.

This is useful when the source database is not in the consumers control, so that it is not possible to create a static view

Are you a dlt user?

I'd consider using dlt, but it's lacking a feature I need.

Use case

I have a need to load incremental tables where the cursor is in a second table, i.e. I would like to do

SELECT s.*, m.updatedtimestamp
FROM supporting_table AS s
JOIN main_table AS m ON s.main_table_id = m.id
WHERE m.updatedtimestamp >= :_last_updated

Proposed solution

Add a new "resource type" in sql_database, that allows to use more ephemeral entities like select statements (or temp tables)

Related issues

No response

The text was updated successfully, but these errors were encountered:

rudolfix · 2024-11-04T09:56:26Z

@erik-meko
we are slowly working on such a thing... the biggest problem is a proper column reflection that is not available for sql queries (correct me if I'm wrong!) via sql alchemy. right now you have 3 options:

create a VIEW with your join and extract it
use query_adapter_callback to replace the query with your own. I think you'll be able to add new columns. the data type will be inferred from the data. maybe it is enough in your case
you can use pure dlt to do sql queries ie.

def load_sql_data() -> None:
    """Load data from a sql database with sqlalchemy, for more complex examples use our sql_database source"""

    # Use any SQL database supported by SQLAlchemy, below we use a public
    # MySQL instance to get data.
    # NOTE: you'll need to install pymysql with `pip install pymysql`
    # NOTE: loading data from public mysql instance may take several seconds
    engine = sa.create_engine("mysql+pymysql://[email protected]:4497/Rfam")

    with engine.connect() as conn:
        # Select genome table, stream data in batches of 100 elements
        query = "SELECT * FROM genome LIMIT 1000"
        rows = conn.execution_options(yield_per=100).exec_driver_sql(query)

        pipeline = dlt.pipeline(
            pipeline_name="from_database",
            destination="duckdb",
            dataset_name="genome_data",
        )

        # Convert the rows into dictionaries on the fly with a map function
        load_info = pipeline.run(map(lambda row: dict(row._mapping), rows), table_name="genome")

    print(load_info)  # noqa: T201

tell me if any of those solves your problem. we'll keep this ticket in the pipeline anyway because things above are less convenient that sql query.

erik-meko · 2024-11-11T15:27:53Z

I am not allowed to modify the source and unfortunately I wasnt successful in my attempts to apply a query_adapter_callback (I failed to understand how to modify the schema) Alternative 3 looks simple enough, however, I dropped the PoC after trying to implement another pipeline in self-hosted airflow with azure keyvault as secrets backend.

During the run, an api pipeline is apparently looking for a lot of airflow variables. With each check came a failing remote call to the azure keyvault, I got ~25 lines of stacktrace in the log for each, totalling in many 100 lines of stacktrace in the airflow log.

Troubleshooting it would have been a nightmare and since this part wasnt well documented, I shut down the PoC.

rudolfix · 2024-11-18T10:14:58Z

sorry to hear that dlt didn't work for you. airflow config provider can be disabled in config toml or via env. variable. there's not much we can do for Airflow to not log this, we already use Variable.get method with a default value to get secrets. it seems Airflow decides to log that anyway.

rudolfix · 2024-11-19T13:15:27Z

actually free text sql support is coming here #2070

rudolfix · 2024-12-09T09:59:48Z

fixed in 1.4.1

github-project-automation bot added this to dlt core library Oct 28, 2024

github-project-automation bot moved this to Todo in dlt core library Oct 28, 2024

rudolfix self-assigned this Nov 4, 2024

rudolfix added the question Further information is requested label Nov 4, 2024

rudolfix closed this as completed Nov 18, 2024

github-project-automation bot moved this from In Progress to Done in dlt core library Nov 18, 2024

rudolfix reopened this Nov 19, 2024

rudolfix moved this from Done to In Progress in dlt core library Nov 20, 2024

rudolfix mentioned this issue Nov 23, 2024

adds engine adapter and passes incremental and engine to query adapter #2070

Merged

rudolfix closed this as completed Dec 9, 2024

github-project-automation bot moved this from In Progress to Done in dlt core library Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add capability to create sql resources from queries #1997

Add capability to create sql resources from queries #1997

erik-meko commented Oct 28, 2024

rudolfix commented Nov 4, 2024

erik-meko commented Nov 11, 2024 •

edited

Loading

rudolfix commented Nov 18, 2024

rudolfix commented Nov 19, 2024

rudolfix commented Dec 9, 2024

Add capability to create sql resources from queries #1997

Add capability to create sql resources from queries #1997

Comments

erik-meko commented Oct 28, 2024

Feature description

Are you a dlt user?

Use case

Proposed solution

Related issues

rudolfix commented Nov 4, 2024

erik-meko commented Nov 11, 2024 • edited Loading

rudolfix commented Nov 18, 2024

rudolfix commented Nov 19, 2024

rudolfix commented Dec 9, 2024

erik-meko commented Nov 11, 2024 •

edited

Loading