Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add capability to create sql resources from queries #1997

Closed
erik-meko opened this issue Oct 28, 2024 · 5 comments
Closed

Add capability to create sql resources from queries #1997

erik-meko opened this issue Oct 28, 2024 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@erik-meko
Copy link

Feature description

Create a sql resource in dlt.sources.sql_database that allows creating models/tables with sql. Could be named sql_query and work like sql_table, except that instead of creating a resource given a single table, it can use multiple database objects as input, preferably defined using a sql statement.

This is useful when the source database is not in the consumers control, so that it is not possible to create a static view

Are you a dlt user?

I'd consider using dlt, but it's lacking a feature I need.

Use case

I have a need to load incremental tables where the cursor is in a second table, i.e. I would like to do

SELECT s.*, m.updatedtimestamp
FROM supporting_table AS s
JOIN main_table AS m ON s.main_table_id = m.id
WHERE m.updatedtimestamp >= :_last_updated

Proposed solution

Add a new "resource type" in sql_database, that allows to use more ephemeral entities like select statements (or temp tables)

Related issues

No response

@rudolfix
Copy link
Collaborator

rudolfix commented Nov 4, 2024

@erik-meko
we are slowly working on such a thing... the biggest problem is a proper column reflection that is not available for sql queries (correct me if I'm wrong!) via sql alchemy. right now you have 3 options:

  1. create a VIEW with your join and extract it
  2. use query_adapter_callback to replace the query with your own. I think you'll be able to add new columns. the data type will be inferred from the data. maybe it is enough in your case
  3. you can use pure dlt to do sql queries ie.
def load_sql_data() -> None:
    """Load data from a sql database with sqlalchemy, for more complex examples use our sql_database source"""

    # Use any SQL database supported by SQLAlchemy, below we use a public
    # MySQL instance to get data.
    # NOTE: you'll need to install pymysql with `pip install pymysql`
    # NOTE: loading data from public mysql instance may take several seconds
    engine = sa.create_engine("mysql+pymysql://[email protected]:4497/Rfam")

    with engine.connect() as conn:
        # Select genome table, stream data in batches of 100 elements
        query = "SELECT * FROM genome LIMIT 1000"
        rows = conn.execution_options(yield_per=100).exec_driver_sql(query)

        pipeline = dlt.pipeline(
            pipeline_name="from_database",
            destination="duckdb",
            dataset_name="genome_data",
        )

        # Convert the rows into dictionaries on the fly with a map function
        load_info = pipeline.run(map(lambda row: dict(row._mapping), rows), table_name="genome")

    print(load_info)  # noqa: T201

tell me if any of those solves your problem. we'll keep this ticket in the pipeline anyway because things above are less convenient that sql query.

@rudolfix rudolfix self-assigned this Nov 4, 2024
@rudolfix rudolfix added the question Further information is requested label Nov 4, 2024
@erik-meko
Copy link
Author

erik-meko commented Nov 11, 2024

I am not allowed to modify the source and unfortunately I wasnt successful in my attempts to apply a query_adapter_callback (I failed to understand how to modify the schema) Alternative 3 looks simple enough, however, I dropped the PoC after trying to implement another pipeline in self-hosted airflow with azure keyvault as secrets backend.

During the run, an api pipeline is apparently looking for a lot of airflow variables. With each check came a failing remote call to the azure keyvault, I got ~25 lines of stacktrace in the log for each, totalling in many 100 lines of stacktrace in the airflow log.

Troubleshooting it would have been a nightmare and since this part wasnt well documented, I shut down the PoC.

@rudolfix
Copy link
Collaborator

sorry to hear that dlt didn't work for you. airflow config provider can be disabled in config toml or via env. variable. there's not much we can do for Airflow to not log this, we already use Variable.get method with a default value to get secrets. it seems Airflow decides to log that anyway.

@github-project-automation github-project-automation bot moved this from In Progress to Done in dlt core library Nov 18, 2024
@rudolfix rudolfix reopened this Nov 19, 2024
@rudolfix
Copy link
Collaborator

actually free text sql support is coming here #2070

@rudolfix
Copy link
Collaborator

rudolfix commented Dec 9, 2024

fixed in 1.4.1

@rudolfix rudolfix closed this as completed Dec 9, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in dlt core library Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Done
Development

No branches or pull requests

2 participants