Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake: remote file not found when using parallel-isolated decomposition in airflow #2146

Open
julesmga opened this issue Dec 13, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@julesmga
Copy link

dlt version

1.4.1

Describe the problem

When using the parallel-isolated decomposition in airflow with snowflake as destination, using an external GCS stage, there seems to be a race condition somewhere and sometimes tasks fail when trying to load the state file because it has already been deleted (or possibly never been created in the first place).

snowflake.connector.errors.ProgrammingError: 091016 (22000): 01b9001e-0203-c615-0003-0f22001f45b2: Remote file 'gcs://composer-prod/data/dlt/datahub/_dlt_pipeline_state/1734084368.7914042.5748b20734.jsonl' was not found. If you are running a copy command, please make sure files are not deleted when they are being loaded or files are not being loaded into two different tables concurrently with auto purge option.

It does not happen when using the serialize decomposition.

Expected behavior

State file should not be deleted before it is uploaded to Snowflake.

Steps to reproduce

Set up Snowflake as destination with an external stage, and try to load a few resources with the parallel-isolated decomposition strategy.

Operating system

Linux

Runtime environment

Google Cloud Composer

Python version

3.11

dlt data source

SQLAlchemy (Snowflake)

dlt destination

Snowflake

Other deployment details

No response

Additional information

No response

@rudolfix
Copy link
Collaborator

@julesmga path to state looks weird to me. when using parallel-isolated each resource gets its own task that is executed in separate pipeline name. Also the working folder should be random. It does not look like that in your case:

snowflake.connector.errors.ProgrammingError: 091016 (22000): 01b9001e-0203-c615-0003-0f22001f45b2: Remote file 'gcs://composer-prod/data/dlt/datahub/_dlt_pipeline_state/1734084368.7914042.5748b20734.jsonl' was not found. If you are running a copy command, please make sure files are not deleted when they are being loaded or files are not being loaded into two different tables concurrently with auto purge option.

Could you paste your DAG? are you forcing working folder with environment variable maybe? do you see a correct DAG in Airflow (each resource in separate task)

@rudolfix rudolfix self-assigned this Dec 15, 2024
@rudolfix rudolfix added the question Further information is requested label Dec 15, 2024
@rudolfix rudolfix moved this from Todo to In Progress in dlt core library Dec 15, 2024
@julesmga
Copy link
Author

julesmga commented Dec 16, 2024

Thanks for the quick reply, here are my env variables:

        "DESTINATION__STAGE_NAME"  = "public.composer"
        "DESTINATION__FILESYSTEM__BUCKET_URL" = "gs://composer-prod/data/dlt"
        "RUNTIME__LOG_LEVEL" = "INFO"
        "DATA_WRITER__BUFFER_MAX_ITEMS" = "1000"
        "DATA_WRITER__FILE_MAX_ITEMS" = "1000000"
        "LOAD__DELETE_COMPLETED_JOBS" = "true"
        "LOAD__TRUNCATE_STAGING_DATASET" = "true"
        "EXTRACT__WORKERS" = "4"
        "LOAD__WORKERS" = "4"

DAGs are split into a common file, and a separate DAG per source, tasks are correctly split by resource in Airflow in both serialize and parallel-isolated scenarios.

common.py: https://paste.xinu.at/YoUwAjoJntv6LtoH/
datahub.py: https://paste.xinu.at/Bu1sj2xCPi4s6RAT/

I'm using filesystem="staging", could that be the cause of the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: In Progress
Development

No branches or pull requests

2 participants