You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the parallel-isolated decomposition in airflow with snowflake as destination, using an external GCS stage, there seems to be a race condition somewhere and sometimes tasks fail when trying to load the state file because it has already been deleted (or possibly never been created in the first place).
snowflake.connector.errors.ProgrammingError: 091016 (22000): 01b9001e-0203-c615-0003-0f22001f45b2: Remote file 'gcs://composer-prod/data/dlt/datahub/_dlt_pipeline_state/1734084368.7914042.5748b20734.jsonl' was not found. If you are running a copy command, please make sure files are not deleted when they are being loaded or files are not being loaded into two different tables concurrently with auto purge option.
It does not happen when using the serialize decomposition.
Expected behavior
State file should not be deleted before it is uploaded to Snowflake.
Steps to reproduce
Set up Snowflake as destination with an external stage, and try to load a few resources with the parallel-isolated decomposition strategy.
Operating system
Linux
Runtime environment
Google Cloud Composer
Python version
3.11
dlt data source
SQLAlchemy (Snowflake)
dlt destination
Snowflake
Other deployment details
No response
Additional information
No response
The text was updated successfully, but these errors were encountered:
@julesmga path to state looks weird to me. when using parallel-isolated each resource gets its own task that is executed in separate pipeline name. Also the working folder should be random. It does not look like that in your case:
snowflake.connector.errors.ProgrammingError: 091016 (22000): 01b9001e-0203-c615-0003-0f22001f45b2: Remote file 'gcs://composer-prod/data/dlt/datahub/_dlt_pipeline_state/1734084368.7914042.5748b20734.jsonl' was not found. If you are running a copy command, please make sure files are not deleted when they are being loaded or files are not being loaded into two different tables concurrently with auto purge option.
Could you paste your DAG? are you forcing working folder with environment variable maybe? do you see a correct DAG in Airflow (each resource in separate task)
DAGs are split into a common file, and a separate DAG per source, tasks are correctly split by resource in Airflow in both serialize and parallel-isolated scenarios.
dlt version
1.4.1
Describe the problem
When using the
parallel-isolated
decomposition in airflow with snowflake as destination, using an external GCS stage, there seems to be a race condition somewhere and sometimes tasks fail when trying to load the state file because it has already been deleted (or possibly never been created in the first place).It does not happen when using the
serialize
decomposition.Expected behavior
State file should not be deleted before it is uploaded to Snowflake.
Steps to reproduce
Set up Snowflake as destination with an external stage, and try to load a few resources with the
parallel-isolated
decomposition strategy.Operating system
Linux
Runtime environment
Google Cloud Composer
Python version
3.11
dlt data source
SQLAlchemy (Snowflake)
dlt destination
Snowflake
Other deployment details
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: