feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_clinical #478
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_clinical
📌 Summary
Retrieve all sequencing IDs that share the same analysis ID as the given sequencing IDs from the
enriched_clinical
Delta table directly in Airflow, usingpandas
anddeltalake
libraries in aPythonVirtualEnvOperator
.🛠️ Changes
get_all_sequencing_ids
task toetl_run.py.
datasets.py
file to create a common place where to store our dataset locations.deltalake===0.24.0
torequirements.txt
.🧪 Local Test Results
Here is an output example using QA data locally:
💡 Note: I removed this logging from the task but I could add it back if you think it could be useful.
🔗 Related Jira issues
https://ferlab-crsj.atlassian.net/browse/CLIN-3728