feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_clinical #478

laurabegin · 2025-02-07T21:32:07Z

feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_clinical

📌 Summary

Retrieve all sequencing IDs that share the same analysis ID as the given sequencing IDs from the enriched_clinical Delta table directly in Airflow, using pandas and deltalake libraries in a PythonVirtualEnvOperator.

🛠️ Changes

✨ Added get_all_sequencing_ids task to etl_run.py.
✨ Added a datasets.py file to create a common place where to store our dataset locations.
✨ Added deltalake===0.24.0 to requirements.txt.

🧪 Local Test Results

Here is an output example using QA data locally:

[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - distinct_sequencing_ids: {'1261530', '1261482'}
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - filtered_df:     service_request_id analysis_service_request_id
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 0              1261482                     1261485
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 1              1261496                     1261485
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 2              1261509                     1261485
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 3              1261519                     1261522
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 4              1261530                     1261522
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - ..                 ...                         ...
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 136            1262935                     1262907
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 137            1265128                     1265131
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 138            1265143                     1265146
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 139            1288063                     1288061
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 140            1516099                     1516098
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - [141 rows x 2 columns]
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - analysis_ids: {'1261522', '1261485'}
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - all_sequencing_ids: {'1261482', '1261509', '1261519', '1261540', '1261530', '1261496'}
[2025-02-07, 16:10:06 EST] {python.py:237} INFO - Done. Returned value was: ('1261519', '1261540', '1261482', '1261496', '1261509', '1261530')

💡 Note: I removed this logging from the task but I could add it back if you think it could be useful.

🔗 Related Jira issues

https://ferlab-crsj.atlassian.net/browse/CLIN-3728

…nical

feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_cli…

fd0f9f2

…nical

laurabegin requested review from jecos, creativeyann17 and LysianeBouchard February 7, 2025 21:32

jecos approved these changes Feb 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_clinical #478

feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_clinical #478

laurabegin commented Feb 7, 2025

feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_clinical #478

Are you sure you want to change the base?

feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_clinical #478

Conversation

laurabegin commented Feb 7, 2025

feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_clinical

📌 Summary

🛠️ Changes

🧪 Local Test Results

🔗 Related Jira issues