Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_clinical #478

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

laurabegin
Copy link
Member

feat(etl_run): CLIN-3728 Extract all sequencing IDs from enriched_clinical

📌 Summary

Retrieve all sequencing IDs that share the same analysis ID as the given sequencing IDs from the enriched_clinical Delta table directly in Airflow, using pandas and deltalake libraries in a PythonVirtualEnvOperator.

🛠️ Changes

  • ✨ Added get_all_sequencing_ids task to etl_run.py.
  • ✨ Added a datasets.py file to create a common place where to store our dataset locations.
  • ✨ Added deltalake===0.24.0 to requirements.txt.

🧪 Local Test Results

Here is an output example using QA data locally:

[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - distinct_sequencing_ids: {'1261530', '1261482'}
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - filtered_df:     service_request_id analysis_service_request_id
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 0              1261482                     1261485
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 1              1261496                     1261485
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 2              1261509                     1261485
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 3              1261519                     1261522
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 4              1261530                     1261522
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - ..                 ...                         ...
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 136            1262935                     1262907
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 137            1265128                     1265131
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 138            1265143                     1265146
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 139            1288063                     1288061
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 140            1516099                     1516098
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - 
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - [141 rows x 2 columns]
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - analysis_ids: {'1261522', '1261485'}
[2025-02-07, 16:10:06 EST] {process_utils.py:191} INFO - all_sequencing_ids: {'1261482', '1261509', '1261519', '1261540', '1261530', '1261496'}
[2025-02-07, 16:10:06 EST] {python.py:237} INFO - Done. Returned value was: ('1261519', '1261540', '1261482', '1261496', '1261509', '1261530')

💡 Note: I removed this logging from the task but I could add it back if you think it could be useful.

🔗 Related Jira issues

https://ferlab-crsj.atlassian.net/browse/CLIN-3728

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants