Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update individuals with sample qc #4663

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from
Open

update individuals with sample qc #4663

wants to merge 9 commits into from

Conversation

jklugherz
Copy link
Contributor

@jklugherz jklugherz commented Feb 20, 2025

much of the logic is based on this function, but I tried to make fewer db queries

@jklugherz jklugherz requested a review from hanars February 20, 2025 18:12
i.sample.sample_id: i for i in Individual.objects.filter(
family__project__in=projects,
sample__sample_id__in=sample_qc_map.keys(),
).select_related('sample')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select_related is generally an inefficient way to query things (although there are a lot of these floating around in old code still) so its generally a sign that what we are querying from the database is not efficiently written.
Also, while its not obvious in this part of the code, for hail data the Sample sample_id is always identical to the corresponding Individual's individual_id
Also, we are currently not using it but match_and_update_search_samples returns updated_samples which will be a queryset of all the sample models that were included in this round of loading. We should be able to safely assume that the sample QC data will only be included for those samples, so this can be rewritten as follows

sample_id_individual_map = {
    i.individual_id: i for i in Individual.objects.filter(sample__in=updated_samples)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering using updated_samples! glad to hear I was on the right track, thanks for the suggestion.

unknown_pop_filter_flags = set()

for sample_id, individual in sample_individual_map.items():
record = sample_qc_map[sample_id]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more stable if you iterate through sample_qc_map and get the corresponding individual model from sample_id_individual_map, rather than doing it in this order

for sample_id, individual in sample_individual_map.items():
record = sample_qc_map[sample_id]
filter_flags = {}
for flag in json.loads(record['filter_flags']):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is essentially copy-paste from existing qc code- let me know if you made any substantive changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no substantive changes here !

updated_individuals.append(individual)

if updated_individuals:
Individual.objects.bulk_update(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its better to use the Individual.bulk_update_models helper as it includes audit logging by default

if unknown_filter_flags:
message = 'The following filter flags have no known corresponding value and were not saved: {}'.format(
', '.join(unknown_filter_flags))
logger.warning(message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warnings are useful when something is triggered in the UI and the person who does so will be shown the warnings, but they are essentially useless in a job that runs automatically in the background.
I think we should hard fail updating QC if there are unexpected values, as we now have more control over how these are generated and can choose ot to include things we don't care about in our metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here, this was a copy paste without critical thinking. There's actually no way we'd have erroneous filter flags, and for the hail qc_metrics_filters, which are set by hail's sample qc, we can trust that the output has the same flags every time, so I'm leaning towards just taking the flag validation out of the new seqr sample qc flow.

@jklugherz jklugherz changed the title first pass update individuals with sample qc update individuals with sample qc Feb 21, 2025
@jklugherz jklugherz marked this pull request as ready for review February 21, 2025 19:37
@jklugherz jklugherz requested a review from hanars February 21, 2025 19:37
@@ -242,6 +249,10 @@ def _load_new_samples(cls, metadata_path, genome_version, dataset_type, run_vers
except Exception as e:
logger.error(f'Error reporting loading failure for {run_version}: {e}')

# Update sample qc
if 'sample_qc' in metadata:
cls.update_individuals_sample_qc(sample_type, updated_samples, metadata['sample_qc'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should wrap this in a try/except block that logs an error because if something goes wrong somehow we would still want the rest of the job to proceed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants