-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update individuals with sample qc #4663
base: dev
Are you sure you want to change the base?
Conversation
i.sample.sample_id: i for i in Individual.objects.filter( | ||
family__project__in=projects, | ||
sample__sample_id__in=sample_qc_map.keys(), | ||
).select_related('sample') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select_related
is generally an inefficient way to query things (although there are a lot of these floating around in old code still) so its generally a sign that what we are querying from the database is not efficiently written.
Also, while its not obvious in this part of the code, for hail data the Sample sample_id
is always identical to the corresponding Individual's individual_id
Also, we are currently not using it but match_and_update_search_samples
returns updated_samples
which will be a queryset of all the sample models that were included in this round of loading. We should be able to safely assume that the sample QC data will only be included for those samples, so this can be rewritten as follows
sample_id_individual_map = {
i.individual_id: i for i in Individual.objects.filter(sample__in=updated_samples)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering using updated_samples
! glad to hear I was on the right track, thanks for the suggestion.
unknown_pop_filter_flags = set() | ||
|
||
for sample_id, individual in sample_individual_map.items(): | ||
record = sample_qc_map[sample_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be more stable if you iterate through sample_qc_map
and get the corresponding individual model from sample_id_individual_map
, rather than doing it in this order
for sample_id, individual in sample_individual_map.items(): | ||
record = sample_qc_map[sample_id] | ||
filter_flags = {} | ||
for flag in json.loads(record['filter_flags']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this is essentially copy-paste from existing qc code- let me know if you made any substantive changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no substantive changes here !
updated_individuals.append(individual) | ||
|
||
if updated_individuals: | ||
Individual.objects.bulk_update( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its better to use the Individual.bulk_update_models
helper as it includes audit logging by default
if unknown_filter_flags: | ||
message = 'The following filter flags have no known corresponding value and were not saved: {}'.format( | ||
', '.join(unknown_filter_flags)) | ||
logger.warning(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warnings are useful when something is triggered in the UI and the person who does so will be shown the warnings, but they are essentially useless in a job that runs automatically in the background.
I think we should hard fail updating QC if there are unexpected values, as we now have more control over how these are generated and can choose ot to include things we don't care about in our metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree here, this was a copy paste without critical thinking. There's actually no way we'd have erroneous filter flags, and for the hail qc_metrics_filters, which are set by hail's sample qc, we can trust that the output has the same flags every time, so I'm leaning towards just taking the flag validation out of the new seqr sample qc flow.
@@ -242,6 +249,10 @@ def _load_new_samples(cls, metadata_path, genome_version, dataset_type, run_vers | |||
except Exception as e: | |||
logger.error(f'Error reporting loading failure for {run_version}: {e}') | |||
|
|||
# Update sample qc | |||
if 'sample_qc' in metadata: | |||
cls.update_individuals_sample_qc(sample_type, updated_samples, metadata['sample_qc']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should wrap this in a try/except block that logs an error because if something goes wrong somehow we would still want the rest of the job to proceed
much of the logic is based on this function, but I tried to make fewer db queries