-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hail backend - SV WGS #3574
Hail backend - SV WGS #3574
Conversation
…institute/seqr into hail-backend-sv
…l-backend-gcnv
hail_search/hail_search_query.py
Outdated
'response_key': 'transcripts', | ||
'empty_array': True, | ||
'format_value': lambda value: value.rename({k: _to_camel_case(k) for k in value.keys()}), | ||
'format_values': lambda values: values.group_by(lambda t: t.geneId), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to name this function something other than format_values
. Having format_value
and format_values
both be present but have different behavior is a bit surprising.
'protein_consequence': lambda r: [hl.min(r.sorted_gene_consequences.map(lambda g: g.major_consequence_id))], | ||
'size': lambda r: [hl.if_else( | ||
r.start_locus.contig == r.end_locus.contig, r.start_locus.position - r.end_locus.position, -50, | ||
)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
SORTS = {
'protein_consequence': ...
...
**BaseHailTableQuery.SORTS
}
return [gene_ranks.get(r.selected_transcript.gene_id)] + super()._gene_rank_sort(r, gene_ranks) | ||
|
||
|
||
class SvHailTableQuery(BaseHailTableQuery): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could these subclasses be in separate files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They could be. I'll make a follow up PR to move everything, to keep the diff more manageable on this and the gcnv PR
hail_search/hail_search_query.py
Outdated
(ht.alleles == [ref, alt]) | ||
for chrom, pos, ref, alt in variant_ids | ||
] | ||
variant_id_q = variant_id_qs[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like another hl.any
😜
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦
}, | ||
), | ||
# For insertions, end_locus represents the svSourceDetail, otherwise represents the endChrom | ||
'endChrom': lambda r: hl.or_missing(r.sv_type_id != insertion_type_id, get_end_chrom(r)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had pretty good luck with just defining a hail expression as a variable and re-using:
end_chrom = get_end_chrom(r)
...
'endChrom': lambda r: hl.or_missing(r.sv_type_id != insertion_type_id, end_chrom),
'svSourceDetail': lambda r: hl.or_missing(r.sv_type_id == insertion_type_id, hl.or_missing(hl.is_defined(end_chrom), hl.struct(chrom=end_chrom)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is r
is not defined above where you do end_chrom = get_end_chrom(r)
, its only available in the lambda
NESTED_GENOTYPE_FIELDS = {'concordance': ['new_call', 'prev_call', 'prev_num_alt']} | ||
GENOTYPE_QUERY_FIELDS = {'gq_sv': 'GQ', 'gq': None} | ||
|
||
TRANSCRIPTS_FIELD = 'sorted_gene_consequences' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to use sorted_transcript_consequences
for both SNPs and SVs. I'm unsure why we use gene
instead of transcript
for SVs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the v3 pipeline has a different data model
elif not self._load_table_kwargs.get('_intervals'): | ||
ht = self._prefilter_entries_table(ht, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SV doesn't implement interval filtering. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the SNP entry tables are keyed by locus so we can filter them by interval at the time we read them in. The SV tables are keyed by variant ID and the entry tables do not have a locus so there is no way to filter there. Therefore, for SVs the interval filtering happens after joining with the annotation table: https://github.com/broadinstitute/seqr/pull/3574/files#diff-43e8b0dcfc307385b78fcb0683e55b20ddfb4c0f85c1794aa39b33e0a23322e8R1093
POPULATIONS = { | ||
'sv_callset': {'hemi': None, 'sort': 'callset_af'}, | ||
'gnomad_svs': {'id': 'ID', 'ac': None, 'an': None, 'hom': None, 'hemi': None, 'het': None, 'sort': 'gnomad'}, | ||
} | ||
POPULATION_FIELDS = {'sv_callset': 'gt_stats'} | ||
PREDICTION_FIELDS_CONFIG = { | ||
'strvctvre': PredictionPath('strvctvre', 'score'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't recognize that the SVs have so few populations and predictions.
} | ||
|
||
SORTS = { | ||
'protein_consequence': lambda r: [hl.min(r.sorted_gene_consequences.map(lambda g: g.major_consequence_id))], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use r[TRANSCRIPTS_FIELD]
for r.sorted_gene_consequences
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not in scope here so it would have to be r[SvHailTableQuery. TRANSCRIPTS_FIELD]
which is less readable. The purpose of that varianble is to support methods in the base class, not so much to use it every possible location
hail_search/hail_search_query.py
Outdated
return super()._get_allowed_consequences_annotations(annotations, annotation_filters) | ||
|
||
def _get_consequence_filter(self, allowed_consequence_ids, annotation_exprs): | ||
return self._ht.sorted_gene_consequences.any( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
Getting this up early for review, still needs unit tests