Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hail backend - SV WGS #3574

Merged
merged 31 commits into from
Sep 5, 2023
Merged

Hail backend - SV WGS #3574

merged 31 commits into from
Sep 5, 2023

Conversation

hanars
Copy link
Collaborator

@hanars hanars commented Aug 22, 2023

Getting this up early for review, still needs unit tests

'response_key': 'transcripts',
'empty_array': True,
'format_value': lambda value: value.rename({k: _to_camel_case(k) for k in value.keys()}),
'format_values': lambda values: values.group_by(lambda t: t.geneId),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to name this function something other than format_values. Having format_value and format_values both be present but have different behavior is a bit surprising.

'protein_consequence': lambda r: [hl.min(r.sorted_gene_consequences.map(lambda g: g.major_consequence_id))],
'size': lambda r: [hl.if_else(
r.start_locus.contig == r.end_locus.contig, r.start_locus.position - r.end_locus.position, -50,
)],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

SORTS = {
   'protein_consequence': ...
   ...
   **BaseHailTableQuery.SORTS
}

return [gene_ranks.get(r.selected_transcript.gene_id)] + super()._gene_rank_sort(r, gene_ranks)


class SvHailTableQuery(BaseHailTableQuery):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could these subclasses be in separate files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could be. I'll make a follow up PR to move everything, to keep the diff more manageable on this and the gcnv PR

(ht.alleles == [ref, alt])
for chrom, pos, ref, alt in variant_ids
]
variant_id_q = variant_id_qs[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like another hl.any 😜

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦

},
),
# For insertions, end_locus represents the svSourceDetail, otherwise represents the endChrom
'endChrom': lambda r: hl.or_missing(r.sv_type_id != insertion_type_id, get_end_chrom(r)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had pretty good luck with just defining a hail expression as a variable and re-using:

end_chrom = get_end_chrom(r)
...
'endChrom': lambda r: hl.or_missing(r.sv_type_id != insertion_type_id, end_chrom),
'svSourceDetail': lambda r: hl.or_missing(r.sv_type_id == insertion_type_id, hl.or_missing(hl.is_defined(end_chrom), hl.struct(chrom=end_chrom)))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is r is not defined above where you do end_chrom = get_end_chrom(r), its only available in the lambda

NESTED_GENOTYPE_FIELDS = {'concordance': ['new_call', 'prev_call', 'prev_num_alt']}
GENOTYPE_QUERY_FIELDS = {'gq_sv': 'GQ', 'gq': None}

TRANSCRIPTS_FIELD = 'sorted_gene_consequences'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to use sorted_transcript_consequences for both SNPs and SVs. I'm unsure why we use gene instead of transcript for SVs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the v3 pipeline has a different data model

Comment on lines +310 to 311
elif not self._load_table_kwargs.get('_intervals'):
ht = self._prefilter_entries_table(ht, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SV doesn't implement interval filtering. Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the SNP entry tables are keyed by locus so we can filter them by interval at the time we read them in. The SV tables are keyed by variant ID and the entry tables do not have a locus so there is no way to filter there. Therefore, for SVs the interval filtering happens after joining with the annotation table: https://github.com/broadinstitute/seqr/pull/3574/files#diff-43e8b0dcfc307385b78fcb0683e55b20ddfb4c0f85c1794aa39b33e0a23322e8R1093

Comment on lines +1067 to +1073
POPULATIONS = {
'sv_callset': {'hemi': None, 'sort': 'callset_af'},
'gnomad_svs': {'id': 'ID', 'ac': None, 'an': None, 'hom': None, 'hemi': None, 'het': None, 'sort': 'gnomad'},
}
POPULATION_FIELDS = {'sv_callset': 'gt_stats'}
PREDICTION_FIELDS_CONFIG = {
'strvctvre': PredictionPath('strvctvre', 'score'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't recognize that the SVs have so few populations and predictions.

}

SORTS = {
'protein_consequence': lambda r: [hl.min(r.sorted_gene_consequences.map(lambda g: g.major_consequence_id))],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use r[TRANSCRIPTS_FIELD] for r.sorted_gene_consequences?

Copy link
Collaborator Author

@hanars hanars Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not in scope here so it would have to be r[SvHailTableQuery. TRANSCRIPTS_FIELD] which is less readable. The purpose of that varianble is to support methods in the base class, not so much to use it every possible location

return super()._get_allowed_consequences_annotations(annotations, annotation_filters)

def _get_consequence_filter(self, allowed_consequence_ids, annotation_exprs):
return self._ht.sorted_gene_consequences.any(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@hanars hanars merged commit cef2cbf into dev Sep 5, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants