-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancements to make star alleles more explainable #37
Comments
Hi @davidyuyuan, Thanks for this information! It sounds like there's a lot of useful feedback in here, I'll try to respond to each component below and have some clarification questions on some of it.
We provide similar information in the debug folder outputs under a file called
We have kept this is a separate file since it's really only for power users who want to dig deep into HLA and identify novel alleles. Are your users interested in having easier access to that information?
This is not too surprising. It's difficult to communicate a complex system like CYP2D6 into a simplified format beyond the final star-allele calls. You mention that it's difficult to interpret the SVG, but if they only care about the final diplotype call, they shouldn't need that. Are you able to clarify what information they're looking for beyond the diplotype call itself? That may help me understand what the gap is. Just to note some of the things that are available currently:
Ah, so for the other genes at least, it sounds like you weren't able to find the variant annotation information. We are capturing this information in our output JSON file, e.g.:
Are you suggesting that providing similar variant-level information for CYP2D6 may be helpful? Matt |
Hi Matt, Thank you for the prompt response and detailed explainations. Thank you for pointing out the The user requirement on CYP2D6 is the same as on HLA genes. It would be nice to have some kind of straight forward explaination why pb-Startphase concluded that the diplotype in the For other genes, the json file specified by The general, users would like to have some clear inforation in the format that they can read to feel comfortable with the star alleles in the |
Are you able to elaborate on what the "easy" information is here? I look at the hifihla TSV and that's fairly detailed / complex information that I generally wouldn't expect a user to want to see (hence why we've put it separate).
There's possibly some things we could do here, it's just unclear to me which key pieces of information they're looking for. For example, they may want information on distinguishing *4 from *10 or maybe they're trying to understand a *4x2 (duplication) instead of *4. The former could possibly be resolved by just emphasizing the variants discovered in each consensus (I don't think we explicitly report that currently), the latter requires a deeper understanding of how the haplotyping works (I suspect most users don't want to go down that rabbit hole).
If I'm understanding this correctly, then there may be some confusion on the role of StarPhase for this part. The non-HLA, non-CYP2D6 genes all implicitly trust the provided VCF file (BAM is not relevant here). In other words, all the variants used to "decide" the star alleles are provided by the user already, StarPhase is not calling variants or deciding their correctness either. Thus, the "VCF before the output-calls" is exactly what a user provides. The output JSON reports all the identified variants which can be matched to the DB definitions.
I'll have to think about that part, the database TSV files are pretty overwhelming to look through by eye and a TSV from StarPhase would probably look pretty similar. I'm not sure there's a great solution here, but open to suggestions on what this might look like. Matt |
Hi Matt, Here is a file in an earlier version of pb-Starphase but not in the latest one any more when CYP2D6 was analyzed. It seemed having useful information to explain why *2/*17 was called. Can this level of details be made available for CYP2D6 and other genes when BAMs are used as input? If TSV format is doable, it would be more readable than Json for people.
I checked several *.deepvariant.phased.vcf and *.pbsv.vcf files by the targeted enrichment pipeline. I think that I now understand what you menat when you said "implicitly trust". Please ignore my previous comment related to annotated VCFs as supporting evidences for star allele assignments. By the way, I wasn't asking for a database TSV file. The database itself in Json is perfectly fine. Overall, these are just some rough ideas or suggestions, not explicit demands or solutions. Thank you for looking into this. |
Ah, I see. That file is actually from pangu, which was a tool we had for a little bit but retired in favor of StarPhase. But extrapolating a bit, it sounds like the missing information here is that we are not reporting the CYP2D6 variants from the consensus sequences. We're clearly computing this information already to determine the star allele, but it's not part of output. I'll have to think about how big of a lift that component is, but its certainly doable.
Yes, thank you for the suggestions! I'm just asking lots of questions so I can understand whether there's a gap in the tool/outputs or a gap in our documentation. :) |
The HiFiHLA provides a hifihla_summary.tsv file to provide an explaination why centain diplotype was called. I was unable to find similar information when I use pb-Starphase to analyze HLA genes.
Similarly, I was unable to find an explaination I use pb-Starphase to analyze CYP2D6. I did find cyp2d6_link_graph.svg and included it in the final report of PGx recommendations. The feedback from my users was that it was a bit difficult to relate the SVG back to the star allele diplotypes. I just want to say that pb-Starphase is doing a better job than Pangu, which provided no explaination. If there are ways to make the explaination more comprehensible to genetic counselors, it would be a nice enhancement.
As to the genes other than CYP2D6 and HLA genes, I was unable to find explainations, either. If pb-Starphase was implemented in a way to normalize, annotate, phase and QC VCFs before making star allele assignments, it would be a good solid explaination of the diplotypes. If such final VCFs can be made available, I would be happy to surface them as part of the explainations in the final PGx report.
The text was updated successfully, but these errors were encountered: