Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancements to make star alleles more explainable #37

Open
davidyuyuan opened this issue Jan 28, 2025 · 5 comments
Open

Enhancements to make star alleles more explainable #37

davidyuyuan opened this issue Jan 28, 2025 · 5 comments
Labels
enhancement New feature or request

Comments

@davidyuyuan
Copy link

The HiFiHLA provides a hifihla_summary.tsv file to provide an explaination why centain diplotype was called. I was unable to find similar information when I use pb-Starphase to analyze HLA genes.

queryId	qLen	nMatches	gType	gPctId	gPctCov	gEdit	cdnaType	exCovered	exEdit	coverage	errRate	Type	
HLA-A_0	3502	1	HLA-A*03:01:01:01	100.0	100.0	0	HLA-A*03:01:01	1,2,3,4,5,6,7,8	0	50	0.00150	HLA-A*03:01:01:01	

Similarly, I was unable to find an explaination I use pb-Starphase to analyze CYP2D6. I did find cyp2d6_link_graph.svg and included it in the final report of PGx recommendations. The feedback from my users was that it was a bit difficult to relate the SVG back to the star allele diplotypes. I just want to say that pb-Starphase is doing a better job than Pangu, which provided no explaination. If there are ways to make the explaination more comprehensible to genetic counselors, it would be a nice enhancement.

As to the genes other than CYP2D6 and HLA genes, I was unable to find explainations, either. If pb-Starphase was implemented in a way to normalize, annotate, phase and QC VCFs before making star allele assignments, it would be a good solid explaination of the diplotypes. If such final VCFs can be made available, I would be happy to surface them as part of the explainations in the final PGx report.

@holtjma
Copy link
Collaborator

holtjma commented Jan 28, 2025

Hi @davidyuyuan,

Thanks for this information! It sounds like there's a lot of useful feedback in here, I'll try to respond to each component below and have some clarification questions on some of it.

The HiFiHLA provides a hifihla_summary.tsv file to provide an explaination why centain diplotype was called. I was unable to find similar information when I use pb-Starphase to analyze HLA genes.

We provide similar information in the debug folder outputs under a file called hla_debug.json. Inside that file, you can see a list of the comparison of each consensus sequence to the database. For example, this is one that I've truncated and annotated slightly:

{
  "read_mapping_stats": {
    "HLA-A": {
      "consensus1": {
        "best_match_id": "HLA:HLA00089", // database ID for the best match
        "best_match_star": "30:01:01:01", // more user friendly *-allele for best match
        "mapping_stats": {
          ...
          "30:01:01:01": { // corresponding entry of the best match
            "cdna_mapping": {
              "query_len": 1098, // length of database sequence
              "target_len": 1535, // length of generated cDNA sequence with flanking DNA (for alignment)
              "match_len": 1098,
              "nm": 0, // number of mismatches
              "query_unmapped": 0, // number of unaligned bases from DB, these are counted as errors also in the comparison
              "target_unmapped": 437,
              "cigar": "1098=", // cigar of the cDNA is an exact match
              "md": "1098"
            },
            "dna_mapping": {
              "query_len": 3503,
              "target_len": 3818,
              "match_len": 3503,
              "nm": 0,
              "query_unmapped": 0,
              "target_unmapped": 315,
              "cigar": "3503=", // cigar of the DNA is also an exact match; this allele is exactly defined in the database
              "md": "3503"
            }
          },
...

We have kept this is a separate file since it's really only for power users who want to dig deep into HLA and identify novel alleles. Are your users interested in having easier access to that information?

Similarly, I was unable to find an explaination I use pb-Starphase to analyze CYP2D6. I did find cyp2d6_link_graph.svg and included it in the final report of PGx recommendations. The feedback from my users was that it was a bit difficult to relate the SVG back to the star allele diplotypes. I just want to say that pb-Starphase is doing a better job than Pangu, which provided no explaination. If there are ways to make the explaination more comprehensible to genetic counselors, it would be a nice enhancement.

This is not too surprising. It's difficult to communicate a complex system like CYP2D6 into a simplified format beyond the final star-allele calls. You mention that it's difficult to interpret the SVG, but if they only care about the final diplotype call, they shouldn't need that. Are you able to clarify what information they're looking for beyond the diplotype call itself? That may help me understand what the gap is. Just to note some of the things that are available currently:

  1. We have the full consensus sequences for each CYP2D6 component in a FASTA file
  2. We have a debug BAM that has the read sequences realigned to the final diplotype sequences, which can be helpful to understand when copy number changes or hybrids are present.

As to the genes other than CYP2D6 and HLA genes, I was unable to find explainations, either. If pb-Starphase was implemented in a way to normalize, annotate, phase and QC VCFs before making star allele assignments, it would be a good solid explaination of the diplotypes. If such final VCFs can be made available, I would be happy to surface them as part of the explainations in the final PGx report.

Ah, so for the other genes at least, it sounds like you weren't able to find the variant annotation information. We are capturing this information in our output JSON file, e.g.:

...
"CYP2B6": {
      "diplotypes": [
        {
          "hap1": "*5",
          "hap2": "*6",
          "diplotype": "*5/*6" // the diplotype
        }
      ],
      "simple_diplotypes": null,
      "variant_details": [ // each non-reference variant shows up in this list
        {
          "cpic_variant_id": 1102587, // various identifiers
          "cpic_name": "25505C>A; 25505C>T",
          "dbsnp": "rs3211371",
          "normalized_variant": { // normalized variant, i.e. left-shifted
            "chrom": "chr19",
            "position": 41016809,
            "reference": "C",
            "alternate": "T"
          },
          "normalized_genotype": { // normalized genotype from the VCF, including phase information where relevant
            "genotype": "1|0",
            "phase_set": 40790485
          }
        },
        ...
        }
      ],
      "mapping_details": null,
      "multi_mapping_details": null
    },
...

Are you suggesting that providing similar variant-level information for CYP2D6 may be helpful?

Matt

@davidyuyuan
Copy link
Author

Hi Matt,

Thank you for the prompt response and detailed explainations.

Thank you for pointing out the hla_debug.json for HLA genes. I agree with you that this file needs to stay separate to provide comprehansive information for power users, for example, to identify novel alleles. My users are more interested in easy access to information to support the conclusion in the pharmcat.tsv. As I mentioned in my first comment, Something like hifihla_summary.tsv by HiFiHLA is what they are looking for.

The user requirement on CYP2D6 is the same as on HLA genes. It would be nice to have some kind of straight forward explaination why pb-Startphase concluded that the diplotype in the pharmcat.tsv was this instead of that. If it is hard or impossible to achieve this on this crazy CYP2D6 gene, the link graph in SVG provides useful information for users to feel comfortable with the result in the pharmcat.tsv.

For other genes, the json file specified by --output-calls does not seem having information to support why the SNPs were called out of a BAM or a VCF. I was hoping to have SNP rs... to explain star alleles in the pharmcat.tsv. Perhaps, some information to support why these SNP rs... were correct is also needed. Something as simple as a VCF before the output-calls json was generated would be good enough.

The general, users would like to have some clear inforation in the format that they can read to feel comfortable with the star alleles in the pharmcat.tsv. Such additional information would be very useful when they were asked why the star alleles in the pharmcat.tsv files were correct. Json format is a bit overwhelming. They can read TSV files as spreadsheets. If scripts have to be developed for better integration, they would be less sensitive to the potential changes in TSV files than in Json files in the future.

@holtjma
Copy link
Collaborator

holtjma commented Jan 28, 2025

Thank you for pointing out the hla_debug.json for HLA genes. I agree with you that this file needs to stay separate to provide comprehansive information for power users, for example, to identify novel alleles. My users are more interested in easy access to information to support the conclusion in the pharmcat.tsv. As I mentioned in my first comment, Something like hifihla_summary.tsv by HiFiHLA is what they are looking for.

Are you able to elaborate on what the "easy" information is here? I look at the hifihla TSV and that's fairly detailed / complex information that I generally wouldn't expect a user to want to see (hence why we've put it separate).

The user requirement on CYP2D6 is the same as on HLA genes. It would be nice to have some kind of straight forward explaination why pb-Startphase concluded that the diplotype in the pharmcat.tsv was this instead of that. If it is hard or impossible to achieve this on this crazy CYP2D6 gene, the link graph in SVG provides useful information for users to feel comfortable with the result in the pharmcat.tsv.

There's possibly some things we could do here, it's just unclear to me which key pieces of information they're looking for. For example, they may want information on distinguishing *4 from *10 or maybe they're trying to understand a *4x2 (duplication) instead of *4. The former could possibly be resolved by just emphasizing the variants discovered in each consensus (I don't think we explicitly report that currently), the latter requires a deeper understanding of how the haplotyping works (I suspect most users don't want to go down that rabbit hole).

For other genes, the json file specified by --output-calls does not seem having information to support why the SNPs were called out of a BAM or a VCF. I was hoping to have SNP rs... to explain star alleles in the pharmcat.tsv. Perhaps, some information to support why these SNP rs... were correct is also needed. Something as simple as a VCF before the output-calls json was generated would be good enough.

If I'm understanding this correctly, then there may be some confusion on the role of StarPhase for this part. The non-HLA, non-CYP2D6 genes all implicitly trust the provided VCF file (BAM is not relevant here). In other words, all the variants used to "decide" the star alleles are provided by the user already, StarPhase is not calling variants or deciding their correctness either. Thus, the "VCF before the output-calls" is exactly what a user provides. The output JSON reports all the identified variants which can be matched to the DB definitions.

Json format is a bit overwhelming. They can read TSV files as spreadsheets.

I'll have to think about that part, the database TSV files are pretty overwhelming to look through by eye and a TSV from StarPhase would probably look pretty similar. I'm not sure there's a great solution here, but open to suggestions on what this might look like.

Matt

@holtjma holtjma added the enhancement New feature or request label Jan 28, 2025
@davidyuyuan
Copy link
Author

Hi Matt,
I had a call with Dr. Fu to make sure that I understood her requirements better earlier today. We agree with you that the hifihla_summary.tsv has too much information.

Here is a file in an earlier version of pb-Starphase but not in the latest one any more when CYP2D6 was analyzed. It seemed having useful information to explain why *2/*17 was called. Can this level of details be made available for CYP2D6 and other genes when BAMs are used as input? If TSV format is doable, it would be more readable than Json for people.

[
    {
        "input": "/home/jovyan/work/pangu/10-0033.deepvariant.haplotagged.bam",
        "diplotype": "CYP2D6 *2/*17",
        "copynumber": 2,
        "haplotypes": [
            {
                "call": "*2",
                "alleles": [
                    {
                        "call": "*2",
                        "num_reads": 732,
                        "meanCover": 379.49,
                        "maxCover": 732,
                        "minCover": 42,
                        "rsIDs": [
                            "rs16947",
                            "rs1135840"
                        ]
                    }
                ]
            },
            {
                "call": "*17",
                "alleles": [
                    {
                        "call": "*17",
                        "num_reads": 2527,
                        "meanCover": 1189.81,
                        "maxCover": 1421,
                        "minCover": 560,
                        "rsIDs": [
                            "rs1135840",
                            "rs16947",
                            "rs28371706"
                        ]
                    }
                ]
            }
        ],
        "warnings": [
            "Unlabeled Clipped Reads\nm84187_241017_185034_s2/222430697/ccs\nm84187_241017_185034_s2/231278485/ccs\nm84187_241017_185034_s2/85792465/ccs\nm84187_241017_185034_s2/42731195/ccs\nm84187_241017_185034_s2/204870507/ccs\nm84187_241017_185034_s2/146539430/ccs\nm84187_241017_185034_s2/126223411/ccs\nm84187_241017_185034_s2/205458969/ccs\nm84187_241017_185034_s2/214502466/ccs\nm84187_241017_185034_s2/36374329/ccs\nm84187_241017_185034_s2/56492814/ccs\nm84187_241017_185034_s2/118491725/ccs\nm84187_241017_185034_s2/219415087/ccs\nm84187_241017_185034_s2/111215665/ccs\nm84187_241017_185034_s2/131270198/ccs\nm84187_241017_185034_s2/194904571/ccs\nm84187_241017_185034_s2/134415972/ccs\nm84187_241017_185034_s2/26611948/ccs\nm84187_241017_185034_s2/228591314/ccs\nm84187_241017_185034_s2/2954762/ccs\nm84187_241017_185034_s2/151262680/ccs\nm84187_241017_185034_s2/140908264/ccs\nm84187_241017_185034_s2/106894898/ccs\nm84187_241017_185034_s2/107483596/ccs\nm84187_241017_185034_s2/98962640/ccs\nm84187_241017_185034_s2/216274086/ccs\nm84187_241017_185034_s2/67895566/ccs\nm84187_241017_185034_s2/133171705/ccs\nm84187_241017_185034_s2/122552559/ccs\nm84187_241017_185034_s2/234227685/ccs\nm84187_241017_185034_s2/254544472/ccs\nm84187_241017_185034_s2/65606643/ccs\nm84187_241017_185034_s2/33165700/ccs\nm84187_241017_185034_s2/254084987/ccs\nm84187_241017_185034_s2/100797692/ccs\nm84187_241017_185034_s2/163714873/ccs"
        ]
    }
]

I checked several *.deepvariant.phased.vcf and *.pbsv.vcf files by the targeted enrichment pipeline. I think that I now understand what you menat when you said "implicitly trust". Please ignore my previous comment related to annotated VCFs as supporting evidences for star allele assignments.

By the way, I wasn't asking for a database TSV file. The database itself in Json is perfectly fine.

Overall, these are just some rough ideas or suggestions, not explicit demands or solutions. Thank you for looking into this.

@holtjma
Copy link
Collaborator

holtjma commented Jan 29, 2025

Ah, I see. That file is actually from pangu, which was a tool we had for a little bit but retired in favor of StarPhase. But extrapolating a bit, it sounds like the missing information here is that we are not reporting the CYP2D6 variants from the consensus sequences. We're clearly computing this information already to determine the star allele, but it's not part of output. I'll have to think about how big of a lift that component is, but its certainly doable.

Overall, these are just some rough ideas or suggestions, not explicit demands or solutions. Thank you for looking into this.

Yes, thank you for the suggestions! I'm just asking lots of questions so I can understand whether there's a gap in the tool/outputs or a gap in our documentation. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants