How to Interpret Probabilities and Matching Scores in HLA Predictions Across Various Platforms #30

gerpervaz · 2024-04-28T10:30:37Z

gerpervaz
Apr 28, 2024

I run predictions for the A, B, and C genes using genotypes of 7103 individuals derived from three different array platforms: GSA v3, Oncoarray, and Omni2.5Exome. These genotypes include a 500kb flanking region and are mapped to the hg19 assembly. I used the corresponding (or closest choice -v2- in the case of GSA) prefit classifer. To add an extra level of certainty to the resulting HLA predictions, I process every genotype across every other platform's prefit classifier to see if the best results align with their corresponding platform. Furthermore, I also utilized the four race-specific prefit classifiers mentioned in the original paper.

Below is a representative example of the R code I used for each combination of platform and gene:

`

Choose the pre-fit classifier we will use

model_name <- "European-HLA4-hg19.RData"

Load the model list from the specified file

mlst <- get(load(model_name))

Load HLA-A genotyping data

geno <- hlaBED2Geno("hla_A.bed", "hla_A.fam", "hla_A.bim")

Load HLA-A pre-fit classifier into memory

model <- hlaModelFromObj(mlst$A)

Run the prediction

hla_a <- hlaPredict(model, geno, cl=8) # Use 8 threads for parallel computation

Here I present a table showing the mean for the highest probability* scores for every prefit classifier (grouped into four categories -4C-) across every analysis platform (array_platform):

*"Highest probability" refers to the best result within each prefit classifier group. Eg. for this particular individual, only the highest score of 0.9526104 (European, Paper-based) is used for the mean calculation:
0.4942446 A African Paper-based
0.8119064 A Asian Paper-based
0.8994038 A Hispanic Paper-based
0.9526104 A European Paper-based

Then, the same aproach, but for the matching scores:

As you can see, on avareage, the best probability scores match with their corresponding array platforms, which aligns with expectations. However, the results of the matching scores are quite intriguing to me and prompt me to ask the following questions:

Do these matching scores make any sense? Are these matching scores within the normal range for this kind of data? What does really "matching score" mean?
If a probability score is better with a different platform than the corresponding one, which HLA prediction should be prioritized?
Considering the ethnic background of a particular individual (e.g., African or Hispanic), how should this influence our choice? Suppose we obtain better probability results with paper-based, race-specific prefit classifiers than with the array-specific multiethnic ones. How should this impact our decision-making process?

Answered by zhengxwen

May 13, 2024

Q: Do these matching scores make any sense? Are these matching scores within the normal range for this kind of data? What does really "matching score" mean?
A: Yes, the matching scores make sense here. It is considered internally as a matching when comparing a pair of SNP alleles to a missing genotype, so you will see higher matching scores when the SNP overlapping between the array-specific model and tested platform is lower.
“Matching” is a measure describing how the observed SNP profile matches the haplotypes observed in the training set (so missing SNP genotypes "always" match any pair of SNP alleles with higher probabilities). Matching proportion is not directly related to confidence…

View full answer

zhengxwen · 2024-05-13T06:03:53Z

zhengxwen
May 13, 2024
Maintainer

Q: Do these matching scores make any sense? Are these matching scores within the normal range for this kind of data? What does really "matching score" mean?
A: Yes, the matching scores make sense here. It is considered internally as a matching when comparing a pair of SNP alleles to a missing genotype, so you will see higher matching scores when the SNP overlapping between the array-specific model and tested platform is lower.
“Matching” is a measure describing how the observed SNP profile matches the haplotypes observed in the training set (so missing SNP genotypes "always" match any pair of SNP alleles with higher probabilities). Matching proportion is not directly related to confidence score, but a very low value of “matching” indicates that the two haplotypes within the SNP profile are infrequently observed in the training set.

Q: If a probability score is better with a different platform than the corresponding one, which HLA prediction should be prioritized?
A: The corresponding platform-specific models should be prioritized.

Q: Considering the ethnic background of a particular individual (e.g., African or Hispanic), how should this influence our choice? Suppose we obtain better probability results with paper-based, race-specific prefit classifiers than with the array-specific multiethnic ones. How should this impact our decision-making process?
A: The array-specific models should have a higher priority than paper-based models. Sometimes, race-specific models work better than multiethnic models, it really depends on whether you could identify the race of tested individuals, or the degree of diversity of your study population.

In your data, it is suggested to aggregate the prediction results from three array platforms (GSA, Omni2.5, Oncoarray), using hlaPredMerge() (requiring HIBAG version >= v1.40.0) to aggregate predictions. Prediction aggregation could work better than choosing the highest probability score.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Interpret Probabilities and Matching Scores in HLA Predictions Across Various Platforms #30

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to Interpret Probabilities and Matching Scores in HLA Predictions Across Various Platforms #30

gerpervaz Apr 28, 2024

Choose the pre-fit classifier we will use

Load the model list from the specified file

Load HLA-A genotyping data

Load HLA-A pre-fit classifier into memory

Run the prediction

Replies: 1 comment

zhengxwen May 13, 2024 Maintainer

gerpervaz
Apr 28, 2024

zhengxwen
May 13, 2024
Maintainer