Unable to reproduce Precision-Recall plot Supp. Fig. 5 #5

rcedgar · 2024-10-16T22:01:10Z

I have tried reproducing Supp. Fig. 5 using the scripts in this repository and also my own code but I get quite different results. For example, as shown in the figure above I find TM-align is better than DALI on Superfamily over the entire range, while your figure shows TM-align to be substantially worse (Fig. S5 on the left, my results on the right). My plot on the right was generated using your data and scripts as follows.

Hits downloaded from https://wwwuser.gwdguser.de/~compbiol/foldseek/scop.benchmark.result.tar.gz

sort -rgk3 ../alns/TMalign.txt > TMalign.sorted.txt
sort -rgk3 ../alns/dali.txt > dali.sorted.txt
bench.fdr.noselfhit.awk TMalign.sorted.txt scop_lookup.fix.tsv <(cat $TMalign.sorted.txt) > TM-align.rocx
bench.fdr.noselfhit.awk dali.sorted.txt scop_lookup.fix.tsv <(cat dali.sorted.txt) > dali.rocx

Plot column PREC_SFAM (y axis) vs. RECALL_SFAM (x axis).

As you can see, the plot for DALI looks right but TM-align looks very wrong.

Any help in resolving this discrepancy will be much appreciated.

Also, the calculation of precision and recall in bench.fdr.noselfhit.awk appears to be making corrections compared to the standard formulas, but I don't understand how it works. Can you clarify? In particular, what is the variable norm is doing? Thanks!

The text was updated successfully, but these errors were encountered:

rcedgar · 2024-10-20T21:28:05Z

After looking more closely at the source code, as best I can tell the calculations of both precision and recall are not correct.

Line 5 in bench.fdr.noselfhit.awk outputs 6 column headings:

                    1           2           3            4             5             6
      print "PREC_FAM","PREC_SFAM","PREC_FOLD","RECALL_FAM","RECALL_SFAM","RECALL_FOLD";

Line 37 in bench.fdr.noselfhit.awk outputs 7 columns:

                                     1                     2                     3                 4                  5                  6       7
NR %1000 == 0{ print tp_fam/(tp_fam+fp), tp_sfam/(tp_sfam+fp), tp_fold/(tp_fold+fp), tp_fam / queries, tp_sfam / queries, tp_fold / queries, tp_fam; }

Matching column headings to expressions in line 37:

1. PREC_FAM     = tp_fam/(tp_fam+fp)
2. PREC_SFAM    = tp_sfam/(tp_sfam+fp)
3. PREC_FOLD    = tp_fold/(tp_fold+fp)

4. RECALL_FAM   = tp_fam / queries # queries = constant = 3566
5. RECALL_SFAM  = tp_sfam / queries
6. RECALL_FOLD  = tp_fold / queries

Definitions of precision and recall are (see e.g. https://scikit-learn.org/1.5/auto_examples/model_selection/plot_precision_recall.html):

# TP = number of true positive hits above the threshold
# FP = number of false positive hits above the threshold
# FN = number of false negatives at the threshold

Precision = TP/(TP + FP)
Recall = TP/(TP + FN)

The calculation of precision appears to be wrong for family and superfamily because fp is always calculated for fold (see line 38, it is increased if id2fold[$1] != id2fold[$2]).

The calculation of recall appears to be wrong because the divisor should be (TP+FN) but is constant (always 3556).

martin-steinegger · 2024-10-20T23:07:33Z

Thanks for flagging this. We will have a look at it. But since the postdoc left the lab it might need some time.
Regarding queries, we are computing a weighted ROC, meaning each query can contribute in total 1 if it recalls all possible TPs (TP=1/foldSize(norm)). Therefore, the number of queries is calculated as TP + FN. Precision is computed as expected: Precision = TP / (TP + FP) but we also apply weighting.

As for the TM-align results, we tried various methods to sort the hits. A reviewer recommended using the average of qTM and tTM scores, which indeed worked best. It’s possible that the uploaded file does not reflect this averaging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce Precision-Recall plot Supp. Fig. 5 #5

Unable to reproduce Precision-Recall plot Supp. Fig. 5 #5

rcedgar commented Oct 16, 2024 •

edited

Loading

rcedgar commented Oct 20, 2024 •

edited

Loading

martin-steinegger commented Oct 20, 2024 •

edited

Loading

Unable to reproduce Precision-Recall plot Supp. Fig. 5 #5

Unable to reproduce Precision-Recall plot Supp. Fig. 5 #5

Comments

rcedgar commented Oct 16, 2024 • edited Loading

rcedgar commented Oct 20, 2024 • edited Loading

martin-steinegger commented Oct 20, 2024 • edited Loading

rcedgar commented Oct 16, 2024 •

edited

Loading

rcedgar commented Oct 20, 2024 •

edited

Loading

martin-steinegger commented Oct 20, 2024 •

edited

Loading