Jaccard similarity differences when using information content #746

souzadevinicius · 2024-05-01T11:56:19Z

I conducted a semantic similarity calculation experiment where I noticed that the Jaccard Similarity Score values of certain records differed when I used the --information-content-file option. I am unsure of the reason behind this behavior and have documented the experiment details in case anyone would like to reproduce it. If anyone can explain these differences, I would appreciate it.

The first one was without using any information content files:

runoak -i semsimian:sqlite:phenio.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file mp_terms.txt \
--min-jaccard-similarity 0.4 \
-O csv \
-o semsim_without_ic_file.tsv

Next, I used the same parameters, just including --information-content-file option:

runoak -i semsimian:sqlite:phenio.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file mp_terms.txt \
--min-jaccard-similarity 0.4 \
--information-content-file  phenio_monarch_hp_mp_ic.tsv \
-O csv \
-o semsim_with_ic_file.tsv

The HP and MP terms' information content files were generated separately and merged into a final file.

runoak -i phenio.db -g gene_phenotype.9606.tsv -G hpoa_g2p information-content -p i i^HP: -o phenio_monarch_hp_ic.tsv

runoak -i phenio.db -g gene_phenotype.10090.tsv -G hpoa_g2p information-content -p i i^MP: -o phenio_mp_ic.tsv

Here are some exploratory analysis regarding jaccard similarity comparisons

property	semsim_without_ic	semsim_with_ic
count	1,485,387.00	1,522,836.00
mean	0.44	0.44
std	0.03	0.03
min	0.40	0.40
25%	0.41	0.41
50%	0.43	0.43
75%	0.46	0.46
max	0.70	0.70

Although the percentiles have the same value, 38,798 records differ in their jaccard similarity values. To identify the most extreme differences, I selected the top 10 records. Out of these 10, five showed an increase in the jaccard score value when an external IC file was passed during calculation, and five showed a decrease in the score.

subject_id	object_id	jaccard_similarity_without_ic	jaccard_similarity_with_ic	difference
HP:0025477	MP:0013304	0.416667	0.481481	15.56%
HP:0025477	MP:0012070	0.416667	0.481481	15.56%
HP:0025477	MP:0030485	0.416667	0.481481	15.56%
HP:0025477	MP:0031348	0.416667	0.481481	15.56%
HP:0025477	MP:0005422	0.416667	0.481481	15.56%
HP:0002514	MP:0000783	0.465116	0.425532	-9.30%
HP:0005671	MP:0000783	0.454545	0.416667	-9.09%
HP:0007045	MP:0000783	0.454545	0.416667	-9.09%
HP:0002514	MP:0000787	0.5	0.458333	-9.09%
HP:0005849	MP:0000783	0.454545	0.416667	-9.09%

The text was updated successfully, but these errors were encountered:

matentzn · 2024-05-01T12:05:01Z

Very nice ticket, subscribing with interest to the thread.

caufieldjh · 2024-05-01T18:22:56Z

Certainly strange and unexpected.
Is the behavior reproducible with a smaller set of terms?
Or rather, does it happen when you use the basic OAK semsim implementation rather than semsimian?
I ask because I didn't think the semsimian implementation did anything with the information-content-file input; the semsim interface will cache the provided values here (

ontology-access-kit/src/oaklib/interfaces/semsim_interface.py

Lines 224 to 228 in aef85c6

    
           if self.cached_information_content_map is not None: 
        
               for curie in curies: 
        
                   if curie in self.cached_information_content_map: 
        
                       yield curie, self.cached_information_content_map[curie] 
        
               return

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaccard similarity differences when using information content #746

Jaccard similarity differences when using information content #746

souzadevinicius commented May 1, 2024 •

edited

Loading

matentzn commented May 1, 2024

caufieldjh commented May 1, 2024

Jaccard similarity differences when using information content #746

Jaccard similarity differences when using information content #746

Comments

souzadevinicius commented May 1, 2024 • edited Loading

matentzn commented May 1, 2024

caufieldjh commented May 1, 2024

souzadevinicius commented May 1, 2024 •

edited

Loading