You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I conducted a semantic similarity calculation experiment where I noticed that the Jaccard Similarity Score values of certain records differed when I used the --information-content-file option. I am unsure of the reason behind this behavior and have documented the experiment details in case anyone would like to reproduce it. If anyone can explain these differences, I would appreciate it.
The first one was without using any information content files:
Here are some exploratory analysis regarding jaccard similarity comparisons
property
semsim_without_ic
semsim_with_ic
count
1,485,387.00
1,522,836.00
mean
0.44
0.44
std
0.03
0.03
min
0.40
0.40
25%
0.41
0.41
50%
0.43
0.43
75%
0.46
0.46
max
0.70
0.70
Although the percentiles have the same value, 38,798 records differ in their jaccard similarity values. To identify the most extreme differences, I selected the top 10 records. Out of these 10, five showed an increase in the jaccard score value when an external IC file was passed during calculation, and five showed a decrease in the score.
subject_id
object_id
jaccard_similarity_without_ic
jaccard_similarity_with_ic
difference
HP:0025477
MP:0013304
0.416667
0.481481
15.56%
HP:0025477
MP:0012070
0.416667
0.481481
15.56%
HP:0025477
MP:0030485
0.416667
0.481481
15.56%
HP:0025477
MP:0031348
0.416667
0.481481
15.56%
HP:0025477
MP:0005422
0.416667
0.481481
15.56%
HP:0002514
MP:0000783
0.465116
0.425532
-9.30%
HP:0005671
MP:0000783
0.454545
0.416667
-9.09%
HP:0007045
MP:0000783
0.454545
0.416667
-9.09%
HP:0002514
MP:0000787
0.5
0.458333
-9.09%
HP:0005849
MP:0000783
0.454545
0.416667
-9.09%
The text was updated successfully, but these errors were encountered:
Certainly strange and unexpected.
Is the behavior reproducible with a smaller set of terms?
Or rather, does it happen when you use the basic OAK semsim implementation rather than semsimian?
I ask because I didn't think the semsimian implementation did anything with the information-content-file input; the semsim interface will cache the provided values here (
I conducted a semantic similarity calculation experiment where I noticed that the Jaccard Similarity Score values of certain records differed when I used the
--information-content-file
option. I am unsure of the reason behind this behavior and have documented the experiment details in case anyone would like to reproduce it. If anyone can explain these differences, I would appreciate it.The first one was without using any information content files:
Next, I used the same parameters, just including --information-content-file option:
The HP and MP terms' information content files were generated separately and merged into a final file.
Here are some exploratory analysis regarding jaccard similarity comparisons
Although the percentiles have the same value, 38,798 records differ in their jaccard similarity values. To identify the most extreme differences, I selected the top 10 records. Out of these 10, five showed an increase in the jaccard score value when an external IC file was passed during calculation, and five showed a decrease in the score.
The text was updated successfully, but these errors were encountered: