Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support v2-beta2 kb.phenoscape API #235

Closed
5 of 9 tasks
johnbradley opened this issue Aug 25, 2021 · 40 comments
Closed
5 of 9 tasks

Support v2-beta2 kb.phenoscape API #235

johnbradley opened this issue Aug 25, 2021 · 40 comments
Assignees

Comments

@johnbradley
Copy link
Contributor

johnbradley commented Aug 25, 2021

Add support for the v2-beta2 API.

Problems

  • Problem 1: /term/all_descendants parts not working
  • Problem 2: /similarity/corpus_size always returns 0
  • Problem 3: /similarity/frequency API change
  • Problem 4: /similarity/matrix API failing
  • Problem 5: /similarity/frequency 500 Error
  • Problem 6: /similarity/matrix returning less results for "basihyal bone" phenotypes
  • Problem 7: /similarity/matrix returning IRIs that have no labels
  • Problem 8: Unable to determine term categories for IRIs returned by /similarity/matrix
  • Problem 9: Resnik similarity zero for some IRI in a matrix returned by /similarity/matrix
@johnbradley
Copy link
Contributor Author

johnbradley commented Aug 25, 2021

Problem 1: /term/all_descendants parts not working

The descendants/ancestors test started failing at the following line when using the v2-beta2 API:

expect_equal(is_descendant("paired fin", c("pelvic fin", "pelvic fin ray"),
includeRels = "part_of"),
c(TRUE, TRUE))

The problem is the following line returns FALSE instead TRUE.

is_descendant("paired fin", c("pelvic fin ray"), includeRels = "part_of")

The paired fin has IRI of http://purl.obolibrary.org/obo/UBERON_0002534.
The pelvic fin ray has an IRI of http://purl.obolibrary.org/obo/UBERON_4300117.

The production and v2-beta api return 500+ records for finding term descendants of http://purl.obolibrary.org/obo/UBERON_0002534 with parts=true:

curl -X GET "https://kb.phenoscape.org/api/v2-beta/term/all_descendants?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0002534&parts=true" -H "accept: application/json"

The v2-beta2 api returns 28 records for finding term descendants of http://purl.obolibrary.org/obo/UBERON_0002534 with parts=true:

curl -X GET "https://kb.phenoscape.org/api/v2-beta2/term/all_descendants?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0002534&parts=true" -H "accept: application/json"

If you have jq installed you can approximate the number of records returned like so:

curl -X GET "https://kb.phenoscape.org/api/v2-beta2/term/all_descendants?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0002534&parts=true" -H "accept: application/json" | jq |  grep -c "@id"

Perhaps the parts argument is always false off in v2-beta2?

@johnbradley
Copy link
Contributor Author

johnbradley commented Aug 26, 2021

Problem 2: /similarity/corpus_size always returns 0

The corpus_size() function returns 0 for taxa and genes.

> corpus_size("taxa")
[1] 0

This is expected to be at least 100.
The Swagger UI seems to be updated to use v2-beta2. So you can reproduce this issue here:
https://kb.phenoscape.org/apidocs/#/Semantic%20similarity/get_similarity_corpus_size

This can be reproduced from the command line:

$ curl -X GET "https://kb.phenoscape.org/api/v2-beta2/similarity/corpus_size?corpus_graph=http%3A%2F%2Fkb.phenoscape.org%2Fsim%2Ftaxa" -H "accept: application/json"
{"total":0}

@johnbradley
Copy link
Contributor Author

Problem 3: /similarity/frequency API change

The term_freqs() function fails with a 400 error now:

> phens <- get_phenotypes(entity = "basihyal bone")
> term_freqs(phens$id, as = "phenotype", corpus = "taxa")
 Error in get_csv_data(pkb_api("/similarity/frequency"), query = query,  : 
  (400) Bad Request: Request is missing required form field 'path' 

We are currently passing terms and corpus_graph("http://kb.phenoscape.org/sim/taxa" or "http://kb.phenoscape.org/sim/genes").
The updated /similarity/frequency API only supports terms and path.

path: SPARQL property path composed of full IRIS. This is used to connect the data resource to count (RDF graph world) to the ontology world. E.g. /

I think the 'E.g. /' is a rendering problem and should be <http://purl.org/phenoscape/vocab.owl#exhibits_state>/<http://purl.org/phenoscape/vocab.owl#describes_phenotype> based on the raw swagger yaml.

I am not sure how to include the corpus(taxa or genes) into the path parameter.

@johnbradley
Copy link
Contributor Author

Problem 4: /similarity/matrix API failing

The subsumer_matrix() function fails with a 500 error now:

> subsumer_matrix(c("http://purl.obolibrary.org/obo/UBERON_0000981"))
 Error in get_csv_data(pkb_api("/similarity/matrix"), query = queryseq,  : 
  (500) Internal Server Error: There was an internal server error. 

The /similarity/matrix endpoint has a new path parameter but looks to be optional. The path parameter description from raw swagger.yaml:

description: SPARQL property path composed of full IRIS. This is used to connect the data resource to count (RDF graph world) to the ontology world. E.g. <http://purl.org/has_state>/<http://purl.org/describes_phenotype>

I tried hard coding the path parameter with the example value above. The API returned a mostly empty response:

> subsumer_matrix(c("http://purl.obolibrary.org/obo/UBERON_0000981"))
[1] UBERON_0000981
<0 rows> (or 0-length row.names)

@johnbradley
Copy link
Contributor Author

@balhoff Please see the above problems I encountered with v2-beta2.

In addition I noticed something strange in swagger.
Some endpoints that support GET and POST have different parameters.
The GET /similarity/frequency endpoint has parameters terms and path.
The POST /similarity/frequency endpoint has parameters terms and corpus_graph.
I expected both to have the same parameters.

@balhoff
Copy link
Member

balhoff commented Sep 2, 2021

Fix for problem 1: phenoscape/phenoscape-kb-services#472

@balhoff
Copy link
Member

balhoff commented Sep 2, 2021

Fix for swagger issue : phenoscape/phenoscape-kb-services#473

@balhoff
Copy link
Member

balhoff commented Sep 3, 2021

Re: problem 2 and problem 3 — parameters have changed for all services related to similarity corpora. Instead of using an IRI to name one, you provide a SPARQL property path for which the subjects are items in the corpus (e.g. taxa), and the objects are the annotations (e.g. phenotype classes). You can also (optionally) provide a specifier_property and specifier_value which can help select the corpus items when you don't want everything that could be a subject of the provided path.

For the previous "taxa" corpus, use:

  • path: <http://purl.org/phenoscape/vocab.owl#has_phenotypic_profile>/<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
  • specifier_property: http://www.w3.org/2000/01/rdf-schema#isDefinedBy
  • specifier_value: http://purl.obolibrary.org/obo/vto.owl

For the previous "genes" corpus, use:

  • path: <http://purl.org/phenoscape/vocab.owl#has_phenotypic_profile>/<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
  • specifier_property: http://www.w3.org/1999/02/22-rdf-syntax-ns#type
  • specifier_value: http://purl.org/phenoscape/vocab.owl#AnnotatedGene

@balhoff
Copy link
Member

balhoff commented Sep 3, 2021

Problem 4 should be fixed by phenoscape/phenoscape-kb-services#476 (which has been deployed). For the subsumer matrix, you may want to consider allowing a choice of relations to traverse (new feature).

@hlapp
Copy link
Member

hlapp commented Sep 3, 2021

@balhoff can you explain (or link to the documentation that explains) what the specifier parameters are for? I don't recall these from our biweekly discussion, but I may have missed it. Do these essentially act as filters for the initial subject of the property chain? (The parameter name seems rather confusing - can you say where that's coming from?)

@balhoff
Copy link
Member

balhoff commented Sep 3, 2021

Do these essentially act as filters for the initial subject of the property chain?

That's exactly right. I invented these today, so feedback on the name is entirely welcome! I realized that some additional specification was needed to target the corpus items. What do you think?

@hlapp
Copy link
Member

hlapp commented Sep 3, 2021

Also (but perhaps the documentation explains this?) for a path of <http://purl.org/phenoscape/vocab.owl#has_phenotypic_profile>/<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> the second component seems meaningless – isn't all that this means that it's an instance (which being the object of the property already implies) asserted to be of some type. I.e., is this weeding out phenotypic profiles for which a type is not asserted, and if so, why would there be such profiles.

@hlapp
Copy link
Member

hlapp commented Sep 3, 2021

If they're in essence a subject filter, then maybe just call them such? I.e., subject_filter_property and subject_filter_value?

@balhoff
Copy link
Member

balhoff commented Sep 3, 2021

The type predicate is needed to connect to the phenotypes; it's just how things are structured in the triplestore. There's an intermediate "profile" node, which is just a shadow of the taxon node, in between these two predicates. Where should this documentation live? Phenoscape wiki, or phenoscape-kb-services repo? I think we need some general topic docs outside of the swagger docs.

@hlapp
Copy link
Member

hlapp commented Sep 3, 2021

Why not the Swagger docs? Isn't that where someone would go to find it? Of course, if it's lengthy, you could put it on the wiki (but I would use the phenoscape-kb-services repo wiki), and then link to it from the Swagger docs.

@hlapp
Copy link
Member

hlapp commented Sep 3, 2021

The type predicate is needed to connect to the phenotypes; it's just how things are structured in the triplestore. There's an intermediate "profile" node, which is just a shadow of the taxon node, in between these two predicates.

So when you say to connect to the phenotypes what you mean is connect to the phenotype class(es) because that, not the instance(s), is what we're interested in and where the semantics are codified.

@balhoff
Copy link
Member

balhoff commented Sep 7, 2021

@hlapp I updated the parameter names as you suggested: phenoscape/phenoscape-kb-services#481

@johnbradley
Copy link
Contributor Author

johnbradley commented Sep 8, 2021

Problem 5: /similarity/frequency 500 Error

I updated the term_freqs() function to include the SPARQL property paths (path, specifier_property, and specifier_value) from #235 (comment) above. The test above in problem 3 no longer fails. However a test that passes 189 term IRIs now fails with a 500 error:

Error (test-semsim.R:140:3): profile similarity with Resnik
Error: (500) Internal Server Error: There was an internal server error.
Backtrace:
 1. rphenoscape::term_freqs(...) test-semsim.R:140:2
 2. rphenoscape::get_csv_data(...) /Users/jpb67/Documents/work/rphenoscape/R/term-weights.R:89:4

The above code uses the POST /similarity/frequency endpoint.

To reproduce in R I do the following:

phens <- get_phenotypes("maxilla", taxon = "Cyprinidae")
subs.mat <- subsumer_matrix(phens$id, .colnames = "label", .labels = phens$label,
                            preserveOrder = TRUE)
freqs <- term_freqs(rownames(subs.mat), as = "phenotype", corpus = "taxa")

If I reduce the terms IRIs to 185 the API doesn't fail but does take 2m10s.
For comparison running the same code against the v2-beta API finishes in 15s.


In testing this out I noticed some data differences between the v2-beta and v2-beta2 API results:

                       v2-beta   v2-beta2
phenotypes found       12        66         
subsumer matrix names  896      189   

The subsumer matrix is created by calling the /similarity/matrix API endpoint.
The v2-beta version of the /similarity/frequency API endpoint can handle 896 term IRIs.


v2-beta2 IRIs: iris.txt

@balhoff
Copy link
Member

balhoff commented Sep 8, 2021

@johnbradley could you paste the list of terms here?

@johnbradley
Copy link
Contributor Author

@balhoff I updated my comment to include a link to a text file of IRIs.

@johnbradley
Copy link
Contributor Author

johnbradley commented Sep 9, 2021

Problem 6: /similarity/matrix returning less results for "basihyal bone" phenotypes

Failing test

A test started failing when switching to v2-beta2 API:

Error (test-pk.R:176:3): labels for pre-generated post-comps
Error: cannot take a sample larger than the population when 'replace = FALSE'
Backtrace:
 1. base::sample(rownames(subsumer_matrix(phen)), size = 30) test-pk.R:176:2

The R code to reproduce the problem is:

  phen <- sample(get_phenotypes("basihyal bone")$id, size = 1)
  subs <- sample(rownames(subsumer_matrix(phen)), size = 30)

Explanation of the test

The test fetches phenotype IRIs /phenotype/query for "basihyal bone" entity and chooses a random IRI from the results.
The IRI is sent to the /similarity/matrix API which returns fewer IRIs than previously expected.
The test then tries to fetch 30 random IRIs from the returned IRIs which typically fails because less than 30 IRIs are returned.
In v2-beta I see over 300 results returned for all the examples I checked.

curl example

Example of only receiving 24 IRI back from /similarity/matrix for a "basihyal bone" phenotype IRI:

curl -X GET "https://kb.phenoscape.org/api/v2-beta2/similarity/matrix?terms=%5B%22http%3A%2F%2Fpurl.org%2Fphenoscape%2Fexpression%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000051%253E%2Bsome%2B%250A%2B%2B%2B%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPATO_0000117%253E%250A%2B%2B%2B%2B%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FRO_0000052%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%2529%22%5D" -H "accept: text/csv" | wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5109  100  5109    0     0  33834      0 --:--:-- --:--:-- --:--:-- 33834
      24

If you switch to the v2-beta API in the above curl command 385 items are returned.

The phenotype IRI in question:

http://purl.org/phenoscape/expression?value=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000051%3E+some+%0A++++%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0000117%3E%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0000052%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0011618%3E%29%29

Question

Is the difference in number of returned IRIs is expected?
If so I could reduce the sample size.

@johnbradley
Copy link
Contributor Author

Problem 7: /similarity/matrix returning IRIs that have no labels

Failing test

A test started failing when switching to v2-beta2 API:

Failure (test-pk.R:190:3): labels for pre-generated post-comps
sum(is.na(subs.l$label)) is not less than 1. Difference: 15

The R test code:

subs <- sample(rownames(subsumer_matrix(c("femur"))), 30)
subs.l <- get_term_label(subs, preserveOrder = TRUE)
# Unfortunately, there are some regular ontologies for which the database
# does not consistently have labels. Filter those out.
ontFilter <- Reduce(
function(v1, v2) v1 | startsWith(subs.l$id, v2),
paste0("http://purl.obolibrary.org/obo/", c("CARO"), "_"),
init = rep(FALSE, times = length(subs.l$id))
)
subs.l <- subs.l[! (is.na(subs.l$label) & ontFilter),]
testthat::expect_lte(sum(is.na(subs.l$label)), 1)

A quick way to see the data in R is:

get_term_label(rownames(subsumer_matrix("http://purl.obolibrary.org/obo/UBERON_0000981")))$label

Explanation of the test

The femur IRI is sent to the /similarity/matrix API and from the results 30 IRIs are sampled.
The code fetches labels for the 30 IRIs.
Any CARO IRIs with no labels are removed.
Then the code checks that all remaining IRIs do not have NA for their labels.

Comparing v2-beta results vs v2-beta2 results

I ran the quick R example above using v2-beta and v2-beta2.
In v2-beta 103 IRIs are returned from /similarity/matrix and 102 have labels (The only NA is CARO which the test excludes).
In v2-beta2 136 IRIs are returned from /similarity/matrix and 59 have labels.
Outside of a single CARO IRI the IRIs that do not have labels start with: http://purl.org/phenoscape/term/relation/

Example v2-beta2 IRI that has no label:

http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0004288 

This IRI is not valid for v2-beta.

Some "has part ..." labels show up in v2-beta but did not show up in v2-beta2.

Question

Should the http://purl.org/phenoscape/term/relation/ IRIs have labels? If not I can filter them out like we do the CARO IRIs.

@johnbradley
Copy link
Contributor Author

Problem 8: Unable to determine term categories for IRIs returned by /similarity/matrix

Failing test

A test started failing when switching to v2-beta2 API:

Failure (test-freqs.R:65:3): success rate for entity subsumer terms
mean(is.na(tt.types)) is not strictly less than 0.1. Difference: 0.441

The R test code:

tt <- sapply(c("fin ray", "dorsal fin", "caudal fin"), get_term_iri, as = "anatomy")
subs.mat <- subsumer_matrix(tt)
tt.types <- term_category(rownames(subs.mat))
# less than 10% of the terms should be indeterminate
testthat::expect_lt(mean(is.na(tt.types)), .1)

A quick way to see the IRIs in R is:

> subs.mat <- head(subsumer_matrix(c("http://purl.obolibrary.org/obo/UBERON_4400005","http://purl.obolibrary.org/obo/UBERON_0003097","http://purl.obolibrary.org/obo/UBERON_4000164")))
> subs.mat$tc <- term_category(rownames(subs.mat))
> subs.mat[c("tc")]
                                                                                                                                                          tc
http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0010000   <NA>
http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000002      <NA>
http://purl.obolibrary.org/obo/CARO_0010000                                                                                                           entity
http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000468   <NA>
http://purl.obolibrary.org/obo/UBERON_0000061                                                                                                         entity
http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCARO_0000003     <NA>

In the above example scroll to the right to see the term_category(tc) for each IRI.

Explanation of the test

The test looks up IRIs for "fin ray", "dorsal fin", and "caudal fin".
Then passes these IRIs to /similarity/matrix.
The code then tries to determine the term category for the IRI returned by /similarity/matrix.
The term category is determined by looking at the results of /term/all_ancestors and /term/classification for each IRI.
The code expects 90% of the IRI to have a term category.

The IRIs that we can't determine term category have the http://purl.org/phenoscape/term/relation/ prefix.
Example IRI:

http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCARO_0000000

curl Example

Fetch ancestors for a relation IRI:

curl -X GET "https://kb.phenoscape.org/api/v2-beta2/term/all_ancestors?iri=http%3A%2F%2Fpurl.org%2Fphenoscape%2Fterm%2Frelation%2Fhttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%2Fhttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FCARO_0000000&parts=false" -H "accept: application/json"
{"results":[]}

Fetch term classification for a relation IRI:

curl -X GET "https://kb.phenoscape.org/api/v2-beta2/term/classification?iri=http%3A%2F%2Fpurl.org%2Fphenoscape%2Fterm%2Frelation%2Fhttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%2Fhttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FCARO_0000000" -H "accept: application/json" | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   367  100   367    0     0   2352      0 --:--:-- --:--:-- --:--:--  2352
{
  "label": "http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCARO_0000000",
  "subClassOf": [],
  "equivalentTo": [],
  "superClassOf": [],
  "@id": "http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCARO_0000000"
}

Question

Should these ../term/relation IRIs return data for /term/classification and/or /term/all_ancestors?
If not is there a way to determine a term category ("entity", "quality", "phenotype", or "taxon") for these IRI?

@johnbradley
Copy link
Contributor Author

johnbradley commented Sep 9, 2021

Problem 9: Resnik similarity zero for some IRI in a matrix returned by /similarity/matrix

Failing test

A test started failing when switching to v2-beta2 API:

Failure (test-semsim.R:74:3): Resnik similarity
all(sm.ic > 0) is not TRUE

The R test code:

phens <- get_phenotypes("basihyal bone", taxon = "Cyprinidae")
subs.mat <- subsumer_matrix(phens$id, .colnames = "label", .labels = phens$label,
preserveOrder = TRUE)
s <- unique(c(sample(1:nrow(subs.mat), size = 10),
match(phens$id, rownames(subs.mat))))
subs1 <- rownames(subs.mat)[s]
subs.mat1 <- subs.mat[s,]
rownames(subs.mat1) <- subs1
sm.ic <- resnik_similarity(subs.mat1,
wt_args = list(as = "phenotype", corpus = "taxa"))
testthat::expect_equal(dim(sm.ic), c(nrow(phens), nrow(phens)))
testthat::expect_true(all(sm.ic > 0))

So sm.ic should only have positive values but that is no longer the case in v2-beta2:
Screen Shot 2021-09-09 at 1 47 08 PM

The IRI for anatomical projection and (part_of some (posterior margin and (part_of some basihyal bone))) absent above in sm.ic is:

http://purl.org/phenoscape/expression?value=%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000051%3E+some+%0A++++%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0002000%3E%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0000052%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000468%3E%29%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002503%3E+value+%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fsubclassof%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0004529%253E%250A%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%250A%2B%2B%2B%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBSPO_0000672%253E%250A%2B%2B%2B%2B%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%2529%2529%23e1332d8d-9c88-4a4d-b2c4-04a424b481cd%3E%29%29%29%0A+and+%28%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23phenotype_of%3E+some+%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fsubclassof%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0004529%253E%250A%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%250A%2B%2B%2B%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBSPO_0000672%253E%250A%2B%2B%2B%2B%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%2529%2529%23e1332d8d-9c88-4a4d-b2c4-04a424b481cd%3E%29

Explanation of the test

The test uses /phenotype/query with "basihyal bone" and taxon = "Cyprinidae" to create a list of IRIs.
Some of the IRIs returned by /phenotype/query include "absent" in the label.
A sample of these IRIs are sent to the /similarity/matrix API endpoint.
The test calculates Resnik similarity using R code for the matrix returned.
Some of the matrix contains 0 Resnik similarity values that causes the test to fail.
The test calculates Resnik similarity next.
This is done by calculating term frequencies by passing the rownames from the subsumer matrix(IRIs) to the /similarity/frequency endpoint. The integer values returned from the endpoint are divided by the corpus size.
The values are then passed through -log() and some additional math.
Some of the matrix contains 0 Resnik similarity values that causes the test to fail.

Example IRI that has 0 Resnik similarity

For the an IRI that had 0 Resnik similarity we received 797 for the "frequency score (subsumed items)" returned by /similarity/frequency:

$ curl -X GET "https://kb.phenoscape.org/api/v2-beta2/similarity/frequency?terms=%5B%22http%3A%2F%2Fpurl.org%2Fphenoscape%2Fexpression%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.org%252Fphenoscape%252Fvocab.owl%2523implies_presence_of%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0001015%253E%22%5D&path=%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23has_phenotypic_profile%3E%2F%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23type%3E&subject_filter_property=http%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23isDefinedBy&subject_filter_value=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fvto.owl" -H "accept: text/csv"
http://purl.org/phenoscape/expression?value=%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23implies_presence_of%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0001015%3E,797

797 the size as the corpus of taxa. Since we take log(797/797) we end up with zero for a term weight.

Question

It seems like the IRIs that are problematic have a label ending in "... absent". These IRIs are coming from /phenotype/query. Should these "absent" IRI be filtered out at some point?

@hlapp
Copy link
Member

hlapp commented Sep 10, 2021

Re: problem 9, this (Resnik similarity score of zero) can only really come about if two terms do not have any common subsumers in the matrix.

This could be because of an error in the /similarity/matrix endpoint (in that it doesn't return some subsumers that it nonetheless should). It could also be an effect of a correction for unexpected zero or NA term frequencies being obtained for some subsumers:

rphenoscape/R/semsim.R

Lines 249 to 259 in 4855e6c

# Terms with frequency zero should not occur in the subsumer matrix, so
# if there are any, they either shouldn't have been a subsumer, or they
# didn't yield a count. Either way, remove them from the computation.
rowsToRemove <- is.na(wt) | wt == 0
if (any(rowsToRemove)) {
wt <- wt[! rowsToRemove]
subsumer_mat <- subsumer_mat[! rowsToRemove,]
}
# we assume we got frequencies, turn into IC
wt <- -log(wt, base = base)
}

This may inadvertently for some terms remove the only common subsumer(s) that there are.

It seems more likely that there are some common subsumers that are being returned as from the matrix endpoint, but then erroneously receive no count or a count of zero in the frequencies endpoint.

@johnbradley
Copy link
Contributor Author

johnbradley commented Sep 17, 2021

@hlapp Re: problem 9: I removed the logic that removes rows and the problem persisted. It looks like I missed a pretty big part of what happens in problem 9 ( fetching frequencies from /similarity/frequency ) so I'm going to update my comment above to have better details.

@balhoff
Copy link
Member

balhoff commented Sep 17, 2021

For problem 5—I made a PR to perform many queries instead of one big one: phenoscape/phenoscape-kb-services#489

@balhoff
Copy link
Member

balhoff commented Sep 17, 2021

@johnbradley for problem 6, the reduced number of subsumers in the matrix for a phenotype is expected. You will get more if you add more arguments to the relations parameter. For the phenotype IRI you mentioned, if you add the relation http://purl.org/phenoscape/vocab.owl#phenotype_of_reflexive_part_of, you get 83 subsumers.

Note to myself—object properties for different situations is one of the topics that needs documentation.

@hlapp
Copy link
Member

hlapp commented Sep 17, 2021

object properties for different situations is one of the topics that needs documentation.

Yes. For example, when would I and would I not want to add http://purl.org/phenoscape/vocab.owl#phenotype_of_reflexive_part_of. On the surface, it would seem do not add it if I wanted phenotypes only of true parts, rather than of things or any of their parts. But without more visibility into the data model that's practically impossible to verify.

@balhoff
Copy link
Member

balhoff commented Sep 22, 2021

For problem 7, the http://purl.org/phenoscape/term/relation/ do not have labels. However we could handle these specially in the label and term info queries if we think that's a good idea. Those terms are built from two components, each of which typically has a label.

@balhoff
Copy link
Member

balhoff commented Sep 22, 2021

For problem 6: I will send an example relation list to @johnbradley which most closely mimics the previous results.

@hlapp
Copy link
Member

hlapp commented Sep 22, 2021

@johnbradley and @balhoff just to clarify from our discussion: The Resnick similarity between two terms is zero if (a) they have no subsumers in common (in a graph with a root shared by all terms this should never happen), or if (b) the subsumer(s) that they do have in common either are the root term(s) or have the same frequency as the root term (i.e., for which the frequency is equal to the corpus size).

Note that, unlike Jaccard, Resnick cannot distinguish between a root term and a term descending from the root term that has nonetheless the same frequency as the root term. (This means for example that if the only change we made to a graph is adding a line of subsumer terms to a term that currently is the root term in a graph, then Resnick similarities for any pair of terms would be unchanged. Jaccard similarities would change, however, because now we've added terms into the union and intersection sets of subsumers for any pair of terms.)

Hence, if there isn't a bug with frequency calculations, one question is, is it "correct" (however we define this) that for anatomical projection and (part_of some (posterior margin and (part_of some basihyal bone))) its presence and absence phenotypes should only have common subsumer(s) whose frequency is equal to the corpus size.

@johnbradley
Copy link
Contributor Author

Problem 9: The test currently samples 10 rows from the subsumer matrix (subs.mat):

s <- unique(c(sample(1:nrow(subs.mat), size = 10),
match(phens$id, rownames(subs.mat))))
subs1 <- rownames(subs.mat)[s]
subs.mat1 <- subs.mat[s,]

Could the test be removing the only common subsumer for some terms?

Using the idea from #239 I checked jaccard similarity on the sampled subsumer matrix:

> min(jaccard_similarity(subs.mat1))
[1] 0
> min(jaccard_similarity(subs.mat))
[1] 0.02272727

@hlapp
Copy link
Member

hlapp commented Sep 27, 2021

@johnbradley good catch, and it seems your check shows this to indeed be a (the?) problem. The subsampling is there because originally obtaining the frequencies took more time than seemed tolerable. If you disable the subsampling, does the runtime become prohibitive for a test suite?

You can disable the subsampling simply by reassigning subs.mat1 and commenting out as follows:

# subs1 <- rownames(subs.mat)[s] 
# subs.mat1 <- subs.mat[s,]
subs.mat1 <- subs.mat 
# rownames(subs.mat1) <- subs1

@johnbradley
Copy link
Contributor Author

Problem 9: Even after removing subsampling the test is still failing.
So I simplified the test to create a resnik similarity grid for two phenotypes.

phenotype1: 'anatomical projection and (part_of some (posterior margin and (part_of some basihyal bone))) absent
 http://purl.org/phenoscape/expression?value=%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000051%3E+some+%0A++++%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0002000%3E%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0000052%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000468%3E%29%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002503%3E+value+%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fsubclassof%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0004529%253E%250A%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%250A%2B%2B%2B%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBSPO_0000672%253E%250A%2B%2B%2B%2B%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%2529%2529%23e1332d8d-9c88-4a4d-b2c4-04a424b481cd%3E%29%29%29%0A+and+%28%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23phenotype_of%3E+some+%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fsubclassof%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0004529%253E%250A%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%250A%2B%2B%2B%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBSPO_0000672%253E%250A%2B%2B%2B%2B%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%2529%2529%23e1332d8d-9c88-4a4d-b2c4-04a424b481cd%3E%29

phenotype2: 'anterior margin and (part_of some basihyal bone) straight
 http://purl.org/phenoscape/expression?value=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000051%3E+some+%0A++++%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0002180%3E%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0000052%3E+some+%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fsubclassof%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBSPO_0000671%253E%250A%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%23d91f9091-f506-4348-8caf-6760c015fbaa%3E%29%29

The code then produced the following grid:

Resnic Similarity matrix:
              [...absent] [...straight]
[...absent]   2.299398    0.000000
[...straight] 0.000000    2.600428

The above matrix is creating by combining the subsumer matrix with the frequency values.

Below is the subsumer matrix with an additional nlog_term_freq column:

                                                             ...ome basihyal bone))) absent ...ome basihyal bone) straight nlog_term_freq
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0000464>                              0                              1 0.03164011
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0007844>                              0                              1 0.06703762
http://purl.org/phenoscape/ex...06-4348-8caf-6760c015fbaa>))                              0                              1 2.60042833
http://purl.org/phenoscape/ex...f506-4348-8caf-6760c015fbaa>                              0                              1 2.60042833
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0002513>                              0                              1 0.07863668
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0013702>                              0                              1 0.01047872
http://phenoscape.org/not/htt...9c88-4a4d-b2c4-04a424b481cd>                              1                              0 1.86006564
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0001630>                              0                              1 0.00000000
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0001555>                              0                              1 0.18545498
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0005884>                              0                              1 0.47170604
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0008895>                              0                              1 0.20597664
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0011615>                              0                              1 1.53973049
http://purl.org/phenoscape/ex...ibrary.org/obo/BSPO_0000006>                              0                              1 0.30376314
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0013701>                              0                              1 0.01047872
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0000468>                              1                              1 0.00000000
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0002418>                              0                              1 0.05388566
http://purl.org/phenoscape/ex...c88-4a4d-b2c4-04a424b481cd>)                              1                              0 2.29939833
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0002100>                              0                              1 0.01159660
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0000033>                              0                              1 0.18796778
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0011618>                              0                              1 1.64618582
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0004111>                              0                              1 0.03340196
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0011153>                              0                              1 0.62730047
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0011614>                              0                              1 1.27820903
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0001015>                              0                              1 0.00000000
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0010323>                              0                              1 0.19218836
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0000153>                              0                              1 0.07603220
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0001474>                              0                              1 0.03458051

If you scroll to the right you can see the only subsumer with 1 for both phenotypes is the 15th subsumer( ending in UBERON_0000468>). This term also has a -log(term_freq()) of 0. When calculating the Resnic Similarity we multiply these three numbers together.

Details about 15th subsumer IRI:

row 15 subsumer IRI:http://purl.org/phenoscape/expression?value=%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23implies_presence_of%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000468%3E
row 15 decoded subsumer IRI:http://purl.org/phenoscape/expression?value=<http://purl.org/phenoscape/vocab.owl#implies_presence_of>+some+<http://purl.obolibrary.org/obo/UBERON_0000468>
Label for UBERON_0000468: "multicellular organism"
> term_freqs("http://purl.org/phenoscape/expression?value=%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23implies_presence_of%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000468%3E", as = "phenotype", corpus = "taxa")
[1] 1

So these two phenotypes have a common subsumer of "implies_presence_of multicellular organism" but this subsumer has a term_frequency of 1, which -nlog converts to 0.

@hlapp
Copy link
Member

hlapp commented Oct 1, 2021

It seems this shows that the cause of zero Resnick similarity is in the database. It's certainly expected that the frequency of "implies_presence_of some 'multicellular organism'" would be equal to the corpus size for corpus "taxa".

There are nevertheless two things that are surprising, but they're both presumably due to the database content and how it's generated. One is why anatomical projection and (part_of some (posterior margin and (part_of some basihyal bone))) absent implies (i.e., is a subclass of) "implies_presence_of some 'multicellular organism'". The other is why the two phenotypes don't have closer subsumers, for example "phenotype_of some 'anatomical projection'" and/or "phenotype_of some (part_of some 'basihyal bone')". @balhoff?

@balhoff
Copy link
Member

balhoff commented Oct 1, 2021

@johnbradley which relations are you requesting for Problem 9? I think this may be the cause of missing common subsumers. Also I think I neglected to send you a suggested list to use, is that right?

@johnbradley
Copy link
Contributor Author

which relations are you requesting for Problem 9?

When creating the subsumer matrix using the /similarity/matrix endpoint we only specify terms array. So we are using the default values for relations and path . I don't recall a suggested list.

You did give me some defaults for path, subject_filter_property , and subject_filter_value that are being used by the /similarity/frequency and /similarity/corpus_size endpoints.

@johnbradley
Copy link
Contributor Author

Problem 5 is no longer occurring. Fixed by #235 (comment)

johnbradley added a commit that referenced this issue Jul 8, 2022
Removes sampling that occasionally caused failures implementing
fix suggested here: #235 (comment)

Fixes #246
johnbradley added a commit that referenced this issue Jul 11, 2022
Removes sampling that occasionally caused failures implementing
fix suggested here: #235 (comment)

Fixes #246
@hlapp hlapp added this to the v0.3.0 release milestone Jul 21, 2022
@johnbradley
Copy link
Contributor Author

The KB v2-beta2 API was replaced by a new version (currently https://dev.phenoscape.org/api/v2-beta).
I tested the above issues with the With the current baseline-v0.3.0 branch the above tests all pass.

Note the issue mentioned here #235 (comment) was fixed by 10b2206.

Closing this issue since we aren't using the v2-beta2 KB API and the items found in this issue have been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants