diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml index c84d568d3..0e1d8f676 100644 --- a/.github/workflows/main.yaml +++ b/.github/workflows/main.yaml @@ -80,7 +80,9 @@ jobs: #---------------------------------------------- - name: Install dependencies # if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true' - run: poetry install --no-interaction --no-root + run: | + poetry add setuptools@latest + poetry install --no-interaction --no-root #---------------------------------------------- # install your root project, if required diff --git a/docs/guide/index.rst b/docs/guide/index.rst index 3db8bdb28..b5f90e712 100644 --- a/docs/guide/index.rst +++ b/docs/guide/index.rst @@ -21,5 +21,6 @@ For a quick introduction, see the :ref:`tutorial` obsoletion relationships-and-graphs associations + similarity learning-more diff --git a/docs/guide/relationships-and-graphs.rst b/docs/guide/relationships-and-graphs.rst index c80146f9a..37e4f59f3 100644 --- a/docs/guide/relationships-and-graphs.rst +++ b/docs/guide/relationships-and-graphs.rst @@ -442,7 +442,3 @@ and in principle it *may* be the case that this is not asserted However, OAK is designed for working with *released* versions of ontologies, which should be *pre-classified*. This means that all edges that are both :term:`Direct` and :term:`Entailed` should also be :term:`Asserted`. - - - - diff --git a/docs/guide/similarity.rst b/docs/guide/similarity.rst new file mode 100644 index 000000000..cf50d35ce --- /dev/null +++ b/docs/guide/similarity.rst @@ -0,0 +1,224 @@ +.. _similarity: + +Computing Similarity between Entities +==================================== + +Background +---------- + +A common use case for ontologies is to compute :term:`Similarity` between :term:`Entities `, +or between :term:`Terms `, based on properties of the ontology. For example, +when comparing two terms, we may want to score terms that are closer together in the ontology +:term:`Graph` as more similar. Other measures of similarity may be on frequency of usage of terms, +or on vector :term:`Embeddings ` of terms. + +If :term:`Entities ` are annotated/associated with terms, we can also compute similarity +between these based on the aggregate set of terms they are annotated with. The canonical use +case for OAK here is :term:`Gene` associations (e.g. to phenotypes, GO terms, anatomy or cell +types for expression, etc), although any entities can be compared if ontology associations exist +(e.g. comparing two people basic on shared music genres or favorite foods). + +OAK provides various methods for computing similarity of terms and entities. These are currently +focused on "classic" semantic similarity measures, such as Jaccard similarity and information-content +based measures, but this may be extended in the future to support newer embedding-based methods. + +Like all aspects of OAK, there is a separation between :term:`Interface` and :term:`Implementation`. +The :ref:`semantic similarity interface ` defines the API for computing similarity +and related metrics. In theory there can be many implementations of this interface, in practice there +is a default implementation and the semsimian implementation. See later in this document on for details. + +Concepts +-------- + +Jaccard Similarity +^^^^^^^^^^^^^^^^^^^ + +Jaccard similarity (`Wikipedia `_) is a measure of similarity +between two sets: + +.. math:: + + J(A,B) = \frac{|A \cap B|}{|A \cup B|} + +where :math:`A` and :math:`B` are sets. + +When applied to pairs of ontology terms, the sets are typically the :term:`Reflexive` :term:`Ancestors` +of the terms of interest. Like all OAK graph operations, the choice of :term:`Predicates` used is +important. + +- Phenotype ontologies typically compute this using :term:`SubClassOf` links only +- Ontologies such as GO, Uberon, and ENVO typically use :term:`SubClassOf` and :term:`PartOf` links + +Information Content +^^^^^^^^^^^^^^^^^^^ + +Information Content (IC) is a measure of the specificity of a term. It is typically computed as the log of +the inverse of the frequency of the term in a corpus: + +.. math:: + + IC(t) = -log(P(t)) + +where :math:`P(t)` is the probability of the term in the corpus. There are two ways to compute this: + +- Using the ontology as the corpus +- Using :term:`Associations ` between entities and terms as the corpus + +In both cases the ontology graph is used. + +When the ontology is used as the corpus, the probability of a term is computed as the number of +descendants of a term (reflexive) divided by the total number of terms in the ontology: + +.. math:: + + P(t) = \frac{|Desc*(t)|}{|T|} + +When associations are used, the probability of a term is computed as the number of entities +associated with a term (directly or indirectly) divided by the total number of entities: + +.. math:: + + P(t) = \frac{|E(t)|}{|E|} + +where :math:`E(t)` is the set of entities associated with :math:`t` and :math:`E` is the set of all entities. + +To build a table of IC for all terms in HPO, using the ontology as the corpus: + +.. code-block:: bash + + runoak -i sqlite:obo:hp information-content -p i i^HP: + +To include associations, use the ``--use-associations`` flag, in addition to +specifying the associations themselves: + +.. code-block:: bash + + runoak -g phenotype.hpoa -G hpoa -i sqlite:obo:hp information-content -p i --use-associations i^HP: + +IC can be used to score the :term:`Most Recent Common Ancestor` (MRCAs) of two terms; also known +as Resnik similarity. + +Aggregate measures for comparing entities +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Computing similarity between terms (classes) is relatively easy. But in general users want to compare +*entities* such as genes. Here we broadly have two approaches: + +- Knowledge Graph (KG): treat the entities as nodes in the graph and use the standard ontology methods above +- Aggregate Statistics + +With the KG approach the entities typically become leaf nodes in the graph, and we compare using standard +graph methods where both association predicates (e.g. has-phenotype) and ontology predicates (e.g. is-a and part-of) +are used. + +The more common approach is to use aggregate statistics, which better accounts for incomplete annotation and +annotation bias. Here we have two entities to compare e1 and e2, where each entity is the subject +of association edges pointing to terms (sometimes called a *profile*). The goal is to compare the profiles +using aggregate statistics. The aggregate statistics can be performed over any individual pairwise term +statistics. + +The simplest approach is to take the maximum pairwise similarity between terms in the two profiles; +for the IC of the MRCA this would be: + +.. math:: + + max_{t1 \in e1, t2 \in e2} IC(MRCA(t1,t2)) + +However, this doesn't take into account all the other associations for each entity. + +A more common aggregate statistic is Best Match Average (BMA), which is the average of the maximum +pairwise similarity between each term in the first profile and the second profile: + +.. math:: + + BMA(e1,e2) = \frac{1}{|e1|} \sum_{t1 \in e1} max_{t2 \in e2} IC(MRCA(t1,t2)) + +Currently this is the default used in OAK. + +To compare two profiles in OAK you can use the ``termset-similarity`` command, passing in each +profile, separated by ``@``: + +.. code-block:: bash + + runoak -i sqlite:obo:mp termset-similarity \ + MP:0010771 MP:0002169 MP:0005391 MP:0005389 MP:0005367 @\ + MP:0010771 MP:0002169 MP:0005391 MP:0005389 MP:0005367 + +(note that like all OAK commands, any expression can be substituted as arguments, e.g. ``.idfile`` to +load the IDs from a file) + +The result conforms to `TermSetPairwiseSimilarity `_ + +This assumes that you have already done the lookup using an association table. + +Note that in the above, the IC scores are calculated using only the ontology as corpus. +You can pass in a pre-generated IC table (e.g. if you computed this using a particular association database) +using the ``--information-content-file`` option. + +Vector-based approaches +^^^^^^^^^^^^^^^^^^^^^^ + +The methods above are formulated in terms of *sets* of ontology terms. + +A term can also be conceived of as a *vector*. The simplest representation is a one-hot vector for each term, +with a bit set for every ancestor of that term. Entities can also be conceived of as vectors of their profile +and all ancestors of that profile. + +With a vector representation, vector-based methods such as cosine similarity can be used, including +cosine-similarity. These are typically faster to compute, and libraries such as numpy can be used to +efficiently compute all-by-all similarities. + +Typically one-hot encodings are long, if the ontology is large (one element per term). More recent +methods make use of *reduced dimensionality vectors*. These might be computed from the graph +(either pure ontology graph, or the KG formed by combining associations and ontology graph), or from +textual descriptions of the terms using text embedding models. + +Currently OAK does not support these reduced dimensionality vectors, for now you can use libraries +such as + +- `GRAPE `_ for KG embedding and ML +- `CurateGPT `_ for operations using text embeddings + over ontologies. + +Implementations +--------------- + +Default Implementation +^^^^^^^^^^^^^^^^^^^^^ + +Currently the default implementation in OAK is in pure python, and may be slow for +large ontologies or association databases. + +Ubergraph +^^^^^^^^^ + +Ubergraph has ICs pre-computed, using multiple ontologies as a corpus. + +Semsimian +^^^^^^^^^ + +`Semsimian `_ is a Rust implementation of +semantic similarity. OAK is able to use this as an implementation, *wrapping* an existing +implementation. + +To wrap an existing adapter using semsimian, prefix the selector with ``semsimian:`` + +For example: + +.. code-block:: bash + + runoak -i semsimian:sqlite:obo:hp similarity .all @ .all -O csv + +Note that semsimian is under active development and performance characteristics may change. + +Data Model +~~~~~~~~~~ + +See the `Similarity data model `_ for details of the data model. + + +Further reading +--------------- + +- Lord et al, Semantic similarity measures as tools for exploring the gene ontology https://pubmed.ncbi.nlm.nih.gov/12603061/ +- Koehler et al, Clinical diagnostics in human genetics with semantic similarity searches in ontologies, https://pubmed.ncbi.nlm.nih.gov/19800049/ \ No newline at end of file diff --git a/poetry.lock b/poetry.lock index bd74daedb..b7deef7fc 100644 --- a/poetry.lock +++ b/poetry.lock @@ -6282,4 +6282,4 @@ seaborn = [] [metadata] lock-version = "2.0" python-versions = ">=3.9,<4.0.0" -content-hash = "c9da408969e09fefeb0c97dd611ab30cdb67ebe12209c1b22c4e817a9cb177bb" +content-hash = "72c1185913e07dc8fa640bf531813cec9b1a49d826c5a2641b9f4632d9a64818" diff --git a/pyproject.toml b/pyproject.toml index d67f16c7f..c4439567e 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -8,7 +8,7 @@ readme = "README.md" [tool.poetry.dependencies] python = ">=3.9,<4.0.0" -curies = ">=0.5.5" +curies = ">=0.6.6" pronto = ">=2.5.0" SPARQLWrapper = "*" SQLAlchemy = ">=1.4.32" diff --git a/src/oaklib/cli.py b/src/oaklib/cli.py index 8e8345e0d..70a200062 100644 --- a/src/oaklib/cli.py +++ b/src/oaklib/cli.py @@ -111,6 +111,7 @@ from oaklib.interfaces.merge_interface import MergeInterface from oaklib.interfaces.metadata_interface import MetadataInterface from oaklib.interfaces.obograph_interface import GraphTraversalMethod, OboGraphInterface +from oaklib.interfaces.ontology_generator_interface import OntologyGenerationInterface from oaklib.interfaces.owl_interface import AxiomFilter, OwlInterface from oaklib.interfaces.patcher_interface import PatcherInterface from oaklib.interfaces.search_interface import SearchInterface @@ -183,6 +184,7 @@ shortest_paths, trim_graph, ) +from oaklib.utilities.semsim.similarity_utils import load_information_content_map from oaklib.utilities.subsets.slimmer_utils import ( filter_redundant, roll_up_to_named_subset, @@ -2690,6 +2692,10 @@ def similarity_pair(terms, predicates, autolabel: bool, output: TextIO, output_t show_default=True, help="Score used for summarization", ) +@click.option( + "--information-content-file", + help="File containing information content for each term", +) @autolabel_option @output_type_option @click.argument("terms", nargs=-1) @@ -2702,6 +2708,7 @@ def similarity( min_jaccard_similarity: Optional[float], min_ancestor_information_content: Optional[float], main_score_field, + information_content_file, output_type, output, ): @@ -2765,46 +2772,47 @@ def similarity( logging.info(f"file={writer.file} {type(writer.output)}") if main_score_field and isinstance(writer, HeatmapWriter): writer.value_field = main_score_field - if isinstance(impl, SemanticSimilarityInterface): - set1it = None - set2it = None - if not (set1_file or set2_file): - terms = list(terms) - ix = terms.index("@") - logging.info(f"Splitting terms {terms} on {ix}") - set1it = query_terms_iterator(terms[0:ix], impl) - set2it = query_terms_iterator(terms[ix + 1 :], impl) - else: - if set1_file: - logging.info(f"Getting set1 from {set1_file}") - with open(set1_file) as file: - set1it = list(curies_from_file(file)) - else: - set1it = query_terms_iterator(terms, impl) - if set2_file: - logging.info(f"Getting set2 from {set2_file}") - with open(set2_file) as file: - set2it = list(curies_from_file(file)) - else: - set2it = query_terms_iterator(terms, impl) - actual_predicates = _process_predicates_arg(predicates) - for sim in impl.all_by_all_pairwise_similarity( - set1it, - set2it, - predicates=actual_predicates, - min_jaccard_similarity=min_jaccard_similarity, - min_ancestor_information_content=min_ancestor_information_content, - ): - if autolabel: - # TODO: this can be made more efficient - sim.subject_label = impl.label(sim.subject_id) - sim.object_label = impl.label(sim.object_id) - sim.ancestor_label = impl.label(sim.ancestor_id) - writer.emit(sim) - writer.finish() - writer.file.close() - else: + if not isinstance(impl, SemanticSimilarityInterface): raise NotImplementedError(f"Cannot execute this using {impl} of type {type(impl)}") + if information_content_file: + impl.cached_information_content_map = load_information_content_map(information_content_file) + set1it = None + set2it = None + if not (set1_file or set2_file): + terms = list(terms) + ix = terms.index("@") + logging.info(f"Splitting terms {terms} on {ix}") + set1it = query_terms_iterator(terms[0:ix], impl) + set2it = query_terms_iterator(terms[ix + 1 :], impl) + else: + if set1_file: + logging.info(f"Getting set1 from {set1_file}") + with open(set1_file) as file: + set1it = list(curies_from_file(file)) + else: + set1it = query_terms_iterator(terms, impl) + if set2_file: + logging.info(f"Getting set2 from {set2_file}") + with open(set2_file) as file: + set2it = list(curies_from_file(file)) + else: + set2it = query_terms_iterator(terms, impl) + actual_predicates = _process_predicates_arg(predicates) + for sim in impl.all_by_all_pairwise_similarity( + set1it, + set2it, + predicates=actual_predicates, + min_jaccard_similarity=min_jaccard_similarity, + min_ancestor_information_content=min_ancestor_information_content, + ): + if autolabel: + # TODO: this can be made more efficient + sim.subject_label = impl.label(sim.subject_id) + sim.object_label = impl.label(sim.object_id) + sim.ancestor_label = impl.label(sim.ancestor_id) + writer.emit(sim) + writer.finish() + writer.file.close() @main.command() @@ -2812,12 +2820,17 @@ def similarity( @output_option @output_type_option @autolabel_option +@click.option( + "--information-content-file", + help="File containing information content for each term", +) @click.argument("terms", nargs=-1) def termset_similarity( terms, predicates, autolabel, output_type, + information_content_file, output: TextIO, ): """ @@ -2842,6 +2855,8 @@ def termset_similarity( writer.output = output if not isinstance(impl, SemanticSimilarityInterface): raise NotImplementedError(f"Cannot execute this using {impl} of type {type(impl)}") + if information_content_file: + impl.cached_information_content_map = load_information_content_map(information_content_file) terms = list(terms) ix = terms.index("@") set1 = list(query_terms_iterator(terms[0:ix], impl)) @@ -2856,6 +2871,66 @@ def termset_similarity( writer.finish() +@main.command() +@click.argument("terms", nargs=-1) +@output_option +@output_type_option +@predicates_option +@click.option( + "--use-associations/--no-use-associations", + default=False, + show_default=True, + help="Use associations to calculate IC", +) +def information_content( + terms, + predicates, + output: TextIO, + output_type: str, + use_associations: bool, +): + """ + Show information content for term or list of terms + + Example: + + runoak -i cl.db information-content -p i .all + + Like all OAK commands that operate over graphs, the graph traversal is controlled + by the `--predicates` option. In the above case, the frequency of each term is equal to + the number of reflexive is-a descendants of the term divided by total number of terms + + By default, the ontology is used as the corpus for computing term frequency. + + You can use an association file as the corpus: + + runoak -g hpoa.tsv -G hpoa -i hp.db information-content -p i --use-associations .all + """ + impl = settings.impl + writer = _get_writer(output_type, impl, StreamingCsvWriter) + writer.file = output + if not isinstance(impl, SemanticSimilarityInterface): + raise NotImplementedError(f"Cannot execute this with {type(impl)}") + if len(terms) == 0: + raise ValueError("You must specify a list of terms. Use '.all' for all terms") + actual_predicates = _process_predicates_arg(predicates) + n = 0 + logging.info("Fetching ICs...") + for curie_it in chunk(query_terms_iterator(terms, impl)): + logging.info("** Next chunk:") + n += 1 + for curie, ic in impl.information_content_scores( + curie_it, + object_closure_predicates=actual_predicates, + use_associations=use_associations, + ): + obj = dict(id=curie, information_content=ic) + writer.emit(obj) + if n == 0: + raise ValueError(f"No results for input: {terms}") + writer.finish() + + @main.command() @click.argument("terms", nargs=-1) @output_option @@ -6081,6 +6156,80 @@ def generate_synonyms(terms, rules_file, apply_patch, patch, patch_format, outpu _apply_changes(impl, change_list) +@main.command() +@click.argument("terms", nargs=-1) +@click.option( + "--style-hints", + help="Description of style for definitions", +) +@click.option( + "--apply-patch/--no-apply-patch", + default=False, + show_default=True, + help="Apply KGCL syntax.", +) +@click.option( + "--patch", + type=click.File(mode="w"), + default=sys.stdout, + help="Path to where patch file will be written.", +) +@click.option( + "--patch-format", + help="Output syntax for patches.", +) +@output_option +@output_type_option +def generate_definitions(terms, apply_patch, patch, patch_format, output, output_type, **kwargs): + """ + Generate definitions for a term or terms. + + Currently this only works with the llm extension. + + Example: + + runoak -i llm:sqlite:obo:foodon generate-definitions FOODON:03315258 + + The --style-hints option can be used to provide hints to the definition generator. + + Example: + + runoak -i llm:sqlite:obo:foodon generate-definitions FOODON:03315258 \ + --style-hints "Write the definition in the style of a pretentious food critic" + + Generates: + + "The pancake, a humble delight in the realm of breakfast fare, + presents itself as a delectable disc of gastronomic delight..." + + """ + impl = settings.impl + if apply_patch: + writer = _get_writer(patch_format, impl, StreamingKGCLWriter, kgcl) + writer.output = patch + else: + writer = _get_writer(output_type, impl, StreamingKGCLWriter, kgcl) + writer.output = output + if not isinstance(impl, OntologyGenerationInterface): + raise NotImplementedError + all_terms = query_terms_iterator(terms, impl) + curie_defns = impl.generate_definitions(list(all_terms), **kwargs) + change_list = [] + for curie, defn in curie_defns: + change = kgcl.NewTextDefinition( + id="kgcl_change_id_" + str(curie), + about_node=curie, + new_value=defn.val, + ) + change_list.append(change) + writer.emit(change) + writer.finish() + if apply_patch and len(change_list) > 0: + if output: + impl.resource.slug = output + _apply_changes(impl, change_list) + + @main.command() @click.argument("terms", nargs=-1) @click.option( diff --git a/src/oaklib/converters/logical_definition_flattener.py b/src/oaklib/converters/logical_definition_flattener.py index f8e23358c..412f68af8 100644 --- a/src/oaklib/converters/logical_definition_flattener.py +++ b/src/oaklib/converters/logical_definition_flattener.py @@ -61,8 +61,7 @@ def convert( return obj def _curie(self, iri: str) -> CURIE: - curie = self.curie_converter.compress(iri) - return curie if curie else iri + return self.curie_converter.compress(iri, passthrough=True) @lru_cache def _property_label(self, predicate: CURIE) -> str: diff --git a/src/oaklib/converters/obo_graph_to_cx_converter.py b/src/oaklib/converters/obo_graph_to_cx_converter.py index 7ea830963..59fd9df59 100644 --- a/src/oaklib/converters/obo_graph_to_cx_converter.py +++ b/src/oaklib/converters/obo_graph_to_cx_converter.py @@ -71,8 +71,4 @@ def convert(self, source: Union[GraphDocument, Graph], target: Dict = None, **kw def _id(self, uri: CURIE) -> CURIE: if not self.curie_converter: return uri - curie = self.curie_converter.compress(uri) - if curie is None: - return uri - else: - return curie + return self.curie_converter.compress(uri, passthrough=True) diff --git a/src/oaklib/converters/obo_graph_to_fhir_converter.py b/src/oaklib/converters/obo_graph_to_fhir_converter.py index 323602bdb..da2bf03f9 100644 --- a/src/oaklib/converters/obo_graph_to_fhir_converter.py +++ b/src/oaklib/converters/obo_graph_to_fhir_converter.py @@ -192,11 +192,7 @@ def code(self, uri: CURIE) -> str: """ if not self.curie_converter: return uri - curie = self.curie_converter.compress(uri) - if curie is None: - return uri - else: - return curie + return self.curie_converter.compress(uri, passthrough=True) def _convert_graph( self, diff --git a/src/oaklib/converters/obo_graph_to_obo_format_converter.py b/src/oaklib/converters/obo_graph_to_obo_format_converter.py index 672820741..a539dd544 100644 --- a/src/oaklib/converters/obo_graph_to_obo_format_converter.py +++ b/src/oaklib/converters/obo_graph_to_obo_format_converter.py @@ -99,17 +99,15 @@ def convert(self, source: GraphDocument, target: OboDocument = None, **kwargs) - if target is None: target = OboDocument() for g in source.graphs: + logging.info(f"Converting graph {g.id}, nodes: {len(g.nodes)}, edges: {len(g.edges)}") self._convert_graph(g, target=target) + logging.info(f"Converted {len(target.stanzas)} stanzas") return target def _id(self, uri_or_curie: CURIE) -> CURIE: if not self.curie_converter: return uri_or_curie - curie = self.curie_converter.compress(uri_or_curie) - if curie is None: - return uri_or_curie - else: - return curie + return self.curie_converter.compress(uri_or_curie, passthrough=True) def _predicate_id(self, uri_or_curie: CURIE, target: OboDocument) -> CURIE: curie = self._id(uri_or_curie) diff --git a/src/oaklib/converters/obo_graph_to_rdf_owl_converter.py b/src/oaklib/converters/obo_graph_to_rdf_owl_converter.py index 7ca453f2a..767781d30 100644 --- a/src/oaklib/converters/obo_graph_to_rdf_owl_converter.py +++ b/src/oaklib/converters/obo_graph_to_rdf_owl_converter.py @@ -159,7 +159,5 @@ def _add_reified( def _uri_ref(self, curie: CURIE) -> rdflib.URIRef: if self.curie_converter is None: self.curie_converter = BasicOntologyInterface().converter - uri = self.curie_converter.expand(curie) - if uri is None: - uri = curie + uri = self.curie_converter.expand(curie, passthrough=True) return rdflib.URIRef(uri) diff --git a/src/oaklib/implementations/llm_implementation.py b/src/oaklib/implementations/llm_implementation.py index c8cb0bc3b..75b281e13 100644 --- a/src/oaklib/implementations/llm_implementation.py +++ b/src/oaklib/implementations/llm_implementation.py @@ -2,11 +2,14 @@ import json import logging from dataclasses import dataclass -from typing import TYPE_CHECKING, Iterator, List +from typing import TYPE_CHECKING, Iterator, List, Tuple +from oaklib.datamodels.obograph import DefinitionPropertyValue from oaklib.datamodels.text_annotator import TextAnnotation, TextAnnotationConfiguration -from oaklib.interfaces import TextAnnotatorInterface +from oaklib.interfaces import OboGraphInterface, TextAnnotatorInterface +from oaklib.interfaces.ontology_generator_interface import OntologyGenerationInterface from oaklib.interfaces.text_annotator_interface import TEXT +from oaklib.types import CURIE if TYPE_CHECKING: import llm @@ -17,10 +20,10 @@ @dataclass -class LLMImplementation(TextAnnotatorInterface): +class LLMImplementation(TextAnnotatorInterface, OntologyGenerationInterface): """Perform named entity normalization on LLM.""" - grounder: TextAnnotatorInterface = None + wrapped_adapter: TextAnnotatorInterface = None """A wrapped annotator used to ground NEs. """ @@ -46,10 +49,14 @@ def __post_init__(self): logging.info(f"LLM implementation will use grounder: {slug}") from oaklib import get_adapter - self.grounder = get_adapter(slug) + self.wrapped_adapter = get_adapter(slug) if self.model_id is not None: self.model = llm.get_model(self.model_id) + def entities(self, **kwargs) -> Iterator[CURIE]: + """Return all entities in the ontology.""" + yield from self.wrapped_adapter.entities(**kwargs) + def annotate_text( self, text: TEXT, configuration: TextAnnotationConfiguration = None ) -> Iterator[TextAnnotation]: @@ -60,10 +67,22 @@ def annotate_text( raise NotImplementedError("LLM does not support whole-text matching") else: logging.info("Delegating directly to grounder, bypassing LLM") - yield from self.grounder.annotate_text(text, configuration) + yield from self.wrapped_adapter.annotate_text(text, configuration) else: yield from self._llm_annotate(text, configuration) + def get_model(self): + model = self.model + if not self.model: + # model_id = self.configuration.model or self.model_id + model_id = self.model_id + if not model_id: + model_id = self.default_model_id + import llm + + model = llm.get_model(model_id) + return model + def _llm_annotate( self, text: str, @@ -89,13 +108,15 @@ def _llm_annotate( term = term_obj["term"] category = term_obj["category"] ann = TextAnnotation(subject_label=term, object_categories=[category]) - matches = list(self.grounder.annotate_text(term, grounder_configuration)) + matches = list(self.wrapped_adapter.annotate_text(term, grounder_configuration)) if not matches: aliases = self._suggest_aliases( term, model, configuration.categories, configuration ) for alias in aliases: - matches = list(self.grounder.annotate_text(alias, grounder_configuration)) + matches = list( + self.wrapped_adapter.annotate_text(alias, grounder_configuration) + ) if matches: break logging.info(f"Aliases={aliases}; matches={matches}") @@ -157,3 +178,22 @@ def _suggest_aliases( response = model.prompt(term, system=prompt).text() logging.info(f"LLM aliases[{term}] => {response}") return [x.strip() for x in response.split(";")] + + def generate_definitions( + self, curies: List[CURIE], style_hints="", **kwargs + ) -> Iterator[Tuple[CURIE, DefinitionPropertyValue]]: + """Suggest definitions for the given curies.""" + wrapped_adapter = self.wrapped_adapter + model = self.get_model() + if not isinstance(wrapped_adapter, OboGraphInterface): + raise NotImplementedError("LLM can only suggest definitions for OBO graphs") + if style_hints is None: + style_hints = "" + for curie in curies: + node = wrapped_adapter.node(curie) + info = f"id: {curie}\n" + info += f"label: {node.lbl}\n" + system_prompt = "Provide a textual definition for the given term." + system_prompt += style_hints + response = model.prompt(info, system=system_prompt).text() + yield curie, DefinitionPropertyValue(val=response) diff --git a/src/oaklib/implementations/pronto/pronto_implementation.py b/src/oaklib/implementations/pronto/pronto_implementation.py index dc68630ab..f4da44b7d 100644 --- a/src/oaklib/implementations/pronto/pronto_implementation.py +++ b/src/oaklib/implementations/pronto/pronto_implementation.py @@ -752,7 +752,7 @@ def synonym_property_values( def logical_definitions( self, - subjects: Optional[Iterable[CURIE]], + subjects: Optional[Iterable[CURIE]] = None, predicates: Iterable[PRED_CURIE] = None, objects: Iterable[CURIE] = None, **kwargs, diff --git a/src/oaklib/implementations/simpleobo/simple_obo_implementation.py b/src/oaklib/implementations/simpleobo/simple_obo_implementation.py index 24e6fbcfa..61b48f461 100644 --- a/src/oaklib/implementations/simpleobo/simple_obo_implementation.py +++ b/src/oaklib/implementations/simpleobo/simple_obo_implementation.py @@ -732,6 +732,8 @@ def logical_definitions( objects: Iterable[CURIE] = None, **kwargs, ) -> Iterable[LogicalDefinitionAxiom]: + if subjects is None: + subjects = self.entities() for s in subjects: t = self._stanza(s, strict=False) if not t: diff --git a/src/oaklib/implementations/sqldb/sql_implementation.py b/src/oaklib/implementations/sqldb/sql_implementation.py index a4223cb38..1838a2aa0 100644 --- a/src/oaklib/implementations/sqldb/sql_implementation.py +++ b/src/oaklib/implementations/sqldb/sql_implementation.py @@ -127,6 +127,7 @@ DEFINITION, LANGUAGE_TAG, METADATA_MAP, + METADATA_STATEMENT, PRED_CURIE, PREFIX_MAP, RELATIONSHIP, @@ -312,6 +313,7 @@ class SqlImplementation( max_items_for_in_clause: int = field(default_factory=lambda: 1000) can_store_associations: bool = False + """True if the underlying sqlite database has term_association populated.""" def __post_init__(self): if self.engine is None: @@ -670,6 +672,33 @@ def entity_metadata_map(self, curie: CURIE, include_all_triples=False) -> METADA self.add_missing_property_values(curie, m) return dict(m) + def entities_metadata_statements( + self, + curies: Iterable[CURIE], + predicates: Optional[List[PRED_CURIE]] = None, + include_nested_metadata=False, + **kwargs, + ) -> Iterator[METADATA_STATEMENT]: + q = self.session.query(Statements) + if not include_nested_metadata: + subquery = self.session.query(RdfTypeStatement.subject).filter( + RdfTypeStatement.object == "owl:AnnotationProperty" + ) + annotation_properties = {row.subject for row in subquery} + annotation_properties = annotation_properties.union(STANDARD_ANNOTATION_PROPERTIES) + q = q.filter(Statements.predicate.in_(tuple(annotation_properties))) + q = q.filter(Statements.subject.in_(curies)) + if predicates is not None: + q = q.filter(Statements.predicate.in_(predicates)) + for row in q: + if row.value is not None: + v = _python_value(row.value, row.datatype) + elif row.object is not None: + v = row.object + else: + v = None + yield row.subject, row.predicate, v, row.datatype, {} + def ontologies(self) -> Iterable[CURIE]: for row in self.session.query(OntologyNode): yield row.id @@ -1166,6 +1195,7 @@ def associations(self, *args, **kwargs) -> Iterator[Association]: logging.info("Using base method") yield from super().associations(*args, **kwargs) return + logging.info("Using SQL queries") q = self._associations_query(*args, **kwargs) for row_it in chunk(q): assocs = [] @@ -1226,7 +1256,7 @@ def _associations_query( q = q.filter(TermAssociation.object.in_(subquery)) else: q = q.filter(TermAssociation.object.in_(tuple(objects))) - logging.info(f"Association query: {q}") + logging.info(f"_associations_query: {q}") return q def add_associations( @@ -1236,6 +1266,7 @@ def add_associations( **kwargs, ) -> bool: if not self.can_store_associations: + logging.info("Using base method to store associations") return super().add_associations(associations, normalizers=normalizers) for a in associations: if normalizers: @@ -2063,9 +2094,24 @@ def information_content_scores( predicates: List[PRED_CURIE] = None, object_closure_predicates: List[PRED_CURIE] = None, use_associations: bool = None, + **kwargs, ) -> Iterator[Tuple[CURIE, float]]: curies = list(curies) + if self.cached_information_content_map: + yield from super().information_content_scores(curies) + return if use_associations: + logging.info("Using associations to calculate IC") + if not self.can_store_associations: + yield from super().information_content_scores( + curies, + predicates=predicates, + object_closure_predicates=object_closure_predicates, + use_associations=use_associations, + **kwargs, + ) + return + # raise ValueError("Cannot use associations, not stored") q = self.session.query(EntailedEdge.object, func.count(TermAssociation.subject)) q = q.filter(EntailedEdge.subject == TermAssociation.object) if curies is not None: @@ -2075,6 +2121,7 @@ def information_content_scores( if object_closure_predicates: q = q.filter(EntailedEdge.predicate.in_(object_closure_predicates)) q = q.group_by(EntailedEdge.object) + logging.info(f"QUERY: {q}") num_nodes = ( self.session.query(TermAssociation).distinct(TermAssociation.subject).count() ) diff --git a/src/oaklib/interfaces/association_provider_interface.py b/src/oaklib/interfaces/association_provider_interface.py index a500de354..f5dc44e67 100644 --- a/src/oaklib/interfaces/association_provider_interface.py +++ b/src/oaklib/interfaces/association_provider_interface.py @@ -211,7 +211,7 @@ def _inject_subject_labels(self, association_iterator: Iterable[Association]): association.subject_label = label_map[association.subject] yield from associations - def associations_subjects(self, **kwargs) -> Iterator[CURIE]: + def associations_subjects(self, *args, **kwargs) -> Iterator[CURIE]: """ Yields all distinct subjects. @@ -231,7 +231,7 @@ def associations_subjects(self, **kwargs) -> Iterator[CURIE]: """ # individual implementations should override this to be more efficient yielded = set() - for a in self.associations(**kwargs): + for a in self.associations(*args, **kwargs): s = a.subject if s in yielded: continue diff --git a/src/oaklib/interfaces/basic_ontology_interface.py b/src/oaklib/interfaces/basic_ontology_interface.py index cc96a1e5e..1f45a82e5 100644 --- a/src/oaklib/interfaces/basic_ontology_interface.py +++ b/src/oaklib/interfaces/basic_ontology_interface.py @@ -265,9 +265,7 @@ def uri_to_curie( :param use_uri_fallback: if cannot be contracted, use the URI as a CURIE proxy [default: True] :return: contracted URI, or original URI if no contraction possible """ - rv = self.converter.compress(uri) - if use_uri_fallback: - strict = False + rv = self.converter.compress(uri, passthrough=use_uri_fallback) if rv is None and strict: prefix_map_text = "\n".join( f" {prefix} -> {uri_prefix}" @@ -277,8 +275,6 @@ def uri_to_curie( f"{self.__class__.__name__}.prefix_map() does not support compressing {uri}.\n" f"This ontology interface contains {len(self.prefix_map()):,} prefixes:\n{prefix_map_text}" ) - if rv is None and use_uri_fallback: - return uri return rv @property @@ -1421,7 +1417,11 @@ def entity_metadata_map(self, curie: CURIE) -> METADATA_MAP: raise NotImplementedError def entities_metadata_statements( - self, curies: Iterable[CURIE], predicates: Optional[List[PRED_CURIE]] = None + self, + curies: Iterable[CURIE], + predicates: Optional[List[PRED_CURIE]] = None, + include_nested_metadata=False, + **kwargs, ) -> Iterator[METADATA_STATEMENT]: """ Retrieve metadata statements (entity annotations) for a collection of entities. diff --git a/src/oaklib/interfaces/ontology_generator_interface.py b/src/oaklib/interfaces/ontology_generator_interface.py new file mode 100644 index 000000000..bf786a79b --- /dev/null +++ b/src/oaklib/interfaces/ontology_generator_interface.py @@ -0,0 +1,29 @@ +from abc import ABC +from typing import Iterator, List, Tuple + +from oaklib.datamodels.obograph import DefinitionPropertyValue +from oaklib.interfaces.basic_ontology_interface import BasicOntologyInterface +from oaklib.types import CURIE + +__all__ = [ + "OntologyGenerationInterface", +] + + +class OntologyGenerationInterface(BasicOntologyInterface, ABC): + """ + Interface for generation of ontologies and ontology terms. + """ + + def generate_definitions( + self, curies: List[CURIE], style_hints="", **kwargs + ) -> Iterator[Tuple[CURIE, DefinitionPropertyValue]]: + """ + Generate text definitions for a list of curies. + + :param curies: + :param style_hints: + :param kwargs: + :return: + """ + raise NotImplementedError() diff --git a/src/oaklib/interfaces/semsim_interface.py b/src/oaklib/interfaces/semsim_interface.py index c6b0862c6..df0f6e408 100644 --- a/src/oaklib/interfaces/semsim_interface.py +++ b/src/oaklib/interfaces/semsim_interface.py @@ -3,7 +3,7 @@ import statistics from abc import ABC from collections import defaultdict -from typing import Iterable, Iterator, List, Optional, Tuple +from typing import Dict, Iterable, Iterator, List, Optional, Tuple import networkx as nx @@ -27,6 +27,9 @@ class SemanticSimilarityInterface(BasicOntologyInterface, ABC): collections of terms """ + cached_information_content_map: Dict[CURIE, float] = None + """Mapping from term to information content""" + def most_recent_common_ancestors( self, subject: CURIE, @@ -194,6 +197,8 @@ def information_content_scores( predicates: List[PRED_CURIE] = None, object_closure_predicates: List[PRED_CURIE] = None, use_associations: bool = None, + term_to_entities_map: Dict[CURIE, List[CURIE]] = None, + **kwargs, ) -> Iterator[Tuple[CURIE, float]]: """ Yields entity-score pairs for a given collection of entities. @@ -208,12 +213,44 @@ def information_content_scores( Pr(t) = freq(t)/|items| :param curies: + :param predicates: :param object_closure_predicates: + :param use_associations: + :param term_to_entities_map: + :param kwargs: :return: """ curies = list(curies) + if self.cached_information_content_map is not None: + for curie in curies: + if curie in self.cached_information_content_map: + yield curie, self.cached_information_content_map[curie] + return if use_associations: - raise NotImplementedError + from oaklib.interfaces.association_provider_interface import ( + AssociationProviderInterface, + ) + + if not isinstance(self, AssociationProviderInterface): + raise ValueError( + f"unable to retrieve associations from this interface, type {type(self)}" + ) + all_entities = set() + for a in self.associations(): + all_entities.add(a.subject) + num_entities = len(all_entities) + logging.info(f"num_entities={num_entities}") + for curie in curies: + entities = list( + self.associations_subjects( + objects=[curie], + predicates=predicates, + object_closure_predicates=object_closure_predicates, + ) + ) + if entities: + yield curie, -math.log(len(entities) / num_entities) + return all_entities = list(self.entities()) num_entities = len(all_entities) if not isinstance(self, OboGraphInterface): diff --git a/src/oaklib/utilities/associations/association_index.py b/src/oaklib/utilities/associations/association_index.py index b9ec0da04..074d3d1e3 100644 --- a/src/oaklib/utilities/associations/association_index.py +++ b/src/oaklib/utilities/associations/association_index.py @@ -88,7 +88,7 @@ def lookup( q = q.filter(TermAssociation.predicate.in_(tuple(predicates))) if objects: q = q.filter(TermAssociation.object.in_(tuple(objects))) - logging.info(f"Association query: {q}") + logging.info(f"Association index lookup: {q}") for row in q: tup = (row.subject, row.predicate, row.object) yield from self._associations_by_spo[tup] diff --git a/src/oaklib/utilities/semsim/similarity_utils.py b/src/oaklib/utilities/semsim/similarity_utils.py index e71bffbe0..64970fc0d 100644 --- a/src/oaklib/utilities/semsim/similarity_utils.py +++ b/src/oaklib/utilities/semsim/similarity_utils.py @@ -1,6 +1,9 @@ +import csv from dataclasses import dataclass from typing import Dict, Iterable, List, Set, Tuple, Union +from oaklib.types import CURIE + @dataclass class ListPair: @@ -55,3 +58,27 @@ def setwise_jaccard_similarity(set1: LIST_OR_SET, set2: LIST_OR_SET) -> float: set1 = set(set1) set2 = set(set2) return len(set1.intersection(set2)) / len(set1.union(set2)) + + +def load_information_content_map(path: str) -> Dict[CURIE, float]: + """ + Load information content map from file. + + Columns: id, information_content + + Example: + + >>> from oaklib.utilities.semsim.similarity_utils import load_information_content_map + >>> icmap = load_information_content_map("tests/input/go-nucleus.ic.tsv") + >>> ic = icmap["GO:0016310"] + >>> print(f"{ic:.3f}") + 4.273 + + IC tables can be computed using the :ref:`SemanticSimilarityInterface.information_content_scores` method. + + :param path: + :return: mapping from IDs to information content + """ + with open(path) as f: + dr = csv.DictReader(f, delimiter="\t") + return {row["id"]: float(row["information_content"]) for row in dr} diff --git a/tests/input/go-nucleus.ic.tsv b/tests/input/go-nucleus.ic.tsv new file mode 100644 index 000000000..68630073b --- /dev/null +++ b/tests/input/go-nucleus.ic.tsv @@ -0,0 +1,174 @@ +id information_content +BFO:0000002 1.2531189369687112 +BFO:0000003 1.7150230412855292 +BFO:0000004 1.6880559936852597 +BFO:0000015 1.742503777707636 +BFO:0000017 4.273018494406416 +BFO:0000020 3.1950159824051427 +BFO:0000023 4.442943495848728 +BFO:0000040 1.770518153877233 +CARO:0000000 2.742503777707636 +CARO:0000003 3.442943495848729 +CARO:0000006 3.355480654598389 +CARO:0001010 7.442943495848729 +CARO:0030000 2.6880559936852597 +CHEBI:10545 3.1950159824051427 +CHEBI:23367 3.53605290024021 +CHEBI:24431 2.9193815397917158 +CHEBI:24432 4.857980995127573 +CHEBI:24433 4.635588573791124 +CHEBI:25585 4.635588573791124 +CHEBI:26020 5.857980995127572 +CHEBI:26079 5.442943495848728 +CHEBI:26082 4.442943495848728 +CHEBI:28659 4.857980995127573 +CHEBI:33233 3.050626073069968 +CHEBI:33241 4.857980995127573 +CHEBI:33250 3.273018494406416 +CHEBI:33252 3.1950159824051427 +CHEBI:33253 3.121015400961366 +CHEBI:33284 5.857980995127572 +CHEBI:33300 4.442943495848728 +CHEBI:33302 4.273018494406416 +CHEBI:33318 3.8579809951275723 +CHEBI:33560 4.121015400961366 +CHEBI:33579 3.983511877211431 +CHEBI:33675 4.121015400961366 +CHEBI:33937 6.442943495848729 +CHEBI:36338 3.121015400961366 +CHEBI:36339 3.050626073069968 +CHEBI:36340 2.8579809951275723 +CHEBI:36342 2.584962500721156 +CHEBI:36343 2.9193815397917158 +CHEBI:36344 2.983511877211431 +CHEBI:36347 3.050626073069968 +CHEBI:36357 4.442943495848728 +CHEBI:36359 5.121015400961366 +CHEBI:36360 4.857980995127573 +CHEBI:37577 4.635588573791124 +CHEBI:50906 4.442943495848728 +CHEBI:52211 5.121015400961366 +CHEBI:78295 5.442943495848728 +CL:0000000 3.53605290024021 +GO:0003674 2.983511877211431 +GO:0003824 3.273018494406416 +GO:0004857 6.442943495848729 +GO:0005575 2.9193815397917158 +GO:0005622 3.742503777707636 +GO:0005634 5.857980995127572 +GO:0005635 6.442943495848729 +GO:0005737 4.857980995127573 +GO:0005773 6.442943495848729 +GO:0005886 5.857980995127572 +GO:0005938 6.442943495848729 +GO:0006793 3.355480654598389 +GO:0006796 3.742503777707636 +GO:0008047 6.442943495848729 +GO:0008150 1.9510903995190536 +GO:0008152 2.7990873060740036 +GO:0009892 4.857980995127573 +GO:0009893 4.635588573791124 +GO:0009987 2.7990873060740036 +GO:0010562 5.121015400961366 +GO:0010563 5.442943495848728 +GO:0012505 5.857980995127572 +GO:0016020 4.121015400961366 +GO:0016301 4.635588573791124 +GO:0016310 4.273018494406416 +GO:0016740 3.8579809951275723 +GO:0016772 4.442943495848728 +GO:0019207 5.857980995127572 +GO:0019209 7.442943495848729 +GO:0019210 7.442943495848729 +GO:0019220 4.121015400961366 +GO:0019222 3.1950159824051427 +GO:0031090 6.442943495848729 +GO:0031323 3.442943495848729 +GO:0031324 5.121015400961366 +GO:0031325 4.857980995127573 +GO:0031965 7.442943495848729 +GO:0031967 5.857980995127572 +GO:0031975 5.442943495848728 +GO:0033673 7.442943495848729 +GO:0033674 6.442943495848729 +GO:0042325 4.635588573791124 +GO:0042326 6.442943495848729 +GO:0042327 5.857980995127572 +GO:0043085 5.442943495848728 +GO:0043086 5.121015400961366 +GO:0043226 3.983511877211431 +GO:0043227 4.273018494406416 +GO:0043229 4.442943495848728 +GO:0043231 4.857980995127573 +GO:0043549 5.442943495848728 +GO:0044092 4.857980995127573 +GO:0044093 5.121015400961366 +GO:0044237 3.050626073069968 +GO:0045936 5.857980995127572 +GO:0045937 5.442943495848728 +GO:0048518 4.273018494406416 +GO:0048519 4.442943495848728 +GO:0048522 4.635588573791124 +GO:0048523 4.857980995127573 +GO:0050789 2.7990873060740036 +GO:0050790 3.8579809951275723 +GO:0050794 3.1950159824051427 +GO:0051174 3.742503777707636 +GO:0051338 4.635588573791124 +GO:0051347 5.857980995127572 +GO:0051348 6.442943495848729 +GO:0065007 2.2334901302197787 +GO:0065009 3.53605290024021 +GO:0071944 4.857980995127573 +GO:0098590 7.442943495848729 +GO:0099568 6.442943495848729 +GO:0099738 7.442943495848729 +GO:0110165 2.983511877211431 +IAO:0000078 7.442943495848729 +IAO:0000225 7.442943495848729 +IAO:0000409 7.442943495848729 +NCBITaxon:1 1.82823365173352 +NCBITaxon:10239 7.442943495848729 +NCBITaxon:1117 7.442943495848729 +NCBITaxon:131567 1.9193815397917156 +NCBITaxon:142796 5.121015400961366 +NCBITaxon:1783272 5.857980995127572 +NCBITaxon:1798711 6.442943495848729 +NCBITaxon:2 5.442943495848728 +NCBITaxon:2058185 6.442943495848729 +NCBITaxon:2058949 5.857980995127572 +NCBITaxon:2157 7.442943495848729 +NCBITaxon:2605435 4.857980995127573 +NCBITaxon:2611352 6.442943495848729 +NCBITaxon:2759 3.1950159824051427 +NCBITaxon:33083 5.442943495848728 +NCBITaxon:33090 7.442943495848729 +NCBITaxon:33154 6.442943495848729 +NCBITaxon:33682 7.442943495848729 +NCBITaxon:4751 7.442943495848729 +NCBITaxon:554915 4.635588573791124 +NCBITaxon:5782 7.442943495848729 +NCBITaxon:Union_0000000 4.635588573791124 +NCBITaxon:Union_0000002 5.857980995127572 +NCBITaxon:Union_0000004 4.857980995127573 +NCBITaxon:Union_0000006 4.121015400961366 +NCBITaxon:Union_0000007 4.121015400961366 +NCBITaxon:Union_0000008 5.857980995127572 +NCBITaxon:Union_0000020 4.857980995127573 +NCBITaxon:Union_0000021 3.635588573791124 +NCBITaxon:Union_0000022 5.857980995127572 +NCBITaxon:Union_0000023 4.273018494406416 +NCBITaxon:Union_0000024 3.983511877211431 +NCBITaxon:Union_0000025 3.050626073069968 +NCBITaxon:Union_0000030 1.8579809951275723 +OBI:0100026 1.799087306074004 +PATO:0000001 4.273018494406416 +PATO:0001241 4.442943495848728 +PATO:0001396 4.635588573791124 +PATO:0001404 4.857980995127573 +PATO:0001405 7.442943495848729 +PATO:0001406 7.442943495848729 +PATO:0001407 7.442943495848729 +PATO:0001908 6.442943495848729 +PATO:0002505 5.442943495848728 +oio:Subset 7.442943495848729 diff --git a/tests/test_implementations/__init__.py b/tests/test_implementations/__init__.py index ecc0da4f3..269eb093f 100644 --- a/tests/test_implementations/__init__.py +++ b/tests/test_implementations/__init__.py @@ -112,6 +112,7 @@ INTRACELLULAR, INTRACELLULAR_ORGANELLE, MAMMALIA, + MEMBRANE, NUCLEAR_ENVELOPE, NUCLEAR_MEMBRANE, NUCLEUS, @@ -1593,11 +1594,6 @@ def test_store_associations(self, oi: AssociationProviderInterface): test.assertEqual(1, count_map[NUCLEAR_MEMBRANE]) test.assertEqual(2, count_map[NUCLEUS]) test.assertEqual(2, count_map[IMBO]) - if isinstance(oi, SemanticSimilarityInterface): - try: - self.test_information_content_scores(oi, use_associations=True) - except NotImplementedError: - logging.info(f"Not yet implemented for {type(oi)}") def test_class_enrichment(self, oi: ClassEnrichmentCalculationInterface): """ @@ -1738,12 +1734,43 @@ def test_information_content_scores( (NUCLEUS, IMBO), (IMBO, CELLULAR_COMPONENT), ] + if use_associations: + if not isinstance(oi, AssociationProviderInterface): + raise ValueError("Semantic similarity interface does not support associations") + assoc_pairs = [ + (GENE1, VACUOLE), + (GENE2, NUCLEAR_ENVELOPE), + (GENE3, NUCLEAR_ENVELOPE), + ] + oi.add_associations( + Association(subject=subject, object=object) for subject, object in assoc_pairs + ) m = {} for curie, score in oi.information_content_scores( terms, object_closure_predicates=[IS_A, PART_OF], use_associations=use_associations ): m[curie] = score - # universal root node always has zero information + test.assertGreater(len(m), 0) + if use_associations: + test.assertCountEqual( + [ + OWL_THING, + VACUOLE, + IMBO, + NUCLEAR_ENVELOPE, + NUCLEUS, + CELLULAR_COMPONENT, + PHOTORECEPTOR_OUTER_SEGMENT, + ], + m.keys(), + ) + test.assertEqual(m[CELLULAR_COMPONENT], 0.0, "all genes are under cell component") + test.assertEqual(m[PHOTORECEPTOR_OUTER_SEGMENT], 0.0, "not in graph") + test.assertGreater(m[VACUOLE], 1.0) + test.assertLess(m[NUCLEAR_ENVELOPE], 1.0) + # universal root node always has zero information content + for k, v in m.items(): + print(f"{k} IC= {v}") test.assertEqual(m[OWL_THING], 0.0) for child, parent in posets: if use_associations: @@ -1854,3 +1881,128 @@ def test_annotate_text(self, oi: TextAnnotatorInterface): test.assertEqual(object_label, ann.object_label) test.assertEqual(subject_start, ann.subject_start) test.assertEqual(subject_end, ann.subject_end) + + def test_entities_metadata_statements(self, oi: BasicOntologyInterface): + test = self.test + + cases = [ + ( + [MEMBRANE], + [OIO_CREATION_DATE], + [("GO:0016020", "oio:creation_date", "2014-03-06T11:37:54Z", "xsd:string", {})], + ), + ( + [MEMBRANE], + None, + [ + ( + "GO:0016020", + "IAO:0000115", + "A lipid bilayer along with all the proteins and protein complexes embedded in it an attached to it.", # noqa:E501 + "xsd:string", + {}, + ), + ("GO:0016020", "oio:creation_date", "2014-03-06T11:37:54Z", "xsd:string", {}), + ("GO:0016020", "oio:hasAlternativeId", "GO:0098589", "xsd:string", {}), + ("GO:0016020", "oio:hasAlternativeId", "GO:0098805", "xsd:string", {}), + ( + "GO:0016020", + "oio:hasDbXref", + "Wikipedia:Biological_membrane", + "xsd:string", + {}, + ), + ("GO:0016020", "oio:hasNarrowSynonym", "membrane region", "xsd:string", {}), + ("GO:0016020", "oio:hasNarrowSynonym", "region of membrane", "xsd:string", {}), + ("GO:0016020", "oio:hasNarrowSynonym", "whole membrane", "xsd:string", {}), + ("GO:0016020", "oio:hasOBONamespace", "cellular_component", "xsd:string", {}), + ("GO:0016020", "oio:id", "GO:0016020", "xsd:string", {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_yeast", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_plant", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_pir", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_metagenomics", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_flybase_ribbon", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_chembl", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_candida", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_aspergillus", None, {}), + ("GO:0016020", "rdfs:label", "membrane", "xsd:string", {}), + ], + ), + ( + [MEMBRANE], + ["oio:inSubset"], + [ + ("GO:0016020", "oio:inSubset", "obo:go#goslim_yeast", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_plant", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_pir", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_metagenomics", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_flybase_ribbon", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_chembl", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_candida", None, {}), + ("GO:0016020", "oio:inSubset", "obo:go#goslim_aspergillus", None, {}), + ], + ), + ( + [NUCLEUS], + None, + [ + ( + "GO:0005634", + "IAO:0000115", + "A membrane-bounded organelle of eukaryotic cells in which chromosomes are housed and replicated. In most cells, the nucleus contains all of the cell's chromosomes except the organellar chromosomes, and is the site of RNA synthesis and processing. In some species, or in specialized cell types, RNA metabolism or DNA replication may be absent.", # noqa:E501 + "xsd:string", + {}, + ), + ( + "GO:0005634", + "oio:hasDbXref", + "NIF_Subcellular:sao1702920020", + "xsd:string", + {}, + ), + ("GO:0005634", "oio:hasDbXref", "Wikipedia:Cell_nucleus", "xsd:string", {}), + ("GO:0005634", "oio:hasExactSynonym", "cell nucleus", "xsd:string", {}), + ("GO:0005634", "oio:hasNarrowSynonym", "horsetail nucleus", "xsd:string", {}), + ("GO:0005634", "oio:hasOBONamespace", "cellular_component", "xsd:string", {}), + ("GO:0005634", "oio:id", "GO:0005634", "xsd:string", {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_yeast", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_plant", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_pir", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_mouse", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_metagenomics", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_generic", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_flybase_ribbon", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_drosophila", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_chembl", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_candida", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_aspergillus", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_agr", None, {}), + ("GO:0005634", "rdfs:label", "nucleus", "xsd:string", {}), + ], + ), + ( + [NUCLEUS], + ["oio:inSubset"], + [ + ("GO:0005634", "oio:inSubset", "obo:go#goslim_yeast", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_plant", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_pir", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_mouse", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_metagenomics", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_generic", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_flybase_ribbon", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_drosophila", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_chembl", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_candida", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_aspergillus", None, {}), + ("GO:0005634", "oio:inSubset", "obo:go#goslim_agr", None, {}), + ], + ), + ] + for case in cases: + curies, predicates, expected_result = case + results = list(oi.entities_metadata_statements(curies=curies, predicates=predicates)) + test.assertCountEqual(results, expected_result) + test.assertCountEqual(results[0], expected_result[0]) + for idx, result in enumerate(results): + test.assertEqual(result, expected_result[idx]) diff --git a/tests/test_implementations/test_sqldb.py b/tests/test_implementations/test_sqldb.py index ae5f7962f..42feb32a9 100644 --- a/tests/test_implementations/test_sqldb.py +++ b/tests/test_implementations/test_sqldb.py @@ -877,6 +877,7 @@ def test_summary_statistics(self): def test_information_content_scores(self): self.compliance_tester.test_information_content_scores(self.oi, False) + self.compliance_tester.test_information_content_scores(self.oi, True) def test_common_ancestors(self): self.compliance_tester.test_common_ancestors(self.oi) @@ -898,3 +899,6 @@ def test_transitive_object_properties(self): def test_simple_subproperty_of_chains(self): self.compliance_tester.test_simple_subproperty_of_chains(self.oi) + + def test_entities_metadata_statements(self): + self.compliance_tester.test_entities_metadata_statements(self.oi)