-
Notifications
You must be signed in to change notification settings - Fork 30
February 11, 2021
Resources:
- Associated GitHub Release: https://github.com/callahantiff/PheKnowLator/releases/tag/v2.0.0
- PyPI Release: https://pypi.org/manage/project/pkt-kg/release/2.0.1/
- DockerHub Build: https://hub.docker.com/repository/docker/callahantiff/pheknowlator
KG Benchmark Builds can also be obtained from Zenodo:
- Class Builds
- Standard Relations
- Inverse Relations
- Instance
- Standard Relations
- Inverse Relations
🗂 For additional information on the KG file types please see the following Wiki page
Logs: pkt_builder_phases12_log.log
We provide several different types of output, each of which is described briefly below. Please note that in order to create the logic (XXXX_OWL_LogicOnly.nt
) and annotation (XXXX_OWL_AnnotationsOnly.nt
) subsets of each graph and be able to combine them (XXXX_OWL.nt
) we have added a namespace to all BNode
or anonymous nodes. More specifically, there are two kinds of pkt
namespaces you will find within these files:
-
https://github.com/callahantiff/PheKnowLator/pkt/
. This namespace is used for all non-ontology data definedowl:Class
andowl:NamedIndividual
objects that are added in order to integrate non-ontological entities (see here for more information). -
https://github.com/callahantiff/PheKnowLator/pkt/bnode/
. This namespace is used for all existingBNode
or anonymous nodes and is applied to these types of entities prior to subsetting an input graph.
To remove the second type of namespacing from BNode
that are part of the original ontologies used in each build, you can run the code shown below:
from pkt.utils import removes_namespace_from_bnodes
# remove bnode namespaces
updated_graph = removes_namespace_from_bnodes(org_graph)
Please also note that for all builds prior to v3.0.2
, there are 2,008
nodes in the NodeLabels.txt
files that contain foreign characters. While there is now code in place to prevent this error from happening in the future, there is also a solution to account for the prior builds. The (bad_node_patch.json
) file contains a dictionary where the outer keys are the entity_uri
and the outer values are another dictionary where the inner keys are label
and description/definition
and the inner values for these inner keys are the updated strings without foreign characters. An example of this dictionary is shown below:
key = '<http://purl.obolibrary.org/obo/UBERON_0000468>'
print(bad_node_patch[key])
>>> {'label': 'multicellular organism', 'description/definition': 'Anatomical structure that is an individual member of a species and consists of more than one cell.'}
The code to identify the nodes with erroneous foreign characters is shown below:
import re
import pandas as pd
# link to downloaded `NodeLabels.txt` file
input_file = `'NodeLabels.txt'`
# load data as Pandas DataFrame
nodedf = pd.read_csv(input_file, sep='\t', header=0)
# identify bad nodes and filter DataFrame so it only contains these rows
nodedf['bad'] = nodedf['label'].apply(lambda x: re.search("[\u4e00-\u9FFF]", x) if not pd.isna(x) else None)
nodedf_bad_nodes = nodedf[~pd.isna(nodedf['bad'])].drop_duplicates()