Mining neighbors, paths, and path patterns from a knowledge graph and a set of seed nodes
A detailed description of the motivation and the algorithms of kgpm is available in the related article.
When citing kgpm, please use the following reference:
Pierre Monnin, Emmanuel Bresso, Miguel Couceiro, Malika Smaïl-Tabbone, Amedeo Napoli, and Adrien Coulet. "Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study". In: 1st international conference "Algebras, graphs and ordered sets" (ALGOS 2020). Ed. by Miguel Couceiro, Pierre Monnin, and Amedeo Napoli. Nancy, France, Aug. 2020. url: https://arxiv.org/pdf/2007.08821.pdf.
@inproceedings{Monnin2020kgpm,
author = {Monnin, Pierre and Bresso, Emmanuel and Couceiro, Miguel and Sma{\"i}l-Tabbone, Malika and Napoli, Amedeo and Coulet, Adrien},
title = {{Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study}},
editor = {Miguel Couceiro and Pierre Monnin and Amedeo Napoli},
booktitle = {{1st international conference ``Algebras, graphs and ordered sets'' (ALGOS 2020)}},
address = {Nancy, France},
year = {2020},
month = Aug,
url = {https://arxiv.org/pdf/2007.08821.pdf},
}
Python script to query a knowledge graph and perform its canonicalization. The script outputs:
- Files representing the canonical knowledge graph (in
rdf_to_canonical_index
,canonical_to_rdf_index
,canonical_graph_adjacency
,canonical_graph_inv_adjacency
,rdf_nodes_cache_manager.csv
,predicates_cache_manager.csv
) - Statistics about the knowledge graph before and after canonicalization (in
graphs_statistics.md
)
Parameters:
--configuration
: path of the JSON configuration file--max-rows
: max number of rows the SPARQL endpoint can return--output
: base directory for output files--self-signed-ssl
: enable self signed SSL certificates--debug
: print debug statements
Python script to mine neighbors, paths, and path patterns from a canonical knowledge graph and a set of seed nodes.
Parameters:
--configuration
: path of the JSON configuration file--graph
: base directory for the input graph files--dataset-csv
: CSV file with the seed nodes URIs (column 0) and class labels (column 1)--dataset-name
: name of the data set--output
: base directory for output files (statistics, scipy matrice of nodes x features, column name file, and a numpy vector of class labels)-d
: maximum degree to allow expansion (disabled withd = -1
)--lmin
: minimum support for features--lmax
: maximum support for features--kmin
: minimum k to test (i.e., number of traversed edges, size of paths and path patterns)--kmax
: maximum k to test--tmin
: minimum t to test (i.e., level for generalization in class hierarchies);t = -1
disables type generalization,t = 0
only allows to generalize withowl:Thing
--tmax
: maximum t to test--undirected
: whether only out arcs (false
) or all arcs (true
) are traversed--meaningful
: biomedical additional filtering strategies:p
: only select features containing a pathwayg
: only select features containing a gene or a GO classm
: only select features containing a MeSH classpg
: disjunction ofp
andg
pgm
: disjunction ofp
,g
, andm
all
: test all previous filters (thus, 5 outputs)no_check
: disable the additional filtering
--debug
: print debug statements
Python script to compute the statistics about the subgraph accessible from a set of seed nodes in a canonical knowledge graph. It outputs a markdown file containing the number of neighbors and types reachable from the seed nodes.
Parameters:
--configuration
: path of the JSON configuration file--graph
: base directory for the input graph files--dataset-csv
: CSV file with the seed nodes URIs (column 0) and class labels (column 1)--dataset-name
: name of the data set--output
: base directory for output files (Markdown files)-d
: maximum degree to allow expansion (disabled withd = -1
)--undirected
: whether only out arcs (false
) or all arcs (true
) are traversed--detailed
: enable detailed statistics, i.e., number of neighbors and types accessible w.r.t. k and t until full neighborhood is reached. By default, only the max numbers of reachable neighbors and types in the full neighborhood are output (k and t are not given).
An example of a JSON configuration file is given in configuration.json.example. Keys are:
- server-address: address of the SPARQL endpoint to query
- url-json-conf-attribute: URL attribute to use to get JSON results
- url-json-conf-value: value of the url-json-conf-attribute to get JSON results
- url-default-graph-attribute: URL attribute to use to define the default graph
- url-default-graph-value: value of url-default-graph-attribute to define the default graph
- url-query-attribute: URL attribute to use to define the query
- timeout: timeout value for HTTP requests
- username: username to use if HTTP authentication is required (empty otherwise)
- password: password to use if HTTP authentication is required (empty otherwise)
- path_predicates_blacklist: blacklist of URIs or prefixes of predicates not to traverse
- types_blacklist: blacklist of URIs or prefixes of types not to use in path generalization
- types_expansion_blacklist: blacklist of URIs or prefixes of types whose instances cannot be traversed
- tqdm
- numpy
- bitarray
- scipy