Skip to content
stephaniesimms edited this page Aug 29, 2018 · 8 revisions

maDMPs ontology

As part of the machine-actionable DMPs (maDMPs) project we are loading data from various systems (see the lib/services/ directory for a list) into a graph database. The purpose of this exercise is to explore relationships between data from these disparate systems and expose opportunities to connect and share information.

Technology and terminology

The maDMPs graph model is currently hosted on an instance of Neo4j. We use the Cypher query language (quick reference card) and their terminology for describing the system.

Graph diagram

Nodes

Nodes are logical objects within the graph. They contain a "Label" which identifies the type of object they represent and a collection of properties.

Every node has a unique identifier which is stored in the node's uuid property. This value is assigned when the node is created. Neo4j also assigns an internal node identifier. Neo4j's identifier though is not guaranteed to be persistent.

Node Labels

  • Project: Represents an academic research project
  • Person: Represents any person involved with a research project (e.g. PIs, Program officer, etc.)
  • Org: Represents institutions/organizations (e.g. Universities, Funders, etc.)
  • Award: Represents a monetary award provided to a project (e.g. a Grant)
  • Stage: Represents a logical phase of a project (e.g. Expedition, Sail, etc.)
  • Dataset: Represents a collection of data produced by the project (e.g. statistics, samples, etc.)
  • Document: Represents a document associated with a project (e.g. article, DMP, etc.)
  • Marker: Represents a topic that the project deals with (e.g. specific genome sequence, etc.) This node was found within the Geome system's data and needs further thought to determine if it should become a "Type" or if it does indeed warrant its own label)

Type and Identifier Labels

We decided to raise Types and Identifiers out of the other node's properties and make them separate nodes. We hope that this speeds up access to a given node (currently the graph is small so performance differences are negligible).

  • Type: Represents a category that can be used to quickly search through the graph using facets. (e.g. a search for 'EAGER' can return both Projects and Awards, 'Data Management Plan' will return all documents of that type, etc.) Types are defined as we load data into the graph. Types also include URLs to controlled vocabulary.

  • Identifier: Represents values that are unique to the node they identify. (e.g. unique local system ids, URL landing pages, DOIs and ARKs, email addresses, etc.)

Contributed_to relationship

The CONTRIBUTED_TO relationship includes a property called roles. This property helps define how the person contributed to the Project or Dataset (e.g. PI, Co-PI, Program manager/officer, Data Curation Librarian, etc.)

Preservation of source

Each node and relationship has a sources property. This array contains a list of all the systems that have identified the node or relationship. This information can be useful in several situations:

  • Future APIs: an API could be built that allows an external system to use its own internal identifiers as an entry point into the graph. For example: match (i:Identifier {value: 'https://www.my-system.org/project/12345'})-[r:IDENTIFIES]-(p:Project)-[]-(u:Person) WHERE ANY(source IN r.sources WHERE source = 'my_system') RETURN p,u allows 'my_system' to find a Project based on its own internal URL landing page for the project. Once the node is located, any other nodes could be returned to 'my_system' regardless of the source of the information.
  • Data Validation: When multiple sources identify a node or a relationship between nodes there is greater confidence that the information is valid. For example if both the NSF Awards API and BCO-DMO's system assert that Dr. John Doe was the PI for Project A then we can consider this information more reliable than if it came from a single source.