Skip to content

05 Preparing Your Own Data

Kit Siu edited this page Aug 8, 2023 · 14 revisions

You can create your own data in an ingestion package that can be shared for others to load into RiB. This page will walk you through the following:

  • what to include in an ingestion package
  • how dateTime is formatted
  • how to populate dataInsertedBy to capture how item was collected for insertion into RACK

Ingestion Package

First step for creating ingestion packages is to understand what ingestion packages are. An ingestion package is simply a zip file that includes RACK data that can be loaded into a RACK-in-a-box instance. A good, well-formed ingestion package should include both the data model (.owl files), instance data (.csv files), and a manifest file (manifest.yaml). When properly created all that should be required to load an ingestion package into RACK is described in the manifest file.

Example Zip File Contents:

Example.zip
  |
  ├─manifest.yaml
  |
  ├─OwlModels
  |  ├─NewOnt.owl
  │  └─import.yaml
  |
  ├─nodegroups
  |  ├─QueryNodegroup.json
  │  └─store_data.csv
  |
  └─InstanceData
     ├─REQUIREMENT1.csv
     ├─REQUIREMENT2.csv
     └─import.yaml

Anatomy of an Ingestion Package

An Ingestion Pack will typically consist of the following:

  1. Manifest File
  2. Data Model
  3. Nodegroups (Optional)
  4. Instance Data

Manifest

The manifest for an ingestion package is a simple file that identifies relevant data model files and instance data, and nodegroup files (optional). It also specifies the model and data datagraphs. It can also reference another manifest file. All file paths are resolved relative to the location of the manifest.

Example.zip/Load-IngestionExample.sh contents:

name: 'Turnstile ingestion'

footprint:
  model-graphs:
    - http://rack001/model
  data-graphs:
    - http://rack001/do-178c

steps:
  - manifest: rack.yaml
  - model: ../GE-Ontology/OwlModels/import.yaml
  - data: ../RACK-Ontology/ontology/DO-178C/import.yaml
name: 'RACK ontology'
description: 'Base ontology for assurance case curation'

footprint:
  model-graphs:
    - http://rack001/model

steps:
  - model:      ../RACK-Ontology/OwlModels/import.yaml
  - nodegroups: ../nodegroups/queries

Data Model

Data Model files are provided in owl formats. Generation of the owl files is outside the scope of this article, but typically the owl files are created through the use of another tool such as RITE or SADL. The import.yaml file should lists the data model files to be loaded into RACK.

Example.zip/OwlModels/import.yaml contents:

# This file is intended to be used using the rack.py script found
# in RACK-Ontology/scripts/
#
# Script documentation is available in RACK-Ontology/scripts/README.md
files:
- NewOnt.owl

This simply says that the single owl file is being loaded into RACK.

Nodegroups (optional)

Nodegroups are provided in json format, the generation of which are outside the scope of this article, but typically the json files are created by using SemTK. The store_data.csv file is used to define what nodegroups are to be loaded into RACK, as well as the description data that is to be included in SemTK's Nodegroup store.

Example.zip/nodegroups/store_data.csv contents:

ID, comments, creator, jsonFile
Query for requirements, Nodegroup to query for requirements from the Ingestion Package, JackBlack, QueryNodegroup.json

Instance Data

Instance Data is the final part of an ingestion package and is likely to contain the bulk of the data being loaded into RACK. Normally Instance Data will be in the form of CSV files. The examples in this article are using the files generated by the Scraping Tool Kit. However any source of CSV can be used, it just must be understood that the load order may be important and care should be taken with the import.yaml file to load the data in the correct order.

Example.zip/InstanceData/import.yaml contents:

data-graph: "http://rack001/data"
ingestion-steps:
#Phase1: Instance type declarations only
- {class: "http://arcos.rack/REQUIREMENTS#REQUIREMENT", csv: "REQUIREMENT1.csv"}

#Phase2: Only properties and relationships
- {class: "http://arcos.rack/REQUIREMENTS#REQUIREMENT", csv: "REQUIREMENT1.csv"}

More details on this format are available with the documentation of the RACK CLI, but the short explanation of this file is as follows:

data-graph: "http://rack001/data" -> Load this data into the specified graph

ingestion-steps: -> Load the data in the following:

- {class: "http://arcos.rack/REQUIREMENTS#REQUIREMENT", csv: "REQUIREMENT1.csv"} -> Load REQUIREMENT1.csv into the REQUIREMENT class

- {class: "http://arcos.rack/REQUIREMENTS#REQUIREMENT", csv: "REQUIREMENT2.csv"}

A couple of items worth noting:

Paths of CSV support relative paths from the yaml file location, although best practice would be to create a separate ingestion step if another file location is needed.

This ingestion uses a two phase ingestion: first all the instance type declarations are loaded; then all the properties and relationships are loaded. This two phase ingestion is done to minimize the order dependencies. If this two phase ingestion approach is not used, a situation where an ENTITY objects shadow the true intended objects can occur. As an example, if you were adding two REQUIREMENTS, Sw-R-1 and Sys-R-1, and Sw-R-1 "satisfies" Sys-R-1, then a CSV that describes this could look like:

REQUIREMENT2.csv:

identifier, satisfies_identifier
Sw-R-1, Sys-R-1
Sys-R-1,

If this is loaded into RACK, one of the three outcomes occurs depending on how the nodegroup is constructed:

  • An error of a failed lookup, when the lookup of statisfies_identifier is an "error if missing".
  • The expected two REQUIREMENTS being create, when statisfies_identifier is a "create if missing" and the satisfies node is typed as a REQUIREMENT.
  • Three items being create in RACK: two REQUIREMENTS (Sw-R-1 and Sys-R-1) as well as an ENTITY (Sys-R-1), when statisfies_identifier is a "create if missing" and the satisfies node is typed as a ENTITY.

This is because when the lookup for Sys-R-1 is performed for the ingestion of Sw-R-1, it does not exist yet since it is created by the next row. As a result one of the three outcomes occurs depending on how the nodegroup is constructed. By doing a two phase ingestion all the items are first created with CSV that is simply the identifier:

REQUIREMENT1.csv:

identifier
Sw-R-1
Sys-R-1

Then the REQUIREMENT2.csv will be ingested correctly regardless of how the nodegroup is constructed as the lookup of Sys-R-1 while ingesting Sw-R-1 will find the intended REQUIREMENT since that was created as part of ingesting REQUIREMENT1.csv.

How dateTime is formatted

Here is how to provide dateTime information in a csv data file. Use a value like "Thu Mar 23 03:03:16 PDT 2017" in the csv file (this is the value used for the column generatedAtTime_SoftwareUnitTestResult in the data file SoftwareUnitTestResult.csv in the Turnstile-Ontology). When this dateTime value which is in local dateTime format is ingested in SemTK, it is converted to UTC format which is shown as "2017-03-23T03:03:16-07:00" when appropriate query is run.

Additional resources: https://github.com/ge-semtk/semtk/wiki/Ingestion-Type-Handling

Populating dataInsertedBy

dataInsertedBy is a property that is on all ENTITYs, ACTIVITYs, and AGENTs (THINGs within the ontology). It is intended for the capturing of how the item was collected for insertion into RACK. This differs from the the more standard relationship to ACTIVITYs as it not related to the create of the data, rather to the collecting of the data.

As an example the extraction of ENTITYs from a pdf, the extraction processes should be captured by the dataInsertedBy activity, while the creation of the originating ENTITYs would be captured by wasGeneratedBy or some sub-property of wasGeneratedBy.

The best practice for RACK is that all items should have a dataInsertedBy relationship to an ACTIVITY. This ACTIVTIY should have at a minimum:

  1. identifier
  2. endedAtTime
  3. wasAssociatedWith
  4. description

the AGENT that is the target of wasAssoicatedWith should have at a minimum:

  1. identifier
  2. description

A single THING may have multiple dataInsertedBy relationships, if the item was identified in multiple sources. For example a REQUIREMENT may be identified as part of a extracting from a Requirements Specification, as well as it my be found as trace data in the extraction of TESTs from a test procedure. In This case both the requirement should have a dataInsertedBy relationship to both ACTIVITYs (extraction from Requirements Specification and extraction from the Test Procedure).

Searching for dataInsertedBy information

Given the usefulness of this dataInsertedBy information there is a predefined nodegroup, query Get DataInsertedBy From Guid, is preloaded into RACK that allows one to find this information for a given guid. To run the query in SemTK simply load the nodegroup and select run, a dialog will be presented that will allow you to select the guid for which you wish to get the 'dataInsertedBy'. This nodegroup can also be run programmatically just as any other runtime constrained nodegroup can be.

An important thing to note, especially while running this query programmatically, is that as described above a single THING can and often will have multiple dataInsertedBy ACTIVITYs that will be returned, so any handling of the resultant data should account for the multiple rows of data.