Skip to content

05 Preparing Your Own Data

Kit Siu edited this page Mar 16, 2022 · 14 revisions

You can create your own data in an ingestion package that can be shared for others to load into RiB. This page will walk you through the following:

  • what to include in an ingestion package
  • how dateTime is formatted
  • how to populate dataInsertedBy to capture how item was collected for insertion into RACK

Ingestion Package

First step for creating ingestion packages is to understand what ingestion packages are. An ingestion package is simply a zip file that includes RACK data that can be loaded into a RACK-in-a-box instance. A good, well-formed ingestion package should have no dependencies outside the base RACK Ontology and should include a RACK CLI ingestion script. When properly created all that should be required to load an ingestion package into RACK is to simply unzip the file and run the load script.

Example Zip File Contents:

Example.zip
  |
  ├─Load-IngestionExample.sh
  |
  ├─OwlModels
  |  ├─NewOnt.owl
  │  └─import.yaml
  |
  ├─nodegroups
  |  ├─IngestNewOnt.json
  │  └─store_data.csv
  |
  └─InstanceData
     ├─NewOnt1.csv
     ├─REQUIREMENT1.csv
     ├─NewOnt2.csv
     ├─REQUIREMENT2.csv
     └─import.yaml

Anatomy of an Ingestion Package

An Ingestion Pack will typically consist of the following:

  1. An Ingestion Script
  2. Ontology Updates (Optional)
  3. Additional Nodegroups (Optional)
  4. Instance Data

Ingestion Script

The ingestion script for an ingestion package is a simple sh script that is used invoke the RACK CLI to ingest any Ontology, Nodegroups, or Instance Data into RACK. Typically it is best to first ingest any Ontology updates, followed by any Nodegroups, and lastly the Instance Data.

Example.zip/Load-IngestionExample.sh contents:

#!/bin/bash
# Copyright (c) 2020, General Electric Company and Galois, Inc.
set -eu
BASEDIR=$(dirname "$0")
echo "$BASEDIR"
if ! command -v rack > /dev/null
then
    cat <<-END
        ERROR: rack cli tool not found in PATH

        Installation instructions are available at
        https://github.com/ge-high-assurance/RACK/wiki/RACK-CLI#install-dependencies
        or locally in README.md

        If you've already installed RACK CLI, please activate your virtual environment

        macOS/Linux: source venv/bin/activate
        Windows:     venv\\Scripts\\activate.bat
        PowerShell:  venv\\Scripts\\Activate.ps1
    END
    exit 1
fi

rack model import $BASEDIR/OwlModels/import.yaml

rack nodegroups import $BASEDIR/nodegroups

rack data import --clear $BASEDIR/InstanceData/import.yaml

Those familiar with shell scripting will likely need no explanation but for others a short description is as follows:

  1. Find the folder that this script is located and store this as a variable BASEDIR
  2. Check to see if the RACK CLI is available. If not, print a message about activating the CLI and quit the script.
  3. Use RACK CLI to load ontology specified by a import.yaml file located in the OwlModels subdirectory of the BASEDIR
  4. Use RACK CLI to load nodegroups specified by a store_data.csv file located in the nodegroups subdirectory of the BASEDIR
  5. Use RACK CLI to clear then load instance data specified by a import.yaml file located in the InstanceData subdirectory of the BASEDIR
    • Note: at present, the recommended order of operation is to first clear existing data in RACK, then ingest an entire set of instance data. In the future, we will support the ability to update and "fit" new data into RACK.

This order of loading into RACK is typically important--the loading of Instance Data may be dependent on Additional Nodegroups, and those nodegroups may be dependent on the Ontology Updates. While this may not always be the case, it is best to use this ingestion order to mitigate any risk of unknown dependencies.

Additionally multiple lines for each RACK CLI step could be included. If multiple lines are used care should be taken on the ingestion of Instance Data to ensure that there is a correct load order for the multiple lines. Typically data should be ingested in the order from a higher level to a lower level (i.e. System Requirements are ingested before Software Requirements, Software Requirements are ingested before Source or Testing Data).

Ontology Updates

Ontology Updates are provided in owl formats. Generation of the owl files is outside the scope of this article, but typically the owl files are created through the use of another tool such as SADL. One or more Owl Model can be housed in the directory and the import.yaml file defines what is to be loaded into RACK at ingestion time.

Example.zip/OwlModels/import.yaml contents:

# This file is intended to be used using the rack.py script found
# in RACK-Ontology/scripts/
#
# Script documentation is available in RACK-Ontology/scripts/README.md
files:
- NewOnt.owl

This simply says that the single owl file is being loaded into RACK.

Additional Nodegroups

Additional Nodegroups are provided in json formats, the generation of which are outside the scope of this article, but typically the json files are created by using SemTK. One or more nodegroups can be present in the directory and the store_data.csv is used to define what nodegroups are to be loaded into RACK, as well as the description data that is to be included in SemTK's Nodegroup store.

Example.zip/nodegroups/store_data.csv contents:

ID, comments, creator, jsonFile
IngestNewOnt, Nodegroup for ingesting NewOnt class from the Ingestion Package Example, Example, IngestNewOnt.json

Instance Data

Instance Data is the final part of an ingestion package and is likely to contain the bulk of the data being loaded into RACK. Normally Instance Data will be in the form of CSV files. The examples in this article are using the files generated by the Scraping Tool Kit. However any source of CSV can be used, it just must be understood that the load order may be important and care should be taken with the import.yaml file to load the data in the correct order.

Example.zip/InstanceData/import.yaml contents:

data-graph: "http://rack001/data"
ingestion-steps:
#Phase1: Instance type declarations only
- {nodegroup: "IngestNewOnt", csv: "NewOnt1.csv"}
- {nodegroup: "ingest_REQUIREMENT", csv: "REQUIREMENT1.csv"}

#Phase2: Only properties and relationships
- {nodegroup: "IngestNewOnt", csv: "NewOnt2.csv"}
- {nodegroup: "ingest_REQUIREMENT", csv: "REQUIREMENT2.csv"}

More details on this format are available with the documentation of the RACK CLI, but the short explanation of this file is as follows:

data-graph: "http://rack001/data" -> Load this data into the specified graph

ingestion-steps: -> Load the data in the following:

- {nodegroup: "IngestNewOnt", csv: "NewOnt1.csv"} -> Load NewOnt1.csv using the Nodegroup with an ID of IngestNewOnt

- {nodegroup: "ingest_REQUIREMENT", csv: "REQUIREMENT1.csv"} -> Load REQUIREMENT1.csv using the Nodegroup with an ID of ingest_REQUIREMENT

- {nodegroup: "IngestNewOnt", csv: "NewOnt2.csv"} -> Load NewOnt2.csv using the Nodegroup with an ID of IngestNewOnt

- {nodegroup: "ingest_REQUIREMENT", csv: "REQUIREMENT2.csv"} -> Load REQUIREMENT2.csv using the Nodegroup with an ID of ingest_REQUIREMENT

A couple of items worth noting:

Paths of CSV support relative paths from the yaml file location, although best practice would be to create a separate ingestion step in the ingestion script if another file location is needed.

NewOnt and IngestNewOnt are using Ontology and Nodegroups that are added by this ingestion package. If those steps are not preformed prior to loading the instance data, an error will occur.

This ingestion uses a two phase ingestion: first all the instance type declarations are loaded; then all the properties and relationships are loaded. This two phase ingestion is done to minimize the order dependencies. If this two phase ingestion approach is not used, a situation where an ENTITY objects shadow the true intended objects can occur. As an example, if you were adding two REQUIREMENTS, Sw-R-1 and Sys-R-1, and Sw-R-1 "satisfies" Sys-R-1, then a CSV that describes this could look like:

REQUIREMENT2.csv:

identifier, satisfies_identifier
Sw-R-1, Sys-R-1
Sys-R-1,

If this is loaded into RACK, one of the three outcomes occurs depending on how the nodegroup is constructed:

  • An error of a failed lookup, when the lookup of statisfies_identifier is an "error if missing".
  • The expected two REQUIREMENTS being create, when statisfies_identifier is a "create if missing" and the satisfies node is typed as a REQUIREMENT.
  • Three items being create in RACK: two REQUIREMENTS (Sw-R-1 and Sys-R-1) as well as an ENTITY (Sys-R-1), when statisfies_identifier is a "create if missing" and the satisfies node is typed as a ENTITY.

This is because when the lookup for Sys-R-1 is performed for the ingestion of Sw-R-1, it does not exist yet since it is created by the next row. As a result one of the three outcomes occurs depending on how the nodegroup is constructed. By doing a two phase ingestion all the items are first created with CSV that is simply the identifier:

REQUIREMENT1.csv:

identifier
Sw-R-1
Sys-R-1

Then the REQUIREMENT2.csv will be ingested correctly regardless of how the nodegroup is constructed as the lookup of Sys-R-1 while ingesting Sw-R-1 will find the intended REQUIREMENT since that was created as part of ingesting REQUIREMENT1.csv.

Advanced Ingestion Techniques

Since ultimately ingestion packages are data with a shell script containing instructions for loading the data into RACK shell scripting capabilities can be used make the data more dynamic and adaptable to the host environment. These examples are not the limits of what can be accomplished within ingestion packages but rather are to serve as examples.

Variable Expansion

The first example is to use shell script to update a file with a local file reference. Within the CSV files in a ingestion package a variable expansion tag can be defined {{BASEURL}}. The ingestion shell script is then used to replace this tag with a URLBASE that was determined at ingestion time in the shell script prior to loading the data into RACK. This allows the FILE entityURL to be defined with a URL that specifies a file included within the ingestion packages, so that if someone follows the url it will take them to the specific file that was included in the ingestion package.

Notes: This assumes that the ingestion package is unzipped to a local hard drive and not removed following the ingestion of the data. Furthermore, while this example makes accommodations for execution on Linux or Windows systems. The approach demonstrated below is limited to the situation where the machine hosting the ingestion package is the same as the one following the URL. Modifications could me made adapt to a shared drive or file server, but it would be situation dependent. This example also make the assumption that you will not move the files after ingestion, and will try to re-ingestion in the new location, as the script modifies the CSV file in place (i.e. no expansion variables remain). To re-ingestion with the files in a new location the user is required to replace the modified CSV with the original version from the zipped ingestion package (i.e. with the expansion variables). Modifications could be made to address this is beyond the scope of this example.

Example CSV File:

identifier, entityUrl,
fileName,{{BASEURL}}folderFromIngestionPackage/filenName.txt,

This example CSV file uses a tag within the data to identify the place in the data where the expansion variable should be populated. The script below is a section of shell script that needs to be added to the standard ingestion script. It should be included before the ingestion of the CSV files, but should be after the definition of the $BASEDIR variable.

Example Shell Script:

if test "$OSTYPE" == "cygwin" -o "$OSTYPE" == "msys"; then
URLBASE="file://$(cygpath -m "$BASEDIR")"
else
URLBASE="file://$BASEDIR"
fi

echo "Updating CSV files with URL Base ..."
find "$BASEDIR" -name "*.csv" -exec sed -i -e "s|{{URLBASE}}|$URLBASE|g" {} +

The behavior that is being added is:

  1. Determine the URL base based on the shell $OSTYPE
    • if the script is being run in a cygwin or msys shell the path should use windows formatting ("c:/"). This is provided by the cygpath utility, the -m option will cause the path to be with forward slashes, resulting in a valid URL.
    • otherwise just use the filepath of the $BASEDIR which is file location of the ingestion script this would be using unix-like file paths ("/home/username")
  2. Find all the CSV files in the current directory ($BASEDIR) using the find utility, and for each file use sed to replace {{URLBASE}} with the $URLBASE determined in the first step.

Note: BASH shell (#!/bin/bash) should be used as the $OSTYPE is not available for the original Bourne shell that is commonly associated with simple shell scripts (#!/bin/sh).

After the execution of these shell commands the CSV file will no longer contain the expansion variable {{URLBASE}}, and the resulting update would be OS dependent:

Example Resulting CSV File (in a windows environment):

identifier, entityUrl,
fileName,file://C:/UnzippedLocation/folderFromIngestionPackage/filenName.txt,

Example Resulting CSV File (in a unix-like environment):

identifier, entityUrl,
fileName,file:///home/username/UnzippedLocation/folderFromIngestionPackage/filenName.txt,

Now that the CSV files have been updated with the expansion variable regular data ingestion can continue as described in the above.

How dateTime is formatted

Here is how to provide dateTime information in a csv data file. Use a value like "Thu Mar 23 03:03:16 PDT 2017" in the csv file (this is the value used for the column generatedAtTime_SoftwareUnitTestResult in the data file SoftwareUnitTestResult.csv in the Turnstile-Ontology). When this dateTime value which is in local dateTime format is ingested in SemTK, it is converted to UTC format which is shown as "2017-03-23T03:03:16-07:00" when appropriate query is run.

Additional resources: https://github.com/ge-semtk/semtk/wiki/Ingestion-Type-Handling

Populating dataInsertedby

dataInsertedBy is a property that is on all ENTITYs, ACTIVITYs, and AGENTs (THINGs within the ontology). It is intended for the capturing of how the item was collected for insertion into RACK. This differs from the the more standard relationship to ACTIVITYs as it not related to the create of the data, rather to the collecting of the data.

As an example the extraction of ENTITYs from a pdf, the extraction processes should be captured by the dataInsertedBy activity, while the creation of the originating ENTITYs would be captured by wasGeneratedBy or some sub-property of wasGeneratedBy.

The best practice for RACK is that all items should have a dataInsertedBy relationship to an ACTIVITY. This ACTIVTIY should have at a minimum:

  1. identifier
  2. endedAtTime
  3. wasAssociatedWith
  4. description

the AGENT that is the target of wasAssoicatedWith should have at a minimum:

  1. identifier
  2. description

A single THING may have multiple dataInsertedBy relationships, if the item was identified in multiple sources. For example a REQUIREMENT may be identified as part of a extracting from a Requirements Specification, as well as it my be found as trace data in the extraction of TESTs from a test procedure. In This case both the requirement should have a dataInsertedBy relationship to both ACTIVITYs (extraction from Requirements Specification and extraction from the Test Procedure).

Searching for dataInsertedBy information

Given the usefulness of this dataInsertedby information there is a predefined nodegroup, query Get DataInsertedBy From Guid, is preloaded into RACK that allows one to find this information for a given guid. To run the query in SemTK simply load the nodegroup and select run, a dialog will be presented that will allow you to select the guid for which you wish to get the 'dataInsertedBy'. This nodegroup can also be run programmatically just as any other runtime constrained nodegroup can be.

An important thing to note, especially while running this query programmatically, is that as described above a single THING can and often will have multiple dataInsertedBy ACTIVITYs that will be returned, so any handling of the resultant data should account for the multiple rows of data.