Skip to content

ontodev/valve.rs

Repository files navigation

ontodev_valve/valve.rs

Valve - A lightweight validation engine written in rust.

Table of contents

ToC Generated using markdown-toc

Design and concepts

Valve reads in the contents of user-defined data tables and represents them using a spreadsheet-like structure incorporating annotated cells that may subsequently be edited and saved. The annotations are of two kinds:

  1. Data validation messages indicating that the value of a particular cell violates one or more of a number of user-defined data validation rules.
  2. Data update messages indicating that the value of a particular cell has changed as well as the details of the change.

After reading and optionally validating the initial data, the validated information is loaded into a database that can then be queried directly using your favourite SQL client (e.g., sqlite3 or psql) as in the examples below. The currently supported databases are SQLite and PostgreSQL. Valve further provides an Application-programmer interface (API) that can be used to incorporate Valve's data validation and manipulation functionality and to visualize the validated data within your own rust project. For an example of a rust project that uses Valve, see ontodev/nanobot.rs. Finally, Valve also provides a Command line interface (CLI) for some of its more important functions like load, save, and guess.

The Valve database

Data tables and views

Below is an example of one of the views Valve provides into a data table in which three of the five rows contain invalid data, and one row has been edited since it was initially loaded. The output is from a psql session connected to a PostgreSQL database called 'valve_postgres'.

valve_postgres=> select * from table11_text_view
row_number row_order message history rank name employee_id
1 1000 Eastern division head Denise Laviolette denise_l
2 2000 Southwest division head Eddie Cahill eddie_c
3 3000 [{"column":"employee_id","value":"denise_l","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}] [[{"column":"rank","level":"update","message":"Value changed from 'intern' to 'Chief executive officer'","old_value":"intern","value":"Chief executive officer"}]] Chief executive officer Denise Laurier denise_l
4 4000 [{"column":"employee_id","value":"eddie_c","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}] Development team lead Edward Coulombe eddie_c
5 5000 [{"column":"employee_id","value":"eddie_c","level":"error","rule":"key:primary","message":"Values of employee_id must be unique"}] Test team lead Ed Crantivera eddie_c
(5 rows)

Here, row_number is a fixed identifier assigned to each row at the time of creation, while row_order, which in principle might change multiple times throughout the lifetime of a given row, is used to manipulate the logical order of rows in the table. The message column, in this example, conveys that the value 'denise_l' of the column "employee_id" in row 3, and the value 'eddie_c' of the same column in rows 4 and 5, are duplicates of already existing values of that column, which happens to be a primary key, and are therefore invalid. We can also see, from the history column, that there has been one change, in which one of the values in row 3, namely the value of the column "rank", has been changed from 'intern' to 'Chief executive officer'. Finally, the names of the columns to the right of the history column correspond to the column names of the source table and will therefore vary from table to table. Normally these column names are specified in the header of a '.tsv' file from which the source data is read, though see below for alternate input data formats and associated table options. In any case the data contained in the columns to the right of the history column will exactly match the contents of the source table unless the data has been edited since it was initially loaded.

For the example below we will assume that a file named artists.tsv exists on your hard disk in your current working directory with the following contents:

name type number_of_members health_insurance_provider health_insurance_id health_insurance_id_suffix
The Jimi Hendrix Experience band 3 Pittsfield Medical 9834564422
Cream band 3 Blue Cross 9388883311 XX54
Jennifer Lopez solo Medi-Assist MA67886666881
Janice Joplin solo Pittsfield Medical FFFHYZDFJ432
Chrissie Hynde solo Blue Cross 4422393877
Van Halen band 5 Blue Cross 9476587117 BBDC
Van Morrison solo Medi-Assist MA67920004571
Paul McCartney solo Medi-Assisr MA60768763987
The Band band five Blue Cross 0831133887
Bob Dylan solo Pittsfield Medical FFF GYU ZKJ 954
Van Halen band Blue Cr. 9476587117

We will also assume that a file named providers.tsv exists on your hard disk in your current working directory with the following contents:

name address
Blue Cross 123 Fake Street, Fake Town, USA, 55123
Medi-Assist 933 Phoney Boulevard, Accra, Ghana, GA008
Pittsfield Medical 510 North Street, Pittsfield, MA, 01201

In order for Valve to read this table it must first be configured to do so. This is done using a number of special data tables called configuration tables, usually represented using further '.tsv' files, that contain information about:

  1. Where to find the data tables and other general properties of each managed data table (i.e., the table table).
  2. The datatypes represented in the various data tables (i.e., the datatype table).
  3. Information about the columns of each data table: their associated datatypes, data dependencies between the values in one table column and some other one, and any other constraints or restrictions on individual columns (i.e., the column table).
  4. Rules constraining the joint values of two different cells in a single given row of data (i.e., the rule table).

For our example we will assume that Valve's configuration tables contain the following entries:

  • Table table
table path description type options
artists artists.tsv
providers providers.tsv
  • Column table
table column label nulltype default datatype structure description
artists name trimmed_line primary
artists type trimmed_line
artists number_of_members empty integer
artists health_insurance_provider trimmed_line from(providers.name)
artists health_insurance_id nonspace
artists health_insurance_id_suffix empty word
providers name trimmed_line primary
providers address text
  • Datatype table
datatype parent condition description sql_type
integer nonspace match(/-?\d+/) INTEGER
nonspace trimmed_line exclude(/\s/)
trimmed_line line match(/\S([^\n]\S)/)
line text exclude(/\n/)
word nonspace exclude(/\W/)
empty text equals('')
text TEXT
  • Rule table
table when_column when_condition then_column then_condition level description
artists type equals(band) number_of_members not null error a band must specify the number of members
artists health_insurance_provider equals('Blue Cross') health_insurance_id_suffix not null error a health insurance id suffix is required for Blue Cross members
artists health_insurance_provider equals('Pittsfield Medical') health_insurance_id word error a Pittsfield Medical ID must consist in a single word

For the meanings of all of the columns in the configuration tables above, see the section on configuration. In the rest of this section we'll refrain from explaining the meaning of a particular configuration table column unless and until it becomes relevant to our example. What is relevant at this point is only that each configuration table is also a data table whose contents are themselves subject to validation by Valve. In other words Valve will not necessarily fail to run if there are errors in its configuration (as long as those errors aren't critical) and it can moreover help to identify what those errors are. The upshot is that almost everything I mention below regarding artists also applies to the special configuration tables table, column, datatype and rule unless otherwise noted.

Once it has been read in by Valve from its source file, a given logical table will be represented, in the database, by between one and two database tables and by as many as two database views. In our example, the source data, contained in the file 'artists.tsv', represents (according to the table table configuration) a normal data table with the default options set, which means that all four database tables and views will be created. These are:

  1. artists: The database table, having the same name as the logical table, that normally contains the bulk of the table's data.
  2. artists_conflict (only when the conflict option is set): Valve aims to represent user data whether or not that data is valid. When it is not valid, however, this presents an obstacle when it comes to representing it in a relational database like SQLite or PostgreSQL. For instance, if a table contains a primary key column, then data rows containing values for the column that already exist in the table should be marked as invalid, but still somehow represented by Valve. However the implied database constraint on the database table will prevent us from inserting duplicate values to a primary key column. A similar issue arises for unique and foreign key constraints. To get around the database limitations that are due to database constraints such as primary, unique, or foreign keys, Valve constructs a _conflict table that is identical to the normal version of the table, but that does not include those keys. In our example, any rows containing duplicate values of the primary key column will be inserted to the database table called artists_conflict, while the other rows will be inserted to artists.
  3. artists_view (only when the conflict option is set): When Valve validates a row from the logical table, artists, the validation messages it generates are not stored in the database table called artists but instead in a special internal database table called message. Similarly, when Valve adds, updates, or deletes a row from artists, a record of the change is not stored in artists, but in a special internal database table called history. Because it can be convenient for information about the validity of a particular cell to be presented side by side with the value of the cell itself, Valve constructs a table called artists_view which combines information from artists and artists_conflict with the message and history tables.
  4. artists_text_view (only when the conflict option is set): In addition to the restrictions associated with primary, unique, and foreign key constraints, the database will also not allow us to represent values that are invalid due to an incorrect SQL type, in particular when one attempts to insert a non-numeric string value to a numeric column of a database table. To get around this limitation, Valve constructs a database view called artists_text_view. Unlike artists_view, which is defined such that the datatype of each column in artists_view exactly matches the datatype of the corresponding datatype in artists, in artists_text_view all of the columns are cast to TEXT so that no SQL datatype errors can occur when representing the data.

After loading the data from artists.tsv into the database, these four tables and views will be found to have the following contents:

valve_postgres=> select * from artists;
row_number row_order name type number_of_members health_insurance_provider health_insurance_id health_insurance_id_suffix
1 1000 The Jimi Hendrix Experience band 3 Pittsfield Medical 9834564422
2 2000 Cream band 3 Blue Cross 9388883311 XX54
3 3000 Jennifer Lopez solo Medi-Assist MA67886666881
4 4000 Janice Joplin solo Pittsfield Medical FFFHYZDFJ432
6 6000 Van Halen band 5 Blue Cross 9476587117 BBDC
7 7000 Van Morrison solo Medi-Assist MA67920004571
(6 rows)

valve_postgres=> select * from artists_conflict
row_number row_order name type number_of_members health_insurance_provider health_insurance_id health_insurance_id_suffix
5 5000 Chrissie Hynde solo Blue Cross 4422393877
8 8000 Paul McCartney solo Medi-Assisr MA60768763987
9 9000 The Band band Blue Cross 0831133887
10 10000 Bob Dylan solo Pittsfield Medical FFF GYU ZKJ 954
11 11000 Van Halen band 5 Pittsfield Med. 9476587117
(5 rows)

valve_postgres=> select * from artists_view order by row_number;
row_number row_order name type number_of_members health_insurance_provider health_insurance_id health_insurance_id_suffix message history
1 1000 The Jimi Hendrix Experience band 3 Pittsfield Medical 9834564422
2 2000 Cream band 3 Blue Cross 9388883311 XX54
3 3000 Jennifer Lopez solo Medi-Assist MA67886666881
4 4000 Janice Joplin solo Pittsfield Medical FFFHYZDFJ432
5 5000 Chrissie Hynde solo Blue Cross 4422393877 [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"}]
6 6000 Van Halen band 5 Blue Cross 9476587117 BBDC
7 7000 Van Morrison solo Medi-Assist MA67920004571
8 8000 Paul McCartney solo Medi-Assisr MA60768763987 [{"column":"health_insurance_provider","value":"Medi-Assisr","level":"error","rule":"key:foreign","message":"Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name"}]
9 9000 The Band band Blue Cross 0831133887 [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"},{"column":"number_of_members","value":"five","level":"error","rule":"datatype:integer","message":"number_of_members should be a positive or negative integer"}]
10 10000 Bob Dylan solo Pittsfield Medical FFF GYU ZKJ 954 [{"column":"health_insurance_id","value":"FFF GYU ZKJ 954 ","level":"error","rule":"datatype:nonspace","message":"health_insurance_id should be text without whitespace"},{"column":"health_insurance_provider","value":"Pittsfield Medical","level":"error","rule":"rule:health_insurance_provider-2","message":"a Pittsfield Medical health insurance id must be a single word"}]
11 11000 Van Halen band 5 Pittsfield Med. 9476587117 [{"column":"health_insurance_provider","value":"Pittsfield Med.","level":"error","rule":"key:foreign","message":"Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name"},{"column":"name","value":"Van Halen","level":"error","rule":"key:primary","message":"Values of name must be unique"}]
(11 rows)

valve_postgres=> select * from artists_text_view order by row_number;
row_number row_order name type number_of_members health_insurance_provider health_insurance_id health_insurance_id_suffix message history
1 1000 The Jimi Hendrix Experience band 3 Pittsfield Medical 9834564422
2 2000 Cream band 3 Blue Cross 9388883311 XX54
3 3000 Jennifer Lopez solo Medi-Assist MA67886666881
4 4000 Janice Joplin solo Pittsfield Medical FFFHYZDFJ432
5 5000 Chrissie Hynde solo Blue Cross 4422393877 [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"}]
6 6000 Van Halen band 5 Blue Cross 9476587117 BBDC
7 7000 Van Morrison solo Medi-Assist MA67920004571
8 8000 Paul McCartney solo Medi-Assisr MA60768763987 [{"column":"health_insurance_provider","value":"Medi-Assisr","level":"error","rule":"key:foreign","message":"Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name"}]
9 9000 The Band band five Blue Cross 0831133887 [{"column":"health_insurance_provider","value":"Blue Cross","level":"error","rule":"rule:health_insurance_provider-1","message":"a health insurance id suffix must be specified for Blue Cross members"},{"column":"number_of_members","value":"five","level":"error","rule":"datatype:integer","message":"number_of_members should be a positive or negative integer"}]
10 10000 Bob Dylan solo Pittsfield Medical FFF GYU ZKJ 954 [{"column":"health_insurance_id","value":"FFF GYU ZKJ 954 ","level":"error","rule":"datatype:nonspace","message":"health_insurance_id should be text without whitespace"},{"column":"health_insurance_provider","value":"Pittsfield Medical","level":"error","rule":"rule:health_insurance_provider-2","message":"a Pittsfield Medical health insurance id must be a single word"}]
11 11000 Van Halen band 5 Pittsfield Med. 9476587117 [{"column":"health_insurance_provider","value":"Pittsfield Med.","level":"error","rule":"key:foreign","message":"Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name"},{"column":"name","value":"Van Halen","level":"error","rule":"key:primary","message":"Values of name must be unique"}]
(11 rows)

The message and history tables

In the previous section we mentioned that artists_view and artists_text_view are defined in terms of a query that draws on data from artists and artists_conflict as well as from the special internal tables called message and history. The information in these tables can also be queried directly. After loading the data from artists.tsv into the database, one will find that the message table contains the following contents:

valve_postgres=> select * from message where "table" = 'artists';
message_id table row column value level rule message
62 artists 5 health_insurance_provider Blue Cross error rule:health_insurance_provider-1 a health insurance id suffix must be specified for Blue Cross members
63 artists 8 health_insurance_provider Medi-Assisr error key:foreign Value 'Medi-Assisr' of column health_insurance_provider is not in providers.name
64 artists 9 number_of_members five error datatype:integer number_of_members should be a positive or negative integer
65 artists 9 health_insurance_provider Blue Cross error rule:health_insurance_provider-1 a health insurance id suffix must be specified for Blue Cross members
66 artists 10 health_insurance_provider Pittsfield Medical error rule:health_insurance_provider-2 a Pittsfield Medical health insurance id must be a single word
67 artists 10 health_insurance_id FFF GYU ZKJ 954 error datatype:nonspace health_insurance_id should be text without whitespace
68 artists 11 name Van Halen error key:primary Values of name must be unique
69 artists 11 health_insurance_provider Pittsfield Med. error key:foreign Value 'Pittsfield Med.' of column health_insurance_provider is not in providers.name
(8 rows)

Here,

  • message_id is a unique identifier for this message assigned when it is created.
  • table is the table that this message is associated with.
  • row is the row_number of the table that this message is associated with.
  • column is the column in the row of the table that this message is associated with.
  • value is what the message is about.
  • level is the severity of the message.
  • rule identifies the rule that, when applied to the given value, resulted in the message.
  • message is the text of the message.

The history table will be empty immediately after the initial loading of the database.

valve_postgres=> select * from history where "table" = 'artists';
history_id table row from to summary user undone_by timestamp
(0 rows)

Here,

  • history_id is a unique identifier for this history record that is assigned when it is created.
  • table is the table that this record is associated with.
  • row is the row number of the table that this record is associated with.
  • from is a rich JSON representation of the row as it was before the change (see the section on data validation).
  • to is a rich JSON representation of the row after the change (see the section on data validation).
  • user is the name of the user that initiated the change, or who undid or redid the change (if applicable).
  • undone_by is the name of the user that undid the change (if applicable).
  • timestamp records the time of the change, or the time that it was undone or redone (if applicable).
  • summary is a JSON representation of a summary of the change in the form of an array of records corresponding to each column that has changed. e.g.,
        [
            {
                "column":"name",
                "level":"update",
                "message":"Value changed from 'Prince' to 'The artist formerly known as Prince'",
                "old_value":"Prince",
                "value":"The artist formerly known as Prince"
            },
            {
                "column":"health_insurance_provider",
                "level":"update",
                "message":"Value changed from 'Blue Cross' to 'Medi-Assist'",
                "old_value":"Blue Cross",
                "value":"Medi-Assist"
            },
            {
                "column":"health_insurance_id",
                "level":"update",
                "message":"Value changed from '9933991133' to 'MA0001122344'",
                "old_value":"9933991133",
                "value":"MA0001122344"
            },
            {
                "column":"health_insurance_id_suffix",
                "level":"update",
                "message":"Value changed from 'GG77' to ''",
                "old_value":"GG77",
                "value":""
            }
        ]

Note that the summary column of the history table is where the information in the view columns artists_view.history and artists_text_view.history is taken from, which is why these view columns are always in the form of an array of arrays, i.e., an array of summary records for the given row.

Data validation and editing

Representing validated data

For the purposes of this section we will define a cell as a single value, that can either be valid or invalid, of one of the columns of a given data table. We will further define a row of data, from a given data table, to be an array of correlated cells, such that for every column in the table there is one and only one cell in the row corresponding to it. Once a given row of data has been inserted into the database and (optionally) validated, Valve represents it using a ValveRow struct (see the API reference) containing the following information:

  1. The row's fixed, unique, identifier, or row_number.
  2. A map from the names of the columns contained in the row, to the ValveCell struct associated with each, where the latter represents the results of running Valve's validation engine (see the validation process) on the cell associated with that column.

In particular, each ValveCell contains the following information:

  1. The value of the cell.
  2. Whether the value should be interpreted as a null value of the given column, and if so, the column's nulltype.
  3. Whether or not the value is valid given the column's column configuration.
  4. A list of ValveCellMessage structs representing the validation messages associated with this cell.

Each ValveCellMessage, in turn, contains the following information:

  1. The level, or severity, of the message (e.g., "error", "warn", "info").
  2. An alphanumeric identifier for the rule violation described by the message (see the section on rule violation identifiers).
  3. The text of the message.

The following is a textual representation of an example of a ValveRow:

ValveRow {
    row_number: Some(
        11,
    ),
    contents: {
        "id": ValveCell {
            nulltype: None,
            value: String("BFO:0000027"),
            valid: true,
            messages: [],
        },
        "name": ValveCell {
            nulltype: None,
            value: String("Mike"),
            valid: true,
            messages: [],
        },
        "location": ValveCell {
            nulltype: None,
            value: String("baree"),
            valid: false,
            messages: [
                ValveCellMessage {
                    level: "error",
                    rule: "key:foreign",
                    message: "Value 'baree' of column location is not in cities.city_name",
                },
            ],
        },
        "preferred_seafood": ValveCell {
            nulltype: Some("empty"),
            value: String(""),
            valid: true,
            messages: [],
        },
    },
}
Rule violation identifiers

Valve uses the following to identify the rule that has been violated by a given cell value in the ValveCellMessage associated with the violation:

  • key:foreign: The column that the given value belongs to has a from() structure (see the column table) that references some column, F, in another table, but the given value is not in F.
  • key:primary: The column that the given value belongs to has a primary structure, and the given value already exists in the column.
  • key:unique: The column that the given value belongs to has a unique structure, and the given value already exists in the column.
  • option:unrecognized (table table only): The list of options specified in the options column of the table table contains an unrecognized option.
  • option:redundant: (table table only): The list of options specified in the options column of the table table contains an option that is already implied by one of the other options.
  • option:reserved: (table table only): The list of options specified in the options column of the table table contains an option keyword that is reserved for internal use.
  • option:overrides: (table table only): The list of options specified in the options column of the table table contains an option that overrides one of the other options.
  • tree:foreign: The column that the given value belongs to has a tree() structure that references some other column, T, of the same table; but the given value is not in T.
  • datatype:DATATYPE: The column that the given value belongs to has the datatype, DATATYPE, but applying DATATYPE's associated condition to the given value results in a failure.
  • rule:COLUMN-N: The given value of COLUMN causes the Nth rule in the rule table whose when_column is COLUMN to be violated.

The validation process

Validating a row of data

For a given row of data in a given data table, Valve's validation process consists in a series of checks that are schematically represented in the flowchart below and subsequently explained in more detail.

flowchart TD
    node1["1. Determine the nulltype (if any) of each cell in the row"]
    node1 -- "Then, for each cell:" --> node2
    node2["2. Validate rules"]
    node2 --> modal1
    modal1{"Does the cell have a nulltype?"}
    modal1 -- No --> node3
    modal1 -- Yes, skip further validation for this cell --> modal3
    node3["3. Validate datatype"]
    node3 --> modal2
    modal2{"Does the cell value contain a SQL type error?"}
    modal2 -- No --> node4
    modal2 -- Yes --> modal3
    node4["4. Validate foreign constraints"]
    node4 --> node5
    node5["5. Validate primary and unique Constraints"]
    node5 --> modal3
    modal3{"Have we iterated over all of the cells in the row?"}
    modal3 -- Yes, then over the table as a whole: --> node6
    modal3 -- No, go on to the next cell --> node2
    node6["6. Validate tree-foreign keys"]
Loading
Determining the nulltype of a cell

The validation process begins by determining, for each cell in the row, whether the value of that cell matches the nulltype (if any) of its associated column, as defined in the column table. In particular, if the value of the cell matches the nulltype of its associated column, then the nulltype field of the ValveCell struct used to represent the cell will be set to indicate that the value is a null value of that type. Otherwise the nulltype field will remain unset, indicating that the value is not a null value. For instance, suppose that the cell value is '' (i.e., the empty string), and that the nulltype for its associated column, as defined in the column table, is empty. Since empty's associated condition, as defined in the datatype table is equals(''), applying it to the cell value will result in a match, and Valve will set the nulltype field for the ValveCell representing this particular cell to empty. In the case where the value of the cell does not match the condition associated with the datatype, empty, (i.e., when the cell value is something other than an empty string), the validation process will leave the nulltype field of the ValveCell unset.

Validating rules

This step of the validation process determines whether any of the rules in the rule table that are applicable to a cell have been violated. A rule in the rule table is applicable to a cell when the cell's associated column is the same as the when_column associated with the rule. Note that since the rules in the rule table correspond to if-then conditionals, such that the antecedent and consequent of a given conditional refer (in general) to two distinct columns, a rule violation may indicate that there is a problem with the value of either or both. Whenever a rule violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell, identifying the particular violation that occurred (see the section on rule violation IDs) and its associated level and description as found in the rule table.

Validating datatypes

This step of the validation process determines whether a cell's value violates the datatype condition, as defined in the datatype table, for the datatype associated with the cell's column in the column table. When a datatype violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell, identifying the particular datatype violation that occurred (see the section on rule violation IDs). The text of the message is taken from the description of the datatype that has been violated (see the datatype table).

Validating foreign constraints

This step in the validation process verifies, for a given cell, that if the cell's associated column, C, has been configured with a structure of the form from(T, F) (see the column-table), where T is a foreign table and F is a column in T, then the cell's value (or values if C's datatype is a list datatype) is (are) among the values of F. Note that if the foreign table has a _conflict version (see the table table), then this function will distinguish between (a) the case in which a given value is not found in either the foreign table or its associated conflict table, and (b) the case in which a given value is found only in the conflict table. When a foreign constraint violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell with the identifier key:foreign (see also the section on rule violation IDs). The text of the message will be of the form: Value 'V' of column C is not in T.F, whenever V is neither found in the normal version of the table nor (if applicable) its conflict version, and it will be of the form Value 'V' of column C exists only in T_conflict.F whenever V exists in T_conflict but not in T.

Validating primary and unique constraints

This step in the validation process verifies, for a given cell, that if the cell's associated column has been configured with either a primary or a unique constraint (see the column-table), then the cell's value is not among the values of the column that have already been inserted into the table, neither in the normal version of the table nor in the conflict version (in the case where the conflict option has been set). When a primary or unique constraint violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell with the appropriate rule violation ID. The text of the message is of the form Values of COLUMN must be unique.

Validating tree-foreign keys

When a column, column2 has a structure tree(column1) defined on it in the column table, then all non-null values of column2 must exist in column1. This step in the validation process function verifies, for a given table, whether any of the values of any of the cells in any of the rows violate any of the table's tree() conditions. When a primary or unique constraint violation occurs, a ValveCellMessage struct is added to the list of messages associated with the cell with the appropriate rule violation ID. The text of the message is of the form Value 'VAL' of column COLUMN2 is not in COLUMN1.

Batch validation

The algorithm described in the previous section is applicable to a single row of data. When initially loading the many rows of a data table from its source into the database, however, a number of optimizations may be used to speed up the validation process. In particular, some of the steps for validating a row's cells from the previous section do not need to refer to any data external to a given row, and may therefore be performed in parallel. These are steps 1–3, i.e., the nulltype, rules, and datatype validation steps. Only the remaining steps: foreign and unique/primary constraint validation, need be performed row by row.

Editing the data

Once a data table has been loaded into the database and validated, it may be desirable to further edit the data, for instance to resolve any outstanding validation messages, or for some other reason. The possible operations that can be performed on the data are insert_row(), update_row(), delete_row(), and move_row(). In addition the function validate_row() is used to generate validation information about the row during the editing process.

Note that insert_row(), update_row(), and validate_row() require that a row be specified in the following "simple" form:

{
    "column_1": value1,
    "column_2": value2,
    ...
},

The output of these three functions is a ValveRow (see representing validated data).

After editing the data, a record of the change will be inserted to the history table. The ValveRow representing the row as it was before the change (in the case of an update, delete, or move operation), and the ValveRow representing it as it is after the change (in the case of an update, move, or insert), will be used as the values of the from and to columns, respectively. Note that to represent a ValveRow in the database, "rich" JSON format is used:

{
    "column_1": {
        "valid": <true|false>,
        "messages": [{"level": level, "rule": rule, "message": message}, ...],
        "value": value1
    },
    "column_2": {
        "valid": <true|false>,
        "messages": [{"level": level, "rule": rule, "message": message}, ...],
        "value": value2
    },
    ...
},

Valve also provides the undo() and redo() functions, to undo the last insert, update, or delete operation, and to redo the last undo, respectively. For more information on the data manipulation and browsing operations provided by Valve see the section on the Valve API.

Saving data tables

To save a data table or tables to a '.tsv' file, one uses the API functions, save_all_tables(), save_tables(), or save_table(). Note that unless the save option has been set, it is not possible to overwrite the source file from which the table was originally loaded. One must save the table to an alternate location. For more information on saving data tables Valve see the section on the Valve API.

Installation and configuration

Prerequisites

  • Building Valve from source requires installing a rust development environment.
  • Valve also requires database management software to be installed. The currently supported databases are SQLite and PostgreSQL. In the latter case the database server must be up and running and ready to accept connections. Note that although the database may be queried directly by using your favourite SQL client (e.g., sqlite3 or psql), one advantage to viewing and manipulating the data through Valve is that syntax and other differences between the two database management systems become transparent to the user.

Installation

Binary installation

Option 1: Build and install the binary for the latest release using crates.io by running:

cargo install ontodev_valve

Option 2: Download the appropriate binary for your system from the release page and copy it to your system's executable path.

Source installation

It is also possible to build the binary yourself from the source code. This is required if you would like to work with a version of Valve that is newer than the latest release. Begin by using git clone to clone the source code repository into a local folder:

git clone [email protected]:ontodev/valve.rs.git

Next, use cargo to build the ontodev_valve binary (this may take awhile to complete and will produce a lot of output):

cargo build --release

After the build has completed successfully, the ontodev_valve binary may be found (relative to the repository's root directory) in the target/release/ subdirectory. You should copy or move it from here to your user or system-wide bin/ directory, or to some other convenient location. For example,

cp target/release/ontodev_valve ~/bin/

Finally, run

ontodev_valve --help

to verify that Valve was installed correctly.

Configuration

Valve is configured primarily using a number of special configuration tables that can be represented as '.tsv' files. The most important of these is the table called 'table', also known as the table table. A table table configuration is required to use Valve. A column table and datatype table configuration are required as well. Optionally, the user may also specify a rule table configuration.

The table table is alone among the configuration tables in that it cannot be given an arbitrary name but must always be given the name 'table'. This is not the case for the column, datatype and rules tables. Although it is recommended to use the names 'column, 'datatype', and 'rule', respectively, alternate names may be chosen for these tables as explained below.

The table table

One normally configures the table table using a '.tsv' file called 'table.tsv', although any filename could be used in principle, and in fact it is also possible to read the table table directly from the database, as long as the database already contains a table called 'table' with the right setup.

Below is an example of a table table:

table path description type options
table schema/table.tsv table
column schema/column.tsv column
datatype schema/datatype.tsv datatype
rule schema/rule.tsv rule
user_table1 schema/user/user_table1.tsv
user_readonly_table1 schema/user/user_readonly_table1.tsv no-edit no-save no-conflict
user_view1 schema/user/user_view1.sql db_view
user_view2 schema/user/user_view2.sh db_view
user_view3 db_view

Note that in the first row above the table being described is the table table itself. In general the columns of the table table have the following significance:

  • table: the name of the table.
  • path: where to find information about the contents of the table (see below). Note that the path of a special configuration tables must be a '.tsv' file.
  • description: An optional description of the contents and/or the purpose of the table.
  • type: Valve recognizes four special configuration table types that can be specified using the type column of the table table. These are the table, column, datatype, and rule table types. Data tables (e.g., the 'user_*' tables in the above example) should not explicitly specify a type, and in general if a type other than the ones just mentioned is specified, Valve will exit with an "Unrecognized table type" error.
  • options (optional column): Allows the user to specify a number of further options for the table (see below).
Further information on path

The path column indicates where the data for a given table may be found. It can be (a) a '.tsv' file, (b) a '.sql' file, (c) some other executable file, or (d) it may be empty. In each case it will have the following consequences for the possible values of type and and for the possible options that may be used with the table.

  1. If path ends in '.tsv':

    • Its associated type may be any one of table, column, datatype, rule, or it may be empty.
    • The db_view option is not allowed. If set to true, Valve will fail with a "'.tsv' files are not supported for views" error.
  2. If path does not end in '.tsv':

    • Its associated type must be empty. Note that this implies that Valve's configuration tables must be specified using '.tsv' files.
    • The edit option is not allowed. If set to true, Valve will fail with an "Editable tables require a path that ends in '.tsv'" error.
  3. If path either ends in '.sql', or represents a generic executable:

    • If the db_view option is set, Valve assumes that the '.sql' file or generic executable indicated contains the instructions necessary to create the view and that the view displays the appropriate data.
    • If the db_view option is not set, Valve will take care of creating the table but it will expect that the '.sql' file or generic executable contains the statements necessary to load the data.
    • If path ends in '.sql', Valve reads in the statements contained in the '.sql' file and then executes them against the database.
    • If path represents a generic executable, then it is executed as a shell process. Note that the executable script or binary must accept two arguments indicating the location of the database and the name of the table that the path corresponds to.
  4. If path is empty:

    • If the db_view option is set, Valve assumes that the view already exists in the database.
    • If the db_view option is not set, Valve will take care of creating the table but it will not attempt to load it.

The above is conveniently summarized in the following table:

path possible types possible options created by loaded by†
Ends in '.tsv' any valid type, or empty db_view not allowed Valve Valve
Ends in '.sql' must be empty edit not allowed db_view: the '.sql' file, db_table: Valve Valve
Generic executable must be empty edit not allowed db_view: the executable, db_table: Valve Valve
empty must be empty edit not allowed No one (assumed to already exist) No one (assumed to be already loaded)

† Note that loading is only applicable when the db_view option has not been set.

Further information on options

If no options are specified, the options db_table, truncate, load, save, edit, validate_on_load, and conflict will all be set to true by default. The complete list of allowable options, and their meanings, are given below:

  • db_table: The table will be represented in the database by a regular table. Note that if the edit option has also been set, then path must be non-empty and it can be one of (a) a file ending (case insensitively) with '.tsv' explicitly specifying the contents of the table, (b) a file ending (case insensitively) with '.sql', or (c) an executable file. In the case of (a), Valve will take care of validating and loading the data from the '.tsv' file. In cases (b) and (c), Valve uses the '.sql' file or the executable to load the data, and performs no validation other than to verify that this has been done without error. If path is empty, then Valve expects that the table has already been created and loaded by the user and will fail otherwise. Note that the db_table option is not compatible with the db_view option.
  • db_view: The table will be represented in the database by a view. A view is allowed to have an empty path, in which case valve simply verifies that the view exists but does not create it. When path is non-empty then it must either end (case-insensitively) in '.sql' or be an executable file. Note that the path for a view must never end (case-insensitively) in '.tsv'. When path is not-empty, valve relies on the '.sql' file or the executable to create the view, otherwise it expects the view to have already been created by the user. Regardless of the value of path, Valve performs no validation of the data other than to verify that the view exists in the database. Note that the db_view option is not compatible with any of db_table, truncate, load, conflict, save, edit, or validate_on_load.
  • load: The table may be loaded with data. Note that this option is not compatible with db_view.
  • truncate: The table should be truncated before it is loaded. Note that this option is not compatible with db_view.
  • conflict: When reading the configuration for a table called 'T', then if 'T' has the conflict option set, Valve should create, in addition to 'T', a database table called 'T_conflict'. The purpose of this table is to store invalid data that, due to the violation of a database constraint, cannot be stored in 'T' (see Design and concepts). Note that the conflict option is not compatible with db_view.
  • internal: This option is reserved for internal use and is not allowed to be specified by the user.
  • validate_on_load: When set, Valve will validate a table's rows before loading them to the database. Note that this option is not compatible with db_view.
  • edit: When set, this option indicates that it is allowed to edit the table after it has been initially loaded. Note that this option is not compatible with db_view.
  • save: When set, this indicates that overwriting the table's source file, as indicated by path (which must be a '.tsv' file) is allowed. If the save option is not set, the table may nevertheless be saved as long as it is to a different path. Note that the save option is not compatible with db_view.
  • no-conflict: Sets the conflict option (which is set to true by default unless db_view is true) to false.
  • no-validate_on_load: Sets the validate_on_load option (which is set to true by default unless db_view is true) to false.
  • no-edit: Sets the edit option (which is set to true by default unless db_view is true) to false.
  • no-save: Sets the save option (which is set to true by default unless db_view is true) to false.
Commonly used path and option combinations

Here are some examples of commonly used table table configurations:

table path description type options
plain src/data/plain.tsv A database table with default options set, backed by a '.tsv' file
appendonly src/data/appendonly.tsv A database table which is appended to but never truncated no-truncate
readonly1 src/data/readonly1.tsv A read-only table that is backed by a '.tsv' file no-edit no-save
readonly2 src/data/readonly2.sh A read-only table whose contents are assumed to be valid, backed by a generic executable no-edit no-conflict no-validate-on-load
view1 src/data/view1.sql A view backed by a '.sql' file db_view
table src/table.tsv the table table table
column src/column.tsv the column table column
datatype src/datatype.tsv the datatype table datatype
rule src/rule.tsv the rule table rule

The column table

In addition to a table table, Valve also requires a column table. The column table configuration is normally stored in a file called 'column.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'column' in the table table. The column table contains one row for every column of every configured table. This includes both special configuration tables (such as the column table itself) and user-defined tables.

Below is an example column table, with the special configuration tables omitted:

table column label nulltype default datatype structure description
table1 column1 Column 1 value1
table1 column2 Column 2 empty integer from(table2.column2)
table2 column1 Column 1 trimmed_line primary
table2 column2 Column 2 integer unique
table3 column1 Column 1 word primary
table3 column2 Column 2 empty word tree(column1)

The columns of the column table have the following significance:

  • table: The name of the table to which the column belongs
  • column: The name of the column
  • label: If not empty, then instead of column, use label as the header for the column when saving the table.
  • nulltype: The datatype, defined in the datatype table, used to represent a null value in the column. For instance the datatype 'empty' in the example above is defined to match the string '' (see the example from the section on the datatype table). If a column has a nulltype, then values of the column that match the nulltype's associated datatype are considered to be valid values for the column. If a column has no nulltype, this means that a null value (by default, an empty string) for the column is considered to be an invalid value.
  • default (optional column): The default value to use for the column when inserting a row of data to the database. Note that in the database this implies that a DEFAULT constraint will be declared for the column.
  • datatype: The column's datatype, which must be one of the valid datatypes defined in the the datatype table
  • structure: Valve recognises the following four structural constraints on columns:
    • primary: The column is the primary key for the table to which it belongs; values must therefore be unique. Note that in the database this implies that a PRIMARY KEY constraint will be declared for the column.
    • unique: The column's values must be unique. Note that in the database this implies that a UNIQUE constraint will be declared for the column.
    • from(foreign_table.foreign_column): All non-null values of the column must exist in the column foreign_column of the table foreign_table. Note that in the database this implies that a FOREIGN KEY constraint will be declared for the column, unless the column's datatype is a list datatype (see the datatype table), and it also implies that a UNIQUE constraint will be declared for foreign_table.foreign_column, unless a unique structure has already been declared for that column in the column table.
    • tree(column_name): All non-null values of the column must exist in the column column_name of the same table
  • description: A description of the contents and/or the purpose of the column.

The datatype table

In addition to the table table and the column table, Valve also requires the user to configure a datatype table. The datatype table configuration is normally stored in a file called 'datatype.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'datatype' in the table table. The datatype table stores the definitions of the datatypes referred to in the datatype column of the column table.

Below is a subset of the rows of an example datatype table:

datatype parent condition description sql_type HTML type format
text any text TEXT textarea
integer text match(/-?\d+/) a positive or negative decimal digit, or 0 INTEGER %i
word text exclude(/\W/) a single word: letters, numbers, underscore %s
empty text equals('') the empty string NULL
custom1 text match(/\S+:\S+/) two nonspace character sequences separated by ':'
custom2 text in(alice, bob, cindy) either 'alice', 'bob', or 'cindy'
custom3 text list(word, ' ') a list of words separated by spaces
custom4 text search(/\d+/) a string containing a sequence of digits

The columns of the datatype table have the following significance:

  • datatype: The name of the datatype
  • parent: The more generic datatype, if any, that this datatype is a special case of. When the value of a given column violates its datatype condition, Valve will move up the datatype hierarchy to determine whether its ancestors' datatype conditions have also been violated, and if so, Valve will add validation error messages corresponding to these further violations to the error messages it assigns to that value in that column.
  • condition: The logical condition used to validate whether a given data value conforms to the datatype (see condition types below)
  • description: A description of the datatype and/or its purpose.
  • sql_type: The SQL type to use for columns that have the given datatype in the database. If empty, the SQL type of the nearest ancestor for which a SQL type has been defined will be used.
  • HTML type (optional column): The HTML type corresponding to the datatype.
  • format (optional column): The sprintf-style format string to apply to values of the datatype when saving them.
Condition types
  • match(/REGEX/): Violated if a given value does not match REGEX.
  • exclude(/REGEX/): Violated if a given value contains an instance of REGEX.
  • search(/REGEX/): Violated if a given value does not contain an instance of REGEX.
  • equals(VAL): Violated if a given value is not equal to VAL.
  • in(VAL1, ...): Violated if a given value is not one of the values in the list: VAL1, ...
  • list(ITEM_DATATYPE, SEPARATOR): Violated if a given value is not in the form of a sequence of items, each of datatype ITEM_DATATYPE, separated by the string SEPARATOR. Otherwise the condition is violated if any of the items in the given list fail to conform to ITEM_DATATYPE.
Required datatypes

Valve requires that the following datatypes be defined:

  • text, empty, line, trimmed_line, nonspace, word

The recommended datatype configurations for these four datatypes are the following (note that the HTML type and format columns are optional):

datatype parent condition description sql_type HTML type format
text TEXT textarea
empty text equals('') NULL
line text exclude(/\n/) input
trimmed_line line match(/\S([^\n]\S)/)
nonspace trimmed_line exclude(/\s/)
word nonspace exclude(/\W/) %s

The rule table

In addition to the table table, the column table, and the datatype table, it is also possible (but optional) to configure a table of type 'rule', or a rule table. When it is configured, the rule table configuration is normally stored in a file called 'rule.tsv', though in principle any filename may be used as long as the type field corresponding to the filename is set to 'rule' in the table table.

The rule table is used to define a number of rules of the following form:

        when CONDITION_ON_COLUMN_1 is satisfied then CONDITION_ON_COLUMN_2 must also be satisfied         (1)

where CONDITION_ON_COLUMN_1 and CONDITION_ON_COLUMN_2 are defined with respect to the same table.

Below is an example rule table:

table when column when condition then column then condition level description
table1 foo null bar not null error
table2 foo negative_int bar positive_int error

The columns of the rule table have the following significance:

  • table: The name of the table to which the rule is applicable.
  • when column: The column that the when condition will be checked against.
  • when condition: The condition to apply to values of when column. This can either be one of the recognized condition types, or it can be the name of a datatype in which case the condition corresponding to the datatype will be used.
  • then column: The column that the then condition will be checked against.
  • then condition: The condition to apply to values of then column whenever the when condition has been satisfied for when column. This can either be one of the recognized condition types, or it can be the name of a datatype in which case the condition corresponding to the datatype will be used.
  • level: The severity of the violation
  • description: A description of the rule and/or its purpose.

Using guess

In some cases it is useful to be able to try and guess what the table table and column table configuration should be, using information about the current state of the Valve instance, for a given data table not currently managed by Valve. To do this one may use Valve's command line interface to run the guess subcommand as follows:

ontodev_valve guess [OPTIONS] SOURCE DATABASE TABLE_TSV

where:

  • SOURCE is the location of the '.tsv' representing the table table.
  • DATABASE is the path to a PostgreSQL or SQLite database.
  • TABLE_TSV is the '.tsv' file representing the data table whose column configuration is to be guessed.

For the list of possible options, and for general information on Valve's command line interface, see command line usage.

Below is an example of using guess:

$ ontodev_valve guess test/guess_test_data/table.tsv build/valve_guess.db test/guess_test_data/ontology/table2.tsv 

The following row will be inserted to "table":
table                                     path                                      type                                      description                               
table2                                    test/guess_test_data/ontology/table2.tsv                                                                                      

The following row will be inserted to "column":
table               column              label               nulltype            datatype            structure           description         
table2              zork                                                        integer             primary                                 
table2              zindy                                                       IRI                 unique                                  
table2              xyzzy                                                       suffix              unique                                  
table2              foo                                                         prefix                                                      
table2              bar                                                         IRI                 unique                                  

Do you want to write this updated configuration to the database? [y/N] y

Command line usage

The basic syntax when calling Valve on the command line is:

ontodev_valve [OPTIONS] <SUBCOMMAND <SUBCOMMAND POSITIONAL PARAMETERS>>

To view the list of possible subcommands and global options, run:

ontodev_valve --help

To get help on a particular subcommand, and on the positional parameters and options that are specific to it, run:

ontodev_valve SUBCOMMAND --help

Logging

By default Valve only logs error messages. To also enable warning and information messages, set the environment variable RUST_LOG to the minimum logging level desired for ontodev_valve: debug, info, warn, or error. For instance:

export RUST_LOG="ontodev_valve=info"

For further information see the Rust Cookbook.

API

The API (application-programmer interface) reference documentation for the latest release of Valve can normally be found on docs.rs. In case you would like to generate the API documentation for a newer version of Valve, run:

cargo doc

in the the root folder of your local copy of the valve.rs source code repository (see Installation and configuration), and then open the file

<valve.rs root folder>/target/doc/ontodev_valve/index.html

in your favourite browser. For a higher level See also the Design and concepts section.

Troubleshooting

TODO.

Python bindings

See valve.py

License: BSD-3-Clause