Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.2.0 revisions #50

Merged
merged 72 commits into from
Jan 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
e9065b9
`data_dictionary` to `fields`
mbkranz Nov 27, 2023
f2fbcde
del `repo_link` but see issues #28
mbkranz Nov 27, 2023
68ab362
`module` to `section` as addressed in #40
mbkranz Nov 27, 2023
201e6ab
Del additional properties to enable validation of property names
mbkranz Nov 27, 2023
0a2e930
update section title
mbkranz Nov 30, 2023
b158581
updated examples with section prop
mbkranz Dec 1, 2023
533a8ca
add gitignore
mbkranz Dec 1, 2023
b3a6e7f
del `univarStats` (#49)
mbkranz Dec 1, 2023
065b63e
Add `version` prop in vlmd root json schema (#47)
mbkranz Dec 1, 2023
d0a5af4
encodings to enumLabels and ordered to enumOrdered #44
mbkranz Dec 4, 2023
3d084a3
Added field standard mapping object with examples from issue #39
mbkranz Dec 4, 2023
ebdc8d3
Added proposed standardsMappings for root and fields with examples fr…
mbkranz Dec 4, 2023
1802cba
Update vlmd version to 0.2.0
mbkranz Dec 4, 2023
6f5c0c0
Updated flattened object within array naming convention
mbkranz Dec 6, 2023
bbcab92
Fix: vlmd description not showing for object of arrays
mbkranz Dec 8, 2023
1dd06b2
improve human readable definitions
mbkranz Jan 3, 2024
bceffa7
additional dictionary yaml definition updates
mbkranz Jan 3, 2024
5ca3248
paired back num examples
mbkranz Jan 3, 2024
b6b2a60
add schema version
mbkranz Jan 3, 2024
00f6f62
minor formatting
mbkranz Jan 3, 2024
f96bc32
build with new dictionary updates
mbkranz Jan 3, 2024
f8a2759
make example in description valid (ie a string instead of int)
mbkranz Jan 3, 2024
e840e62
build based on update
mbkranz Jan 3, 2024
df7d96a
init test for gh admonitions
mbkranz Jan 4, 2024
6259272
Update vlmd README.md
mbkranz Jan 4, 2024
3d2c7f1
remove Csv and json spec suffix given established translation pattern
mbkranz Jan 4, 2024
e47531c
WIP documentation on conventions and rules
mbkranz Jan 4, 2024
f535851
additional json to csv property name example
mbkranz Jan 5, 2024
c397ba2
slight formatting updates to templates
mbkranz Jan 5, 2024
9fdac53
WIP init function for json to csv translation with new convention-bas…
mbkranz Jan 5, 2024
4c6b012
fix: missing type for enum
mbkranz Jan 5, 2024
045dfee
add enumLabel and enumOrder official pattern examples
mbkranz Jan 5, 2024
4e05314
add boolean to list of scalars in README
mbkranz Jan 5, 2024
72999c2
finish implementation of csv translation rules
mbkranz Jan 5, 2024
b281c32
add schemaVersion/cascading properties to rules and build for csv spec
mbkranz Jan 5, 2024
1944153
minor README updates
mbkranz Jan 5, 2024
59e66a6
recommendation on vlmd document file naming
mbkranz Jan 5, 2024
eaa9176
new line for version in vlmd schema markdown template
mbkranz Jan 5, 2024
9f1a2b4
Run build with updates
mbkranz Jan 5, 2024
0cd6c80
Note on schemaVersion cascading in csv files
mbkranz Jan 5, 2024
15c3bde
Update build
mbkranz Jan 5, 2024
f26f552
Add examples to individual standardsMappings props
mbkranz Jan 5, 2024
dfee8cf
Update build
mbkranz Jan 5, 2024
0ab4677
Definitions of standardsMappings instrument properties
mbkranz Jan 5, 2024
bc05b82
Update build
mbkranz Jan 5, 2024
0fbe8f1
Update README rules
mbkranz Jan 6, 2024
3d5155b
Added schemaVersion to fields (and made schemaVersion a defnition)
mbkranz Jan 11, 2024
bf7543f
Added contraints.required
mbkranz Jan 11, 2024
d863b17
Update build
mbkranz Jan 11, 2024
9fd743d
Update annotations
mbkranz Jan 16, 2024
6edf2f6
Update build
mbkranz Jan 16, 2024
5d107b1
Update build with new annotations
mbkranz Jan 16, 2024
5d420fb
del relatedConcepts (to avoid confusion with standardsMappings for no…
mbkranz Jan 16, 2024
a2581ce
update build
mbkranz Jan 16, 2024
9a76010
minor formatting changes
mbkranz Jan 17, 2024
ca23120
Update build
mbkranz Jan 17, 2024
15e597e
added tbl and field lvl 'extra' properties
mbkranz Jan 22, 2024
c28f05f
update build
mbkranz Jan 22, 2024
73b865f
fix: patternNames is AND not OR
mbkranz Jan 22, 2024
d79b87c
remove trueValue type constraint
mbkranz Jan 22, 2024
943f8c1
Update build
mbkranz Jan 22, 2024
6345a72
fix
mbkranz Jan 22, 2024
cad5bec
update build
mbkranz Jan 22, 2024
46418da
add custom property to dd level
mbkranz Jan 23, 2024
d14cd73
Update build
mbkranz Jan 23, 2024
877dfbc
Update README
mbkranz Jan 23, 2024
2481437
add underscore to designate definitions are more for internal process
mbkranz Jan 23, 2024
7d5fa21
Update build
mbkranz Jan 23, 2024
3686b3f
Added back an updated relatedConcepts property
mbkranz Jan 23, 2024
f9dc401
Update build
mbkranz Jan 23, 2024
7630f34
minor updates to README
mbkranz Jan 23, 2024
908517f
Update build
mbkranz Jan 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@

#hidden libs and cache dirs
.vscode
.pytest_cache
*/pytest_cache/
*__pycache__

# word docs
*.docx
2 changes: 1 addition & 1 deletion VERSIONS.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"slmd":"1.0.0",
"vlmd":"0.1.0"
"vlmd":"0.2.0"
}
204 changes: 192 additions & 12 deletions variable-level-metadata-schema/README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,211 @@
# Variable level metadata

This metadata directory contains the specifications for variable level metadata submissions to the
HEAL platform in addition to variable level metadata templates in CSV format and the associated code
converting this template to its validated json format.
This metadata directory contains the specifications for variable level metadata documents in the HEAL data ecosystem.

## Schemas

## Workflow
❗ Look here for schema specifications.

The `schemas/dictionary` directory contains a comprehensive json schema with fields for
### json data dictionary format specification
1. `schemas/jsonschema/data-dictionary.json`: The "json" json data dictionary schema (ie json template schema)
- Intended to specify the data dictionary representation of json objects available in the HEAL platform metadata-service.
- See here for the markdown rendered version --> [`docs/md-rendered-schemas/jsonschema-jsontemplate-data-dictionary.md`](docs/md-rendered-schemas/jsonschema-jsontemplate-data-dictionary.md)

### csv field format specifications
- See here for the markdown rendered version --> [`docs/md-rendered-schemas/jsonschema-jsontemplate-data-dictionary.md`](docs/md-rendered-schemas/jsonschema-csvtemplate-fields.md)


2. `schemas/frictionless/fields.json` Table schema (previously known as "frictionless") standard specification
- This json file is intended to represent csv data dictionary documents following the [Table Schema specification](https://specs.frictionlessdata.io/table-schema/).
- Csv version is intended to make data dictionary creation and discovery available in a more familiar/human readable format,
- The representation of data dictionary field values in a csv file. It's used to facilitate documentation of data dictionary csv
files in addition to input validation.
3. `schemas/jsontemplate/fields.json`The "csv" json schema (ie csv template schema)
- :warning: The "csv" json schema is intended to be an intermediate specification used for documentation and in translation workflows to the json schema template. As fully specifying a tabular file (for example missing value specification) is out of scope here (see the table schema representation in (2))

## Document flow chart

```mermaid

%%{init: {"flowchart": {"defaultRenderer": "elk","htmlLabels": false}} }%%

flowchart TD

subgraph dictionary[Dictionary YAML files]

defs["schemas/dictionary/definitions.yaml"]
fields["schemas/dictionary/fields.yaml"]
dd["schemas/dictionary/data-dictionary.yaml"]
end

subgraph Schema specifications

jsonspec["schema/jsontemplate/data-dictionary.json"]
csvspec["schema/jsontemplate/csvtemplate/fields.json"]
csvtblspec["schema/frictionless/csvtemplate/fields.json"]
end

subgraph "Rendered schema documentation \n(html also available)"

csvmd["/docs/\nmd-rendered-schemas/\njsonschema-csvtemplate-fields.md"]
jsonmd["/docs/\nmd-rendered-schemas/\njsonschema-jsontemplate-data-dictionary.md"]

end

defs --> fields --> dd
defs --> dd

fields --> csvspec --> csvtblspec
dd --> jsonspec

csvspec --> csvmd
jsonspec --> jsonmd

```

## Directories

- `docs`:
See the rendered human readable schemas
in a markdown format and an interactive html format.
- `schemas/jsonschema`: `data_dictionary.json` contains the final and full specification.
- `schemas/frictionless/csvtemplate`: contains schemas following the frictionless schema specifications. `fields.json` contains the frictionless Table Schema descriptor that validates a tabular heal templated csv data dictionary. See [here](https://specs.frictionlessdata.io/table-schema/) for the specification. **NOTE: the `csvtemplate` is an intermediate format meant to be converted into the final `jsontemplate` format.
- `schemas/dictionary`: the yaml files used to generate json schemas with build.py. Fields with `jsonSpec` and `csvSpec` keys to indicate which property to extract in the `build.py` script.
- `schemas/jsonschema`: contains the final and full specification for schemas following json schema.
- `schemas/frictionless`: contains schemas following the frictionless table schema specifications. See [here](https://specs.frictionlessdata.io/table-schema/) for the specification.
- `schemas/dictionary`: the yaml files used to generate json schemas and documentation with build.py.
- `templates`: empty templates in csv spreadsheet format and JSON format.
- `examples`: the ~~(filled out)~~ templates in csv spreadsheet format and JSON format.
TO BE ADDED: for now, see https://github.com/norc-heal/healdata-utils/tree/main/tests/data/valid/output
- `build.py`: This script compiles the yaml files and generates associated jsonschemas and frictionless schemas in addition to the human rendered schemas
- `examples`: exapmles of filled out templates in csv spreadsheet format and JSON format.
- `build.py`: This script compiles the yaml files and generates associated schemas in addition to the human rendered schema
documentation.

## Contributing

To contribute to the variable level metadata, please modify the `dictionary/*.yaml` files directly. For example, if you want to add/modify an example, description, etc for either the JSON or CSV spec, then do so here.
To contribute to the variable level metadata specification (and annotations/examples/documentation), please modify the `dictionary/*.yaml` files directly.

1. Update the dictionary/*.yaml files
2. Run `build.py` script
3. Check output is correct (see above)
4. When satisfied, push to github and ensure it passes validation (ie commit has ✔️ and not ❌)

❗ Please read the below conventions and principles before contributing and review the existing `dictionary` directory.


## Conventions, principles, and rules for annotations and csv <> json translation

### Annotation/documentation properties
1. `description`: SHOULD be created as markdown syntax without any headers as headers are applied in the templates.

2. `additionalDescription`: SHOULD be added if there are additional documentation "footer" details. In rendering the documentation, these are appended to the end of rendered markdown document.

### `type` conversion rules
Given csv field values can only be scalar values with records separated by a new line and each individual field values separated by a comma delimiter, the following rules and restrictions are applied to allow json to csv specification translation.

1. type `object`
- converted to type `string` with pattern of `^(?:.*?=.*?(?:\||$))+$` to indicate a stringified object with a equal sign (`=`) connecting the key-value pair and a pipe (`|`) delimiter separating unique key-value pairs.
2. type `array`
- if type `object` in `items`: flattened to the children property or properties
- if type is a scalar (`string`,`integer`,`number`) in `items`,
translated to type `string` with pattern `^(?:[^|]+\||[^|]*)(?:[^|]*\|)*[^|]*$` to indicate a string containing a pipe delimiter (i.e., a stringified array with a pipe delimiter)
### `property` name conversion rules
To facilitate the mapping of json spec property names to csv property names, the resulting flattened `property` names from the flattened properties should correspond to the [jsonpath](https://datatracker.ietf.org/doc/id/draft-goessner-dispatch-jsonpath-00.html) representation where:

1. type `object`

The json spec type object property below:
```json

"constraints": {
"type": "object",
"properties": {
"maxLength": {
"type": "integer"}
}
}
```

translates to the csv stringified type object:

```json

"constraints.maxLength":{"type":"integer"}

```
2. type `array`

The json spec type array property below:

```json
{ "..more props..":"...",
"standardsMappings": {
"type": "array",
"items": {
"type": "object",
"properties": {
"instrument": {
"type": "object",
"properties": {
"url": {
"type": "string",
"format": "uri"
},
"..more props..":"..."}
},
"..more props..":"..."}
}}}

```
translates to the csv stringified type array property:

```json
{ "..more props..":"...",
"standardsMappings[0].instrument.url": {
"type": "string",
"format": "uri"
}
}
```

### Complex `type` restrictions

1. Currently, no complex types (`anyOf`,`oneOf`) are supported and the `type` MUST be specified. This is to ensure coverage for all csv to json translation use cases.
- Each json specification schema property type must be a scalar (e.g., `boolean`,`string`,`integer`,`number`), an `array`, or an `object`
- Each csv specification schema property type must be a scalar (e.g., `boolean`,`string`,`integer`,`number`)

### csv to json and json to csv translations

There are two rules for conversion from json to csv (or csv to json) specs:

1. __csv spec field-level property and json spec root-level property match__: If -- in the json schema spec version -- a property is specified at the root-level AND this same property is specified in the field level of the json spec schema
- csv to json: If the same value/instance of a property exists at the field level for ALL records (only one unique value but no missing values) then this unique value -- when translated to the json spec version -- will be moved to the root level data dictionary
- json to csv: All root level properties will be moved to individual field properties BUT field level properties that exist take precedence.

More concretely, this provides a way to specify root level properties within vlmd csv documents for a few use cases but can generalize to other future additional property matches:

1. specifying the schema version that represents the vlmd document (`schemaVersion`)
2. specifying other data dictionary level properties such as `standardsMappings[0].instrument`

### root ("data dictionary level") and field property cascading pattern
Akin to the above json to csv, more generally:

All root level properties will be applied to individual fields IF this same field level property is not specified (i.e., field-level takes precedence). This strategy can be seen in the [data package standard (but with missingValues)](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)


### csv and json vlmd document file naming

File names for json and csv translations of a vlmd document are suggested to
have the same stem name with corresponding "csv" and "json" suffixes (eg `my-heal-dd.csv` and `my-heal-dd.json`)

## Additional table-level (root) and field-level properties

Some table-level or field-level properties in other standards (or custom properties in specific use cases) do not map onto
a core HEAL property. To allow these properties to be included, we list these property names under `propertyNames`.

❗ For study or use case specific names, it is recommended to put the property under a `custom` namespace (e.g., `"custom":{"myvarname"})`. Adding additional properties here are for well established standards and/or property names used in practice.

☝️ At the root level, [`propertyNames`](https://json-schema.org/draft-07/json-schema-validation#rfc.section.6.5.8) was used to:

1. allow inclusion and minimal validation of these extra properties (ie of only the existence of property names) without making any assumptions about corresponding property types.
2. It also provides a clear distinction between "core" properties and "extra" properties.

One consideration, however, is that `propertyNames` was introduced in json schema draft-6.

## Considerations

Expand Down
Loading