From 6ffe40cadcaaa78b4e28caa43eafd57142ac9db1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mirko=20M=C3=A4licke?= Date: Mon, 11 Dec 2023 15:44:59 +0100 Subject: [PATCH 1/4] remove unnecessary sections --- docs/input.md | 72 +++++++++++++++++---------------------------------- 1 file changed, 24 insertions(+), 48 deletions(-) diff --git a/docs/input.md b/docs/input.md index cca9e97..d02937b 100644 --- a/docs/input.md +++ b/docs/input.md @@ -46,7 +46,7 @@ if the parameterization of a tool can be applied to other data. That means, the From a practical perspective, if you build a tool around these tool specifications, the tool name and content of the sections `parameters` and `data` of `/in/input.json` -can be used to create checksums and therefor help to establish reproducible workflows. +can be used to create checksums and therefore help to establish reproducible workflows. ## Parameters: File specification @@ -147,7 +147,9 @@ Note, that default parameters are only parsed if they are not set as `optional=t ## Data: File specification All input `Data` is described in a data block in the `/src/tool.yml` file. -All sets of input data are collected as the **optional** `tools..data` block: +All sets of input data are collected as the **optional** `tools..data` block. +The simples declaration of input data is to list all available data files in a +single, top-level list: ```yaml tools: @@ -155,59 +157,36 @@ tools: parameters: [...] data: - foo_data: - [...] + - foo_data + - foo_data2 ``` -Refer to the section below to learn about mandatory and optional fields for `Data`. - - -### Fields - -The following section defines all mandatory and optional fields of a `Data` entity. - -#### `load` - -This is the only **mandatory** field for an entity of `Data`. -Boolean field which defaults to `true`. If set to `load=false`, the file is not parsed by the -library used for parsing input. In this case, file paths are passed as ordinary strings and -the parsing library will not attempt to load the file. - -There are a number of file formats, which are loaded by default: - - -| file extension | Python | R | Matlab | NodeJS | -| ---------------|--------|-----|---------|----------| -| .dat | `numpy.array` | `vector` | `matrix` | `number[][]` | -| .csv | `pandas.DataFrame` | `data.frame` | `matrix` | `number[][]` | - - -Note that setting `load=false` can be helpful when developing tools that require to load the -data in a different way than it is provided by the parsing libraries. - -#### `extension` - -By default, the file format is derived from the file extension given in the path to the data -in `input.json`. Via the `extension` field, it is possible to override the file format of input -data. This way, it can be ensured that the library used for parsing the input always loads the -file in the respective datastructure to the tool. If the file format / extension is not -supported by the parsing library, file paths are passed just as strings, the parsing library -will not attempt to load the file (see above for supported formats). +If any of the dataset sources requires a more detailed configuration, objects +can be specifies as well: ```yaml tools: foobar: parameters: - ... + [...] data: foo_data: - load: true - extension: .csv + description: Our first dataset with foo properties + foo_data2: + description: Our second dataset with foo2 properties ``` +Refer to the section below to learn about the fields for `Data`. + + +### Fields + +The following section defines all fields of a `Data` entity. + + #### `description` -The `description` is a multiline comment to describe the input data. +The `description` is a single- or multiline comment to describe the input data. For the `description` Markdown is allowed, although tool-frameworks are not required to parse it. Descriptions are optional and can be omitted. @@ -248,12 +227,9 @@ tools: description: An optional array of floats data: foo_csv_data: - load: true - extension: .csv description: | - The parsing library will try to load the data like .csv files, - regardless of the file extension. + This is a CSV file that should contain valid input. We do currently + not specify, what that exactly means. foo_nc_data: - load: false - description: netCDF data that is not loaded by the parsing library. + description: CF-netCDF 1.8 conform climate model output. ``` From 01deee899d6aec58fac6ee040e27e1eca8273e58 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mirko=20M=C3=A4licke?= Date: Tue, 12 Dec 2023 10:13:12 +0100 Subject: [PATCH 2/4] Add optional data fields --- docs/input.md | 34 +++++++++++++++++++++++++++++++++- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/docs/input.md b/docs/input.md index d02937b..aa5d30a 100644 --- a/docs/input.md +++ b/docs/input.md @@ -188,7 +188,8 @@ The following section defines all fields of a `Data` entity. The `description` is a single- or multiline comment to describe the input data. For the `description` Markdown is allowed, although tool-frameworks are not required to parse it. -Descriptions are optional and can be omitted. +Descriptions are optional and can be omitted, but it is highly recommended to +add descriptions to all required data inputs. A multiline comment in YAML can be specified like: @@ -198,6 +199,37 @@ description: | This is the second line ``` +#### `example` + +The `example` field is optional and can be used to reference a sample dataset +for the given input, **within** the container. Data examples are a prime source +for your users to understand how inputs should look like and be formatted. + +```yaml +example: /samples/input_name.csv +``` + +#### `quality` + +The `quality` field is an optional field, that contains various sub-fields. +These text-based fields can be used to specify data quality requirements for the +input data. +The quality field can contain one or more of the child fields, but cannot be empty. + +```yaml +quality: + completeness: Describes the expectations of present variables and measurements. + accuracy: | + Describes if the tool has expectations of minimum required accurary. + This can involve measurement accuracy, but also expected scaling. + validity: | + Describes which format requirements the tool has, to recognize the passed in + files as valid data inputs. +``` + +There are additional dimensions to data quality, consistency, timeliness and +uniqueness. These dimensions do not apply here as general catgories. + ## Example From 36a9f399e06c11710db39ee1b34da20cc300cbf0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mirko=20M=C3=A4licke?= Date: Tue, 12 Dec 2023 13:38:35 +0100 Subject: [PATCH 3/4] some fixes --- docs/input.md | 31 ++++++++----------------------- 1 file changed, 8 insertions(+), 23 deletions(-) diff --git a/docs/input.md b/docs/input.md index aa5d30a..58623d4 100644 --- a/docs/input.md +++ b/docs/input.md @@ -148,7 +148,7 @@ Note, that default parameters are only parsed if they are not set as `optional=t All input `Data` is described in a data block in the `/src/tool.yml` file. All sets of input data are collected as the **optional** `tools..data` block. -The simples declaration of input data is to list all available data files in a +The simplest declaration of input data is to list all available data files in a single, top-level list: ```yaml @@ -206,30 +206,15 @@ for the given input, **within** the container. Data examples are a prime source for your users to understand how inputs should look like and be formatted. ```yaml -example: /samples/input_name.csv +example: /in/input_name.csv ``` -#### `quality` - -The `quality` field is an optional field, that contains various sub-fields. -These text-based fields can be used to specify data quality requirements for the -input data. -The quality field can contain one or more of the child fields, but cannot be empty. - -```yaml -quality: - completeness: Describes the expectations of present variables and measurements. - accuracy: | - Describes if the tool has expectations of minimum required accurary. - This can involve measurement accuracy, but also expected scaling. - validity: | - Describes which format requirements the tool has, to recognize the passed in - files as valid data inputs. -``` - -There are additional dimensions to data quality, consistency, timeliness and -uniqueness. These dimensions do not apply here as general catgories. - +It is considered good practice to add example data and example parameterizaitons +to the `/in/` folder. At inspection time, when a client application reads the +`tool.yml`, this client can also access the examples in the `/in/` folder. +At runtime, as the client application mounts data and parameterizations into the +container at `/in/`, the examples are non-existent in the container and cannot +accidentally pollute the runtime container. ## Example From 062070861b54ba3813de00eada3fa1717dbe6000 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mirko=20M=C3=A4licke?= Date: Tue, 12 Dec 2023 13:55:57 +0100 Subject: [PATCH 4/4] add the extension field --- docs/input.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/docs/input.md b/docs/input.md index 58623d4..df416c9 100644 --- a/docs/input.md +++ b/docs/input.md @@ -216,6 +216,26 @@ At runtime, as the client application mounts data and parameterizations into the container at `/in/`, the examples are non-existent in the container and cannot accidentally pollute the runtime container. + +#### `extension` + +The `extension` field is optional and can be used to limit the permitted file +extensions for a data input. Allowed is a single string input or a list of strings. +By convention, the point `.` should be included into the `extension` as well. + +```yaml +extension: .csv +``` + +```yaml +extension: + - .dat + - .txt + - .DAT + - .TXT +``` + + ## Example ```yaml