diff --git a/docs/input.md b/docs/input.md index cca9e97..df416c9 100644 --- a/docs/input.md +++ b/docs/input.md @@ -46,7 +46,7 @@ if the parameterization of a tool can be applied to other data. That means, the From a practical perspective, if you build a tool around these tool specifications, the tool name and content of the sections `parameters` and `data` of `/in/input.json` -can be used to create checksums and therefor help to establish reproducible workflows. +can be used to create checksums and therefore help to establish reproducible workflows. ## Parameters: File specification @@ -147,7 +147,9 @@ Note, that default parameters are only parsed if they are not set as `optional=t ## Data: File specification All input `Data` is described in a data block in the `/src/tool.yml` file. -All sets of input data are collected as the **optional** `tools..data` block: +All sets of input data are collected as the **optional** `tools..data` block. +The simplest declaration of input data is to list all available data files in a +single, top-level list: ```yaml tools: @@ -155,68 +157,82 @@ tools: parameters: [...] data: - foo_data: - [...] + - foo_data + - foo_data2 ``` -Refer to the section below to learn about mandatory and optional fields for `Data`. +If any of the dataset sources requires a more detailed configuration, objects +can be specifies as well: +```yaml +tools: + foobar: + parameters: + [...] + data: + foo_data: + description: Our first dataset with foo properties + foo_data2: + description: Our second dataset with foo2 properties +``` -### Fields +Refer to the section below to learn about the fields for `Data`. -The following section defines all mandatory and optional fields of a `Data` entity. -#### `load` +### Fields -This is the only **mandatory** field for an entity of `Data`. -Boolean field which defaults to `true`. If set to `load=false`, the file is not parsed by the -library used for parsing input. In this case, file paths are passed as ordinary strings and -the parsing library will not attempt to load the file. +The following section defines all fields of a `Data` entity. -There are a number of file formats, which are loaded by default: +#### `description` -| file extension | Python | R | Matlab | NodeJS | -| ---------------|--------|-----|---------|----------| -| .dat | `numpy.array` | `vector` | `matrix` | `number[][]` | -| .csv | `pandas.DataFrame` | `data.frame` | `matrix` | `number[][]` | +The `description` is a single- or multiline comment to describe the input data. +For the `description` Markdown is allowed, although tool-frameworks are not required to parse it. +Descriptions are optional and can be omitted, but it is highly recommended to +add descriptions to all required data inputs. +A multiline comment in YAML can be specified like: -Note that setting `load=false` can be helpful when developing tools that require to load the -data in a different way than it is provided by the parsing libraries. +```yaml +description: | + This is the first line + This is the second line +``` -#### `extension` +#### `example` -By default, the file format is derived from the file extension given in the path to the data -in `input.json`. Via the `extension` field, it is possible to override the file format of input -data. This way, it can be ensured that the library used for parsing the input always loads the -file in the respective datastructure to the tool. If the file format / extension is not -supported by the parsing library, file paths are passed just as strings, the parsing library -will not attempt to load the file (see above for supported formats). +The `example` field is optional and can be used to reference a sample dataset +for the given input, **within** the container. Data examples are a prime source +for your users to understand how inputs should look like and be formatted. ```yaml -tools: - foobar: - parameters: - ... - data: - foo_data: - load: true - extension: .csv +example: /in/input_name.csv ``` -#### `description` +It is considered good practice to add example data and example parameterizaitons +to the `/in/` folder. At inspection time, when a client application reads the +`tool.yml`, this client can also access the examples in the `/in/` folder. +At runtime, as the client application mounts data and parameterizations into the +container at `/in/`, the examples are non-existent in the container and cannot +accidentally pollute the runtime container. -The `description` is a multiline comment to describe the input data. -For the `description` Markdown is allowed, although tool-frameworks are not required to parse it. -Descriptions are optional and can be omitted. -A multiline comment in YAML can be specified like: +#### `extension` + +The `extension` field is optional and can be used to limit the permitted file +extensions for a data input. Allowed is a single string input or a list of strings. +By convention, the point `.` should be included into the `extension` as well. ```yaml -description: | - This is the first line - This is the second line +extension: .csv +``` + +```yaml +extension: + - .dat + - .txt + - .DAT + - .TXT ``` @@ -248,12 +264,9 @@ tools: description: An optional array of floats data: foo_csv_data: - load: true - extension: .csv description: | - The parsing library will try to load the data like .csv files, - regardless of the file extension. + This is a CSV file that should contain valid input. We do currently + not specify, what that exactly means. foo_nc_data: - load: false - description: netCDF data that is not loaded by the parsing library. + description: CF-netCDF 1.8 conform climate model output. ```