-
Notifications
You must be signed in to change notification settings - Fork 4
HXL schemas
David Megginson edited this page May 29, 2018
·
17 revisions
The Validation page validates a HXL dataset against a simple, spreadsheet-style HXL schema. This article describes the schema format.
The schema is itself a HXL dataset, using the following hashtags:
Schema tag | Required | Description | Example |
---|---|---|---|
#valid_tag | yes | A tag pattern for the hashtag being described, including the "#" character. | #sector |
#valid_required | no | Without the +min or +max attributes, a truthy value (like "1") means simply that the value is required. | 1 |
#valid_required+min | no | The minimum number of times a non-empty value for the tag must appear in each row of the dataset. Defaults to no minimum. | 1 |
#valid_required+max | no | The maximum number of times a non-empty value for the tag may appear in each row of the dataset. Defaults to no maximum. | 5 |
#valid_unique | no | Require individual values in all matching columns to be unique throughout the document | true |
#valid_unique+key | no | Define a comma-separated list of tag patterns that determines whether two rows match, and report any duplicate rows using a compound key made up of matching values from the row. | #org,#adm1+code,#sector |
#valid_correlation | no | Define a comma-separated list of tag patterns that should always have values that should always have the same values for any given value of #valid_tag. See the note on correlation for more information. | #adm1,#adm2 (for #adm3) |
#valid_datatype | no | The type of data expected in the column under the HXL tag. Currently-allowed values are "text", "number", "url", "email", and "phone" ("date" coming soon). Defaults to no type checking. | number |
#valid_datatype+consistent | no | Test for consistent datatypes in a column. Test first for dates (only if tagged #date), then for numbers, then for strings. Ignores empty values. Recommend using this test with a status of "info" or "warning", since it is purely heuristic. | true |
#valid_value+whitespace | no | Reports an error for any irregular whitespace (allows no leading or trailing space, and only single internal space characters) when set to a truthy value. | true |
#valid_value+min | no | The minimum value allowed when #valid_datatype is "number". Defaults to no minimum value. Ignored for non-numeric datatypes. | 100 |
#valid_value+max | no | The maximum value allowed when #valid_datatype is "number". Defaults to no maximum value. Ignored for non-numeric datatypes. | 10000 |
#valid_value+regex | no | A regular expression pattern that the value must match. | ^([0-9])(,[0-9])*$ |
#valid_value+outliers | no | If truthy, report numeric outliers (ignores non-numeric values). Calculates the median and standard deviation (excluding the highest and lowest values), then reports any value with a distance of more than three standard deviations from the median. Recommend using this test with a status of "info" or "warning", since it is purely heuristic. The test will be skipped for any column whose values have a coefficient of variation > 1.0. | True |
#valid_value+spelling | no | If truthy, look for deviant spellings in the column(s) specified. Makes no assumption about language in use, etc, but just detects rare spellings that are similar to more-common spellings. Sses #valid_value+case if specified. Recommend using this test with a status of "info" or "warning", since it is purely heuristic. The test will be skipped for any column whose spelling frequencies have a coefficient of variation > 1.0. | true |
#valid_value+list | no | A list of allowed values, separated by "|". | female|male |
#valid_value+case | no | A truthy value like "1" if matches for patterns and enumerations should be case-insensitive. | 0 |
#valid_value+url | no | The URL of a HXL dataset containing allowed values (possibly thousands of them). Use together with #valid_value+target_tag for the hashtag of the column containing the values. (Note: if the URL points to an unavailable resource, or to one that can't be parsed as HXL, there will be only a single error mentioning that.) | http://example.org/codes/p-codes.hxl |
#valid_value+target_tag | no | When used together with #valid_value+url, a tag pattern for the column containing the allowed values in the external HXL dataset. | #adm1+code |
#valid_severity | no | The severity of the error, for user feedback. Allowed values are "info", "warning", or "error" (the default). | warning |
#description | no | A human-readable description of the error, to provide user feedback. | It is a good idea to include at least one #sector column in a 3W. |
The generic core HXL schema is available on HDX at https://data.humdata.org/dataset/hxl-core-schemas
Here is a simple sample schema in spreadsheet and JSON formats.
#valid_tag | #valid_severity | #valid_required +min | #valid_required +max | #valid_value +list | #description |
---|---|---|---|---|---|
#org | error | 1 | You must provide the name of the organisation doing the work. | ||
#sector | error | 1 | 1 | WASH | Health | Education | CCCM | Protection | You must provide the primary cluster for the activity |
#subsector | info | 1 | Adding a subsector enables better aid coordination. | ||
#country | error | 1 | 1 | Guinea | Liberia | Sierra Leone | You must specify the country where the work is taking place. |
#adm1 | warning | 1 | We strongly encourage specifying the administrative subdivision as well as the country. |
[
{
"#valid_tag":"#org",
"#valid_severity":"error",
"#valid_required+min":1,
"#description":"You must provide the name of the organisation doing the work."
},
{
"#valid_tag":"#sector",
"#valid_severity":"error",
"#valid_required+min":1,
"#valid_required+max":1,
"#valid_value+list":"WASH | Health | Education | CCCM | Protection",
"#description":"You must provide the primary cluster for the activity"
},
{
"#valid_tag":"#subsector",
"#valid_severity":"info",
"#valid_required+min":1,
"#description":"Adding a subsector enables better aid coordination."
},
{
"#valid_tag":"#country",
"#valid_severity":"error",
"#valid_required+min":1,
"#valid_required+max":1,
"#valid_value+list":"Guinea | Liberia | Sierra Leone"
"#description":"You must specify the country where the work is taking place."
},
{
"#valid_tag":"#adm1",
"#valid_severity":"warning",
"#valid_required+min":1,
"#description":"We strongly encourage specifying the administrative subdivision as well as the country."
}
]
Learn more about the HXL standard at http://hxlstandard.org