Skip to content

HXL schemas

David Megginson edited this page May 29, 2018 · 17 revisions

The Validation page validates a HXL dataset against a simple, spreadsheet-style HXL schema. This article describes the schema format.

Schema hashtags

The schema is itself a HXL dataset, using the following hashtags:

Schema tag Required Description Example
#valid_tag yes A tag pattern for the hashtag being described, including the "#" character. #sector
#valid_required no Without the +min or +max attributes, a truthy value (like "1") means simply that the value is required. 1
#valid_required+min no The minimum number of times a non-empty value for the tag must appear in each row of the dataset. Defaults to no minimum. 1
#valid_required+max no The maximum number of times a non-empty value for the tag may appear in each row of the dataset. Defaults to no maximum. 5
#valid_unique no Require individual values in all matching columns to be unique throughout the document true
#valid_unique+key no Define a comma-separated list of tag patterns that determines whether two rows match, and report any duplicate rows using a compound key made up of matching values from the row. #org,#adm1+code,#sector
#valid_correlation no Define a comma-separated list of tag patterns that should always have values that should always have the same values for any given value of #valid_tag. See the note on correlation for more information. #adm1,#adm2 (for #adm3)
#valid_datatype no The type of data expected in the column under the HXL tag. Currently-allowed values are "text", "number", "url", "email", and "phone" ("date" coming soon). Defaults to no type checking. number
#valid_datatype+consistent no Test for consistent datatypes in a column. Test first for dates (only if tagged #date), then for numbers, then for strings. Ignores empty values. Recommend using this test with a status of "info" or "warning", since it is purely heuristic. true
#valid_value+whitespace no Reports an error for any irregular whitespace (allows no leading or trailing space, and only single internal space characters) when set to a truthy value. true
#valid_value+min no The minimum value allowed when #valid_datatype is "number". Defaults to no minimum value. Ignored for non-numeric datatypes. 100
#valid_value+max no The maximum value allowed when #valid_datatype is "number". Defaults to no maximum value. Ignored for non-numeric datatypes. 10000
#valid_value+regex no A regular expression pattern that the value must match. ^([0-9])(,[0-9])*$
#valid_value+outliers no If truthy, report numeric outliers (ignores non-numeric values). Calculates the median and standard deviation (excluding the highest and lowest values), then reports any value with a distance of more than three standard deviations from the median. Recommend using this test with a status of "info" or "warning", since it is purely heuristic. The test will be skipped for any column whose values have a coefficient of variation > 1.0. True
#valid_value+spelling no If truthy, look for deviant spellings in the column(s) specified. Makes no assumption about language in use, etc, but just detects rare spellings that are similar to more-common spellings. Sses #valid_value+case if specified. Recommend using this test with a status of "info" or "warning", since it is purely heuristic. The test will be skipped for any column whose spelling frequencies have a coefficient of variation > 1.0. true
#valid_value+list no A list of allowed values, separated by "|". female|male
#valid_value+case no A truthy value like "1" if matches for patterns and enumerations should be case-insensitive. 0
#valid_value+url no The URL of a HXL dataset containing allowed values (possibly thousands of them). Use together with #valid_value+target_tag for the hashtag of the column containing the values. (Note: if the URL points to an unavailable resource, or to one that can't be parsed as HXL, there will be only a single error mentioning that.) http://example.org/codes/p-codes.hxl
#valid_value+target_tag no When used together with #valid_value+url, a tag pattern for the column containing the allowed values in the external HXL dataset. #adm1+code
#valid_severity no The severity of the error, for user feedback. Allowed values are "info", "warning", or "error" (the default). warning
#description no A human-readable description of the error, to provide user feedback. It is a good idea to include at least one #sector column in a 3W.

Sample schema

The generic core HXL schema is available on HDX at https://data.humdata.org/dataset/hxl-core-schemas

Here is a simple sample schema in spreadsheet and JSON formats.

Spreadsheet-flavour schema

#valid_tag #valid_severity #valid_required +min #valid_required +max #valid_value +list #description
#org error 1 You must provide the name of the organisation doing the work.
#sector error 1 1 WASH | Health | Education | CCCM | Protection You must provide the primary cluster for the activity
#subsector info 1 Adding a subsector enables better aid coordination.
#country error 1 1 Guinea | Liberia | Sierra Leone You must specify the country where the work is taking place.
#adm1 warning 1 We strongly encourage specifying the administrative subdivision as well as the country.

JSON-flavour schema

[
    {
        "#valid_tag":"#org",
        "#valid_severity":"error",
        "#valid_required+min":1,
        "#description":"You must provide the name of the organisation doing the work."
    },
    {
        "#valid_tag":"#sector",
        "#valid_severity":"error",
        "#valid_required+min":1,
        "#valid_required+max":1,
        "#valid_value+list":"WASH | Health | Education | CCCM | Protection",
        "#description":"You must provide the primary cluster for the activity"
    },
    {
        "#valid_tag":"#subsector",
        "#valid_severity":"info",
        "#valid_required+min":1,
        "#description":"Adding a subsector enables better aid coordination."
    },
    {
        "#valid_tag":"#country",
        "#valid_severity":"error",
        "#valid_required+min":1,
        "#valid_required+max":1,
        "#valid_value+list":"Guinea | Liberia | Sierra Leone"
        "#description":"You must specify the country where the work is taking place."
    },
    {
        "#valid_tag":"#adm1",
        "#valid_severity":"warning",
        "#valid_required+min":1,
        "#description":"We strongly encourage specifying the administrative subdivision as well as the country."
    }
]
Clone this wiki locally