Community schemas: allowing less fields than are defined in the schema #636

peterdesmet · 2020-10-01T18:36:26Z

peterdesmet
Oct 1, 2020

The Data Resource spec allows to point to an externally hosted Resource Schema (schema: url-or-path). This is great, because it allows communities to for example develop and host a agreed upon set of Table Schemas (example) and data producers within that community to indicate that they are following the shared schema:

"resources": [
    {
      "name": "deployments",
      "path": "deployments.csv"
      "profile": "tabular-data-resource",
      "schema": "https://gitlab.com/oscf/camtrap-package-schemas/-/raw/development/deployments-table-schema.json"
    },
   {
      "name": "media",
      "path": "media.csv",
      "profile": "tabular-data-resource",
      "schema": "https://gitlab.com/oscf/camtrap-package-schemas/-/raw/development/media-table-schema.json"
    }
]

A very common use however, is that data producers will have less fields in their csv files than are defined in the schema (e.g. skipping some non-required fields that are not applicable). According to the Table Schema specs, that should be possible (key word is SHOULD) as long as the fields are in the correct order:

A Table Schema is represented by a descriptor. The descriptor MUST be a JSON object (JSON is defined in RFC 4627).

It MUST contain a property fields. fields MUST be an array where each entry in the array is a field descriptor (as defined below). The order of elements in fields array MUST be the order of fields in the CSV file. The number of elements in fields array SHOULD be exactly the same as the number of fields in the CSV file.

Is this a correct assumption?

As in:

id,name,population
1,Britain,67
2,France,

Should be valid against (an externally hosted):

{
  "fields": [
    { "name": "id" },
    { "name": "neighbour_id" }, <-- Extra field
    { "name": "name" },
    { "name": "population" }
  ]
}

Validation in e.g. frictionless-py seems to require the exact same number of fields, which is a serious drawback for externally hosted schemas, so I wanted to understand first what the intention of the spec is.

roll · 2020-10-02T05:23:00Z

roll
Oct 2, 2020
Maintainer

Hi @peterdesmet,

It's a very good issue. It needs to be discussed on the specs level as it's not only the dimension requirement but also an addressing one: the mapping between file fields and schema fields are tied to their order.

On the software level it's much easier you just need to provide the sync_schema argument. In this case, a provided schema.fields will be used as a reference table indexed by a field name that can have more or fewer fields than the data has.

extract('path/to/table.csv', schema=schema, sync_schema=True)
validate('path/to/table.csv', schema=schema, sync_schema=True)

https://colab.research.google.com/drive/1is_PcpzFl42aWI2B2tHaBGj3jxsKZ_eZ#scrollTo=myOn0wYQcIHY

0 replies

peterdesmet · 2020-10-02T09:54:53Z

peterdesmet
Oct 2, 2020
Author

Fantastic! So the implementation allows to have less fields (1) and fields in a different order (2) then is provided in the schema. Question: is this sync_schema also available as an option in the CLI?

The less fields (1) is addressed in the specs (SHOULD) but as you point out, the order (2) is currently required (MUST) in the specs:

The order of elements in fields array MUST be the order of fields in the CSV file.

0 replies

roll · 2020-10-02T11:36:26Z

roll
Oct 2, 2020
Maintainer

Fantastic! So the implementation allows to have less fields (1) and fields in a different order (2) then is provided in the schema. Question: is this sync_schema also available as an option in the CLI?

Yes it's available in the CLI

Not sure why the spec uses SHOULD instead of MUST

0 replies

peterdesmet · 2020-10-06T07:50:22Z

peterdesmet
Oct 6, 2020
Author

Not sure why the spec uses SHOULD instead of MUST

But given the need (and the already supported implementation) the issue at hand is to change MUST into SHOULD (recommended but not required) right?

It MUST contain a property fields. fields MUST be an array where each entry in the array is a field descriptor (as defined below). The order of elements in fields array ~~MUST~~ SHOULD be the order of fields in the CSV file. The number of elements in fields array SHOULD be exactly the same as the number of fields in the CSV file.

0 replies

roll · 2020-10-06T07:59:36Z

roll
Oct 6, 2020
Maintainer

I agree. Would you like to PR?

0 replies

peterdesmet · 2020-10-06T08:14:06Z

peterdesmet
Oct 6, 2020
Author

Yes, please see frictionlessdata/datapackage#707

0 replies

roll · 2020-10-06T09:34:21Z

roll
Oct 6, 2020
Maintainer

Thanks

0 replies

peterdesmet · 2020-10-07T11:10:59Z

peterdesmet
Oct 7, 2020
Author

Commented by @roll in frictionlessdata/datapackage#707 (comment):

@peterdesmet
TBH I though you're going to make both MUST not SHOULD. I think they have to be MUST.

But actually, I think there are two aspects:

for concrete data + schema like Resource has they should be MUST

for abstract schema applicable to many files or community schema used as a reference there is no CSV file at all

Maybe we need to rephrase this paragraph.

0 replies

peterdesmet · 2020-10-07T11:16:03Z

peterdesmet
Oct 7, 2020
Author

there is no CSV file at all

I don't understand this. E.g. a camera trap data package for a specific project has csv files. These can be described literally (no reference) in schema for each resource. Or (better) schema can reference the url for the community table schema (an external json file that describes more or less columns in a different order).

0 replies

roll · 2020-10-07T11:17:26Z

roll
Oct 7, 2020
Maintainer

It MUST contain a property fields. fields MUST be an array where each entry in the array is a field descriptor (as defined below).

The order of fields in a compliant tabular file MUST BE the order of elements in fields array. The number of fields in a compliant tabular file MUST BE a number of elements in fields array.

I would use something like this just because Schema makes/can exist sense even you don't have a file.

0 replies

roll · 2020-10-07T11:19:07Z

roll
Oct 7, 2020
Maintainer

there is no CSV file at all.
I don't understand this.

I mean you can write an abstract Table Schema. It's not tied to any file. That why it can be a community/shared/etc 😃 One schema can describe N data files.

For example, Data Package and Data Resource are tied to exact files.

Table Schema: metadata
Data Resource: metadata + data
Data Package: metadata + data

0 replies

roll · 2020-10-07T11:25:57Z

roll
Oct 7, 2020
Maintainer

E.g. a camera trap data package for a specific project has csv files. These can be described literally (no reference) in schema for each resource. Or (better) schema can reference the url for the community table schema (an external json file that describes more or less columns in a different order).

There is a problem that this will not work in any existing software unless you provide special flags on software level like validate(...sync_schema=True). But it's an extension over the spec while it's not actually described in the spec even though we change MUST->SHOULD.

To make it possible we will need a whole new feature of the specs that allows software to lookup the fields by names. Some explicit analogue of sync_schema or something otherwise it will be ambiguous.

PS.
Sorry for a lot of messages I've just completely understood the problem

1 reply

peterdesmet Sep 16, 2021
Author

@roll @rufuspollock I would like to revisit this discussion, because being able to link to external Table Schemas is key for communities who want to use and share the same data format. Given that schema-sync is currently not a software requirement, these communities currently have to 1) always include all columns of the community Table Schema or 2) copy/paste the community Table Schema and remove the fields they don't want (losing the connection with the externally hosted schema). Neither is ideal, see tdwg/camtrap-dp#172

Would it be possible to explore the consequences if we update the Table Schema specs (and/or Patterns) so that Frictionless Software is required to implement schema-sync by default: i.e. reading and validating resources supports the resource to have fewer columns (and less important: in different order) that the schema has?

If we cannot/won't make this a requirement, then it can be a recommendation to communities that they should always include all columns if they want to refer to an external schema.

peterdesmet · 2020-10-09T14:37:02Z

peterdesmet
Oct 9, 2020
Author

@roll

Regarding `MUST` vs `SHOULD` in Table Schema

you write an abstract Table Schema. It's not tied to any file.

I agree, which is exactly why relaxing (SHOULD not MUST frictionlessdata/datapackage#707) the order/number of fields is an improvement imo, because it reflects reality and supports more use cases (e.g. community standards). Since it's not tied to data, it is bit odd that CSV file (3 times) is mentioned in the Table Schema spec though. It makes it easy to read, but maybe that should be adapted?

Regarding making `sync_schema` implementation explicit

I agree completely regarding:

To make it possible we will need a whole new feature of the specs that allows software to lookup the fields by names. Some explicit analogue of sync_schema or something otherwise it will be ambiguous.

I guess this should then be mentioned in the specs for Tabular Data Resource (not Table Schema), somewhere at:

The schema property MUST follow the Table Schema specification, either as a JSON object directly under the property, or a string referencing another JSON document containing the Table Schema

0 replies

kgeographer · 2020-11-18T01:32:31Z

kgeographer
Nov 18, 2020

Regarding the Oct 1 message from @roll:

On the software level it's much easier you just need to provide the sync_schema argument. In this case, a provided schema.fields will be used as a reference table indexed by a field name that can have more or fewer fields than the data has.

extract('path/to/table.csv', schema=schema, sync_schema=True)
validate('path/to/table.csv', schema=schema, sync_schema=True)

I find that adding sync_schema makes it impossible to trap header errors. For example non-matching-header. In the validation report, the header is changed to include possibly incorrect incoming header strings, and constraints initially in the referenced schema (e.g. 'required') are dropped; type becomes 'any.'

I encountered this trying to move from goodtables to frictionless to take advantage of its xlsx support. My situation is that incoming data can have fewer fields than the schema describes, but a few of the fields are required. Data creators shouldn't have to care about the order of fields, and in goodtables (2.4.1) this is not an issue.

0 replies

roll · 2020-11-23T08:01:17Z

roll
Nov 23, 2020
Maintainer

Hi @kgeographer,

Can you please post a simple example of the header and the field names in your case? I'm trying to figure out if labels in data don't match field names so how we can link them for validation.

0 replies

kgeographer · 2020-11-24T18:58:53Z

kgeographer
Nov 24, 2020

Yes, my case arises in performing validation of data contributions for World Historical Gazetteer using the LP-TSV variation of our contribution format.

Attached a sample data file, sample schema and little python script for demo

One problem is that with sync_schema = True, the file is validated even if a header for a required field is missing or if the column is missing altogether - I guess because the schema considered is modified on the fly to simply match what arrives.

The second problem is that if sync_schema is not used, and an incoming file header is not in the same order as the schema, and/or it is missing some non-required fields, then it generates many errors because columns are evaluated against the schema by index/position, not header string.

thanks

issue_704.zip

0 replies

roll · 2020-11-25T08:07:06Z

roll
Nov 25, 2020
Maintainer

Thanks @kgeographer,

I've created a feature request - frictionlessdata/frictionless-py#546

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Community schemas: allowing less fields than are defined in the schema #636

{{title}}

Replies: 17 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Community schemas: allowing less fields than are defined in the schema #636

peterdesmet Oct 1, 2020

Replies: 17 comments · 1 reply

roll Oct 2, 2020 Maintainer

peterdesmet Oct 2, 2020 Author

roll Oct 2, 2020 Maintainer

peterdesmet Oct 6, 2020 Author

roll Oct 6, 2020 Maintainer

peterdesmet Oct 6, 2020 Author

roll Oct 6, 2020 Maintainer

peterdesmet Oct 7, 2020 Author

peterdesmet Oct 7, 2020 Author

roll Oct 7, 2020 Maintainer

roll Oct 7, 2020 Maintainer

roll Oct 7, 2020 Maintainer

peterdesmet Sep 16, 2021 Author

peterdesmet Oct 9, 2020 Author

Regarding MUST vs SHOULD in Table Schema

Regarding making sync_schema implementation explicit

kgeographer Nov 18, 2020

roll Nov 23, 2020 Maintainer

kgeographer Nov 24, 2020

roll Nov 25, 2020 Maintainer

peterdesmet
Oct 1, 2020

Replies: 17 comments 1 reply

roll
Oct 2, 2020
Maintainer

peterdesmet
Oct 2, 2020
Author

roll
Oct 2, 2020
Maintainer

peterdesmet
Oct 6, 2020
Author

roll
Oct 6, 2020
Maintainer

peterdesmet
Oct 6, 2020
Author

roll
Oct 6, 2020
Maintainer

peterdesmet
Oct 7, 2020
Author

peterdesmet
Oct 7, 2020
Author

roll
Oct 7, 2020
Maintainer

roll
Oct 7, 2020
Maintainer

roll
Oct 7, 2020
Maintainer

peterdesmet Sep 16, 2021
Author

peterdesmet
Oct 9, 2020
Author

Regarding `MUST` vs `SHOULD` in Table Schema

Regarding making `sync_schema` implementation explicit

kgeographer
Nov 18, 2020

roll
Nov 23, 2020
Maintainer

kgeographer
Nov 24, 2020

roll
Nov 25, 2020
Maintainer