Community schemas: allowing less fields than are defined in the schema #636
Replies: 17 comments 1 reply
-
Hi @peterdesmet, It's a very good issue. It needs to be discussed on the specs level as it's not only the dimension requirement but also an addressing one: the mapping between file fields and schema fields are tied to their order. On the software level it's much easier you just need to provide the extract('path/to/table.csv', schema=schema, sync_schema=True)
validate('path/to/table.csv', schema=schema, sync_schema=True) https://colab.research.google.com/drive/1is_PcpzFl42aWI2B2tHaBGj3jxsKZ_eZ#scrollTo=myOn0wYQcIHY |
Beta Was this translation helpful? Give feedback.
-
Fantastic! So the implementation allows to have less fields (1) and fields in a different order (2) then is provided in the schema. Question: is this The less fields (1) is addressed in the specs (
|
Beta Was this translation helpful? Give feedback.
-
Yes it's available in the CLI Not sure why the spec uses |
Beta Was this translation helpful? Give feedback.
-
But given the need (and the already supported implementation) the issue at hand is to change
|
Beta Was this translation helpful? Give feedback.
-
I agree. Would you like to PR? |
Beta Was this translation helpful? Give feedback.
-
Yes, please see frictionlessdata/datapackage#707 |
Beta Was this translation helpful? Give feedback.
-
Commented by @roll in frictionlessdata/datapackage#707 (comment):
|
Beta Was this translation helpful? Give feedback.
-
I don't understand this. E.g. a camera trap data package for a specific project has csv files. These can be described literally (no reference) in |
Beta Was this translation helpful? Give feedback.
-
I would use something like this just because Schema makes/can exist sense even you don't have a file. |
Beta Was this translation helpful? Give feedback.
-
I mean you can write an abstract Table Schema. It's not tied to any file. That why it can be a community/shared/etc 😃 One schema can describe N data files. For example, Data Package and Data Resource are tied to exact files. Table Schema: metadata |
Beta Was this translation helpful? Give feedback.
-
There is a problem that this will not work in any existing software unless you provide special flags on software level like To make it possible we will need a whole new feature of the specs that allows software to lookup the fields by names. Some explicit analogue of PS. |
Beta Was this translation helpful? Give feedback.
-
Regarding
|
Beta Was this translation helpful? Give feedback.
-
Regarding the Oct 1 message from @roll:
I find that adding sync_schema makes it impossible to trap header errors. For example non-matching-header. In the validation report, the header is changed to include possibly incorrect incoming header strings, and constraints initially in the referenced schema (e.g. 'required') are dropped; type becomes 'any.' I encountered this trying to move from goodtables to frictionless to take advantage of its xlsx support. My situation is that incoming data can have fewer fields than the schema describes, but a few of the fields are required. Data creators shouldn't have to care about the order of fields, and in goodtables (2.4.1) this is not an issue. |
Beta Was this translation helpful? Give feedback.
-
Hi @kgeographer, Can you please post a simple example of the header and the field names in your case? I'm trying to figure out if labels in data don't match field names so how we can link them for validation. |
Beta Was this translation helpful? Give feedback.
-
Yes, my case arises in performing validation of data contributions for World Historical Gazetteer using the LP-TSV variation of our contribution format. Attached a sample data file, sample schema and little python script for demo One problem is that with sync_schema = True, the file is validated even if a header for a required field is missing or if the column is missing altogether - I guess because the schema considered is modified on the fly to simply match what arrives. The second problem is that if sync_schema is not used, and an incoming file header is not in the same order as the schema, and/or it is missing some non-required fields, then it generates many errors because columns are evaluated against the schema by index/position, not header string. thanks |
Beta Was this translation helpful? Give feedback.
-
Thanks @kgeographer, I've created a feature request - frictionlessdata/frictionless-py#546 |
Beta Was this translation helpful? Give feedback.
-
The Data Resource spec allows to point to an externally hosted Resource Schema (
schema: url-or-path
). This is great, because it allows communities to for example develop and host a agreed upon set of Table Schemas (example) and data producers within that community to indicate that they are following the shared schema:A very common use however, is that data producers will have less fields in their csv files than are defined in the schema (e.g. skipping some non-required fields that are not applicable). According to the Table Schema specs, that should be possible (key word is
SHOULD
) as long as the fields are in the correct order:Is this a correct assumption?
As in:
Should be valid against (an externally hosted):
Validation in e.g.
frictionless-py
seems to require the exact same number of fields, which is a serious drawback for externally hosted schemas, so I wanted to understand first what the intention of the spec is.Beta Was this translation helpful? Give feedback.
All reactions