Replies: 5 comments
-
@iSnow good catch on ambiguity of implementation here. Schema inferral from JSON data (without header order info) will lead to ambiguity at least on header order depending on how the parser sorts the keys. One's hope would be that parsers somehow do respect the written-order (at least for the first row) but no guarantee of that afaik (any experience you have on that in your language of choice would be useful BTW). IMO order does matter so the example you gave of table schemas with /cc @roll |
Beta Was this translation helpful? Give feedback.
-
@rufuspollock and @roll I don't think it is wise to rely on an implementation detail that is more restrictive than the spec, no matter how much more convenient that would be. The Java JSON.org parser seems to respect the JSON property order, but it is getting swapped out anyways. It is actually even more fun as I just found out with real-world JSON data, as the following JSON:
Still matches this Schema:
Because typically, NULL-properties are just omitted in JSON serialization. For the time being in the Java port, I have switched JSON data validation against a Schema to a more permissive model that disregards property order and allows missing properties in the data. It still disallows additional properties not declared in the Schema, and for CSV data, the tighter rules stay in place. Ideally, the Data Resource Spec should be tightened to disallow "naked" JSON-Array based Resources and always require a Schema definition. Schema inferral from JSON-Array based Resources maybe should be dropped completely, it's very hard to infer a Schema from real-world JSON array data and be sure you have a full set of headers. For small JSON data, you can just iterate through every array-entry and do a union of found property keys, but at the very least, a warning and description of the problem should be put both in the user docs and implementation docs to be careful with really huge files, as it will have a noticable performance impact. It also does not solve the potential that your derived Schema will have the fields in a random order if your JSON library isn't order-preserving. The brute-force inferral of a Schema for JSON-Array data works as following:
This is different from inferral of a Schema for CSV data:
|
Beta Was this translation helpful? Give feedback.
-
We use Table Schema to define requirements on CSV content as well and find currently that fields defined as object or array are rather weak. The Table Schema spec says that either or these should simply be well-formed JSON objects or arrays (respectively). I see in the discussion here that these fields' schemas could be inferred from examples although doing so leads to questionable results. My question linked to this issue, and potentially a Table Schema spec feature request, would be to simply allow object and array fields to have a JSON schema defined as part of their format definition. In fact if a JSON schema could be defined like this, it would render quite similar fields of type object, array, geojson and geopoint (with "object" format), given that all these cases could be covered by defining a schema. Any thoughts would be appreciated (if considered worthy I would be happy to create a separate issue for this). |
Beta Was this translation helpful? Give feedback.
-
@costas80 i think your proposal makes a lot of sense. i'm not sure i would get rid of all those other types but just allow json schema as |
Beta Was this translation helpful? Give feedback.
-
Thanks for the feedback @rufuspollock . See issue frictionlessdata/datapackage#711 . |
Beta Was this translation helpful? Give feedback.
-
Unclear about Schema inferral of JSON data
In implementation.md, it is written that implementations should allow the inferring of a Schema from supplied data: "infer a Table Schema descriptor from a supplied sample of data". This makes it seem that no matter whether the data is CSV or an inline array of JSON objects, it should be possible to infer a Schema.
However, the JSON specs say "An object is an unordered set of name/value pairs.", which means that the following two data samples are equivalent:
and
It is easy to see that since the ordering of properties is not guaranteed, it is not possible to infer a Schema with a guaranteed field order from JSON arrays containing JSON objects.
If I understand the Python implementation right (not a Python guy, so I may well be missing something), it is confused on this:
https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/infer.py says:
"source (any): source as path, url or inline data"
whereas https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/cli.py#L48 states "data must be CSV".
I don't know how an implementation should react to an attempt to infer a Schema from a JSON array containing JSON objects. Raise an exception? Just return the right fields in any old order?
Unclear about Schema application to JSON data
While the formal aspects of a Schema can be validated against the JSON Schema spec, I didn't find a lot whether the order of fields in a Schema should count if applied to CSV data, ie. are the following two Schemas considered the same:
and
No matter what rules apply for CSV data, it is not possible to enforce order when applying to JSON arrays containing JSON objects. Therefore I guess validation of a data sample against a Schema should leave property order out. Am I right in this?
Beta Was this translation helpful? Give feedback.
All reactions