Table Schemas and JSON data #890

iSnow · 2019-12-19T13:52:40Z

iSnow
Dec 19, 2019

Unclear about Schema inferral of JSON data

In implementation.md, it is written that implementations should allow the inferring of a Schema from supplied data: "infer a Table Schema descriptor from a supplied sample of data". This makes it seem that no matter whether the data is CSV or an inline array of JSON objects, it should be possible to infer a Schema.

However, the JSON specs say "An object is an unordered set of name/value pairs.", which means that the following two data samples are equivalent:

{
  "data": {
    "resource-name-data": [
      {"a": 1, "b": 2}
    ]
  },
}

and

{
  "data": {
    "resource-name-data": [
      {"b": 2, "a": 1}
    ]
  },
}

It is easy to see that since the ordering of properties is not guaranteed, it is not possible to infer a Schema with a guaranteed field order from JSON arrays containing JSON objects.

If I understand the Python implementation right (not a Python guy, so I may well be missing something), it is confused on this:

https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/infer.py says:

"source (any): source as path, url or inline data"

whereas https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/cli.py#L48 states "data must be CSV".

I don't know how an implementation should react to an attempt to infer a Schema from a JSON array containing JSON objects. Raise an exception? Just return the right fields in any old order?

Unclear about Schema application to JSON data

While the formal aspects of a Schema can be validated against the JSON Schema spec, I didn't find a lot whether the order of fields in a Schema should count if applied to CSV data, ie. are the following two Schemas considered the same:

{
  "fields": [
    {
      "name": "a",
      "type": "integer"
    },
    {
      "name": "b",
      "type": "integer"
    }
  ]
}

and

{
  "fields": [
    {
      "name": "b",
      "type": "integer"
    },
    {
      "name": "a",
      "type": "integer"
    }
  ]
}

No matter what rules apply for CSV data, it is not possible to enforce order when applying to JSON arrays containing JSON objects. Therefore I guess validation of a data sample against a Schema should leave property order out. Am I right in this?

rufuspollock · 2019-12-28T21:46:59Z

rufuspollock
Dec 28, 2019
Maintainer

@iSnow good catch on ambiguity of implementation here. Schema inferral from JSON data (without header order info) will lead to ambiguity at least on header order depending on how the parser sorts the keys. One's hope would be that parsers somehow do respect the written-order (at least for the first row) but no guarantee of that afaik (any experience you have on that in your language of choice would be useful BTW).

IMO order does matter so the example you gave of table schemas with a and b reversed are different - though one could debate the significance of that difference but i think it would matter for validation.

/cc @roll

0 replies

iSnow · 2020-01-14T21:39:17Z

iSnow
Jan 14, 2020
Author

@rufuspollock and @roll I don't think it is wise to rely on an implementation detail that is more restrictive than the spec, no matter how much more convenient that would be.

The Java JSON.org parser seems to respect the JSON property order, but it is getting swapped out anyways.

It is actually even more fun as I just found out with real-world JSON data, as the following JSON:

{
   "data": {
     "resource-name-data": [
       {"a": 1}
     ]
   },
}

Still matches this Schema:

{
  "fields": [
    {
       "name": "b",
       "type": "integer"
    },
    {
       "name": "a",
       "type": "integer"
    }
  ]
}

Because typically, NULL-properties are just omitted in JSON serialization.

For the time being in the Java port, I have switched JSON data validation against a Schema to a more permissive model that disregards property order and allows missing properties in the data. It still disallows additional properties not declared in the Schema, and for CSV data, the tighter rules stay in place.

Ideally, the Data Resource Spec should be tightened to disallow "naked" JSON-Array based Resources and always require a Schema definition.

Schema inferral from JSON-Array based Resources maybe should be dropped completely, it's very hard to infer a Schema from real-world JSON array data and be sure you have a full set of headers. For small JSON data, you can just iterate through every array-entry and do a union of found property keys, but at the very least, a warning and description of the problem should be put both in the user docs and implementation docs to be careful with really huge files, as it will have a noticable performance impact. It also does not solve the potential that your derived Schema will have the fields in a random order if your JSON library isn't order-preserving.

The brute-force inferral of a Schema for JSON-Array data works as following:

iterate through every entry of the array
pull all properties for the JSON array => the "headers" for this entry
union with existing headers to derive the headers for the Table
pull type information for the entry
add to the corpus of type information for the property key to later calculate the most likely type
hope and pray that the JSON processor that created your data was order-preserving.

This is different from inferral of a Schema for CSV data:

read the first row => the headers for the Table
iterate the other rows to add to the corpus of type information for the property key to later calculate the most likely type

0 replies

costas80 · 2020-10-21T08:29:26Z

costas80
Oct 21, 2020

We use Table Schema to define requirements on CSV content as well and find currently that fields defined as object or array are rather weak. The Table Schema spec says that either or these should simply be well-formed JSON objects or arrays (respectively). I see in the discussion here that these fields' schemas could be inferred from examples although doing so leads to questionable results.

My question linked to this issue, and potentially a Table Schema spec feature request, would be to simply allow object and array fields to have a JSON schema defined as part of their format definition.

In fact if a JSON schema could be defined like this, it would render quite similar fields of type object, array, geojson and geopoint (with "object" format), given that all these cases could be covered by defining a schema.

Any thoughts would be appreciated (if considered worthy I would be happy to create a separate issue for this).

0 replies

rufuspollock · 2020-10-21T12:11:28Z

rufuspollock
Oct 21, 2020
Maintainer

@costas80 i think your proposal makes a lot of sense. i'm not sure i would get rid of all those other types but just allow json schema as format on all of them - and yes let's create a separate issue.

0 replies

costas80 · 2020-10-21T12:17:35Z

costas80
Oct 21, 2020

Thanks for the feedback @rufuspollock . See issue frictionlessdata/datapackage#711 .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table Schemas and JSON data #890

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Table Schemas and JSON data #890

iSnow Dec 19, 2019

Unclear about Schema inferral of JSON data

Unclear about Schema application to JSON data

Replies: 5 comments

rufuspollock Dec 28, 2019 Maintainer

iSnow Jan 14, 2020 Author

costas80 Oct 21, 2020

rufuspollock Oct 21, 2020 Maintainer

costas80 Oct 21, 2020

iSnow
Dec 19, 2019

rufuspollock
Dec 28, 2019
Maintainer

iSnow
Jan 14, 2020
Author

costas80
Oct 21, 2020

rufuspollock
Oct 21, 2020
Maintainer

costas80
Oct 21, 2020