Validation schema and online editor #67

simontaurus · 2024-04-23T06:02:38Z

We are currently on the way implementing RO-Crate / ELN Fileformat for OpenSemanticLab.

On the way we will create a validation schema using OO-LD.

As discussed with @SteffenBrinckmann this issue to share the work early and provide a first preview (code):
playground example

SteffenBrinckmann · 2024-04-23T07:28:04Z

@simontaurus I like the idea: immediately show the content of the .eln file in a github.io and allow the user to inspect it.

SteffenBrinckmann · 2024-10-07T13:02:53Z

Hey,
I want to come back to the schema again, as it can help us clarify if something is complying. The ro-crate-metadata is a json and we should be able to create a schema for it. I have no experience in writing schema, but I am volunteering to create a github action to verify all .eln once the schema exists.

@FlorianRhiem, @nicobrandt, @NicolasCARPi : can you help; have you already created one?

There is no gui / preview aspect to this.

NicolasCARPi · 2024-10-07T13:49:35Z

TBH I'm surprised RO-Crate doesn't provide a schema. Seems it would be most helpful to everyone. This seems to discuss it: ResearchObject/ro-crate#33

Maybe the JSON-LD nature of it makes it hard to create... Given that all the nodes can accept a wide range of properties, which can themselves have values of different types and subtypes, I wonder if it's even possible to create a schema that will validate all eln. It will necessarily be restrictive/incomplete, and we will need to adjust it often, based on our use.

Here is a generated one from an elabftw metadata, as a starting point:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "@context": {
      "type": "string",
      "format": "uri"
    },
    "@graph": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "@id": {
            "type": "string"
          },
          "@type": {
            "type": "string"
          },
          "about": {
            "type": "object",
            "properties": {
              "@id": {
                "type": "string"
              }
            }
          },
          "conformsTo": {
            "type": "object",
            "properties": {
              "@id": {
                "type": "string",
                "format": "uri"
              }
            }
          },
          "dateCreated": {
            "type": "string",
            "format": "date-time"
          },
          "sdPublisher": {
            "type": "object",
            "properties": {
              "@id": {
                "type": "string"
              }
            }
          },
          "version": {
            "type": "string"
          },
          "author": {
            "type": "object",
            "properties": {
              "@id": {
                "type": "string"
              }
            }
          },
          "dateModified": {
            "type": "string",
            "format": "date-time"
          },
          "name": {
            "type": "string"
          },
          "encodingFormat": {
            "type": "string"
          },
          "url": {
            "type": "string",
            "format": "uri"
          },
          "genre": {
            "type": "string"
          },
          "creativeWorkStatus": {
            "type": "string"
          },
          "identifier": {
            "type": "string"
          },
          "keywords": {
            "type": "string"
          },
          "step": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "@type": {
                  "type": "string"
                },
                "position": {
                  "type": "integer"
                },
                "creativeWorkStatus": {
                  "type": "string"
                },
                "itemListElement": {
                  "type": "array",
                  "items": {
                    "type": "object",
                    "properties": {
                      "@type": {
                        "type": "string"
                      },
                      "text": {
                        "type": "string"
                      }
                    }
                  }
                }
              }
            }
          },
          "hasPart": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "@id": {
                  "type": "string"
                }
              }
            }
          },
          "comment": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "@id": {
                  "type": "string"
                },
                "@type": {
                  "type": "string"
                },
                "dateCreated": {
                  "type": "string",
                  "format": "date-time"
                },
                "text": {
                  "type": "string"
                },
                "author": {
                  "type": "object",
                  "properties": {
                    "@id": {
                      "type": "string"
                    }
                  }
                }
              }
            }
          }
        },
        "required": ["@id", "@type"]
      }
    }
  },
  "required": ["@context", "@graph"]
}

FlorianRhiem · 2024-10-07T14:05:25Z

@FlorianRhiem, @nicobrandt, @NicolasCARPi : can you help; have you already created one?

I haven't created one, rather I'm using a custom implemented parser (after using the built-in json module for parsing the data as json in the first place). This way I can directly split things up in ways useful for SampleDB, create warnings or error messages based upon that, etc. It might be better to have a unified parser, but doing it this way got things rolling quickly.

simontaurus · 2024-10-13T14:30:15Z

The main issue is that RO-CRATE specifies in principle an RDF graph, serialized as flattened and compacted JSON-LD (@graph with a list of nodes). This makes any syntactical validation (like JSON-SCHEMA) limited, in comparison to semantic RDF-"SCHEMAS" like SHACL (which, vise verse, have their limits in adaption and tool availability)

However, looking more closely, the RO-CRATE spec is not pure semantically but also syntactically:
e.g.

@type: MUST be Dataset
datePublished: MUST be a string in ISO 8601 date format and SHOULD be specified to at least the precision of a day, MAY be a timestamp down to the millisecond.

Pure semantically would mean, e.g. "the triple <./.> <rdf:type> <http://schema.org/Dataset> is present".
This leads to the interpretation (and implementations, also in https://github.com/TheELNConsortium/TheELNFileFormat/blob/master/tests/test_01_params_metadata_json.py#L53) that the named keys (e.g. @type) must be present but in fact this two JSON-LD files are semantically equivalent:

{
  "@context": "https://w3id.org/ro/crate/1.1/context",
  "@type": "Dataset"
}

{
  "@context": [
    "https://w3id.org/ro/crate/1.1/context",
    {
      "type": "@type"
    }
  ],
  "type": "Dataset"
}

In addition, RO-Crate inherits the very lax handling of data type, cardinality, range and required-ness of properties further comlicating validation and consumer implementation.
Therefore we would suggest to create at least for the ELN format syntactically (but sematically annotated) schemas (see OO-LD) for a normalized tree-shape of the ro-create-metadata.json. Luckily JSON-LD can help use here with simply applying the framing algorithm, in combination with some aliases (@id => id, @type => type) in oder to get valid variable names e.g. in python.

Early ELN-Fileformat Schema Draft

{
  "@context": [
    "https://w3id.org/ro/crate/1.1/context",
    {
      "id": "@id",
      "type": "@type"
    }
  ],
  "title": "RO-Crate",
  "type": "object",
  "version": "0.1.1",
  "required": [
    "id",
    "type",
    "conformsTo",
    "sdPublisher"
  ],
  "definitions": {
    "Thing": {
      "type": "object",
      "required": [
        "type"
      ],
      "properties": {
        "type": {
          "type": "string",
          "default": "Thing"
        },
        "id": {
          "type": "string"
        },
        "name": {
          "type": "string"
        },
        "description": {
          "type": "string"
        }
      }
    },
    "Organization": {
      "allOf": [
        {
          "$ref": "#/definitions/Thing"
        }
      ],
      "type": "object",
      "required": [
        "type"
      ],
      "properties": {
        "type": {
          "type": "string",
          "default": "Organization"
        },
        "url": {
          "type": "string",
          "format": "url"
        },
        "areaServed": {
          "type": "string"
        },
        "slogan": {
          "type": "string"
        },
        "logo": {
          "type": "string",
          "format": "url",
          "links": [
            {
              "href": "{{self}}",
              "type": "img/png"
            }
          ]
        },
        "parentOrganization": {
          "$ref": "#/definitions/Organization"
        }
      },
      "options": {
        "display_required_only": true
      }
    },
    "Person": {
      "allOf": [
        {
          "$ref": "#/definitions/Thing"
        }
      ],
      "title": "Person",
      "type": "object",
      "properties": {
        "email": {
          "type": "string",
          "format": "email"
        },
        "familyName": {
          "type": "string"
        },
        "givenName": {
          "type": "string"
        }
      }
    },
    "CreativeWork": {
      "allOf": [
        {
          "$ref": "#/definitions/Thing"
        }
      ],
      "title": "CreativeWork",
      "type": "object",
      "properties": {
    "dateCreated": {
      "type": "string",
      "format": "datetime-local",
      "options": {
        "flatpickr": {}
      }
    },
    "dateModified": {
      "type": "string",
      "format": "datetime-local",
      "options": {
        "flatpickr": {}
      }
    },
        "keywords": {
          "type": "array",
          "items": {
            "type": "string"
          }
        }
      }
    },
    "File": {
      "allOf": [
        {
          "$ref": "#/definitions/CreativeWork"
        }
      ],
      "title": "File",
      "type": "object",
      "id": "file",
      "required": [
        "type",
        "id"
      ],
      "properties": {
        "type": {
          "type": "string",
          "enum": [
            "File"
          ]
        },
        "_id": {
          "type": "string"
        },
        "encodingFormat": {
          "type": "string"
        },
        "id": {
          "type": "string",
          "format": "url",
          "options": {
            "upload": {
              "upload_handler": "testUploadHandler"
            }
          },
          "links": [
            {
              "href": "dummy",
              "rel": "view / download"
            }
          ]
        }
      },
      "not": { "required": [ "hasPart" ] },
      "options": {
        "display_required_only": true
      }
    },
    "Dataset": {
      "allOf": [
        {
          "$ref": "#/definitions/CreativeWork"
        }
      ],
      "title": "Dataset",
      "type": "object",
      "required": [
        "type",
        "id"
      ],
      "properties": {
        "type": {
          "type": "string",
          "enum": [
            "Dataset"
          ]
        },
        "about": {
          "$ref": "#/definitions/Thing"
        },
        "hasPart": {
          "type": "array",
          "format": "tabs",
          "items": {
        "discriminator": {
        "propertyName": "type",
        "mapping": {
          "File": "#/definitions/File",
          "Dataset": "#/definitions/Dataset"
        }
      },
            "oneOf": [
              {
                "$ref": "#/definitions/File"
              },
              {
                "$ref": "#/definitions/Dataset"
              }
            ]
          }
        }
      },
      "options": {
        "display_required_only": true
      }
    }
  },
  "properties": {
    "id": {
      "type": "string",
      "enum": [
        "ro-crate-metadata.json"
      ]
    },
    "type": {
      "type": "string",
      "enum": [
        "CreativeWork"
      ]
    },
    "conformsTo": {
      "type": "string",
      "enum": [
        "https://w3id.org/ro/crate/1.1"
      ]
    },
    "version": {
      "type": "string"
    },
    "sdPublisher": {
      "$ref": "#/definitions/Organization"
    },
    "about": {
      "$ref": "#/definitions/Dataset",
      "properties": {
        "id": {
          "enum": [
            "./"
          ]
        }
      },
      "default": {
        "hasPart": []
      }
    }
  },
  "options": {
    "_display_required_only": true
  }
}

This would provide use three outcomes:

programming language agnostic validation by using JSON-LD normalization + JSON-SCHEMA
code generation to automate the parsing and serialization
UI generation for generic viewers / editors

As a demo for 1. and 3.,
ELN Fileformat Playground - PASTA.eln
reveals that Steffen has nested a Dataset in a File, which could at least be discussed.

For 2., throwing the same schema at OO-LD Python playground gives us pydantic dataclasses both for validation and implementation.

Generated Dataclasses

from __future__ import annotations

from enum import Enum
from typing import List, Literal, Optional, Union

from pydantic import BaseModel, EmailStr, Field


class Id(Enum):
    ro_crate_metadata_json = "ro-crate-metadata.json"


class Type(Enum):
    CreativeWork = "CreativeWork"


class ConformsTo(Enum):
    https___w3id_org_ro_crate_1_1 = "https://w3id.org/ro/crate/1.1"


class Thing(BaseModel):
    type: str
    id: Optional[str] = None
    name: Optional[str] = None
    description: Optional[str] = None


class Organization(Thing):
    type: str
    url: Optional[str] = None
    areaServed: Optional[str] = None
    slogan: Optional[str] = None
    logo: Optional[str] = Field(None, links=[{"href": "{{self}}", "type": "img/png"}])
    parentOrganization: Optional[Organization] = None


class Person(Thing):
    email: Optional[EmailStr] = None
    familyName: Optional[str] = None
    givenName: Optional[str] = None


class CreativeWork(Thing):
    dateCreated: Optional[str] = Field(None, options={"flatpickr": {}})
    dateModified: Optional[str] = Field(None, options={"flatpickr": {}})
    keywords: Optional[List[str]] = None


class Type1(Enum):
    File = "File"


class File(CreativeWork):
    type: Literal["File"]
    field_id: Optional[str] = Field(None, alias="_id")
    encodingFormat: Optional[str] = None
    id: str = Field(
        ...,
        links=[{"href": "dummy", "rel": "view / download"}],
        options={"upload": {"upload_handler": "testUploadHandler"}},
    )


class Type2(Enum):
    Dataset = "Dataset"


class ROCrate(BaseModel):
    id: Id
    type: Type
    conformsTo: ConformsTo
    version: Optional[str] = None
    sdPublisher: Organization
    about: Optional[Dataset] = Field(
        default_factory=lambda: Dataset.parse_obj({"hasPart": []})
    )


class Dataset(CreativeWork):
    type: Literal["Dataset"]
    about: Optional[Thing] = None
    hasPart: Optional[List[Union[File, Dataset]]] = Field(None, discriminator="type")
    id: str

While this approach never forbidds additional properties we can easily define in a machine readable way which properties we expect to be used and how we expect them to be used. Also we can define subclasses of Dataset or File that have a strict definition of metadata in order to allow automated processing, e.g. CsvFile with CSVW conform column annotations.

What do you think?

NicolasCARPi mentioned this issue Nov 3, 2024

Validation of examples for ELN-file format with RO-Crate validator fails #88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation schema and online editor #67

Validation schema and online editor #67

simontaurus commented Apr 23, 2024 •

edited

Loading

SteffenBrinckmann commented Apr 23, 2024

SteffenBrinckmann commented Oct 7, 2024

NicolasCARPi commented Oct 7, 2024

FlorianRhiem commented Oct 7, 2024

simontaurus commented Oct 13, 2024 •

edited

Loading

Validation schema and online editor #67

Validation schema and online editor #67

Comments

simontaurus commented Apr 23, 2024 • edited Loading

SteffenBrinckmann commented Apr 23, 2024

SteffenBrinckmann commented Oct 7, 2024

NicolasCARPi commented Oct 7, 2024

FlorianRhiem commented Oct 7, 2024

simontaurus commented Oct 13, 2024 • edited Loading

simontaurus commented Apr 23, 2024 •

edited

Loading

simontaurus commented Oct 13, 2024 •

edited

Loading