Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation schema and online editor #67

Open
simontaurus opened this issue Apr 23, 2024 · 5 comments
Open

Validation schema and online editor #67

simontaurus opened this issue Apr 23, 2024 · 5 comments

Comments

@simontaurus
Copy link
Contributor

simontaurus commented Apr 23, 2024

We are currently on the way implementing RO-Crate / ELN Fileformat for OpenSemanticLab.

On the way we will create a validation schema using OO-LD.

As discussed with @SteffenBrinckmann this issue to share the work early and provide a first preview (code):
playground example

@SteffenBrinckmann
Copy link
Collaborator

@simontaurus I like the idea: immediately show the content of the .eln file in a github.io and allow the user to inspect it.

@SteffenBrinckmann
Copy link
Collaborator

Hey,
I want to come back to the schema again, as it can help us clarify if something is complying. The ro-crate-metadata is a json and we should be able to create a schema for it. I have no experience in writing schema, but I am volunteering to create a github action to verify all .eln once the schema exists.

@FlorianRhiem, @nicobrandt, @NicolasCARPi : can you help; have you already created one?

There is no gui / preview aspect to this.

@NicolasCARPi
Copy link
Contributor

TBH I'm surprised RO-Crate doesn't provide a schema. Seems it would be most helpful to everyone. This seems to discuss it: ResearchObject/ro-crate#33

Maybe the JSON-LD nature of it makes it hard to create... Given that all the nodes can accept a wide range of properties, which can themselves have values of different types and subtypes, I wonder if it's even possible to create a schema that will validate all eln. It will necessarily be restrictive/incomplete, and we will need to adjust it often, based on our use.

Here is a generated one from an elabftw metadata, as a starting point:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "@context": {
      "type": "string",
      "format": "uri"
    },
    "@graph": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "@id": {
            "type": "string"
          },
          "@type": {
            "type": "string"
          },
          "about": {
            "type": "object",
            "properties": {
              "@id": {
                "type": "string"
              }
            }
          },
          "conformsTo": {
            "type": "object",
            "properties": {
              "@id": {
                "type": "string",
                "format": "uri"
              }
            }
          },
          "dateCreated": {
            "type": "string",
            "format": "date-time"
          },
          "sdPublisher": {
            "type": "object",
            "properties": {
              "@id": {
                "type": "string"
              }
            }
          },
          "version": {
            "type": "string"
          },
          "author": {
            "type": "object",
            "properties": {
              "@id": {
                "type": "string"
              }
            }
          },
          "dateModified": {
            "type": "string",
            "format": "date-time"
          },
          "name": {
            "type": "string"
          },
          "encodingFormat": {
            "type": "string"
          },
          "url": {
            "type": "string",
            "format": "uri"
          },
          "genre": {
            "type": "string"
          },
          "creativeWorkStatus": {
            "type": "string"
          },
          "identifier": {
            "type": "string"
          },
          "keywords": {
            "type": "string"
          },
          "step": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "@type": {
                  "type": "string"
                },
                "position": {
                  "type": "integer"
                },
                "creativeWorkStatus": {
                  "type": "string"
                },
                "itemListElement": {
                  "type": "array",
                  "items": {
                    "type": "object",
                    "properties": {
                      "@type": {
                        "type": "string"
                      },
                      "text": {
                        "type": "string"
                      }
                    }
                  }
                }
              }
            }
          },
          "hasPart": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "@id": {
                  "type": "string"
                }
              }
            }
          },
          "comment": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "@id": {
                  "type": "string"
                },
                "@type": {
                  "type": "string"
                },
                "dateCreated": {
                  "type": "string",
                  "format": "date-time"
                },
                "text": {
                  "type": "string"
                },
                "author": {
                  "type": "object",
                  "properties": {
                    "@id": {
                      "type": "string"
                    }
                  }
                }
              }
            }
          }
        },
        "required": ["@id", "@type"]
      }
    }
  },
  "required": ["@context", "@graph"]
}

@FlorianRhiem
Copy link
Contributor

@FlorianRhiem, @nicobrandt, @NicolasCARPi : can you help; have you already created one?

I haven't created one, rather I'm using a custom implemented parser (after using the built-in json module for parsing the data as json in the first place). This way I can directly split things up in ways useful for SampleDB, create warnings or error messages based upon that, etc. It might be better to have a unified parser, but doing it this way got things rolling quickly.

@simontaurus
Copy link
Contributor Author

simontaurus commented Oct 13, 2024

The main issue is that RO-CRATE specifies in principle an RDF graph, serialized as flattened and compacted JSON-LD (@graph with a list of nodes). This makes any syntactical validation (like JSON-SCHEMA) limited, in comparison to semantic RDF-"SCHEMAS" like SHACL (which, vise verse, have their limits in adaption and tool availability)

However, looking more closely, the RO-CRATE spec is not pure semantically but also syntactically:
e.g.

@type: MUST be Dataset
datePublished: MUST be a string in ISO 8601 date format and SHOULD be specified to at least the precision of a day, MAY be a timestamp down to the millisecond.

Pure semantically would mean, e.g. "the triple <./.> <rdf:type> <http://schema.org/Dataset> is present".
This leads to the interpretation (and implementations, also in https://github.com/TheELNConsortium/TheELNFileFormat/blob/master/tests/test_01_params_metadata_json.py#L53) that the named keys (e.g. @type) must be present but in fact this two JSON-LD files are semantically equivalent:

{
  "@context": "https://w3id.org/ro/crate/1.1/context",
  "@type": "Dataset"
}
{
  "@context": [
    "https://w3id.org/ro/crate/1.1/context",
    {
      "type": "@type"
    }
  ],
  "type": "Dataset"
}

In addition, RO-Crate inherits the very lax handling of data type, cardinality, range and required-ness of properties further comlicating validation and consumer implementation.
Therefore we would suggest to create at least for the ELN format syntactically (but sematically annotated) schemas (see OO-LD) for a normalized tree-shape of the ro-create-metadata.json. Luckily JSON-LD can help use here with simply applying the framing algorithm, in combination with some aliases (@id => id, @type => type) in oder to get valid variable names e.g. in python.

Early ELN-Fileformat Schema Draft
{
  "@context": [
    "https://w3id.org/ro/crate/1.1/context",
    {
      "id": "@id",
      "type": "@type"
    }
  ],
  "title": "RO-Crate",
  "type": "object",
  "version": "0.1.1",
  "required": [
    "id",
    "type",
    "conformsTo",
    "sdPublisher"
  ],
  "definitions": {
    "Thing": {
      "type": "object",
      "required": [
        "type"
      ],
      "properties": {
        "type": {
          "type": "string",
          "default": "Thing"
        },
        "id": {
          "type": "string"
        },
        "name": {
          "type": "string"
        },
        "description": {
          "type": "string"
        }
      }
    },
    "Organization": {
      "allOf": [
        {
          "$ref": "#/definitions/Thing"
        }
      ],
      "type": "object",
      "required": [
        "type"
      ],
      "properties": {
        "type": {
          "type": "string",
          "default": "Organization"
        },
        "url": {
          "type": "string",
          "format": "url"
        },
        "areaServed": {
          "type": "string"
        },
        "slogan": {
          "type": "string"
        },
        "logo": {
          "type": "string",
          "format": "url",
          "links": [
            {
              "href": "{{self}}",
              "type": "img/png"
            }
          ]
        },
        "parentOrganization": {
          "$ref": "#/definitions/Organization"
        }
      },
      "options": {
        "display_required_only": true
      }
    },
    "Person": {
      "allOf": [
        {
          "$ref": "#/definitions/Thing"
        }
      ],
      "title": "Person",
      "type": "object",
      "properties": {
        "email": {
          "type": "string",
          "format": "email"
        },
        "familyName": {
          "type": "string"
        },
        "givenName": {
          "type": "string"
        }
      }
    },
    "CreativeWork": {
      "allOf": [
        {
          "$ref": "#/definitions/Thing"
        }
      ],
      "title": "CreativeWork",
      "type": "object",
      "properties": {
    "dateCreated": {
      "type": "string",
      "format": "datetime-local",
      "options": {
        "flatpickr": {}
      }
    },
    "dateModified": {
      "type": "string",
      "format": "datetime-local",
      "options": {
        "flatpickr": {}
      }
    },
        "keywords": {
          "type": "array",
          "items": {
            "type": "string"
          }
        }
      }
    },
    "File": {
      "allOf": [
        {
          "$ref": "#/definitions/CreativeWork"
        }
      ],
      "title": "File",
      "type": "object",
      "id": "file",
      "required": [
        "type",
        "id"
      ],
      "properties": {
        "type": {
          "type": "string",
          "enum": [
            "File"
          ]
        },
        "_id": {
          "type": "string"
        },
        "encodingFormat": {
          "type": "string"
        },
        "id": {
          "type": "string",
          "format": "url",
          "options": {
            "upload": {
              "upload_handler": "testUploadHandler"
            }
          },
          "links": [
            {
              "href": "dummy",
              "rel": "view / download"
            }
          ]
        }
      },
      "not": { "required": [ "hasPart" ] },
      "options": {
        "display_required_only": true
      }
    },
    "Dataset": {
      "allOf": [
        {
          "$ref": "#/definitions/CreativeWork"
        }
      ],
      "title": "Dataset",
      "type": "object",
      "required": [
        "type",
        "id"
      ],
      "properties": {
        "type": {
          "type": "string",
          "enum": [
            "Dataset"
          ]
        },
        "about": {
          "$ref": "#/definitions/Thing"
        },
        "hasPart": {
          "type": "array",
          "format": "tabs",
          "items": {
        "discriminator": {
        "propertyName": "type",
        "mapping": {
          "File": "#/definitions/File",
          "Dataset": "#/definitions/Dataset"
        }
      },
            "oneOf": [
              {
                "$ref": "#/definitions/File"
              },
              {
                "$ref": "#/definitions/Dataset"
              }
            ]
          }
        }
      },
      "options": {
        "display_required_only": true
      }
    }
  },
  "properties": {
    "id": {
      "type": "string",
      "enum": [
        "ro-crate-metadata.json"
      ]
    },
    "type": {
      "type": "string",
      "enum": [
        "CreativeWork"
      ]
    },
    "conformsTo": {
      "type": "string",
      "enum": [
        "https://w3id.org/ro/crate/1.1"
      ]
    },
    "version": {
      "type": "string"
    },
    "sdPublisher": {
      "$ref": "#/definitions/Organization"
    },
    "about": {
      "$ref": "#/definitions/Dataset",
      "properties": {
        "id": {
          "enum": [
            "./"
          ]
        }
      },
      "default": {
        "hasPart": []
      }
    }
  },
  "options": {
    "_display_required_only": true
  }
}

This would provide use three outcomes:

  1. programming language agnostic validation by using JSON-LD normalization + JSON-SCHEMA
  2. code generation to automate the parsing and serialization
  3. UI generation for generic viewers / editors

As a demo for 1. and 3.,
ELN Fileformat Playground - PASTA.eln
reveals that Steffen has nested a Dataset in a File, which could at least be discussed.

grafik

For 2., throwing the same schema at OO-LD Python playground gives us pydantic dataclasses both for validation and implementation.

Generated Dataclasses
from __future__ import annotations

from enum import Enum
from typing import List, Literal, Optional, Union

from pydantic import BaseModel, EmailStr, Field


class Id(Enum):
    ro_crate_metadata_json = "ro-crate-metadata.json"


class Type(Enum):
    CreativeWork = "CreativeWork"


class ConformsTo(Enum):
    https___w3id_org_ro_crate_1_1 = "https://w3id.org/ro/crate/1.1"


class Thing(BaseModel):
    type: str
    id: Optional[str] = None
    name: Optional[str] = None
    description: Optional[str] = None


class Organization(Thing):
    type: str
    url: Optional[str] = None
    areaServed: Optional[str] = None
    slogan: Optional[str] = None
    logo: Optional[str] = Field(None, links=[{"href": "{{self}}", "type": "img/png"}])
    parentOrganization: Optional[Organization] = None


class Person(Thing):
    email: Optional[EmailStr] = None
    familyName: Optional[str] = None
    givenName: Optional[str] = None


class CreativeWork(Thing):
    dateCreated: Optional[str] = Field(None, options={"flatpickr": {}})
    dateModified: Optional[str] = Field(None, options={"flatpickr": {}})
    keywords: Optional[List[str]] = None


class Type1(Enum):
    File = "File"


class File(CreativeWork):
    type: Literal["File"]
    field_id: Optional[str] = Field(None, alias="_id")
    encodingFormat: Optional[str] = None
    id: str = Field(
        ...,
        links=[{"href": "dummy", "rel": "view / download"}],
        options={"upload": {"upload_handler": "testUploadHandler"}},
    )


class Type2(Enum):
    Dataset = "Dataset"


class ROCrate(BaseModel):
    id: Id
    type: Type
    conformsTo: ConformsTo
    version: Optional[str] = None
    sdPublisher: Organization
    about: Optional[Dataset] = Field(
        default_factory=lambda: Dataset.parse_obj({"hasPart": []})
    )


class Dataset(CreativeWork):
    type: Literal["Dataset"]
    about: Optional[Thing] = None
    hasPart: Optional[List[Union[File, Dataset]]] = Field(None, discriminator="type")
    id: str

While this approach never forbidds additional properties we can easily define in a machine readable way which properties we expect to be used and how we expect them to be used. Also we can define subclasses of Dataset or File that have a strict definition of metadata in order to allow automated processing, e.g. CsvFile with CSVW conform column annotations.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants