Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of different data formats #24

Open
fretchen opened this issue Mar 22, 2024 · 8 comments
Open

Integration of different data formats #24

fretchen opened this issue Mar 22, 2024 · 8 comments

Comments

@fretchen
Copy link
Collaborator

A number of different possible data formats exist and ideally we should find a way to streamline them. One issue that was raised with the xlsx format in #21 was by @goergen95

The provision of the ToR are a burden to technical adapt partners. For partners with elaborated GIS systems there are better formats/procedures for the exchange of geospatial information. I would be against KfW making it mandatory for them to deliver their data in sub-optimal and proprietary data formats.

We have started to look into this in #17 and we have first tools for conversion in #18 . However, it is not yet clarified how we can put all of the ideas together...

@fretchen
Copy link
Collaborator Author

Also has to go into #10

@fretchen
Copy link
Collaborator Author

We now have the specifications for the template fixed within the technical notes. To make progress on this we could use these notes and translate them into a json schema. The advantages of json schema to the markdown format:

Alternatives

I currently do not see any good alternatives. Possible options would be:

xlsx

  • Hard to read for machines.
  • Hard to implement complex data structures.
  • Not really a very open standard.
  • Not very broadly used as reference.
  • more adapt as a technical implementation than as a reference

markdown

not machine readable and hence not able to serve as technical basis for validations.

uml

  • this is a fairly abstract format which does not allow for validation.
  • so it falls more into the documentation level and allows for less precise implementations.

direct technical implementations

Keeping them compatible requires a common reference / language. This should be digestable across technical implementations. Hence, jsonschema.

I would propose this as first step to see where we can go with this. Comments @Jo-Schie , @Maja4Dev or @goergen95 before I start simple first attempts in this direction ?

@Jo-Schie
Copy link
Collaborator

Hi Fred. I think this is an awesome idea. From what you listed above I also think that json or even geojson could be a good approach. I will try to ask the people from the Geonode project also for their opinion. Will report back as soon as I have an answer.

@goergen95
Copy link
Collaborator

Hi @fretchen, totally agree with your summary here. Please note, that translating the specification to JSON Schema can only be a first step. To make something useful with that, I think it would require to also provide some tooling for conversion and validation. For a first step, conversion could go e.g. from Excel/CSV -> JSON/GeoJSON which could then be validated. There is also recent fiboa project providing prior-art of designing an extensible and modularized data specification, including geospatial information, based on JSON Schema.

@goergen95
Copy link
Collaborator

Also, I don't think the title of this issue applies if we are now targeting JSON Schema. JSON Schema specifies how data should look like, it is not data itself. The specification is not the same as an implementation. And, as I said in other comments, I do not think we want to force third parties into concrete implementations. Instead, we should aim to offer a specification, tooling for some conversions, and, most importantly, validation.

@fretchen
Copy link
Collaborator Author

Sounds good to me and I have nothing to add. When I find time to create a json schema, I would open a separate issue / PR such that we can work us through the todo list. The way I see the todo list right now is:

  • get started with a machine readable reference (i.e. the json schema for the moment)
  • use the reference for validation of certain file formats (xls or csv)
  • make this validation part of automatic testing !
  • use the reference for validation of converers that are always tested as part of PRs...

@fretchen
Copy link
Collaborator Author

@goergen95 made the following argument for YAML in #76

as a human, I find JavaScript hard to write and most of the time I make a lot of mistakes with the brackets etc. when I am trying. Since we should aim for everyone to easily participate in the maintenance of the specification, we could think about moving it to YAML and create JSON Schema automatically from there?

I personally support the idea in general. However, there are a few technical points to be respected, which might actually create quite some work:

Automatic conversion from YAML to json schema: I personally have not easily found a tool to directly do this.

Introduction of yet another file format: From what I have seen YAML cannot completely make json schemas obsolete. So we will typically have them in our tool-chain independently of YAML in or not. So we really have to see even the ease of working with YAML outweighs the need of extra conversion tools.

If I am right in my understanding we might consider these points after we have decided to merge #76 .

@goergen95
Copy link
Collaborator

In Python, both file formats are internally represented as dictionaries. So you can easily use JSON Schema vocabulary in YAML and validated with jsonschema. That means you could maintain the schema within this repo in YAML, convert incoming data to a dictionary and validate on the fly with jsonschema without ever writing the schema to JSON.
However, one might want to serialize to JSON via GitHub Actions to distribute the schema as JSON which can be achieved via json.dump()

Here is a small example (incoming data does not have to be in YAML, but needs to be converted to a dictionary):

import yaml
import json
import urllib.request as req
from jsonschema.validators import Draft202012Validator

schema_yaml = "schema.yaml"
data_yaml = "data.yaml"
req.urlretrieve("https://raw.githubusercontent.com/mapme-initiative/mapme.pipelines/main/inst/schema.yaml", schema_yaml)
req.urlretrieve("https://raw.githubusercontent.com/mapme-initiative/mapme.pipelines/main/inst/config-example.yaml", data_yaml)

with open(schema_yaml, 'r') as yaml_file:
    schema = yaml.safe_load(yaml_file)

with open(data_yaml, 'r') as yaml_file:
    data = yaml.safe_load(yaml_file)

validator = Draft202012Validator(schema)
validator.validate(data)

with open("schema.json", "w") as json_file:
  json.dump(schema, json_file, indent=4)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants