Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHACL Validation #767

Merged
merged 36 commits into from
Nov 29, 2019
Merged

SHACL Validation #767

merged 36 commits into from
Nov 29, 2019

Conversation

Panaetius
Copy link
Member

Closes #755 Closes #547

Adds SHACL schema for all KG types.

@Panaetius Panaetius requested a review from a team as a code owner October 23, 2019 15:03
@rokroskar
Copy link
Member

I think this also closes #548?

@rokroskar
Copy link
Member

If I understand this correctly, I should get a validated graph when I run e.g. renku log --strict --format nt right? I seem to be getting datasets without the schma:name property without any complaints from validation.

@Panaetius
Copy link
Member Author

If I understand this correctly, I should get a validated graph when I run e.g. renku log --strict --format nt right? I seem to be getting datasets without the schma:name property without any complaints from validation.

Well it's WIP for a reason 😉 I actually haven't tested it at all, I pushed to save work before the weekend. I'm not sure if this works without #782

The way it's supposed to work is that you get an exception if the output doesn't validate.

@rokroskar
Copy link
Member

oh right I forgot about #782. Indeed, the name property does appear with that fix.

@Panaetius
Copy link
Member Author

I think this also closes #548?

I think #782 closes that one.

@Panaetius Panaetius changed the title [WIP] SHACL Validation SHACL Validation Nov 8, 2019
@rokroskar
Copy link
Member

Out of WIP, huzzah!

Copy link
Contributor

@jsam jsam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks for the good work!

I'm just wondering - didn't we remove dependency for click exceptions from the core? Maybe we should create custom exceptions for those raises which don't strictly depend on the core.

@rokroskar
Copy link
Member

There are a few issues still. For imported datasets there are some issues with types - for example, because zenodo datasets use the schema.org context, their dates have type schema:Date, which makes our validation fail (try importing a dataset from zenodo and running log --strict). I can fix this by modifying the dataset shape like so:

{
               "path": "schema:datePublished",
               "or": [
                     {
                        "nodeKind": "sh:Literal",
                        "datatype": {
                        "@id": "schema:Date"
                        }
                     },
                     {
                        "nodeKind": "sh:Literal",
                        "datatype": {
                           "@id": "xsd:string"
                        }
                     }
                ],
               "maxCount": 1
            },

but I'm not sure that's the best way (allowing two different types).

Another problem is that it seems that translations in the translate of the json-ld class are not being used. I noticed this because the SHACL complains about a Project being foaf:Project and not schema:Project (the translation in the definition of the Project class is supposed to take care of this). This makes it essentially impossible to migrate old projects without rewriting history. If I have a project which has a type foaf:Project and I change its metadata file to be schema:Project the graph generation walks the commit history to the original metadata and still makes the project a foaf:Project which causes the shacl validation to fail.

@Panaetius
Copy link
Member Author

There are a few issues still. For imported datasets there are some issues with types - for example, because zenodo datasets use the schema.org context, their dates have type schema:Date, which makes our validation fail (try importing a dataset from zenodo and running log --strict). I can fix this by modifying the dataset shape like so:

{
               "path": "schema:datePublished",
               "or": [
                     {
                        "nodeKind": "sh:Literal",
                        "datatype": {
                        "@id": "schema:Date"
                        }
                     },
                     {
                        "nodeKind": "sh:Literal",
                        "datatype": {
                           "@id": "xsd:string"
                        }
                     }
                ],
               "maxCount": 1
            },

but I'm not sure that's the best way (allowing two different types).

I think ideally we'd migrate it on import to what we want?

Another problem is that it seems that translations in the translate of the json-ld class are not being used. I noticed this because the SHACL complains about a Project being foaf:Project and not schema:Project (the translation in the definition of the Project class is supposed to take care of this). This makes it essentially impossible to migrate old projects without rewriting history. If I have a project which has a type foaf:Project and I change its metadata file to be schema:Project the graph generation walks the commit history to the original metadata and still makes the project a foaf:Project which causes the shacl validation to fail.

Right now there is an explicit check that nothing of type foaf:Project exists in the graph, there's not really any need for this check, I just added it as part PoC and partly as a way to detect old metadata. If this check if removed, it just wouldn't validate something of the type foaf:Project. But this might be a bigger issue, as the current version of the schema is used to validate everything, but walking the commit history yields old structure. Ideally, the old graph should be migrated on-the-fly when walking the commit history, as it makes no sense to use current code in an up-to-date renku repository to parse data from a point in time when that version of the code was not valid.
I don't know about .translate not being called, to my knowledge I didn't touch that part. It might be that SHACL validation just highlights a bug that was silently successful in the past?

@rokroskar
Copy link
Member

I think ideally we'd migrate it on import to what we want?

You mean on import of the dataset? From schema:Date to xsd:string? Though ideally we would probably use schema:Date and not string...

Re: foaf:Project - I think the check is fine. The "on-the-fly" migration is exactly what translate is supposed to do, but maybe something goes wrong there with the compaction.

@Panaetius
Copy link
Member Author

Panaetius commented Nov 18, 2019

I think ideally we'd migrate it on import to what we want?

You mean on import of the dataset? From schema:Date to xsd:string? Though ideally we would probably use schema:Date and not string...

I mean, ideally it'd be xsd:dateTime (http://schema.org/Date looks quite complex for a simple publishedDate) but otherwise yes. I think right now there's some explicit checks somewhere that dates should be strings as conversion is done in code, so maybe strings for now. See also #764

Re: foaf:Project - I think the check is fine. The "on-the-fly" migration is exactly what translate is supposed to do, but maybe something goes wrong there with the compaction.

I looked at the code quickly and don't see a reason why it shouldn't work, this looks like an issue that might take a bit more time to fix.

@rokroskar
Copy link
Member

Ok so I checked that the translation works as it should and it does. However, if the metadata on disk says foaf:Project, the triple also has foaf:Project although the generated project instance has type schema:Project. Which I think means that somewhere a Project is being instantiated directly by reading the yaml and not going through the from_yaml method. I haven't been able to track this down - the good news is that if I change it on disk it's fine (I don't know what I did before that it didn't work).

@Panaetius
Copy link
Member Author

An easy fix would be to change the SHACL check to a SPARQL query so that it only validates foaf:project if schema:project is not present. Though I think this is a problem that should be looked into and fixed properly.

@rokroskar
Copy link
Member

I've looked into this a bit further and it's definitely not a problem of this PR. The issue seems to be that since translate is only applied on the instantiation of the object, it doesn't work for nested objects. So when Project is a sub-entity of DatasetFile, the translation doesn't get applied. A solution to this is to move the "translation" to the context of Project so that it is applied in all the nested contexts. However, this then leads to a problem in the case where there is a single file that for some reason files is no longer a list. If I add "@container": "@list" to the context (as suggested here) it doesn't seem to do as expected and files is still just a single value. This transformation can be seen on the json-ld playground here.

@rokroskar
Copy link
Member

This LGTM otherwise - should we address the issue with nested contexts/compaction/translation separately?

rokroskar
rokroskar previously approved these changes Nov 27, 2019
Copy link
Member

@rokroskar rokroskar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!

rokroskar
rokroskar previously approved these changes Nov 29, 2019
@Panaetius
Copy link
Member Author

Panaetius commented Nov 29, 2019

Previously, based_on was a dict on loading, and the unit tests just compared yaml directly when comparing local to remote repositories, instead of loading the metadata to code and comparing there. This meant no transformations would get applied and the tests would fail if the remote repo was not up to date with the current state of renku-python. Now based_on is a proper DatasetFile object and the tests compare Dataset/DatasetFile objects.

Old dataset files had invalid ids (not starting with file://), that is now adjusted on import (the scheme is prepended if it's not there). ideally, ids would get re-generated on the fly if they're invalid. But DatasetFile's in a remote repository don't have client and commit set, so we can't generate ids. And that seems like a relatively heavy fix to do as part of this PR (it's already big enough...). See test_import_from_renku_project() remote._id for an issue where this manifested.

@rokroskar
Copy link
Member

I agree -- the id scheme for file entities in general needs an overhaul, see SwissDataScienceCenter/renku#696

@Panaetius Panaetius merged commit 3d1c53e into master Nov 29, 2019
Panaetius added a commit that referenced this pull request Dec 3, 2019
* Adds SHACL validation to renku doctor
* Adds renku log tests
* Adds --strict option to renku log for shape validation
jsam added a commit that referenced this pull request Dec 17, 2019
* SHACL Validation (#767)

* Adds SHACL validation to renku doctor
* Adds renku log tests
* Adds --strict option to renku log for shape validation

* Fix JSON-LD translation and related issues (#846)

* Add default value converter and remove dataset property from DatasetFile

* Removes type from Person context, fixes missing context in jsonld translation of project, fixes url encoding of dataset names

* feat: added renku service with cache and datasets

* fix: addressed review points

* chore: added travis encrypted oauth token

* fix: cleanup of user vs user_id

* chore: method naming convention update

* feat: set custom commit message

* chore: remove defensive scoping of git options

* chore: added IndexError exception sentinel

* chore: added test for renku clone with custom config
@rokroskar rokroskar deleted the 755-kg-validation branch December 18, 2019 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Contract tests suite verifying json-ld output of renku log command Add unit tests for jsonld marshalling
4 participants