-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validation check to make sure all ontology_purls resolve #96
Comments
Currently handled by this section of Makefile:
however, http://purl.obolibrary.org/obo/mp.owl redirects to an ftp site, resulting in:
Maybe this lib will fix it, need to investigate |
This might be a problem for some ontologies (one I know of is DINTO) that store their ontology as a compressed file, because it is too big for github. However, I do indeed think that purls should resolve to parsable OWL (or OBO) files, and if they are too big for github, another repository solution needs to be found. |
owl/zip is one option owlcs/owlapi#375 using github release management is incredibly convenient but not always appropriate, especially for ontologies that are really databases translated to owl For ncbitaxon (raw metadata here https://raw.githubusercontent.com/OBOFoundry/OBOFoundry.github.io/master/ontology/ncbitaxon.md ) we have the following lines in the yaml:
this tells the obo central build to grab the latest build from here: and place it in the default fall-through for purls on servers in Berkeley We can't commit to do this for massive databases-as-ontologies but doable for mid-size like ncbitaxon |
What is the status of this? |
The code @cmungall points to checks all |
Should this be moved to the OBO-Dashboard repo? |
@jamesaoverton What about having a separate GitHub action that runs once per month to check that all product URLs are healthy, and simply open an issue if it found more than 0 purls that didn't resolve? |
@matentzn If this was easy, we would have done this years ago. I think it's surprisingly difficult to get good results without a bunch of false positives that must be manually reviewed. The easy thing is to request the URL and follow redirects. You don't really want to download the full ontology (or whatever format) files, but not every server supports HTTP HEAD requests properly, and some files are (were?) served from FTP rather than HTTP. You can't necessarily trust the HTTP response codes either -- the PURL system fall-through to http://ontologies.berkeleybop.org/ is happy to return 200 and some weird XML for non-existent files. So in the end, you either need to handle a bunch of special cases or just bite the bullet and download all the files. That will take a long time and a lot of data transfer. Since the products include more than just RDF/OWL files, if you want to check that the files are intact and valid then you'll have to handle an open-ended list of file types. In any case, you'll have transient network failures, so you'll need to decide on a retry strategy. It might not be so hard to get to 90%, but I expect the last 10% to be a bunch of work. |
Ah yeah, forgot about that berkelybop issue.. I was thinking literally checking for HTTP codes. Ok, in this case, lets leave this open for some future volunteer to deal with. I think checking the main .owl with the dashboard is the most important, and that issue is solved. This here seems lower priority to me. |
Every
ontology_purl
in the yaml should resolve, and to something that has valid content, e.g. parseable owl.Note: this should probably not be a travis check, as it may be prone to false positives, we need to have no false positives in travis or it will confuse the pull request workflow.
The text was updated successfully, but these errors were encountered: