Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation check to make sure all ontology_purls resolve #96

Open
cmungall opened this issue Sep 4, 2015 · 9 comments
Open

Add validation check to make sure all ontology_purls resolve #96

cmungall opened this issue Sep 4, 2015 · 9 comments
Labels
ontology metadata Issues related to ontology metadata pipeline Issues related to development or revision of global Foundry pipelines quality checks Issues related to global quality control (i.e., not specific to a particular ontology)
Milestone

Comments

@cmungall
Copy link
Contributor

cmungall commented Sep 4, 2015

Every ontology_purl in the yaml should resolve, and to something that has valid content, e.g. parseable owl.

Note: this should probably not be a travis check, as it may be prone to false positives, we need to have no false positives in travis or it will confuse the pull request workflow.

@cmungall cmungall added ontology metadata Issues related to ontology metadata pipeline Issues related to development or revision of global Foundry pipelines labels Sep 4, 2015
@cmungall cmungall added this to the ICBO 2016 milestone Sep 4, 2015
@cmungall
Copy link
Contributor Author

cmungall commented Sep 4, 2015

Currently handled by this section of Makefile:

# Note this should *not* be run as part of general travis jobs, it is expensive
# and may be prone to false positives as it is inherently network-based
#
# TODO: Other non-travis CI job. Nightly?
# TODO: Integrate this with some kind of OCLC query check
#
# See: https://github.com/OBOFoundry/OBOFoundry.github.io/issues/18
valid-purl-report.txt: registry/ontologies.yml
    ./util/processor.py -i $< check-urls > [email protected] && mv [email protected] $@

however, http://purl.obolibrary.org/obo/mp.owl redirects to an ftp site, resulting in:

requests.exceptions.InvalidSchema: No connection adapters were found for 'ftp://ftp.informatics.jax.org/pub/reports/mp.owl'

Maybe this lib will fix it, need to investigate
https://pypi.python.org/pypi/requests-ftp/0.1.3

@ramonawalls
Copy link
Contributor

This might be a problem for some ontologies (one I know of is DINTO) that store their ontology as a compressed file, because it is too big for github. However, I do indeed think that purls should resolve to parsable OWL (or OBO) files, and if they are too big for github, another repository solution needs to be found.

@cmungall
Copy link
Contributor Author

cmungall commented Sep 5, 2015

owl/zip is one option owlcs/owlapi#375
or splitting the ontology ontodev/robot#39

using github release management is incredibly convenient but not always appropriate, especially for ontologies that are really databases translated to owl

For ncbitaxon (raw metadata here https://raw.githubusercontent.com/OBOFoundry/OBOFoundry.github.io/master/ontology/ncbitaxon.md ) we have the following lines in the yaml:

build:
  source_url: http://build.berkeleybop.org/job/build-ncbitaxon/lastSuccessfulBuild/artifact/*zip*/archive.zip
  path: archive/src/ontology
  method: archive

this tells the obo central build to grab the latest build from here:
http://build.berkeleybop.org/job/build-ncbitaxon/

and place it in the default fall-through for purls on servers in Berkeley

We can't commit to do this for massive databases-as-ontologies but doable for mid-size like ncbitaxon

@selewis selewis added the quality checks Issues related to global quality control (i.e., not specific to a particular ontology) label Oct 5, 2016
@nlharris
Copy link
Contributor

What is the status of this?

@jamesaoverton
Copy link
Member

The code @cmungall points to checks all products for each ontology in the registry. The dashboard checks only the main OWL product. So this issue is only partially resolved.

@nlharris
Copy link
Contributor

Should this be moved to the OBO-Dashboard repo?

@matentzn
Copy link
Contributor

@jamesaoverton What about having a separate GitHub action that runs once per month to check that all product URLs are healthy, and simply open an issue if it found more than 0 purls that didn't resolve?

@jamesaoverton
Copy link
Member

@matentzn If this was easy, we would have done this years ago. I think it's surprisingly difficult to get good results without a bunch of false positives that must be manually reviewed.

The easy thing is to request the URL and follow redirects. You don't really want to download the full ontology (or whatever format) files, but not every server supports HTTP HEAD requests properly, and some files are (were?) served from FTP rather than HTTP. You can't necessarily trust the HTTP response codes either -- the PURL system fall-through to http://ontologies.berkeleybop.org/ is happy to return 200 and some weird XML for non-existent files. So in the end, you either need to handle a bunch of special cases or just bite the bullet and download all the files. That will take a long time and a lot of data transfer. Since the products include more than just RDF/OWL files, if you want to check that the files are intact and valid then you'll have to handle an open-ended list of file types.

In any case, you'll have transient network failures, so you'll need to decide on a retry strategy.

It might not be so hard to get to 90%, but I expect the last 10% to be a bunch of work.

@matentzn
Copy link
Contributor

Ah yeah, forgot about that berkelybop issue.. I was thinking literally checking for HTTP codes. Ok, in this case, lets leave this open for some future volunteer to deal with. I think checking the main .owl with the dashboard is the most important, and that issue is solved. This here seems lower priority to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ontology metadata Issues related to ontology metadata pipeline Issues related to development or revision of global Foundry pipelines quality checks Issues related to global quality control (i.e., not specific to a particular ontology)
Projects
None yet
Development

No branches or pull requests

6 participants