-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unit tests and continuous integration #114
Comments
Great suggestion! |
👍 Lots of datasets in this organization are not quite valid for one reason or another. It would be good to get some validation in place. Going even further: https://github.com/frictionlessdata/ex-continuous-data-integration |
The "problem" here is that each repository is responsible of testing if its data are valid or not. You have to visit each repository to see if it's valid or not. If code of validator or datapackage spec change you have to run CI in EACH repository. It may be quite long. I think we should be able to download a lot of datapackage locally (for example all https://github.com/datasets/registry ) so a cache mechanism is something very important https://github.com/frictionlessdata/datapackage-py/issues/72 ) and run validation with cached datapackages. We will only have to use CI in ONE repository which will be responsible of testing if datapackage are valid (or not) |
@danfowler I have ideas / plans as to how to do this. However, want to do this as part of the systematic infrastructure upgrade we are planning here ;-) |
Some (quick and dirty) code that might help. from requests import Session
from unittest import TestCase
import re
import datapackage
pattern = re.compile("https:\/\/github\.com\/(.*)\/(.*)")
def fix_url(url, pattern):
m = re.search(pattern, url)
if m is not None:
owner, repository = m.groups()
return "https://raw.githubusercontent.com/%s/%s/master/datapackage.json" % (owner, repository)
else:
return url
class TestDatasets(TestCase):
def setUp(self):
self.session = Session()
def test_datasets(self):
url_registry = "https://github.com/datasets/registry"
url_registry = fix_url(url_registry, pattern)
dp_registry = datapackage.DataPackage(url_registry)
print(url_registry)
dp_registry.validate()
for resources in dp_registry.resources:
for data in resources.data:
url = data["url"]
url = fix_url(url, pattern)
dp = datapackage.DataPackage(url_registry)
print(url)
dp.validate() that can be run using
but there are 2 issues:
|
@femtotrader that's amazing - thanks! |
Hello,
when datasets/registry will be a DataPackage it will be a good idea to ensure that every URL are available and requests returns a HTTP status code == 200.
Such a test could be done using python and requests (see some sample code #112 )
but a more rigorous approach (maybe in a second time) could be to ensure that they are "valid" DataPackages.
It will avoid to add bad DataPackages url to this repository.
Kind regards
The text was updated successfully, but these errors were encountered: