Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unit tests and continuous integration #114

Open
femtotrader opened this issue Sep 19, 2015 · 6 comments
Open

Add unit tests and continuous integration #114

femtotrader opened this issue Sep 19, 2015 · 6 comments

Comments

@femtotrader
Copy link

Hello,

when datasets/registry will be a DataPackage it will be a good idea to ensure that every URL are available and requests returns a HTTP status code == 200.

Such a test could be done using python and requests (see some sample code #112 )

but a more rigorous approach (maybe in a second time) could be to ensure that they are "valid" DataPackages.

It will avoid to add bad DataPackages url to this repository.

Kind regards

@rufuspollock
Copy link
Member

Great suggestion!

@danfowler
Copy link

👍

Lots of datasets in this organization are not quite valid for one reason or another. It would be good to get some validation in place.

Going even further: https://github.com/frictionlessdata/ex-continuous-data-integration

@femtotrader
Copy link
Author

femtotrader commented Aug 29, 2016

The "problem" here is that each repository is responsible of testing if its data are valid or not.

You have to visit each repository to see if it's valid or not.

If code of validator or datapackage spec change you have to run CI in EACH repository. It may be quite long.

I think we should be able to download a lot of datapackage locally (for example all https://github.com/datasets/registry ) so a cache mechanism is something very important https://github.com/frictionlessdata/datapackage-py/issues/72 ) and run validation with cached datapackages.

We will only have to use CI in ONE repository which will be responsible of testing if datapackage are valid (or not)

@rufuspollock
Copy link
Member

@danfowler I have ideas / plans as to how to do this. However, want to do this as part of the systematic infrastructure upgrade we are planning here ;-)

@femtotrader
Copy link
Author

femtotrader commented Aug 29, 2016

Some (quick and dirty) code that might help.

from requests import Session
from unittest import TestCase

import re
import datapackage

pattern = re.compile("https:\/\/github\.com\/(.*)\/(.*)")


def fix_url(url, pattern):
    m = re.search(pattern, url)
    if m is not None:
        owner, repository = m.groups()
        return "https://raw.githubusercontent.com/%s/%s/master/datapackage.json" % (owner, repository)
    else:
        return url


class TestDatasets(TestCase):
    def setUp(self):
        self.session = Session()

    def test_datasets(self):
        url_registry = "https://github.com/datasets/registry"
        url_registry = fix_url(url_registry, pattern)
        dp_registry = datapackage.DataPackage(url_registry)
        print(url_registry)
        dp_registry.validate()

        for resources in dp_registry.resources:
            for data in resources.data:
                url = data["url"]
                url = fix_url(url, pattern)
                dp = datapackage.DataPackage(url_registry)
                print(url)
                dp.validate()

that can be run using

$ nosetests -s -v tests/test_dp.py

but there are 2 issues:

@rufuspollock
Copy link
Member

@femtotrader that's amazing - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Misc.
Development

No branches or pull requests

4 participants