Skip to content

Commit

Permalink
Merge pull request #17 from cal-itp/feat-14-version1
Browse files Browse the repository at this point in the history
Feat 14 version1
  • Loading branch information
chriscauley authored Feb 17, 2022
2 parents 23436b2 + 8a317d6 commit 9358880
Show file tree
Hide file tree
Showing 12 changed files with 180 additions and 126 deletions.
48 changes: 48 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: CI

on:
push:
release:
types: [ published ]

jobs:
checks:
name: "Run Tests"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Set up Pre-commit
uses: pre-commit/[email protected]
release:
name: "Release to PyPI"
runs-on: ubuntu-latest
needs: checks
if: "github.event_name == 'release' && startsWith(github.event.release.tag_name, 'v')"
steps:

- uses: actions/checkout@v2
- name: "Set up Python"
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: "Build package"
run: |
python setup.py build sdist
- name: "TEST Upload to PyPI"
uses: pypa/gh-action-pypi-publish@release/v1
if: github.event.release.prerelease
with:
user: __token__
password: ${{ secrets.PYPI_TEST_API_TOKEN }}
repository_url: https://test.pypi.org/legacy/

- name: "Upload to PyPI"
uses: pypa/gh-action-pypi-publish@release/v1
if: "!github.event.release.prerelease"
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ repos:
- id: flake8
types:
- python
args: ["--max-line-length=88"]
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
Expand Down
63 changes: 33 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,51 +1,54 @@
# Feed Checker
# GTFS Aggregator Checker

This repo is to verify that a given list of feeds is listed in feed aggregators.
Currently it checks transit.land and transitfeeds.com to verify that feeds are
listed in an aggregator.

## Installation

## Requirements
```
pip install gtfs-aggregator-checker
```

* `.env` - Acquire an [api key from transitland][1] and save it to a `.env` file
like `TRANSITLAND_API_KEY=SECRET`. Alternatively you can prefix commands with
the api key like `TRANSITLAND_API_KEY=SECRET python feed_checker.py [...]`.
## Configure

* `agencies.yml` - This file can have any structure as the feed checker just
looks for any urls (strings starting with `'http://'`), but the intended usage
is a [Cal-ITP agencies.yml file][2]. (to run the program without an
`agencies.yml` file, see the "Options" section below)
The following env variables can be set in a `.env` file, set to the environment,
or inline like `TRANSITLAND_API_KEY=SECRET python -m gtfs_aggregator_checker`.

## Getting Started
* `TRANSITLAND_API_KEY` An [api key from transitland][1].

To install requirments and check urls run the following. The first time you run
this it will take a while since the cache is empty.
* `GTFS_CACHE_DIR` Folder to save cached files to. Defaults to
`~/.cache/gtfs-aggregator-checker`

``` bash
pip install -r requirements.txt
python feed_checker.py
```
## Getting Started

The final line of stdout will tell how many urls were in `agencies.yml` and how
many of those were matched in a feed. Above that it will list the domains for
each url (in alphabetical order) as well group paths based on if the path was
matched (in both `agencies.yml` and aggregator), missing (in `agencies.yml` but
not aggregator) or unused (in aggregator but not in `agencies.yml`). An ideal
outcome would mean the missing column is empty for all domains.
## CLI Usage

`python -m gtfs_aggregator_checker [YAML_FILE] [OPTIONS]`

## CLI Usage
`python -m gtfs_aggregator_checker` or `python -m gtfs_aggregator_checker
/path/to/yml` will search a [Cal-ITP agencies.yml file][2] for any urls and see
if they are present in any of the feed aggregators. Alternatively you can use a
`--csv-file` or `--url` instead of an `agencies.yml` file.

`python feed_checker.py` or `python feed_checker.py /path/to/yml` will search a
[Cal-ITP agencies.yml file][2] for any urls and see if they are present in any
of the feed aggregators.
The final line of stdout will tell how many urls were in `agencies.yml` and how
many of those were matched in a feed.

### Options
* `python feed_checker.py --help` print the help
* `--csv-file agencies.csv` load a csv instead of a Cal-ITP agencies yaml file (one url per line)
* `--url http://example.com` Check a single url instead of a Cal-ITP agencies yaml file
* `--verbose` Print a table of all results (organized by domain)
* `python -m gtfs_aggregator_checker --help` print the help
* `--csv-file agencies.csv` load a csv instead of a Cal-ITP agencies yaml file
(one url per line)
* `--url http://example.com` Check a single url instead of a Cal-ITP agencies
yaml file
* `--output /path/to/file.json` Save the results as a json file

[1]: https://www.transit.land/documentation/index#signing-up-for-an-api-key
[2]: https://github.com/cal-itp/data-infra/blob/main/airflow/data/agencies.yml

## Development

Clone this repo and `pip install -e /pat/to/feed-checker` to develop locally.

By default, downloaded files (raw html files, api requsets) will be saved to
`~/.cache/calitp_gtfs_aggregator_checker`. This greatly reduces the time
required to run the script. Delete this folder to reset the cache.
78 changes: 0 additions & 78 deletions cache.py

This file was deleted.

20 changes: 5 additions & 15 deletions feed_checker.py → gtfs_aggregator_checker/__init__.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
from collections import OrderedDict
import json
import typer
import urllib.error
import urllib.parse
import urllib.request
import yaml

from transitland import get_transitland_urls
from transitfeeds import get_transitfeeds_urls
from .transitland import get_transitland_urls
from .transitfeeds import get_transitfeeds_urls


__version__ = "1.0.0"
SECRET_PARAMS = ["api_key", "token", "apiKey", "key"]


Expand All @@ -26,13 +26,7 @@ def clean_url(url):
return urllib.parse.urlunparse(url)


def main(
yml_file=typer.Argument("agencies.yml", help="A yml file containing urls"),
csv_file=typer.Option(None, help="A csv file (one url per line)"),
url=typer.Option(None, help="URL to check instead of a file",),
output=typer.Option(None, help="Path to a file to save output to."),
verbose: bool = typer.Option(False, help="Print a result table to stdout"),
):
def check_feeds(yml_file=None, csv_file=None, url=None, output=None):
results = {}

if url:
Expand Down Expand Up @@ -96,7 +90,7 @@ def main(
if "present" not in statuses:
missing.append(url)

if missing and verbose:
if missing:
print(f"Unable to find {len(missing)}/{len(results)} urls:")
for url in missing:
print(url)
Expand All @@ -108,7 +102,3 @@ def main(
with open(output, "w") as f:
f.write(json.dumps(results, indent=4))
print(f"Results saved to {output}")


if __name__ == "__main__":
typer.run(main)
15 changes: 15 additions & 0 deletions gtfs_aggregator_checker/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import typer

from . import check_feeds


def main(
yml_file=typer.Argument("agencies.yml", help="A yml file containing urls"),
csv_file=typer.Option(None, help="A csv file (one url per line)"),
url=typer.Option(None, help="URL to check instead of a file",),
output=typer.Option(None, help="Path to a file to save output to."),
):
check_feeds(yml_file=yml_file, csv_file=csv_file, url=url, output=output)


typer.run(main)
44 changes: 44 additions & 0 deletions gtfs_aggregator_checker/cache.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import os
from pathlib import Path
import urllib.error
import urllib.request

from .utils import url_split


def get_cache_dir():
if "GTFS_CACHE_DIR" in os.environ:
path = Path(os.environ["GTFS_CACHE_DIR"])
else:
path = Path.home() / ".cache/gtfs-aggregator-checker"
path.mkdir(exist_ok=True, parents=True)
return path


def get_cached(key, func, directory=None):
if not directory:
directory = get_cache_dir()
path = directory / key
if not path.exists():
content = func()
with open(path, "w") as f:
f.write(content)
with open(path, "r") as f:
return f.read()


def curl_cached(url, key=None):
domain, path = url_split(url)
if key is None:
key = path.replace("/", "__")
if len(key) > 255:
key = key[:255] # max filename length is 255

def get():
req = urllib.request.Request(url)
r = urllib.request.urlopen(req)
return r.read().decode()

path = get_cache_dir() / domain
path.mkdir(exist_ok=True, parents=True)
return get_cached(key, get, directory=path)
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from bs4 import BeautifulSoup
from urllib.error import HTTPError

from cache import curl_cached
from .cache import curl_cached

LOCATION = "67-california-usa"
ROOT = "https://transitfeeds.com"
Expand Down
4 changes: 2 additions & 2 deletions transitland.py → gtfs_aggregator_checker/transitland.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import json

from config import env
from cache import curl_cached
from .config import env
from .cache import curl_cached

API_KEY = env["TRANSITLAND_API_KEY"]
BASE_URL = f"https://transit.land/api/v2/rest/feeds?apikey={API_KEY}"
Expand Down
File renamed without changes.
31 changes: 31 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/usr/bin/env python

import re
from setuptools import setup, find_namespace_packages

_version_re = re.compile(r"__version__\s+=\s+(.*)")

with open("gtfs_aggregator_checker/__init__.py", "r") as f:
version = _version_re.search(f.read()).group(1).strip("'\"")

with open("README.md", "r") as f:
long_description = f.read()

setup(
name="gtfs_aggregator_checker",
version=version,
packages=find_namespace_packages(),
install_requires=[
"beautifulsoup4",
"python-dotenv",
"PyYAML",
"requests",
"typer",
],
description="Tool for checking if transit urls are on aggregator websites",
long_description=long_description,
long_description_content_type="text/markdown",
author="",
author_email="",
url="https://github.com/cal-itp/gtfs-aggregator-checker",
)

0 comments on commit 9358880

Please sign in to comment.