Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

15 gtfs validation pipeline #20

Merged
merged 98 commits into from
Aug 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
7b138f1
chore: Add ipykernel to deps
Jun 21, 2023
6b29978
chore: Add reqs for scraping GTFS lookup table
Jun 28, 2023
7eaffe9
feat: Func to ingest GTFS route_type table from urls
Jun 28, 2023
5bf382f
refactor: Dynamic column name lookup
Jun 28, 2023
ac0ae02
refactor: Reduce num of variables
Jun 28, 2023
2a4a433
feat: Defensive checking & refactoring for complexity hook
Jun 28, 2023
f7c2844
docs: Add flag guidance to test module
Jun 29, 2023
e8f5607
docs: Typos
Jun 29, 2023
5acc6e4
chore: Test fixture for route_type lookup
Jun 29, 2023
a29ea96
chore: Adjust pickle ignore rules for single named exception
Jun 29, 2023
bcfd84c
chore: Setup runinteg pytest flag
Jun 29, 2023
f91f508
test: Integration test check scraped html table is stable
Jun 29, 2023
ac0c179
test: Defensive checks raise as expected
Jun 29, 2023
ca73900
chore: Add test flag, Sergio suggested fix
Jun 29, 2023
3652ae7
Merge branch 'dev' into 15-gtfs-validation-pipeline
Jun 29, 2023
ddfe587
refactor: Helper function easier to mock requests
Jun 29, 2023
62d2925
chore: Add mocking dependencies
Jun 29, 2023
fbf32d8
test: Check expected lookup is returned
Jun 29, 2023
a63a51c
refactor: Use class to handle mocked return values
Jun 29, 2023
c50c266
fix: Catch the final gtfs description as not succeeded by br tag
Jun 29, 2023
c168573
fix: Update response value for new func logic
Jun 29, 2023
ab205bf
test: Table scraped correctly with extended schema format
Jun 29, 2023
40186dc
fix: Update test fixture lookup with corrected format
Jun 29, 2023
c20a94a
chore: Delete placeholder script
Jun 29, 2023
2ed8ec3
chore: Add toml to deps
Jul 3, 2023
75f8c77
feat: First iteration of validation class
Jul 3, 2023
953e1b2
style: Consistent use of speechmarks
Jul 3, 2023
26dc30c
feat: Pipeline to check, clean, visualise GTFS
Jul 3, 2023
3d443ec
feat: Helper print statements in pipeline
Jul 3, 2023
1650c90
test: Test class init defensive behaviour
Jul 3, 2023
dfc3b21
feat: Class defence implementation
Jul 3, 2023
1c06fd0
test: Assertions about feed instantiated attribute
Jul 3, 2023
5a56e6f
test: Passes on conversion from UK meters to m
Jul 3, 2023
8555382
test: Check dimensions of validity_df attribute
Jul 3, 2023
37cb97b
feat: Defensive checks for print_alerts method
Jul 3, 2023
ad778bb
test: Defensive behaviour of print_alerts method
Jul 3, 2023
a39fc34
fix: Use hasattr to check for attribute existence
Jul 3, 2023
a987d72
test: print_alert prints single error without truncation
Jul 3, 2023
3568572
test: print_alerts prints multiple alerts without truncation
Jul 3, 2023
e895a95
refactor: Tests use function scope fixture
Jul 3, 2023
e2b031f
fix: Filter pkg_resources warnings
Jul 4, 2023
61f9afa
refactor: Rm unused **kwargs
Jul 4, 2023
7fbcd49
test: Defensive behaviour for viz_stops
Jul 4, 2023
04dd63f
refactor: Helper function checks path-likes
Jul 4, 2023
15c8ef1
viz_stops defense implementation
Jul 4, 2023
fabbbbe
refactor: Unified save statement
Jul 4, 2023
c5fd6c1
test: viz_stops behaviour when mapping points
Jul 4, 2023
e593b3d
test: Fileops on viz_plot when plot convex hull
Jul 4, 2023
9d57d9f
test: Internal helper renders text as required
Jul 4, 2023
42f7180
refactor: Minor text formatting
Jul 4, 2023
7320cee
test: Expected table format for get_route_modes
Jul 4, 2023
3778229
feat: Defensive checks for summarise_weekday
Jul 4, 2023
e773ca4
test: Defensive behaviour for summarise_weekday
Jul 4, 2023
a2278a3
feat: Defensive implementation for summarise_weekday
Jul 4, 2023
6484092
refactor: Move integration test into more appropriate test module
Jul 4, 2023
de72c6b
test: Expensive test for test_summarise_weekday_defence with pytest flag
Jul 4, 2023
fc0dfc6
test: Check mockers called as expected
Jul 5, 2023
102f812
refactor: Mock scrape call when test get_route_modes
Jul 5, 2023
a4ddb9e
feat: Pipeline optionally prints performance profiling
Jul 5, 2023
1e83b01
refactor: Move utility defensive checkers to dedicated module
Jul 6, 2023
10bacf6
feat: Added optional performance testing to pipeline
Jul 6, 2023
dbf76cc
refactor: Move more utilities to defence module
Jul 6, 2023
7fc2cd9
test: Func returns None on pass
Jul 7, 2023
9d9881e
test: bbox_filter behaviour on pass
Jul 7, 2023
184ccde
feat: Func filters GTFS based on bbox
Jul 7, 2023
b0aa256
feat: Class now prints calendar date range
Jul 7, 2023
62b60ab
feat: Defence check list & elements optionally
Jul 7, 2023
f00dabb
feat: Pipeline checks GTFS archive exists & extracts available dates …
Jul 7, 2023
50a8987
refactor: Minor string styling in STOP_HULL_PTH
Jul 7, 2023
703140c
test: Defensive behaviour for utility func
Jul 10, 2023
de17234
refactor: Avoid complexity hook
Jul 10, 2023
85c771f
test: Assert refcatored message calls
Jul 10, 2023
67f1b4e
feat: Handle attribute error
Jul 10, 2023
c2554fd
test: Exception handle when no alerts exist
Jul 10, 2023
691d88d
test: Raises error on missing parent dir when create=False
Jul 10, 2023
b2ac376
test: Simulated condition where shape_id missing from shapes.txt
Jul 10, 2023
5341729
test: viz_stops handles cases with missing stops_id
Jul 10, 2023
212be70
refactor: Add assertion fail messages to all tests
Jul 10, 2023
cae3757
refactor: Add print statements to assertion fails
Jul 10, 2023
f9308cf
chore: Add more assertion fail print statements
Jul 10, 2023
616c3e7
refactor: Test expected columns
Jul 10, 2023
c17668e
chore: Add assertion failure message
Jul 10, 2023
23467cd
chore: Add pyrojroot to deps
Jul 10, 2023
b2fff03
refactor: Run pytest with no runsetup check
Jul 10, 2023
e52a6c4
chore: Try fixing np version
Jul 10, 2023
bb886d5
refactor: Class name syntax
Jul 11, 2023
8b809e8
chore: Merge dev breaking changes & update import statements
Jul 19, 2023
364cd03
Refactored daily summaries (#35)
CBROWN-ONS Aug 3, 2023
7a82780
refactor: Use gtfskit get_dates
Aug 7, 2023
dce4bc9
refactor: Rm shallow wrapper around gtfs_kit.get_dates
Aug 7, 2023
3d04cda
docs: Update mthd return type
Aug 7, 2023
57ee8c7
chore: Bring up to date with dev branch
Aug 7, 2023
4ee328d
validation that checks are not needed for non matching GTFS IDs
CBROWN-ONS Aug 7, 2023
362219d
refactor: Condition check compatible with numpy >=1.25.0
Aug 7, 2023
52b683e
refactor: Test returns min and max columns instead of amin & amax in …
Aug 7, 2023
32be07c
chore: Pin minimum version of numpy to 1.25.0
Aug 7, 2023
a37df94
fix: Typo in reqs
Aug 7, 2023
053fcde
merged with dev; resolved requirements conflict
ethan-moss Aug 16, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@
**/*.pbf
**/*.mapdb
**/*.mapdb.p
# moved zip blanket rule above specific exception for test fixture below
# moved blanket rules above specific exceptions for test fixtures
*.zip
*.pkl
# except test fixtures
!tests/data/newport-2023-06-13.osm.pbf
!tests/data/newport-20230613_gtfs.zip
!tests/data/gtfs/route_lookup.pkl

### Project structure ###
data/*
Expand Down Expand Up @@ -36,7 +38,6 @@ outputs/*
*.html
*.pdf
*.csv
*.pkl
*.rds
*.rda
*.parquet
Expand Down
36 changes: 35 additions & 1 deletion conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,32 @@ def pytest_addoption(parser):
default=False,
help="run set-up tests",
)
parser.addoption(
"--runinteg",
action="store_true",
default=False,
help="run integration tests",
)
parser.addoption(
"--runexpensive",
action="store_true",
default=False,
help="run expensive tests",
)


def pytest_configure(config):
"""Add ini value line."""
config.addinivalue_line("markers", "setup: mark test to run during setup")
config.addinivalue_line(
"markers", "runinteg: mark test to run for integration tests"
)
config.addinivalue_line(
"markers", "runexpensive: mark test to run expensive tests"
)


def pytest_collection_modifyitems(config, items):
def pytest_collection_modifyitems(config, items): # noqa C901
"""Handle switching based on cli args."""
if config.getoption("--runsetup"):
# --runsetup given in cli: do not skip slow tests
Expand All @@ -32,3 +50,19 @@ def pytest_collection_modifyitems(config, items):
for item in items:
if "setup" in item.keywords:
item.add_marker(skip_setup)

if config.getoption("--runinteg"):
return
skip_runinteg = pytest.mark.skip(reason="need --runinteg option to run")
for item in items:
if "runinteg" in item.keywords:
item.add_marker(skip_runinteg)

if config.getoption("--runexpensive"):
return
skip_runexpensive = pytest.mark.skip(
reason="need --runexpensive option to run"
)
for item in items:
if "runexpensive" in item.keywords:
item.add_marker(skip_runexpensive)
93 changes: 93 additions & 0 deletions notebooks/gtfs/check_unmatched_id_warnings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
"""Validation of invalid IDs whilst joining GTFS sub-tables."""

# %%
# imports
import gtfs_kit as gk
from pyprojroot import here
import pandas as pd
import numpy as np

# %%
# initialise my feed from GTFS test data
feed = gk.read_feed(
here("tests/data/newport-20230613_gtfs.zip"), dist_units="m"
)
feed.validate()

# %%
# calendar test
feed.calendar = pd.concat(
[
feed.calendar,
pd.DataFrame(
{
"service_id": [101],
"monday": [0],
"tuesday": [0],
"wednesday": [0],
"thursday": [0],
"friday": [0],
"saturday": [0],
"sunday": [0],
"start_date": ["20200104"],
"end_date": ["20230301"],
}
),
],
axis=0,
)

feed.validate()

# %%
# trips test
feed.trips = pd.concat(
[
feed.trips,
pd.DataFrame(
{
"service_id": [101],
"route_id": [20304],
"trip_id": ["VJbedb4cfd0673348e017d42435abbdff3ddacbf89"],
"trip_headsign": ["Newport"],
"block_id": [np.nan],
"shape_id": ["RPSPc4c99ac6aff7e4648cbbef785f88427a48efa80f"],
"wheelchair_accessible": [0],
"trip_direction_name": [np.nan],
"vehicle_journey_code": ["VJ109"],
}
),
],
axis=0,
)

feed.validate()

# %%
# routes test
feed.routes = pd.concat(
[
feed.routes,
pd.DataFrame(
{
"service_id": [101],
"route_id": [20304],
"agency_id": ["OL5060"],
"route_short_name": ["X145"],
"route_long_name": [np.nan],
"route_type": [200],
}
),
],
axis=0,
)

feed.validate()

# OUTCOME
# It appears that 'errors' are recognised when there is an attempt to validate
# the gtfs data using the pre-built gtfs_kit functions.
# This suggests that if the GTFS data is flawed, it will be identified within
# the pipeline and therefore the user will be made aware. It is also flagged
# as an error which means that 'the GTFS is violated'
# (https://mrcagney.github.io/gtfs_kit_docs/).
112 changes: 112 additions & 0 deletions pipeline/gtfs/01-validate-gtfs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
"""Run the GTFS validation checks for the toml-specified GTFS file.

1. read feed
2. describe feed
3. validate feed
4. clean feed
5. new - print errors / warnings in full
6. new - visualise convex hull of stops and area
7. visualise stop locations
8. new - modalities available (including extended spec)
9. new - feed stats by is-weekend
"""
import toml
from pyprojroot import here
import time
import subprocess

from transport_performance.gtfs.validation import GtfsInstance
from transport_performance.utils.defence import _is_gtfs_pth

CONFIG = toml.load(here("pipeline/gtfs/config/01-validate-gtfs.toml"))
GTFS_PTH = here(CONFIG["GTFS"]["PATH"])
UNITS = CONFIG["GTFS"]["UNITS"]
GEOM_CRS = CONFIG["GTFS"]["GEOMETRIC_CRS"]
POINT_MAP_PTH = CONFIG["MAPS"]["STOP_COORD_PTH"]
HULL_MAP_PATH = CONFIG["MAPS"]["STOP_HULL_PTH"]
PROFILING = CONFIG["UTILS"]["PROFILING"]
# check GTFS Path exists
_is_gtfs_pth(pth=GTFS_PTH, param_nm="GTFS_PTH", check_existing=True)
# Get the disk usage of the GTFS file.
gtfs_du = (
subprocess.check_output(["du", "-sh", GTFS_PTH]).split()[0].decode("utf-8")
)
if PROFILING:
print(f"GTFS at {GTFS_PTH} disk usage: {gtfs_du}")

pre_init = time.perf_counter()
feed = GtfsInstance(gtfs_pth=GTFS_PTH, units=UNITS)
post_init = time.perf_counter()
if PROFILING:
print(f"Init in {post_init - pre_init:0.4f} seconds")

available_dates = feed.feed.get_dates()
post_dates = time.perf_counter()
if PROFILING:
print(f"get_dates in {post_dates - post_init:0.4f} seconds")
s = available_dates[0]
f = available_dates[-1]
print(f"{len(available_dates)} dates available between {s} & {f}.")

try:
# If agency_id is missing, an AttributeError is raised. GTFS spec states
# This is conditionally required, dependent if more than one agency is
# operating within the feed. https://gtfs.org/schedule/reference/#agencytxt
# Cleaning the feed doesn't resolve. Raise issue to investigate.
print(feed.is_valid())
post_isvalid = time.perf_counter()
if PROFILING:
print(f"is_valid in {post_isvalid - post_dates:0.4f} seconds")
print(feed.validity_df["type"].value_counts())
feed.print_alerts()
post_errors = time.perf_counter()
feed.print_alerts(alert_type="warning")
post_warn = time.perf_counter()
if PROFILING:
print(f"print_alerts errors: {post_errors - post_isvalid:0.4f} secs")
print(f"print_alerts warn: {post_warn - post_errors:0.4f} secs")
except AttributeError:
print("AttributeError. Unable to validate feed.")

pre_clean = time.perf_counter()
feed.clean_feed()
post_clean = time.perf_counter()
if PROFILING:
print(f"clean_feed in {post_clean - pre_clean:0.4f} seconds")

try:
print(feed.is_valid())
print(feed.validity_df["type"].value_counts())
feed.print_alerts()
feed.print_alerts(alert_type="warning")
except AttributeError:
print("AttributeError. Unable to validate feed.")

# visualise gtfs
pre_viz_points = time.perf_counter()
feed.viz_stops(out_pth=POINT_MAP_PTH)
post_viz_points = time.perf_counter()
if PROFILING:
print(f"viz_points in {post_viz_points - pre_viz_points:0.4f} seconds")
print(f"Map written to {POINT_MAP_PTH}")

pre_viz_hull = time.perf_counter()
feed.viz_stops(out_pth=HULL_MAP_PATH, geoms="hull", geom_crs=GEOM_CRS)
post_viz_hull = time.perf_counter()
if PROFILING:
print(f"viz_hull in {post_viz_hull - pre_viz_hull:0.4f} seconds")
print(f"Map written to {HULL_MAP_PATH}")

pre_route_modes = time.perf_counter()
print(feed.get_route_modes())
post_route_modes = time.perf_counter()
if PROFILING:
print(f"route_modes in {post_route_modes - pre_route_modes:0.4f} seconds")

pre_summ_weekday = time.perf_counter()
print(feed.summarise_trips())
print(feed.summarise_routes())
post_summ_weekday = time.perf_counter()
if PROFILING:
print(f"summ_weekday in {post_summ_weekday - pre_summ_weekday:0.4f} secs")
print(f"Pipeline execution in {post_summ_weekday - pre_init:0.4f}")
13 changes: 13 additions & 0 deletions pipeline/gtfs/config/01-validate-gtfs.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
title = "Config for GTFS Validation Pipeline"

[GTFS]
PATH = "data/external/croppednewport-bus-07-07-2022_gtfs.zip"
UNITS = "m"
GEOMETRIC_CRS = 27700 # used for area calculations only

[MAPS]
STOP_COORD_PTH = "outputs/gtfs/validation/gtfs-stops-locations.html"
STOP_HULL_PTH = "outputs/gtfs/validation/gtfs-stops-convex-hull.html"

[UTILS]
PROFILING = true
7 changes: 7 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@ r5py>=0.0.4
gtfs_kit==5.2.7
pytest
coverage
ipykernel==6.23.1
pandas
beautifulsoup4
requests
pytest-mock
toml
rasterio
pyprojroot
matplotlib
Expand All @@ -15,5 +21,6 @@ geocube
mapclassify
pytest-lazy-fixture
seaborn
numpy>=1.25.0 # test suite will fail if user installed lower than this
rioxarray
-e .
1 change: 1 addition & 0 deletions src/transport_performance/gtfs/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Helpers for working with & validating GTFS."""
Loading