Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

15 gtfs validation pipeline #20

Merged
merged 98 commits into from
Aug 16, 2023
Merged

15 gtfs validation pipeline #20

merged 98 commits into from
Aug 16, 2023

Conversation

r-leyshon
Copy link
Contributor

Description

GTFS validation pipeline. GTFS utility function to filter GTFS feed on bounding box. Utility function internals.

Fixes #15

Motivation and Context

Validate GTFS feeds.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

How Has This Been Tested?

Test coverage at 93% average.
More testing in marked suites:

  • --runinteg (scrapes lookup table)
  • -- runexpensive (gtfs_kit performance warnings. ~30s duration with test fixture.)

Test configuration details:

  • OS: macos
  • Python version: Python 3.9.13
  • Java version: openjdk 11.0.19 2023-04-18 LTS
  • Python management system: pip

Advice for reviewer

Checklist:

  • My code follows the intended structure of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings - warnings handled. Will raise separate issue.
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Additional comments

r-leyshon added 30 commits June 21, 2023 13:45
@r-leyshon
Copy link
Contributor Author

Some CI notes:

  • I needed to pin the version of numpy for one test to pass
  • --runsetup was causing issue and running tests marked --runexpensive. This needs checking. As it stands, I've adjusted CI suite to run only the standard test suite with no optional flags.
  • Added some more missing deps.

@ethan-moss ethan-moss self-assigned this Jul 19, 2023
@ethan-moss ethan-moss removed this from the prep work milestone Jul 19, 2023
@r-leyshon r-leyshon mentioned this pull request Jul 26, 2023
11 tasks
@r-leyshon r-leyshon marked this pull request as draft July 31, 2023 06:27
* Refactored daily summaries

* Update github actions inline with dev

* docs: Add timestamp format in docstring

* docs: Update docstring with route count explanation

* chore: Better sentence syntax

* Updated breaking change summarise_weekday()->summarise_days()

* Add skelentons for route/trip/pre-processing functions

* Added the ability to generate route and trip summaries

---------

Co-authored-by: r-leyshon <[email protected]>
@r-leyshon r-leyshon marked this pull request as ready for review August 3, 2023 15:58
@r-leyshon
Copy link
Contributor Author

@ethan-moss I just pushed another small diff that removed some defunct source code - a shallow wrapper around gtfs_kit.feed.get_dates(), that never become anything more meaningful than what gtfs_kit implemented. Removing those lines increases coverage of that file and less to review.

@r-leyshon
Copy link
Contributor Author

r-leyshon commented Aug 7, 2023

@ethan-moss turned out the numpy breaking change was a little more detailed than initially expected. This is the way.
Updated defensive logic worked as expected. The order of the summary table columns were affected. Numpy now returns np.min instead of np.amin for the function name, therefore alphabatical sorting returns a different order.
Due to this, I've asked for numpy>=1.25.0 in the reqs, otherwise this will break our CI checks.
Making a backward-compatible src function would involve renaming and resorting columns in attribute dataframes to whatever we want, in spite of the numpy version used. Would be straight forward though I ran out of time to implement today.

Thankfully the thorough test suite catches all these breaking changes and forces you to treat it.

Copy link
Collaborator

@ethan-moss ethan-moss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job in putting this GTFS module together! This is a great addition to the codebase and I took lots of learnings away from reviewing it:

  • The HTML parsing to 'knife and fork' the route types and descriptions is an excellent solution and very well implement - it is a much better method than hard coding values into the codebase. Nicely done!
  • Unit testing strategies are exemplary - examples I will definitely learn from and use as a basis going forward.
  • Great use of mocking in the unit testing too - I've now learnt about this package for the future as well.
  • The defence module is a great idea, looking forward to implementing this across the rest of the codebase - it will be very useful for sure!
  • The implementation of summarise_trips() and summarise_routes() is much more performant than the base implementation - it handled a much larger, London-sized GTFS file with ease!

I performed a review of this PR, considering the whole diff, and using both the test GTFS dataset provided here in the PR and a different GTFS file (a Wales only GTFS dataset, extracted from DfT'S BODS). Overall the gtfs module and additions to utils worked as expected. The review has brought up some issues, 'technical debt', and feature requests to raise, but I'll do it separately outside this PR such that any future work resolving them will have a smaller diff - less to PR, making it easier for us all.

The issues are mainly corner case that won't be present during typical usage. I've got fix recommendation for most of the them, and I'll add these to the issues themselves when raising them - it's just a case of implementing these ones. In the very short term, these issues wouldn't present a problem for typical applications - another reason why merging this PR should be OK. There are some issues related to warnings that need to be looked at in more detail though.

The 'technical debt' and feature requests we can progress without for now, and include when needed - same as the population module.

There is an additional GTFS validation feature that we should develop before progressing the codebase further - primarily because gtfs_kit does not implement this check and it's cropped up in the GTFS data we're using. The feature would be checking for 'fast travel between stops'. This happens when the GTFS timetable is unrealistic/erroneous and the transit vehicle would have to go unrealistically fast to go between stops - example causes could be poor timetabling, human error, and/or the stop has lat/lon info in an incorrect location. This GTFS behaviour could impact our results (may lead to unrealistic transport performance for those trips), so it's best to get a handle on it now. This seems to be a fairly uncommon issue that appears in GTFS data extracts from DfT's Bus Open Data platform, however it is probably common enough that we should look to detect this case as part of the GTFS validation steps - we can at least then understand the impact it may have on any results we generate. We have a working prototype for this feature ready for further development, checks, and implementation. We should be able to add this in as a separate PR soon.

Again, really great work with this PR - as I said above, the module brings in great functionalities and I learnt a lot from reviewing it. Thanks again!

@ethan-moss
Copy link
Collaborator

Additional commit after PR review: a merge conflict crept in after merging in the population PR. This was isolated to the requirements.txt file only, and all other changes were additive and reviewed in PR #44 <-- no impact on this PR.

@ethan-moss ethan-moss merged commit 96f48a9 into dev Aug 16, 2023
4 checks passed
@ethan-moss ethan-moss deleted the 15-gtfs-validation-pipeline branch August 16, 2023 15:31
@ethan-moss ethan-moss restored the 15-gtfs-validation-pipeline branch August 17, 2023 12:23
@ethan-moss ethan-moss deleted the 15-gtfs-validation-pipeline branch August 17, 2023 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gtfs validation pipeline
4 participants