Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve clean_feed() when shape_id not available #74

Closed
ethan-moss opened this issue Aug 17, 2023 · 3 comments
Closed

Improve clean_feed() when shape_id not available #74

ethan-moss opened this issue Aug 17, 2023 · 3 comments
Assignees
Labels
wontfix This will not be worked on
Milestone

Comments

@ethan-moss
Copy link
Collaborator

Description of the Feature to be Added

clean_feed() is a shallow wrapper around gtfs_kit's clean() function which implements 4 cleaning functions:

  1. clean_ids(): strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore
  2. clean_times(): convert H:MM:SS time strings to HH:MM:SS time strings to make sorting by time work as expected.
  3. clean_route_short_names(): In feed.routes, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-’ and its route ID. Note: this is the method that fixes the "Repeated pair (route_short_name, route_long_name)" warning.
  4. drop_zombies(): does the following:
    • Drop stops of location type 0 or NaN with no stop times.
    • Remove undefined parent stations from the parent_station column.
    • Drop trips with no stop times.
    • Drop shapes with no trips.
    • Drop routes with no trips.
    • Drop services with no trips.

clean_feed() and hence clean() will fail if there is no shape_id column in trip.txt. However, drop_zombies() is the only one that relies on that column, the 3 other cleaning functions work fine:
Screenshot 2023-08-15 at 10 57 39

(OPTIONAL) Suggested Implementations

Instead of returning without cleaning if shape_id is not present, this could be improved to instead only action the first 3 cleaning methods.

@ethan-moss ethan-moss added needs triage technical debt A better way is available. Fix later approach has been adopted. GTFS labels Aug 17, 2023
@CBROWN-ONS CBROWN-ONS self-assigned this Sep 13, 2023
@CBROWN-ONS CBROWN-ONS added this to the sprint 4 End milestone Sep 13, 2023
@CBROWN-ONS
Copy link
Collaborator

Note to self:
To replicate error, delete shape_id from feed.trips

@CBROWN-ONS
Copy link
Collaborator

UPDATE

This issue will not be completed before the end of sprint 4 due to it relying on other PRs that are yet to be merged. This issue should however be completed early on in sprint 5.

@CBROWN-ONS CBROWN-ONS modified the milestones: sprint 4 End, sprint 5 end Oct 9, 2023
@r-leyshon r-leyshon added wontfix This will not be worked on and removed needs triage technical debt A better way is available. Fix later approach has been adopted. GTFS labels Aug 26, 2024
@r-leyshon
Copy link
Contributor

Migrated to datasciencecampus/assess_gtfs#6

@r-leyshon r-leyshon closed this as not planned Won't fix, can't repro, duplicate, stale Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants