-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Current siri and gtfs matching problems #16
Comments
you can see that the relevant dag
so the scheduled run only runs on the last day. To run on other dates, you can trigger it manually from Airflow by clicking on the "play" icon next to it -
|
Taking a look at one of the late hours blocks in which there where many misses: 2023-05-08 01:00 (Monday), 23% (45 rides) misses Ran the ETL query manually, we can already see 6 rides that should've matched but didn't: select siri_ride.id, siri_route.operator_ref, siri_route.line_ref, gtfs_route.route_long_name, gtfs_ride.start_time
from gtfs_ride, gtfs_route, siri_route, siri_ride
where
gtfs_route.id = gtfs_ride.gtfs_route_id
and gtfs_route.operator_ref = siri_route.operator_ref
and gtfs_route.line_ref = siri_route.line_ref
and siri_route.id = siri_ride.siri_route_id
and gtfs_route.date between '2023-05-07' and '2023-05-09'
and siri_ride.scheduled_start_time = gtfs_ride.start_time
and scheduled_time_gtfs_ride_id is null and DATE_TRUNC('hour', scheduled_start_time) = '2023-05-08 01:00:00.000000'
-- if we have updated_duration_minutes it means we updated the duration of the ride
-- so we have all the ride stops data which we must ensure before making these updates
and siri_ride.updated_duration_minutes is not null Tried to play with the conditions a bit to see if more rides will match -
-- and siri_ride.scheduled_start_time = gtfs_ride.start_time
and siri_ride.scheduled_start_time > gtfs_ride.start_time - '5 minutes'::interval
and siri_ride.scheduled_start_time < gtfs_ride.start_time + '5 minutes'::interval
|
@OriHoch - I also re-ran the etl between the 29/4-2/5 and it did found extra matches 🥳 Do you have any idea why would this happen? Maybe some kind of constraint with other ETLs I don't think of? |
sounds like it might be due to these ETLs only running for last 1 day, maybe for the edge cases they don't have all data yet. I'm not sure what are the implications, but maybe we can increase it to run for last 2 days.. or more |
By looking at
siri_ride
table, you can see some cases withjourney_gtfs_ride_id
(old etl) orscheduled_time_gtfs_ride_id
(being null, since 1/5 when the new etl was deployed)Problems I found -
siri_ride.id=37226085
- should have matched with the current queries (both the old and the new) but didn't - line 70 at 2023-05-01 04:07:00.000000The new ETL gets much better matching results! (mostly ~2-3 mistake percentage compared to a lot) -
Zooming in more, we can see that our main problem in the new ETL is the edge cases hours (23-01)
Follow-up tasks:
The text was updated successfully, but these errors were encountered: