Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current siri and gtfs matching problems #16

Open
ShayAdler opened this issue May 8, 2023 · 4 comments
Open

Current siri and gtfs matching problems #16

ShayAdler opened this issue May 8, 2023 · 4 comments

Comments

@ShayAdler
Copy link
Contributor

By looking at siri_ride table, you can see some cases with journey_gtfs_ride_id (old etl) or scheduled_time_gtfs_ride_id (being null, since 1/5 when the new etl was deployed)

Problems I found -

  • By looking at a specific case -
    siri_ride.id=37226085 - should have matched with the current queries (both the old and the new) but didn't - line 70 at 2023-05-01 04:07:00.000000
-- route_gtfs_ride_id update query
select *
    from gtfs_ride, gtfs_route, siri_route, siri_ride
    where
    gtfs_route.id = gtfs_ride.gtfs_route_id
    and gtfs_route.operator_ref = siri_route.operator_ref
    and gtfs_route.line_ref = siri_route.line_ref
    and siri_route.id = siri_ride.siri_route_id
    and gtfs_route.date = '2023-05-01'
      and siri_ride.id = 37226085
    and siri_ride.scheduled_start_time > gtfs_ride.start_time - '5 minutes'::interval
    and siri_ride.scheduled_start_time < gtfs_ride.start_time + '5 minutes'::interval

-- scheduled_time_gtfs_ride_id update time
    select *
    from gtfs_ride, gtfs_route, siri_route, siri_ride
    where
    gtfs_route.id = gtfs_ride.gtfs_route_id
    and gtfs_route.operator_ref = siri_route.operator_ref
    and gtfs_route.line_ref = siri_route.line_ref
    and siri_route.id = siri_ride.siri_route_id
    and gtfs_route.date between '2023-05-01' and '2023-05-02'
    and siri_ride.scheduled_start_time = gtfs_ride.start_time
    and siri_ride.id = 37226085
    -- if we have updated_duration_minutes it means we updated the duration of the ride
    -- so we have all the ride stops data which we must ensure before making these updates
    and siri_ride.updated_duration_minutes is not null
  • By looking at the aggregative data -
    The new ETL gets much better matching results! (mostly ~2-3 mistake percentage compared to a lot) -
SELECT DATE_TRUNC('day', scheduled_start_time),
         COUNT(*) FILTER (WHERE journey_gtfs_ride_id IS NULL) AS null_count,
  COUNT(*) FILTER (WHERE journey_gtfs_ride_id IS NOT NULL) AS not_null_count,
  COUNT(*) FILTER (WHERE journey_gtfs_ride_id IS NULL) * 100.0 / COUNT(*) AS null_percentage,

  COUNT(*) FILTER (WHERE scheduled_time_gtfs_ride_id IS NULL) AS new_null_count,
  COUNT(*) FILTER (WHERE scheduled_time_gtfs_ride_id IS NOT NULL) AS new_not_null_count,
COUNT(*) FILTER (WHERE scheduled_time_gtfs_ride_id IS NULL) * 100.0 / COUNT(*) AS new_null_percentage
FROM siri_ride
WHERE DATE_TRUNC('day', scheduled_start_time) > '2023-04-20'
GROUP BY DATE_TRUNC('day', scheduled_start_time)

image

Zooming in more, we can see that our main problem in the new ETL is the edge cases hours (23-01)

SELECT DATE_TRUNC('hour', scheduled_start_time),

  COUNT(*) FILTER (WHERE scheduled_time_gtfs_ride_id IS NULL) AS new_null_count,
  COUNT(*) FILTER (WHERE scheduled_time_gtfs_ride_id IS NOT NULL) AS new_not_null_count,
COUNT(*) FILTER (WHERE scheduled_time_gtfs_ride_id IS NULL) * 100.0 / COUNT(*) AS new_null_percentage
FROM siri_ride
WHERE DATE_TRUNC('day', scheduled_start_time) > '2023-05-01'
GROUP BY DATE_TRUNC('hour', scheduled_start_time)

image

Follow-up tasks:

  1. Re-run the ETL on the specific case I mentioned and see it if works and updates as expected - @OriHoch can you help me with that? 🙏
  2. Validate the edge cases hours - @NoamGaash and I will take a look
  3. And another random question - why do we have siri data from future dates? (up to 7/6)
@ShayAdler ShayAdler changed the title Current matching siri and gtfs matching problems Current siri and gtfs matching problems May 8, 2023
@OriHoch
Copy link
Contributor

OriHoch commented May 10, 2023

  1. the dag definition shows how the DAG is configured - https://github.com/hasadna/open-bus-stride-etl/blob/main/open_bus_stride_etl/siri/dags.yaml

you can see that the relevant dag stride-etl-siri-update-rides-gtfs is configured with kwargs:

          min_date: {}
          max_date: {}
          num_days: {default: "1"}

so the scheduled run only runs on the last day. To run on other dates, you can trigger it manually from Airflow by clicking on the "play" icon next to it -
image
in the following screen you can paste a configuration json like this specifying a range of dates to run on:

{
"min_date": "2020-03-04",
"max_date": "2020-03-06"
}
  1. regarding future dates - I assume this is coming from the source siri xml data

@ShayAdler
Copy link
Contributor Author

Taking a look at one of the late hours blocks in which there where many misses: 2023-05-08 01:00 (Monday), 23% (45 rides) misses

Ran the ETL query manually, we can already see 6 rides that should've matched but didn't:

select siri_ride.id, siri_route.operator_ref, siri_route.line_ref, gtfs_route.route_long_name, gtfs_ride.start_time
    from gtfs_ride, gtfs_route, siri_route, siri_ride
    where
    gtfs_route.id = gtfs_ride.gtfs_route_id
    and gtfs_route.operator_ref = siri_route.operator_ref
    and gtfs_route.line_ref = siri_route.line_ref
    and siri_route.id = siri_ride.siri_route_id
    and gtfs_route.date between '2023-05-07' and '2023-05-09'
    and siri_ride.scheduled_start_time = gtfs_ride.start_time
    and scheduled_time_gtfs_ride_id is null and DATE_TRUNC('hour', scheduled_start_time) = '2023-05-08 01:00:00.000000'
    -- if we have updated_duration_minutes it means we updated the duration of the ride
    -- so we have all the ride stops data which we must ensure before making these updates
    and siri_ride.updated_duration_minutes is not null
image

Tried to play with the conditions a bit to see if more rides will match -

  • removing siri_ride.updated_duration_minutes is not null condition didn't help
  • adding more tolerance in the match to the scheduled_start_time got 10 more matches, in which the scheduled_start_time contains miliseconds and is not equal exactly equal as we've seen till now (need to deeper investigate this, I'll open a seperate issue 😅 )
--     and siri_ride.scheduled_start_time = gtfs_ride.start_time
      and siri_ride.scheduled_start_time > gtfs_ride.start_time - '5 minutes'::interval
    and siri_ride.scheduled_start_time < gtfs_ride.start_time + '5 minutes'::interval
image
  • Taking a look at the remaining 29 rides
    • 10 out of them are rides of a specific line that actually weren't planned (operator ref - 50, line ref - 15135, first planned ride is at 4 am, these are between 3-4)
    • 2 more are also unplanned
    • you got the point

@ShayAdler
Copy link
Contributor Author

ShayAdler commented May 15, 2023

@OriHoch - I also re-ran the etl between the 29/4-2/5 and it did found extra matches 🥳
It both matched the specific ride I found missing (siri_ride.id=37226085)
And also decreased the total missing percentage in those days (compared to my screenshot from last week):
image

Do you have any idea why would this happen? Maybe some kind of constraint with other ETLs I don't think of?

@OriHoch
Copy link
Contributor

OriHoch commented May 16, 2023

sounds like it might be due to these ETLs only running for last 1 day, maybe for the edge cases they don't have all data yet. I'm not sure what are the implications, but maybe we can increase it to run for last 2 days.. or more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants