generated from jtr13/quarto-edav-template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
results.qmd
587 lines (459 loc) · 26.9 KB
/
results.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
# Exploring the Subway
This projects aims at exploring the different data around the New York subway system that is run by the MTA. The datasets here have been used from the nyc open data repository. Here, we explore the various trends with the data in this respect.
```{r, echo=FALSE}
#| warning: false
# Load libraries
library(tidyverse)
library(dplyr)
library(lubridate)
library(stringr)
library(GGally)
# Load datasets
subway_train_delays <- read_csv("data/MTA_Subway_Trains_Delayed__Beginning_2020_20241118.csv", show_col_types = FALSE)
mta_service_alerts <- read_csv("data/MTA_Service_Alerts__Beginning_April_2020_20241118.csv", show_col_types = FALSE)
subway_wait_assessment <- read_csv("data/MTA_Subway_Wait_Assessment__Beginning_2020_20241119.csv", show_col_types = FALSE)
mta_ridership <- read_csv("data/MTA_Subway_Hourly_Ridership_Data_Modified_v2.csv", show_col_types = FALSE)
all_ridership <- read.csv('data/MTA_Daily_Ridership_Data__Beginning_2020_20241120.csv')
```
## Why the subway?
First, it is important to visit why we choose the subway for this project. The subway is the most used mode of public transport in the city.
```{r, echo=FALSE}
# Boxplot: Distribution of total daily ridership for Subways, Buses, LIRR, and Metro-North
all_ridership_long <- all_ridership |>
select(Date,
`Subways..Total.Estimated.Ridership`,
`Buses..Total.Estimated.Ridership`,
`LIRR..Total.Estimated.Ridership`,
`Metro.North..Total.Estimated.Ridership`) |>
pivot_longer(cols = -Date, names_to = "Transport_Mode", values_to = "Ridership")
ggplot(all_ridership_long, aes(x = Transport_Mode, y = Ridership, fill = Transport_Mode)) +
geom_boxplot() +
labs(title = "Daily Ridership by Transport Mode",
x="",
y = "Daily Ridership (in millions)") +
scale_y_continuous(
labels = function(x) x / 1e6 # Divide values by 1,000,000
) +
theme(axis.text.x = element_blank(),
axis.ticks = element_blank())
```
Here, we see clearly that the median daily ridership for the subway is higher than even the peak ridership for any other mode of transport in the city. This speaks volumes about how important the subway system is, to the city. We see everyday the number of people on the subway, but this graph tells us how much more the number really is.
This makes so that the subway system is essentially a lifeline for the city itself, and needs to be as effective as possible to ensure the best possible effects for the citizens of the city (New Yorkers - us).
## Delays in the subway
Let's take a look at the delays in the subway system in the city.
### When are the delays occuring?
Here's a graph faceted over the delays over the weekdays and weekends in the NYC subway system.
```{r fig.width=10, echo=FALSE}
#| warning: false
subway_groups <- data.frame(
line = c("1", "2", "3", "4", "5", "6", "7", "A", "C", "E", "B", "D", "F", "M", "G", "J", "Z", "JZ", "L", "N", "Q", "R", "W", "S 42nd", "S Fkln", "S Rock"),
group = c(
"IRT - Broadway–Seventh Avenue Line", # 1, 2, 3
"IRT - Broadway–Seventh Avenue Line",
"IRT - Broadway–Seventh Avenue Line",
"IRT - Lexington Avenue Line", # 4, 5, 6
"IRT - Lexington Avenue Line",
"IRT - Lexington Avenue Line",
"IRT - Flushing Line", # 7
"IND - Eighth Avenue Line", # A, C, E
"IND - Eighth Avenue Line",
"IND - Eighth Avenue Line",
"IND - Sixth Avenue Line", # B, D, F, M
"IND - Sixth Avenue Line",
"IND - Sixth Avenue Line",
"IND - Sixth Avenue Line",
"IND - Crosstown Line", # G
"BMT - Nassau Street Line", # J, Z
"BMT - Nassau Street Line",
"BMT - Nassau Street Line",
"BMT - Canarsie Line", # L
"BMT - Broadway Line", # N, Q, R, W
"BMT - Broadway Line",
"BMT - Broadway Line",
"BMT - Broadway Line",
"Shuttles", # All S lines
"Shuttles",
"Shuttles"
)
)
delays_category <- subway_train_delays[,!names(subway_train_delays) %in% c("subcategory")]
pdf<- delays_category |> group_by(line, reporting_category, month, day_type) |> summarise(delays = sum(delays))
pdf_with_groups <- pdf |>
left_join(subway_groups, by = "line")
total_delays_per_group <- pdf_with_groups |>
group_by(group, month) |>
summarize(total_delays = sum(delays, na.rm = TRUE), .groups = "drop")
# custom colors based on NYC Subway colors
subway_group_colors <- c(
"IRT - Broadway–Seventh Avenue Line" = "#EE352E",# Red
"IRT - Lexington Avenue Line" = "#00933C", # Green
"IRT - Flushing Line" = "#B933AD", # Purple
"IND - Eighth Avenue Line" = "#0039A6", # Blue
"IND - Sixth Avenue Line" = "#FF6319", # Orange
"IND - Crosstown Line" = "#6CBE45", # Light Green
"BMT - Nassau Street Line" = "#996633", # Brown
"BMT - Canarsie Line" = "#A7A9AC", # Gray
"BMT - Broadway Line" = "#FCCC0A", # Yellow
"Shuttles" = "#808183" # Dark Slate Gray
)
custom_legend_labels <- c(
"IRT - Broadway–Seventh Avenue Line" = "1, 2, 3",
"IRT - Lexington Avenue Line" = "4, 5, 6",
"IRT - Flushing Line" = "Flushing (7)",
"IND - Eighth Avenue Line" = "A, C, E",
"IND - Sixth Avenue Line" = "B, D, F, M",
"IND - Crosstown Line" = "Crosstown (G)",
"BMT - Nassau Street Line" = "J, Z",
"BMT - Canarsie Line" = "L",
"BMT - Broadway Line" = "N, Q, R, W",
"Shuttles" = "S lines"
)
pdf_with_groups <- pdf_with_groups |>
mutate(day_label = ifelse(day_type == 1, "Weekday", "Weekend"))
total_delays_per_group_day <- pdf_with_groups |>
group_by(group, month, day_label) |>
summarize(total_delays = sum(delays, na.rm = TRUE), .groups = "drop")
ggplot(total_delays_per_group_day, aes(x = month, y = total_delays, color = group, group = group)) +
geom_line(linewidth = 1) +
scale_color_manual(values = subway_group_colors, labels = custom_legend_labels) +
labs(
title = "Total Delays Per Month by Subway Group (Weekday vs Weekend)",
x = "Month",
y = "Total Delays",
color = "Subway Group (Lines)"
) +
facet_wrap(~ day_label, ncol = 1, scales = "free_y") + # Facet by Weekday/Weekend
theme_minimal() +
theme(
text = element_text(size = 14),
legend.position = "right",
legend.title = element_text(face = "bold"),
strip.text = element_text(face = "bold", size = 12)
)
```
We notice that there are more delays in the weekdays than on weekends, and we acknowledge the fact that this could be due to the larger number of "days" in the weekdays (5) than in the weekends (2). However, the more interesting pattern to notice here is the sudden spike in the number of delays on weekdays around 2021. This is a timeline around the pandemic, and the full functioning of the subway had not returned. However, the lack of the same pattern on the weekends means there is more to it than just the pandemic, or some extra ripple effect due to the pandemic.
On further digging, we found that ([a lack of workers](https://www.nytimes.com/2021/07/13/nyregion/nyc-subway-delays-worker-shortage.html)) is potentially a big reason why this is occuring. This ties in with the timeline, as well as, is a valid reason why this occured.
### Why are the delays occuring?
It is also important to correlate the same with the reasons mentioned in the dataset, which we perform below.
```{r fig.width=10, echo=FALSE}
total_delays_per_group_category <- pdf_with_groups |>
group_by(group, month, reporting_category) |>
summarize(total_delays = sum(delays, na.rm = TRUE), .groups = "drop")
ggplot(total_delays_per_group_category, aes(x = month, y = total_delays, color = group, group = group)) +
geom_line(linewidth = 0.8) +
scale_color_manual(values = subway_group_colors, labels = custom_legend_labels) +
labs(
title = "Total Delays Per Month by Subway Group (Faceted by Delay Category)",
x = "Month",
y = "Total Delays",
color = "Subway Group (Lines)"
) +
facet_wrap(~ reporting_category, ncol = 2, scales = "free_y") + # Facet by Delay Category
theme_minimal() +
theme(
text = element_text(size = 14),
legend.position = "right",
legend.title = element_text(face = "bold"),
strip.text = element_text(face = "bold", size = 12)
)
```
Here, we clearly see that crew availability was a problem pretty much through the year 2021, however this faceting tells another story about the delays that occurred after 2021 - notice the graphs on Operating Conditions. This indicates that the overall operating conditions of the subway only deteriorated over time on an overall scale. However, there are two more important things to note :
1. The A-C-E line had an initial increase in the total number of delays due to operating conditions and then a subsequent decrease. Similarly, the 4-5-6 lines have had an increase too. This also corresponds the the uptick in the planned ROW work column, where we see that there are more delays in the A-C-E lines to deter these problems.
2. Secondly we see a large number of delays on the 4-5-6 lines due to police and medical reasons, and this can probably be attributed to the ([higher crime rates](https://jknylaw.com/blog/whats-the-safest-borough-in-nyc/)) in Brooklyn and Bronx, both of which boroughs are connected by these lines.
The above graph shows a raw number of the number of delays due to each category, but it is interesting to see a more nuanced approach to the same number.
```{r fig.width=10, echo=FALSE}
# Normalize the total delays by dividing weekdays by 5 and weekends by 2
delays_by_group_category_day_normalized <- pdf_with_groups |>
group_by(group, reporting_category, day_label) |>
summarize(total_delays = sum(delays, na.rm = TRUE), .groups = "drop") |>
mutate(
normalized_delays = ifelse(day_label == "Weekday", total_delays / 5, total_delays / 2)
)
ggplot(delays_by_group_category_day_normalized, aes(x = reporting_category, y = normalized_delays, fill = group)) +
geom_bar(stat = "identity", position = "stack") + # Use 'stack' for normalized counts
scale_fill_manual(values = subway_group_colors, labels = custom_legend_labels) +
labs(
title = "Normalized Total Delays by Subway Group for Each Delay Category (Weekday vs Weekend)",
x = "Delay Category",
y = "Normalized Delays",
fill = "Subway Group (Lines)"
) +
facet_wrap(~ day_label, ncol = 2, scales = "fixed") + # Side-by-side facets with fixed Y-axis scales
theme_minimal() +
theme(
text = element_text(size = 10),
legend.position = "right",
legend.title = element_text(face = "bold"),
strip.text = element_text(face = "bold", size = 12),
axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels for readability
)
```
Here, we ensure that we normalize across the number of days in weekdays and weekends but dividing the number of delays on weekdays by 5 and weekends by 2. A standard belief is that the total number of delays on weekends is higher, but clearly, there have been more infrastructure related delays on weekdays, over the last 4 years. This is concerning, since we see in the previous graph that the number of infrasture related delays has been more or less constant over the last 3 years, with 2020 being the all exempt pandemic year.
The other major observation is that the number of delays due to external factors is an extremely small percentage of the total number of delays. This is another major concern, since apart from Police and Medical related issues, all other categories are known to be preventable, or are otherwise under the control of the MTA.
### How many trains are really delayed?
Here we look at the percentage of trains per month that are delayed.
```{r, echo=FALSE}
#| warning: false
mta_service_alerts_filtered <-
mta_service_alerts |>
filter(Agency == "NYCT Subway") |>
mutate(Date = as.Date(as.POSIXlt(Date, format="%m/%d/%Y %I:%M:%S %p", tz="America/New_York")))
subway_train_delays_filtered <-
subway_train_delays |>
filter(month >= '2020-07-01')
mta_service_alerts_filtered_v2 <-
mta_service_alerts_filtered |>
filter(Date >= '2020-07-01')
subway_wait_assessment_filtered <-
subway_wait_assessment |>
filter(month >= '2020-07-01')
monthly_delays <- subway_train_delays_filtered |>
group_by(month, day_type) |>
summarize(delays = sum(delays)) |>
ungroup()
monthly_wait_assessment <- subway_wait_assessment_filtered |>
group_by(month, day_type) |>
summarize(tp_passed = sum(num_timepoints_passing_wait_assessment, na.rm = TRUE),
tp_sched = sum(num_sched_timepoints, na.rm = TRUE)) |>
mutate(`wait assessment` = tp_passed/tp_sched) |>
ungroup()
monthly_service_alerts <- mta_service_alerts_filtered_v2 |>
mutate(month = make_date(year(Date), month(Date), 1),
day_type = ifelse(wday(Date) %in% c(1, 7), 2, 1)) |>
group_by(month, day_type) |>
summarize(alert_count = n()) |>
ungroup()
mta_ridership_new <-
mta_ridership |>
select(station_complex, year, month, day_type, ridership)
monthly_mta_ridership <- mta_ridership_new |>
mutate(month = make_date(year, month, 1)) |>
group_by(month, day_type) |>
summarize(ridership = sum(ridership)) |>
ungroup()
df_list <- list(monthly_delays, monthly_service_alerts, monthly_wait_assessment, monthly_mta_ridership)
joined_df <- df_list |>
reduce(full_join, by=c('month', 'day_type'))
ggplot(joined_df, aes(x = month, y = as.factor(day_type), fill = 1 - `wait assessment`)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
scale_y_discrete(labels = c("1" = "Weekday", "2" = "Weekend")) +
labs(title = "Heatmap of 1 - Wait Assessment by Month and Day Type",
x = "Month",
y = "Day Type",
fill = "1 - Wait Assessment")
```
Almost every month, there is a good percentage of trains that are delayed. This is not a good sign for the subway's reliability. Especially when we notice that the weekday delays are much more frequent than the weekend delays.
Overall, the total number of late trains should be as close to zero as possible, but we do see some months in 2021 which had close to 35% trains delayed. This, especially on weekdays, causes huge problems to all commuters whose primary mode of transport is the subway (which as we saw in the first section, is a majority of New Yorkers).
This also looks like a one off trend for this year, where only the months in 2021 had a large nummber of delays, which could be attributed, as earlier to the article linked previously.
### Are we getting alerts on time?
```{r echo=FALSE}
#| warning: false
library(ggplot2)
library(dplyr)
library(lubridate)
nyc_subway_lines <- c("A", "B", "C", "D", "E", "F", "G", "J", "Z", "L", "M",
"N", "Q", "R", "W", "1", "2", "3", "4", "5", "6", "7", "S")
linewise_service_alerts <- mta_service_alerts |> separate_rows(Affected, sep = "\\|") |>
filter(Affected %in% nyc_subway_lines)
linewise_service_alerts <- linewise_service_alerts |>
mutate(Date = as.Date(Date, format = "%m/%d/%Y %I:%M:%S %p"))
linewise_service_alerts <- linewise_service_alerts |>
left_join(subway_groups, by = c("Affected" = "line"))
linewise_service_alerts <- linewise_service_alerts |>
mutate(Date = as.Date(Date, format = "%m/%d/%Y %I:%M:%S %p"))
linewise_service_alerts <- linewise_service_alerts |>
mutate(DayType = if_else(wday(Date) %in% c(1, 7), "Weekend", "Weekday"))
filtered_alerts <- linewise_service_alerts |>
filter(Agency == "NYCT Subway") |>
mutate(Date = as.Date(Date))
daily_alerts <- filtered_alerts |>
group_by(Date, DayType) |>
summarise(alert_count = n(), .groups = 'drop') # Count alerts by date and day type
daily_alerts <- daily_alerts |>
mutate(
adjusted_alert_count = ifelse(DayType == "Weekday", alert_count / 5, alert_count / 2)
)
ggplot(daily_alerts, aes(x = Date, y = adjusted_alert_count, color = DayType, group = DayType)) +
geom_point(size = 1, alpha = 0.6) +
geom_smooth(se = FALSE, method = "loess", span = 0.3, size = 1.2) +
scale_color_manual(values = c("Weekday" = "#1f77b4", "Weekend" = "#ff7f0e")) +
labs(
title = "Smoothed Normalised Daily Subway Alerts by Day Type",
x = "Date",
y = "Normalised Number of Alerts",
color = "Day Type"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom"
)
```
Clearly, we are getting more alerts on the weekends than on weekdays.
This is opposite to the number of actual delays. This is a major problem, since typically, we expect the number of delays and number of service alerts to follow a similar pattern.
As part of this project, we also revisited the delays definition to ensure that even with an alert, a delay is still being counted in the total number of delays.
#### Side note : Alert data quality and distribution
During our experimentation, we came across many patterns that did not show any real patterns. For example, the alert distribution per line, or the alert distribution by group. However, this graph shows a fairly interesting characteristic of the alerts.
```{r, echo=FALSE}
normalized_data <- linewise_service_alerts |>
group_by(`Status Label`, DayType) |>
summarise(Count = n(), .groups = "drop") |>
mutate(NormalizedCount = if_else(DayType == "Weekday", Count / 5, Count / 2))
status_summary <- normalized_data |>
group_by(`Status Label`) |>
summarise(TotalNormalizedCount = sum(NormalizedCount), .groups = "drop") |>
arrange(desc(TotalNormalizedCount)) |>
slice_head(n = 10)
top_status_labels <- status_summary$`Status Label`
filtered_data <- normalized_data |>
filter(`Status Label` %in% top_status_labels) |>
mutate(`Status Label` = factor(`Status Label`, levels = top_status_labels))
ggplot(filtered_data, aes(x = `Status Label`, y = NormalizedCount, fill = DayType)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Top 10 Status Labels by Normalized Number of Service Alerts",
x = "Status Label",
y = "Normalized Number of Alerts",
fill = "Day Type"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5, size = 14)
)
```
In this graph, the interesting insights are mainly around the distribution of alerts across weekday / weekend. We see that in every category, the average number of alerts per weekday is the same as the average number of alerts per weekend. This shows that there are no specific alerts (apart from weekday and weekend service). This is interesting because there are different patterns for both delays and alerts for every category, in the delays as we saw before, but not in the alerts. The overall number is more or less equal.
Another interesting insight is around the fact that there are a small number of alerts for "weekday service" on weekends and vice versa. This could be an issue with the data quality, or maybe there is a underlying reason why this occurs. We could contact the data source for an explanation on this. But Nonetheless, the presence of these rows is pretty interesting.
## Wait times
Let us take a look at the percentage of delays in the subway system over the same time period.
```{r fig.width=12, echo=FALSE}
mta_service_alerts_filtered <-
mta_service_alerts |>
filter(Agency == "NYCT Subway") |>
mutate(Date = as.Date(as.POSIXlt(Date, format="%m/%d/%Y %I:%M:%S %p", tz="America/New_York")))
mta_ridership_new <-
mta_ridership |>
select(station_complex, year, month, day_type, ridership)
subway_wait_assessment_with_groups <- subway_wait_assessment |>
left_join(subway_groups, by = "line")
wait_assessment_groupwise <- subway_wait_assessment_with_groups |>
group_by(group, month) |>
summarize(tp_passed = sum(num_timepoints_passing_wait_assessment, na.rm = TRUE),
tp_sched = sum(num_sched_timepoints, na.rm = TRUE),
.groups = "drop") |>
mutate(`wait assessment` = (1 - tp_passed/tp_sched)*100) |>
ungroup() |>
drop_na(group)
ggplot(wait_assessment_groupwise, aes(x = month, y = `wait assessment`, color = group, group = group)) +
geom_line(size = 1) +
scale_x_date(limits = c(as.Date("2020-01-01"), as.Date("2024-10-01")),
date_breaks = "2 months", date_labels = "%b %Y") +
scale_color_manual(values = subway_group_colors, labels = custom_legend_labels) +
labs(
title = "Monthly Wait Assessment by Subway Group",
x = "Month",
y = "% timepoints with delays",
color = "Subway Group (Lines)"
) +
theme(
text = element_text(size = 14),
legend.position = "right",
legend.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
```
Here, the first spike we notice is during the Covid 19 pandemic in 2020. But the most interesting pattern is the fact that the S line had a steady amount of delays throughout the 3 years, but suddenly in April 2024, we see a sudden spike in the wait times. No other line has such a drastic change in delays anywhere else. ([This](https://www.cbsnews.com/newyork/news/new-york-city-subway-shuttle-cellphone-service-wi-fi/#:~:text=The%20line%2C%20which%20has%20been,browse%20the%20internet%20without%20interruption.)) is the only article specific to the S train in 2024. This can be a reason why this pattern exists.
## Total Ridership analysis
Here, the ridership of particular lines is hard to gauge since the nyc subway has multiple crossover points, and it is hard to note which line a particular person got onto from the subway.
So, we see the total number of riders from every subway station.
```{r fig.width=8, echo=FALSE}
#| warning: false
library(ggplot2)
library(dplyr)
summarized_data <- mta_ridership |>
group_by(station_complex, day_type) |>
summarize(total_ridership = sum(ridership, na.rm = TRUE)) |>
mutate(adjusted_ridership = ifelse(day_type == 1, total_ridership / 5, total_ridership / 2)) |>
ungroup()
top_stations <- summarized_data |>
group_by(station_complex) |>
summarize(total_ridership_all = sum(total_ridership, na.rm = TRUE)) |>
arrange(desc(total_ridership_all)) |>
slice_head(n = 20) |>
select(station_complex)
plot_data <- summarized_data |>
filter(station_complex %in% top_stations$station_complex)
ggplot(plot_data, aes(x = adjusted_ridership, y = reorder(station_complex, total_ridership), color = as.factor(day_type))) +
geom_point(size = 3) +
labs(
title = "MTA Ridership Cleveland Dot Plot (Top 20 Stations)",
x = "Normalised Ridership",
y = "Station Complex",
color = "Day Type"
) +
scale_color_manual(
values = c("2" = "cornflowerblue", "1" = "orange"), # Assign colors to weekday and weekend
labels = c("Weekday", "Weekend") # Change legend labels
) +
scale_x_continuous(
expand = c(0.1, 0.1), # Add padding to both sides of the x-axis (10%)
limits = c(0, max(plot_data$adjusted_ridership) * 1.2) # Extend x-axis to 20% beyond max value
) +
theme_minimal()
```
Here, we see that there is a higher number of riders every weekday than every weekend. Most of these are stations that we expect to be here.
Times Square being the most populous station is quite expected, and since the grand central station connects the west side and east side, we can explain it being up there, along with all the tourists who would want to visit the station.
There is also a small mislabelling in the dataset where it notes 14th street (1,2,3) and (A,C,E) Penn station for the same pairs of station are separate. This can probably be attributed to the fact that these stations have seperate entrances multiple streets apart that are closer to the respective lines, however there is a link between these lines that is navigable without exits. This should ideally put the stations in the same bucket, but the MTA data collection team has not done this.
### Ridership - Going one step deeper
Do all stations get a similar influx of passengers through the year? Are there stations that get a higher influx on certain weekdays?
```{r fig.width=10, echo=FALSE}
stations <- unique(top_n(mta_ridership_new, 150, ridership)$station_complex)
scale_val_2 <- 1000000
mta_stationwise_ridership_wd <- mta_ridership_new |>
mutate(month = make_date(year, month, 1)) |>
filter(day_type == 1, station_complex %in% stations)
# for lengend
station_order <- mta_stationwise_ridership_wd |>
group_by(station_complex) |>
summarize(total_ridership = sum(ridership, na.rm = TRUE)) |>
arrange(desc(total_ridership)) |>
pull(station_complex) # Extract the station names in sorted order
mta_stationwise_ridership_wd <- mta_stationwise_ridership_wd |>
mutate(station_complex = factor(station_complex, levels = station_order))
mta_stationwise_ridership_wd |>
ggplot(aes(x = month, y = ridership / scale_val_2, color = station_complex)) +
geom_line(size = 1) +
scale_x_date(
limits = c(as.Date("2020-07-01"), as.Date("2024-10-01")),
date_breaks = "2 months",
date_labels = "%b %Y"
) +
scale_color_viridis_d(option = "C", guide = guide_legend(reverse = FALSE)) +
labs(
x = "Month",
y = "Ridership (in millions)",
color = "Station"
) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_text(face = "bold", size = 10),
legend.text = element_text(size = 9),
legend.key.width = unit(1, "cm"),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 12),
panel.grid.major = element_line(color = "gray90"),
panel.grid.minor = element_blank()
) +
labs(
title = "MTA Ridership Trends for Top 5 Stations",
subtitle = "Weekday ridership over time (2020–2024)",
x = "Month",
y = "Ridership (in millions)",
color = "Station"
)
```
Here, we notice that although the graphs for each of the stations is not exactly the same, it follows a very similar pattern. The existence of this patterns means that there are no line wise / area wise separate trends on the subway system on weekdays , all the stations get a higher influx of passengers at similar times in the year, and similarly the troughs in this trend also correspond to one another. Thus we say that we have similar patterns throughout.
However, we see that the overall ridership trends upwards, with time, since the pandemic. We see that there has been an increase in the total number of passengers overall. We can take a look at this along with the weekend plot in the interactive plot.
In a way, this is a sign of good subway design, since the stations that we see on this list follow more or less a similar order in terms of their size as well.