results.qmd

# Exploring the Subway

This projects aims at exploring the different data around the New York subway system that is run by the MTA. The datasets here have been used from the nyc open data repository. Here, we explore the various trends with the data in this respect. 

```{r, echo=FALSE}
#| warning: false

# Load libraries
library(tidyverse)
library(dplyr)
library(lubridate)
library(stringr)
library(GGally)

# Load datasets
subway_train_delays <- read_csv("data/MTA_Subway_Trains_Delayed__Beginning_2020_20241118.csv", show_col_types = FALSE)
mta_service_alerts <- read_csv("data/MTA_Service_Alerts__Beginning_April_2020_20241118.csv", show_col_types = FALSE)
subway_wait_assessment <- read_csv("data/MTA_Subway_Wait_Assessment__Beginning_2020_20241119.csv", show_col_types = FALSE)
mta_ridership <- read_csv("data/MTA_Subway_Hourly_Ridership_Data_Modified_v2.csv", show_col_types = FALSE)
all_ridership <- read.csv('data/MTA_Daily_Ridership_Data__Beginning_2020_20241120.csv')


```


## Why the subway?

First, it is important to visit why we choose the subway for this project. The subway is the most used mode of public transport in the city.  

```{r, echo=FALSE}
# Boxplot: Distribution of total daily ridership for Subways, Buses, LIRR, and Metro-North
all_ridership_long <- all_ridership |>
  select(Date, 
         `Subways..Total.Estimated.Ridership`, 
         `Buses..Total.Estimated.Ridership`, 
         `LIRR..Total.Estimated.Ridership`, 
         `Metro.North..Total.Estimated.Ridership`) |>
  pivot_longer(cols = -Date, names_to = "Transport_Mode", values_to = "Ridership")

ggplot(all_ridership_long, aes(x = Transport_Mode, y = Ridership, fill = Transport_Mode)) +
  geom_boxplot() +
  labs(title = "Daily Ridership by Transport Mode", 
       x="",
       y = "Daily Ridership (in millions)") +
  scale_y_continuous(
    labels = function(x) x / 1e6  # Divide values by 1,000,000 
  ) +
  theme(axis.text.x = element_blank(),
       axis.ticks = element_blank()) 
  
```
Here, we see clearly that the median daily ridership for the subway is higher than even the peak ridership for any other mode of transport in the city. This speaks volumes about how important the subway system is, to the city. We see everyday the number of people on the subway, but this graph tells us how much more the number really is. 

This makes so that the subway system is essentially a lifeline for the city itself, and needs to be as effective as possible to ensure the best possible effects for the citizens of the city (New Yorkers - us). 

## Delays in the subway

Let's take a look at the delays in the subway system in the city.

### When are the delays occuring? 

Here's a graph faceted over the delays over the weekdays and weekends in the NYC subway system.

```{r fig.width=10, echo=FALSE}
#| warning: false

subway_groups <- data.frame(
  line = c("1", "2", "3", "4", "5", "6", "7", "A", "C", "E", "B", "D", "F", "M", "G", "J", "Z", "JZ", "L", "N", "Q", "R", "W", "S 42nd", "S Fkln", "S Rock"),
  group = c(
    "IRT - Broadway–Seventh Avenue Line",  # 1, 2, 3
    "IRT - Broadway–Seventh Avenue Line", 
    "IRT - Broadway–Seventh Avenue Line", 
    "IRT - Lexington Avenue Line",         # 4, 5, 6
    "IRT - Lexington Avenue Line", 
    "IRT - Lexington Avenue Line", 
    "IRT - Flushing Line",                 # 7
    "IND - Eighth Avenue Line",            # A, C, E
    "IND - Eighth Avenue Line", 
    "IND - Eighth Avenue Line", 
    "IND - Sixth Avenue Line",             # B, D, F, M
    "IND - Sixth Avenue Line", 
    "IND - Sixth Avenue Line", 
    "IND - Sixth Avenue Line", 
    "IND - Crosstown Line",                # G
    "BMT - Nassau Street Line",            # J, Z
    "BMT - Nassau Street Line", 
    "BMT - Nassau Street Line",
    "BMT - Canarsie Line",                 # L
    "BMT - Broadway Line",                 # N, Q, R, W
    "BMT - Broadway Line", 
    "BMT - Broadway Line", 
    "BMT - Broadway Line",
    "Shuttles",                            # All S lines
    "Shuttles",
    "Shuttles"
  )
)

delays_category <- subway_train_delays[,!names(subway_train_delays) %in% c("subcategory")]
pdf<- delays_category |> group_by(line, reporting_category, month, day_type) |> summarise(delays = sum(delays))


pdf_with_groups <- pdf |>
  left_join(subway_groups, by = "line")

total_delays_per_group <- pdf_with_groups |>
  group_by(group, month) |>
  summarize(total_delays = sum(delays, na.rm = TRUE), .groups = "drop")

# custom colors based on NYC Subway colors
subway_group_colors <- c(
  "IRT - Broadway–Seventh Avenue Line" = "#EE352E",# Red
  "IRT - Lexington Avenue Line" = "#00933C",       # Green
  "IRT - Flushing Line" = "#B933AD",               # Purple
  "IND - Eighth Avenue Line" = "#0039A6",          # Blue
  "IND - Sixth Avenue Line" = "#FF6319",           # Orange
  "IND - Crosstown Line" = "#6CBE45",              # Light Green
  "BMT - Nassau Street Line" = "#996633",          # Brown
  "BMT - Canarsie Line" = "#A7A9AC",               # Gray
  "BMT - Broadway Line" = "#FCCC0A",               # Yellow
  "Shuttles" = "#808183"                           # Dark Slate Gray
)


custom_legend_labels <- c(
  "IRT - Broadway–Seventh Avenue Line" = "1, 2, 3",
  "IRT - Lexington Avenue Line" = "4, 5, 6",
  "IRT - Flushing Line" = "Flushing (7)",
  "IND - Eighth Avenue Line" = "A, C, E",
  "IND - Sixth Avenue Line" = "B, D, F, M",
  "IND - Crosstown Line" = "Crosstown (G)",
  "BMT - Nassau Street Line" = "J, Z",
  "BMT - Canarsie Line" = "L",
  "BMT - Broadway Line" = "N, Q, R, W",
  "Shuttles" = "S lines"
)


pdf_with_groups <- pdf_with_groups |>
  mutate(day_label = ifelse(day_type == 1, "Weekday", "Weekend"))

total_delays_per_group_day <- pdf_with_groups |>
  group_by(group, month, day_label) |>
  summarize(total_delays = sum(delays, na.rm = TRUE), .groups = "drop")

ggplot(total_delays_per_group_day, aes(x = month, y = total_delays, color = group, group = group)) +
  geom_line(linewidth = 1) +
  
  scale_color_manual(values = subway_group_colors, labels = custom_legend_labels) +
  labs(
    title = "Total Delays Per Month by Subway Group (Weekday vs Weekend)",
    x = "Month",
    y = "Total Delays",
    color = "Subway Group (Lines)"
  ) +
  facet_wrap(~ day_label, ncol = 1, scales = "free_y") + # Facet by Weekday/Weekend
  theme_minimal() +
  theme(
    text = element_text(size = 14),
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    strip.text = element_text(face = "bold", size = 12)
  )

``` 


We notice that there are more delays in the weekdays than on weekends, and we acknowledge the fact that this could be due to the larger number of "days" in the weekdays (5) than in the weekends (2). However, the more interesting pattern to notice here is the sudden spike in the number of delays on weekdays around 2021. This is a timeline around the pandemic, and the full functioning of the subway had not returned. However, the lack of the same pattern on the weekends means there is more to it than just the pandemic, or some extra ripple effect due to the pandemic.

On further digging, we found that ([a lack of workers](https://www.nytimes.com/2021/07/13/nyregion/nyc-subway-delays-worker-shortage.html)) is potentially a big reason why this is occuring. This ties in with the timeline, as well as, is a valid reason why this occured.

### Why are the delays occuring?

It is also important to correlate the same with the reasons mentioned in the dataset, which we perform below. 

```{r fig.width=10, echo=FALSE}
total_delays_per_group_category <- pdf_with_groups |>
  group_by(group, month, reporting_category) |>
  summarize(total_delays = sum(delays, na.rm = TRUE), .groups = "drop")

ggplot(total_delays_per_group_category, aes(x = month, y = total_delays, color = group, group = group)) +
  geom_line(linewidth = 0.8) +
  
  scale_color_manual(values = subway_group_colors, labels = custom_legend_labels) +
  labs(
    title = "Total Delays Per Month by Subway Group (Faceted by Delay Category)",
    x = "Month",
    y = "Total Delays",
    color = "Subway Group (Lines)"
  ) +
  facet_wrap(~ reporting_category, ncol = 2, scales = "free_y") + # Facet by Delay Category
  theme_minimal() +
  theme(
    text = element_text(size = 14),
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    strip.text = element_text(face = "bold", size = 12)
  )

```

Here, we clearly see that crew availability was a problem pretty much through the year 2021, however this faceting tells another story about the delays that occurred after 2021 - notice the graphs on Operating Conditions. This indicates that the overall operating conditions of the subway only deteriorated over time on an overall scale. However, there are two more important things to note : 

1. The A-C-E line had an initial increase in the total number of delays due to operating conditions and then a subsequent decrease. Similarly, the 4-5-6 lines have had an increase too. This also corresponds the the uptick in the planned ROW work column, where we see that there are more delays in the A-C-E lines to deter these problems. 

2. Secondly we see a large number of delays on the 4-5-6 lines due to police and medical reasons, and this can probably be attributed to the ([higher crime rates](https://jknylaw.com/blog/whats-the-safest-borough-in-nyc/)) in Brooklyn and Bronx, both of which boroughs are connected by these lines. 


The above graph shows a raw number of the number of delays due to each category, but it is interesting to see a more nuanced approach to the same number.

```{r fig.width=10, echo=FALSE}
# Normalize the total delays by dividing weekdays by 5 and weekends by 2
delays_by_group_category_day_normalized <- pdf_with_groups |>
  group_by(group, reporting_category, day_label) |>
  summarize(total_delays = sum(delays, na.rm = TRUE), .groups = "drop") |>
  mutate(
    normalized_delays = ifelse(day_label == "Weekday", total_delays / 5, total_delays / 2)
  )

ggplot(delays_by_group_category_day_normalized, aes(x = reporting_category, y = normalized_delays, fill = group)) +
  geom_bar(stat = "identity", position = "stack") +  # Use 'stack' for normalized counts
  scale_fill_manual(values = subway_group_colors, labels = custom_legend_labels) +
  labs(
    title = "Normalized Total Delays by Subway Group for Each Delay Category (Weekday vs Weekend)",
    x = "Delay Category",
    y = "Normalized Delays",
    fill = "Subway Group (Lines)"
  ) +
  facet_wrap(~ day_label, ncol = 2, scales = "fixed") + # Side-by-side facets with fixed Y-axis scales
  theme_minimal() +
  theme(
    text = element_text(size = 10),
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    strip.text = element_text(face = "bold", size = 12),
    axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels for readability
  )

```

Here, we ensure that we normalize across the number of days in weekdays and weekends but dividing the number of delays on weekdays by 5 and weekends by 2. A standard belief is that the total number of delays on weekends is higher, but clearly, there have been more infrastructure related delays on weekdays, over the last 4 years. This is concerning, since we see in the previous graph that the number of infrasture related delays has been more or less constant over the last 3 years, with 2020 being the all exempt pandemic year. 

The other major observation is that the number of delays due to external factors is an extremely small percentage of the total number of delays. This is another major concern, since apart from Police and Medical related issues, all other categories are known to be preventable, or are otherwise under the control of the MTA.


### How many trains are really delayed?

Here we look at the percentage of trains per month that are delayed. 


```{r, echo=FALSE}
#| warning: false
mta_service_alerts_filtered <- 
  mta_service_alerts |> 
  filter(Agency == "NYCT Subway") |> 
  mutate(Date = as.Date(as.POSIXlt(Date, format="%m/%d/%Y %I:%M:%S %p", tz="America/New_York")))

subway_train_delays_filtered <- 
  subway_train_delays |> 
  filter(month >= '2020-07-01')
mta_service_alerts_filtered_v2 <- 
  mta_service_alerts_filtered |> 
  filter(Date >= '2020-07-01')
subway_wait_assessment_filtered <-
  subway_wait_assessment |>
  filter(month >= '2020-07-01')

monthly_delays <- subway_train_delays_filtered |> 
  group_by(month, day_type) |>
  summarize(delays = sum(delays)) |>
  ungroup()

monthly_wait_assessment <- subway_wait_assessment_filtered |>
  group_by(month, day_type) |>
  summarize(tp_passed = sum(num_timepoints_passing_wait_assessment, na.rm = TRUE),
            tp_sched = sum(num_sched_timepoints, na.rm = TRUE)) |>
  mutate(`wait assessment` = tp_passed/tp_sched) |>
  ungroup()

monthly_service_alerts <- mta_service_alerts_filtered_v2 |> 
  mutate(month = make_date(year(Date), month(Date), 1), 
         day_type = ifelse(wday(Date) %in% c(1, 7), 2, 1)) |>
  group_by(month, day_type) |>
  summarize(alert_count = n()) |>
  ungroup()

mta_ridership_new <- 
  mta_ridership |>
  select(station_complex, year, month, day_type, ridership)

monthly_mta_ridership <- mta_ridership_new |>
  mutate(month = make_date(year, month, 1)) |>
  group_by(month, day_type) |>
  summarize(ridership = sum(ridership)) |>
  ungroup()

df_list <- list(monthly_delays, monthly_service_alerts, monthly_wait_assessment, monthly_mta_ridership)
joined_df <- df_list |> 
  reduce(full_join, by=c('month', 'day_type'))


ggplot(joined_df, aes(x = month, y = as.factor(day_type), fill = 1 - `wait assessment`)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red") +
  scale_y_discrete(labels = c("1" = "Weekday", "2" = "Weekend")) +
  labs(title = "Heatmap of 1 - Wait Assessment by Month and Day Type",
       x = "Month",
       y = "Day Type",
       fill = "1 - Wait Assessment")

```
Almost every month, there is a good percentage of trains that are delayed. This is not a good sign for the subway's reliability. Especially when we notice that the weekday delays are much more frequent than the weekend delays. 

Overall, the total number of late trains should be as close to zero as possible, but we do see some months in 2021 which had close to 35% trains delayed. This, especially on weekdays, causes huge problems to all commuters whose primary mode of transport is the subway (which as we saw in the first section, is a majority of New Yorkers).

This also looks like a one off trend for this year, where only the months in 2021 had a large nummber of delays, which could be attributed, as earlier to the article linked previously. 

### Are we getting alerts on time? 
```{r echo=FALSE}
#| warning: false
library(ggplot2)
library(dplyr)
library(lubridate)

nyc_subway_lines <- c("A", "B", "C", "D", "E", "F", "G", "J", "Z", "L", "M", 
                      "N", "Q", "R", "W", "1", "2", "3", "4", "5", "6", "7", "S")

linewise_service_alerts <- mta_service_alerts |> separate_rows(Affected, sep = "\\|") |> 
  filter(Affected %in% nyc_subway_lines)
linewise_service_alerts <- linewise_service_alerts |>
  mutate(Date = as.Date(Date, format = "%m/%d/%Y %I:%M:%S %p"))
linewise_service_alerts <- linewise_service_alerts |>
  left_join(subway_groups, by = c("Affected" = "line"))

linewise_service_alerts <- linewise_service_alerts |>
  mutate(Date = as.Date(Date, format = "%m/%d/%Y %I:%M:%S %p"))

linewise_service_alerts <- linewise_service_alerts |>
  mutate(DayType = if_else(wday(Date) %in% c(1, 7), "Weekend", "Weekday"))

filtered_alerts <- linewise_service_alerts |>
  filter(Agency == "NYCT Subway") |>
  mutate(Date = as.Date(Date))  

daily_alerts <- filtered_alerts |>
  group_by(Date, DayType) |>
  summarise(alert_count = n(), .groups = 'drop')  # Count alerts by date and day type

daily_alerts <- daily_alerts |>
  mutate(
    adjusted_alert_count = ifelse(DayType == "Weekday", alert_count / 5, alert_count / 2)
  )

ggplot(daily_alerts, aes(x = Date, y = adjusted_alert_count, color = DayType, group = DayType)) +
  geom_point(size = 1, alpha = 0.6) +  
  geom_smooth(se = FALSE, method = "loess", span = 0.3, size = 1.2) +  
  scale_color_manual(values = c("Weekday" = "#1f77b4", "Weekend" = "#ff7f0e")) +
  labs(
    title = "Smoothed Normalised Daily Subway Alerts by Day Type",
    x = "Date",
    y = "Normalised Number of Alerts",
    color = "Day Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

```
Clearly, we are getting more alerts on the weekends than on weekdays. 
This is opposite to the number of actual delays. This is a major problem, since typically, we expect the number of delays and number of service alerts to follow a similar pattern. 

As part of this project, we also revisited the delays definition to ensure that even with an alert, a delay is still being counted in the total number of delays. 


#### Side note : Alert data quality and distribution

During our experimentation, we came across many patterns that did not show any real patterns. For example, the alert distribution per line, or the alert distribution by group. However, this graph shows a fairly interesting characteristic of the alerts. 


```{r, echo=FALSE}

normalized_data <- linewise_service_alerts |>
  group_by(`Status Label`, DayType) |>
  summarise(Count = n(), .groups = "drop") |>
  mutate(NormalizedCount = if_else(DayType == "Weekday", Count / 5, Count / 2))

status_summary <- normalized_data |>
  group_by(`Status Label`) |>
  summarise(TotalNormalizedCount = sum(NormalizedCount), .groups = "drop") |>
  arrange(desc(TotalNormalizedCount)) |>
  slice_head(n = 10)

top_status_labels <- status_summary$`Status Label`

filtered_data <- normalized_data |>
  filter(`Status Label` %in% top_status_labels) |>
  mutate(`Status Label` = factor(`Status Label`, levels = top_status_labels))

ggplot(filtered_data, aes(x = `Status Label`, y = NormalizedCount, fill = DayType)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(
    title = "Top 10 Status Labels by Normalized Number of Service Alerts",
    x = "Status Label",
    y = "Normalized Number of Alerts",
    fill = "Day Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 14)
  )

```

In this graph, the interesting insights are mainly around the distribution of alerts across weekday / weekend. We see that in every category, the average number of alerts per weekday is the same as the average number of alerts per weekend. This shows that there are no specific alerts (apart from weekday and weekend service). This is interesting because there are different patterns for both delays and alerts for every category, in the delays as we saw before, but not in the alerts. The overall number is more or less equal. 

Another interesting insight is around the fact that there are a small number of alerts for "weekday service" on weekends and vice versa. This could be an issue with the data quality, or maybe there is a underlying reason why this occurs. We could contact the data source for an explanation on this. But Nonetheless, the presence of these rows is pretty interesting. 


## Wait times

Let us take a look at the percentage of delays in the subway system over the same time period. 

```{r fig.width=12, echo=FALSE}
mta_service_alerts_filtered <- 
  mta_service_alerts |> 
  filter(Agency == "NYCT Subway") |> 
  mutate(Date = as.Date(as.POSIXlt(Date, format="%m/%d/%Y %I:%M:%S %p", tz="America/New_York")))

mta_ridership_new <- 
  mta_ridership |>
  select(station_complex, year, month, day_type, ridership)

subway_wait_assessment_with_groups <- subway_wait_assessment |>
  left_join(subway_groups, by = "line")

wait_assessment_groupwise <- subway_wait_assessment_with_groups |>
  group_by(group, month) |>
  summarize(tp_passed = sum(num_timepoints_passing_wait_assessment, na.rm = TRUE),
            tp_sched = sum(num_sched_timepoints, na.rm = TRUE),
            .groups = "drop") |>
  mutate(`wait assessment` = (1 - tp_passed/tp_sched)*100) |>
  ungroup() |>
  drop_na(group)

ggplot(wait_assessment_groupwise, aes(x = month, y = `wait assessment`, color = group, group = group)) +
  geom_line(size = 1) +
  scale_x_date(limits = c(as.Date("2020-01-01"), as.Date("2024-10-01")), 
               date_breaks = "2 months", date_labels = "%b %Y") + 
  scale_color_manual(values = subway_group_colors, labels = custom_legend_labels) +
  labs(
    title = "Monthly Wait Assessment by Subway Group",
    x = "Month",
    y = "% timepoints with delays",
    color = "Subway Group (Lines)"
  ) +
  theme(
    text = element_text(size = 14),
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
```
Here, the first spike we notice is during the Covid 19 pandemic in 2020. But the most interesting pattern is the fact that the S line had a steady amount of delays throughout the 3 years, but suddenly in April 2024, we see a sudden spike in the wait times. No other line has such a drastic change in delays anywhere else. ([This](https://www.cbsnews.com/newyork/news/new-york-city-subway-shuttle-cellphone-service-wi-fi/#:~:text=The%20line%2C%20which%20has%20been,browse%20the%20internet%20without%20interruption.)) is the only article specific to the S train in 2024. This can be a reason why this pattern exists. 

## Total Ridership analysis

Here, the ridership of particular lines is hard to gauge since the nyc subway has multiple crossover points, and it is hard to note which line a particular person got onto from the subway. 

So, we see the total number of riders from every subway station.

```{r fig.width=8, echo=FALSE}
#| warning: false
library(ggplot2)
library(dplyr)

summarized_data <- mta_ridership |>
  group_by(station_complex, day_type) |>
  summarize(total_ridership = sum(ridership, na.rm = TRUE)) |>
  mutate(adjusted_ridership = ifelse(day_type == 1, total_ridership / 5, total_ridership / 2)) |>
  ungroup()

top_stations <- summarized_data |>
  group_by(station_complex) |>
  summarize(total_ridership_all = sum(total_ridership, na.rm = TRUE)) |>
  arrange(desc(total_ridership_all)) |>
  slice_head(n = 20) |>
  select(station_complex)

plot_data <- summarized_data |>
  filter(station_complex %in% top_stations$station_complex)

ggplot(plot_data, aes(x = adjusted_ridership, y = reorder(station_complex, total_ridership), color = as.factor(day_type))) +
  geom_point(size = 3) +
  labs(
    title = "MTA Ridership Cleveland Dot Plot (Top 20 Stations)",
    x = "Normalised Ridership",
    y = "Station Complex",
    color = "Day Type"
  ) + 
  scale_color_manual(
    values = c("2" = "cornflowerblue", "1" = "orange"), # Assign colors to weekday and weekend
    labels = c("Weekday", "Weekend")       # Change legend labels
  ) +
  scale_x_continuous(
    expand = c(0.1, 0.1),  # Add padding to both sides of the x-axis (10%)
    limits = c(0, max(plot_data$adjusted_ridership) * 1.2)  # Extend x-axis to 20% beyond max value
  ) +
  theme_minimal()


```

Here, we see that there is a higher number of riders every weekday than every weekend. Most of these are stations that we expect to be here.

Times Square being the most populous station is quite expected, and since the grand central station connects the west side and east side, we can explain it being up there, along with all the tourists who would want to visit the station. 

There is also a small mislabelling in the dataset where it notes 14th street (1,2,3) and (A,C,E) Penn station for the same pairs of station are separate. This can probably be attributed to the fact that these stations have seperate entrances multiple streets apart that are closer to the respective lines, however there is a link between these lines that is navigable without exits. This should ideally put the stations in the same bucket, but the MTA data collection team has not done this. 

### Ridership - Going one step deeper

Do all stations get a similar influx of passengers through the year? Are there stations that get a higher influx on certain weekdays?


```{r fig.width=10, echo=FALSE}
stations <- unique(top_n(mta_ridership_new, 150, ridership)$station_complex)
scale_val_2 <- 1000000
mta_stationwise_ridership_wd <- mta_ridership_new |>
  mutate(month = make_date(year, month, 1)) |>
  filter(day_type == 1, station_complex %in% stations)

# for lengend
station_order <- mta_stationwise_ridership_wd |>
  group_by(station_complex) |>
  summarize(total_ridership = sum(ridership, na.rm = TRUE)) |>
  arrange(desc(total_ridership)) |>
  pull(station_complex)  # Extract the station names in sorted order

mta_stationwise_ridership_wd <- mta_stationwise_ridership_wd |>
  mutate(station_complex = factor(station_complex, levels = station_order))

mta_stationwise_ridership_wd |>
  ggplot(aes(x = month, y = ridership / scale_val_2, color = station_complex)) +
  geom_line(size = 1) +
  scale_x_date(
    limits = c(as.Date("2020-07-01"), as.Date("2024-10-01")), 
    date_breaks = "2 months", 
    date_labels = "%b %Y"
  ) +
  scale_color_viridis_d(option = "C", guide = guide_legend(reverse = FALSE)) +
  labs(
    x = "Month",
    y = "Ridership (in millions)",
    color = "Station"
  ) +
  theme_minimal(base_size = 12) + 
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.title = element_text(face = "bold", size = 10),
    legend.text = element_text(size = 9),
    legend.key.width = unit(1, "cm"), 
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12),
    panel.grid.major = element_line(color = "gray90"),  
    panel.grid.minor = element_blank()
  ) +
  labs(
    title = "MTA Ridership Trends for Top 5 Stations",
    subtitle = "Weekday ridership over time (2020–2024)",
    x = "Month",
    y = "Ridership (in millions)",
    color = "Station"
  )

```

Here, we notice that although the graphs for each of the stations is not exactly the same, it follows a very similar pattern. The existence of this patterns means that there are no line wise / area wise separate trends on the subway system on weekdays , all the stations get a higher influx of passengers at similar times in the year, and similarly the troughs in this trend also correspond to one another. Thus we say that we have similar patterns throughout.

However, we see that the overall ridership trends upwards, with time, since the pandemic. We see that there has been an increase in the total number of passengers overall. We can take a look at this along with the weekend plot in the interactive plot.

In a way, this is a sign of good subway design, since the stations that we see on this list follow more or less a similar order in terms of their size as well.