Spatial Analysis of Taxi GPS Data (2019)

Overview

Global Positioning System (GPS) data has been a valuable source of information in transportation, urban planning, and logistics. In the Philippines, several transport companies and organization utilized GPS in order to optimize their operational policies to improved revenue and resources. In the government, the use of GPS has been pivotal to improve its key services particularly in public transportation.

Project Summary

This data has been collected by LTFRB through its mobile big data partners in telco. The time coverage is March 2018 and 2019 in a 24 hour interval. The objective of this project is to understand and analyze the behavior of commuters using taxi as a mode of transportation. Additionally, we have to recommend policy that will help improve the experience of commuters.

A. Contents

Temporal Coverage
- With the aid of python, I analyze the area covered by the data such as logs per taxi, and the trends that corresponds to it such as daily average ridership.
Stay Point Identification
- Using python I conducted a spatial grouping. In order to a GPS point belong to a group, the taxi should stay in a significant amount time and didn’t move or exceed a specified distance.
Spatial Clustering
- After identifying the staypoints, I conducted a spatial clustering using DBSCAN and Hierarchical technique. From this two machine learning algorithm, I derived two clustering results that allows me to give recommendations.
Recommendation
- In this part, I recommend practical strategies in mobility management in terms of shifting the commuter from taxi to public transportation.

B. Data Structure

I. Temporal Coverage

Fig 1. GPS logs per User

Fig 2. Coverage and Logs per Day

The daily GPS logs in this data is recorded in an average interval of 2 minutes. So the logs does not translate to an individual ride. We can see in the user logs that the records for each user is not equal since per user the interval is not equal.
Using Spatial filtering, I reduced the coverage within metro manila with allowances for its adjacent provinces such as bulacan in the north, rizal in the east, and cavite in the south. Logs that reach up to clark, and down to laguna are clipped.
For this data to be understand, I utilized its date features and grouped them based on its day and hour. In this way, I will overpower the inaccuracy for the 2 minutes interval of the GPS logs.

Fig 3. Daily Average Ridership

Daily Average Rides:
- The daily average rides follow a trend, it has 3 on-peaks: 9 AM, 1 PM, 7 PM
- 3 off-peaks: 11 AM, 5 PM, 11 PM
- For 9 AM, Monday recorded the highest ridership, with a 701 average rides.
- For 1 PM, Tuesday the is highest, with 618 average rides.
- For 7 PM, Wednesday recorded a 662 average rides.
- For a month basis, we will divide this ridership in 4 days:
  - 175 cars for every monday morning
  - 155 cars for every tuesday afternoon
  - 166 cars for every wednesday evening
- Assuming every ride has 1 passenger/commuter, it will require a 175 car to transport 1 person from its origin to destination.
- If we compressed this by 4 passengers if ride sharing is implemented, it will require approximately 43 cars in monday morning
- If we compressed this by 60 passengers, if they used a bus or shuttle service, it will require 3 buses to transport the 175 passengers.

II. Stay Point Identification

A stay point is a location identified from multiple GPS logs based on specific criteria. The GPS logs within this location are averaged to determine its latitude and longitude.

A. Criteria

Define a radius for the basis of the stay point.
Feed the GPS data into a loop, and test each point against the following conditions:
- time_stayed ≥ minimum_time_to_stay
- distance_changed ≤ threshold_distance
Starting from the initial GPS point, measure the distance to subsequent GPS logs. If the distance exceeds the threshold_distance, exit the loop.
Compute the stay point’s latitude and longitude by averaging the GPS logs that meet the criteria.
Calculate the cumulative time and cumulative distance for the stay point.
Proceed to the next point to feed in the loop.

B. Pseudo-code

class StayPointIdentification:
    def __init__(self, data, cutoff_distance, minimum_time):
        # Initialize variables
        self.gps_data = data
        self.cutoff_distance = cutoff_distance
        self.minimum_time = minimum_time
        self.staypoints = self.identify_staypoints()
        self.staypoints_df = self.to_dataframe()

    def centroid(self, latitude, longitude):
        # Compute centroid of given latitude and longitude lists
        if len(latitude) == 1:
            return sum(latitude) / len(latitude), sum(longitude) / len(longitude)
        return sum(latitude) / (len(latitude) - 1), sum(longitude) / (len(longitude) - 1)

    def radian(self, point):
        # Convert point to radians
        return float(point) * math.pi / 180.0

    def haversine_distance(self, lat1, lon1, lat2, lon2):
        # Calculate Haversine distance between two GPS points
        radius = 6371  # Earth radius in km
        phi1, phi2 = self.radian(lat1), self.radian(lat2)
        delta_phi = phi2 - phi1
        delta_lambda = self.radian(lon2) - self.radian(lon1)
        
        # Haversine formula
        a = sin(delta_phi / 2)**2 + cos(phi1) * cos(phi2) * sin(delta_lambda / 2)**2
        c = 2 * asin(sqrt(a))
        return radius * c

    def identify_staypoints(self):
        # Identify staypoints based on cutoff distance and minimum time
        staypoints = []
        for each point in gps_data:
            if distance <= cutoff_distance and time_interval >= minimum_time:
                staypoints.append(compute_centroid())
        return staypoints

    def to_dataframe(self):
        # Convert staypoints to a dataframe, filtering by minimum time
        dataframe = convert_to_dataframe(self.staypoints)
        return filter_dataframe_by_time(dataframe, self.minimum_time)

Fig 4. Result of Stay Point Identification

III. Spatial Clustering

a. Density-Based Spatial Clustering of Application with Noise

Fig. 5 DBSCAN Spatial Results

Fig. 6 DBSCAN Cumulative time vs Cumulative Count

b. Hierarchical Clustering

Fig 7. Dendogram

Fig 8. Hierarchical Spatial Results

Fig 9. Hierachical Cumulative time vs Cumulative Count

c. Insights

High Cumulative Count, High Cumulative Time: This combination may indicate areas of high traffic density and prolonged dwell time, where there is a lot of activity happening. These areas may be urban centers, shopping districts, or entertainment venues.
Low Cumulative Count, Low Cumulative Time: This combination may indicate areas of low traffic activity, where there is little movement or activity happening. These areas may be remote or less populated regions.
Cumulative count is low and the cumulative time varies: it suggest that the area is not heavily trafficked but that there are some events or activities that draw people to the area for varying amounts of time

IV. Recommendations

High-Demand Areas: Areas with high demand for taxis can be targeted for investment in public transit system such as point-to-point bus system. The stay points within Antipolo, Taguig, and QC can be redesign to have this bus system and connect them to the main transport network such as the MRT and EDSA Busway. In this way, commuters will be encourage to use public transport that offers minimal the transportation cost and time in changing modes.
Low-Demand Areas: Areas with low demand for taxi is accompanied of short travel time. This areas indicates that the origin-to-destination distance is short and can be done using other mode of transport such as cycling or walking. This area can be targeted for green spaces infrastructures, such as exclusive pedestrian and bicycle lanes. This promotes commuters to change in active transport instead of taxi.
Varying Demand Areas: This area is accompanied with varying demand and travel time, stay points that falls in this area are present on both residential and business areas. This might indicates that the demand is based only in specific situation of the commuter. For low travel time, approach for low-demand area can be adapted. For areas with high travel time, carpooling or ride-sharing can be implement within the area. This will reduce space in the road, saves cost for users.
Average-Demand Areas: The stay points that falls in this category are present in business areas, malls, schools, and local communities. The efforts in this category should focus in green space infrastructure such as inclusive waiting area for children, senior citizens, and PWDs.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Accident_Analysis.ipynb		Accident_Analysis.ipynb
CNAME		CNAME
DRIVERs_Accident_Analysis.ipynb		DRIVERs_Accident_Analysis.ipynb
README.md		README.md
Spatial_Modeling_of_Taxi_GPS_Data.ipynb		Spatial_Modeling_of_Taxi_GPS_Data.ipynb
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spatial Analysis of Taxi GPS Data (2019)

Overview

Project Summary

I. Temporal Coverage

II. Stay Point Identification

III. Spatial Clustering

IV. Recommendations

About

Releases

Packages

Languages

pjcroxas/Transportation-Data-Science

Folders and files

Latest commit

History

Repository files navigation

Spatial Analysis of Taxi GPS Data (2019)

Overview

Project Summary

I. Temporal Coverage

II. Stay Point Identification

III. Spatial Clustering

IV. Recommendations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages