Searching by Location

Index

Introduction
Our approach
How we store locations
Vacancies location
How do we get the coordinates for the search location
Location Polygons
- Purpose
- Mappings
Getting the geographical coordinates
- Google Geocoding API costs

Introduction

Jobseekers searching for vacancies within a relative distance from a location is a core feature in Teaching Vacancies.

At the time this was written, around 70% of our vacancy searches include location in their filters.

Our approach

Once a job search location is submitted, the service:

Gets the area or coordinates for the searched location.
Filters vacancies that, after being included by the rest of the search filters, are located within the provided radius distance from the given location.
To allow ordering the search results by distance, computes the exact distance between each vacancy and the given location.

How we store locations

Our PostgreSQL database instance has the PostGIS extension. This extension allows our database to store and operate with geospatial data.

The data type in the database is geometry with the geographic: true flag and SRID 4326.

What does this mean?

We store the data in a geographic coordinate system (GCS) rather than a projected coordinate system (PCS). The coordinates are stored as latitude and longitude on the Earth's surface, following the WGS84 standard.

Using the geometry type with geographic: true allows us to leverage a wide range of PostGIS geospatial functions, such as ST_Distance, ST_Within, ST_Intersects, and ST_Buffer.

Meaning of `SRID 4326`

SRID: Spatial Referencing System ID.
4326: Represents spatial data using longitude and latitude coordinates on the Earth's surface as defined in the WGS84 standard.

The result of this is that the spatial data specifies the Earth as an ellipsoid, and uses elliptical geometry.

The data type we use from PostGIS is Geometry with a Geographic coordinate system, which allows us to use projections and transformations.

Why not just Geometry

When using a geographic coordinate system, distance calculations are performed on the curved surface of the Earth. PostGIS functions like ST_Distance and ST_Buffer will take into account the Earth's curvature, providing more accurate results for large distances compared to a planar (flat) coordinate system.

There is no papertrail on why Geographic data was chosen over Geogmetric when the feature was built, so we will assume it was for precission.

Given that our service longest distances on the search are "within 200 miles from X location in England", using Earths ellipsoid (meant for long intercontinental or across multiple countries precission) doesn't seem a requirement for our service purpose. And it has major performance costs:

Performance

Using Geographic data and operations over the earth's ellipsoid it is very expensive computationally compared to using planar Geometric data.

We have mitigated the performance hit with a few measures:

Spatial indexing

We use PostGIS spatial indexing for any geospatial data used in our queries. Stored areas and points must have spatial indexes when used in any querying.

Reducing Location Polygon areas complexity

We have experienced major performance issues when some of our location polygons coming from ONS have tens of thousands of points (e.g., the Cornwall polygon consisted of 125k points).

Doing an ST_Buffer (expanding the polygon for radius searches) over polygons with that level of complexity is very taxing on the database CPU and memory.

Simplifying the polygons with some tolerance offers a major reduction in the polygon points while quite accurately keeping the polygon shape. For example, the Cornwall polygon with 2.5k points instead of 125k is almost identical to the original while being way less resource-expensive to operate with.

To achieve that, we use ST_SimplifyPreserveTopology over the ONS imported polygons prior to storing them in our database.

Precomputing data when possible

Precomputing the centroid point for the location polygon areas and saving them in the database improves our search by distance queries performance. It avoids calculating each location polygon center to use it to calculate the polygon distance from each vacancy. Instead, it uses the stored centroid to calculate the distance.

Tweaking queries to improve performance

The use_spheroid parameter set to false improves the performance of operations by using a faster spherical calculation instead of a more computationally expensive ellipsoidal calculation.

For example: ST_Distance(gg1, gg2, false)

Increasing our database instance resources

We originally had a basic Azure PostgreSQL Flexible server: GP_Standard_D2ds with 2 vCPUs and 8GB of RAM.

The memory was more than enough, the average CPU usage was low, the connections limit was well above what we were using... but our database CPU kept getting daily 100% usage spikes with location search queries, causing some very slow queries that triggered AKS pods restarting in our servers.

We upgraded to a GP_Standard_D4ds, which provides 4 vCPUs and 16GB of RAM.

The increase in CPU resources resolved the location search SQL queries performance issues. While they're still computationally expensive, the database instance has more than enough resources to swiftly manage those queries without choking or causing a bottleneck.

Vacancies location

Each vacancy has location coordinates stored in their geolocation database field.

This field contains the geographic coordinates for the vacancy associated organization.

How do we get the coordinates for the search location

Once a location is provided for a search, there are three possibilities:

The location is considered a nationwide location (e.g., England, UK..)

We ignore the location. A nationwide location filter is irrelevant when all our vacancies are restricted to England.
We have a polygon stored for the given location.

We create a buffered expanded area based on the provided search location radius.
No polygon is found for the given location.

We get the geographical coordinates for the given location.

Location Polygons

The location polygons are geographical areas we store in our database.

There is an ImportPolygonDataJob running weekly that imports polygons for:

Counties
Cities
Regions

It also creates composite polygons combining the above.

All these imports are obtained querying the ONS (Office for National Statistics) ArcGIS endpoints to obtain the area data for each of the polygons.

Purpose

There is a difference between searching for vacancies within a particular distance from a jobseeker's home (that would match some particular coordinates point) and vacancies within a particular distance "from Essex".

How do we measure the distance between a whole region and a particular vacancy location?

Taking the center of the region would be wrong. As if, let's say, we search for "10 miles from Essex" anything over 10 miles from Essex center point would be filtered out.

What we would expect is to find anything 10 miles away from Essex outer borders.

Having an area stored for "Essex" containing its borders, allows us to combine ST_Buffer and ST_DWithin to first expand the area borders to cover the provided search radius, and then find if any vacancy location is contained within that expanded area.

Mappings

Within the mapping files we define the subset of cities, regions and counties we store polygon areas for. This is used by the location data setup.

The mapped locations file helps to match common location search terms with an appropriate location polygon if any.

Warning: There is no papertrail on why/how this approach was taken. The current team knowledge and understanding on this area is very limited and based on tracking the implementation/configuration files, that are quite complex.

Getting the geographical coordinates

As many location search terms don't match a polygon name, the service falls back to obtaining the location coordinates.

We use the Geocoding class to retrieve and cache the location coordinates. The default source for retrieving the location coordinates is the Google Geocoding API.

Information about the API usage/costs should be accessible from the Google Cloud panel using your DfE email account.

Google Geocoding API costs

Google Geocoding API is one of our higher costs in Google Cloud. To reduce the cost as much as possible, and as the coordinate info in the UK for a given point is quite stable data unlikely to change, we cache its responses for a long period (Currently set at 180 days), so there are fewer API hits renewing existing cached results.

Please be aware of the impact on the API costs if deciding to modify the Geocoding cache duration period.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

searching-by-location.md

searching-by-location.md

Searching by Location

Index

Introduction

Our approach

How we store locations

Meaning of `SRID 4326`

Why not just Geometry

Performance

Spatial indexing

Reducing Location Polygon areas complexity

Precomputing data when possible

Tweaking queries to improve performance

Increasing our database instance resources

Vacancies location

How do we get the coordinates for the search location

Location Polygons

Purpose

Mappings

Getting the geographical coordinates

Google Geocoding API costs

Files

searching-by-location.md

Latest commit

History

searching-by-location.md

File metadata and controls

Searching by Location

Index

Introduction

Our approach

How we store locations

Meaning of SRID 4326

Why not just Geometry

Performance

Spatial indexing

Reducing Location Polygon areas complexity

Precomputing data when possible

Tweaking queries to improve performance

Increasing our database instance resources

Vacancies location

How do we get the coordinates for the search location

Location Polygons

Purpose

Mappings

Getting the geographical coordinates

Google Geocoding API costs

Meaning of `SRID 4326`