Skip to content

Vector Data and Spark

Tim Tisler edited this page Nov 24, 2015 · 3 revisions

*** Completed! ***

Handling Vector Data in Spark

Overview

The following class diagram shows the design for handling vector data within map algebra as a result of the transition to running map algebra within Apache Spark. At the core of vector data processing within map algebra is a map op named VectorMapOp. This is the base class to be extended by any map op that produces vector data. It is also used by any map op that makes use of vector data as an input.

The VectorDataMapOp extends the VectorMapOp because it produces (really just exposes) vector data within map algebra. The VectorDataMapOp will use the existing VectorDataProvider functionality to access data from a variety of possible sources (whether they be files in HDFS, GeoWave or in-memory).

Map ops that require vector data as input, such as the rasterize vector map op will obtain VectorRDD and VectorMetadata from a VectorMapOp vector map op in order to gain access to the actual vector data for processing.

Diagram

VectorMapOp Design

Map Algebra Syntax

Map algebra will require modifications in some cases in the syntax used to specify a vector data source in order to facilitate uniform handling of vector data. This also makes it very easy to support new sources of vector data in the future. The following sections provide information on the changes.

[delimited:...]

In earlier versions of MrGeo, there was a map algebra function named InlineCsv. It allowed specifying vector data and a schema directly in the map algebra to prevent having to write it to a file in HDFS and then use a file reference in map algebra.

The same capability will be retained, but it will no longer be done through a map op. Instead, the functionality will be exposed through a new vector data provider. The difference is that in map algebra the vector data will be included in a data source specification rather than through a map algebra function. The benefit to this approach is that the vector data is available to any code through the vector data provider interface, and no special conditions or processing is required. Callers simply use the vector data provider interface like they would for any source of vector data.

An example of the old syntax is:

src = InlineCsv("GEOMETRY", "'POINT(142.0 -18.0)'");
CostDistance(src, 10, myFriction, 20000.0);

And an example of the new syntax is:

src = [delimited:schema="GEOMETRY";value="POINT(142.0 -18.0)"];
CostDistance(src, 10, myFriction, 20000.0);

The syntax will allow for multiple attributes as well as string, numeric, latitude, longitude and geometry fields.

[geowave:...]

No changes are required. The GeoWave data provider will remain unchanged since it already makes use of the vector data provider interface. This represents a model for changing the other vector data sources.

[abc.csv], [abc.tsv] and [abc.shp]

The map algebra syntax for referencing a file-based source of vector data will not change. However, under the covers, this data will be access through the vector data provider interface so that vector data can be accessed throughout the system in same way, regardless of what the actual source of the vector data is.

Clone this wiki locally