Skip to content

Data Provider Overview

Dave Johnson edited this page May 15, 2015 · 3 revisions

Introduction

The data provider architecture is an abstraction layer that enables MrGeo core code to read and write data without having to know the underlying data storage specifics. Using this approach, images can be stored in HDFS, Accumulo or any other suitable technology without changing MrGeo code. It simply requires a plugin to be written which implements the data provider abstraction layer. MrGeo is also able to use multiple data providers rather than having to choose a single data provider.

The architecture abstracts the following functionality:

  • MrsImage data provider
  • Vector data provider
  • AdHoc data provider
  • IngestImage data provider

MrsImage Data Provider

Abstracts access to MrsImages, including creating Hadoop InputFormat and OutputFormat objects for map/reduce jobs, reading/writing image metadata, and reading/writing image tiles. This approach allows each specific provider to fully determine how MrsImages are stored. Therefore providers can fully leverage the Hadoop InputFormat and OutputFormat capabilities provided by the technology they are abstracting.

It also allows each provider to optimize the performance and footprint of image storage however it makes sense for the underlying technology.

Vector Data Provider

Abstracts access to vector data, including constructing InputFormat and OutputFormat for map/reduce jobs and reading/writing vector data. All vector data is converted to the MrGeo Geometry class so that MrGeo core code works with vector data consistently, regardless of the underlying data provider being used. Each type of provider can fully leverage the Hadoop InputFormats and OutputFormats available to them for optimal performance and ease of coding.

While the interface contains hooks for writing vector data and constructing a Hadoop OutputFormat, that functionality is not currently used in MrGeo nor implemented in the existing vector data providers.

Gotcha: Some code within MrGeo that makes use of vector data has not been converted to the data provider architecture at this time. One notable example is shapefile access. For map algebra functions that do use the new vector provider (like RasterizeVectorMapOp), the code first attempts to get a vector data provider for the source, and if unsuccessful, it then falls back to the old code for accessing the vector data.

AdHoc Data Provider

This data provider is used by the MrGeo core code to persist "ad hoc" data - in other words data that is not a MrsImage or vector. An AdHoc resource is a collection of persisted sub-resources that can be dynamically added (optionally with a specific name), and those resources can be in any format. A typical usage is in a map/reduce job where the driver creates an AdHoc provider and stores its name into the map/reduce job configuration. Mappers and/or reducers can instantiate the AdHoc provider using that name and either add resources to it or use resources already stored in it.

An example use case is how MrGeo stores statistics during a map/reduce raster operation. MrGeo keeps track of statistics across all image tiles in the reducer, then aggregates the results after the job completes. To do this, the driver creates an ad hoc data provider and configures its name into the map/reduce job. Each reducer uses the ad hoc provider's name from the job configuration to instantiate the ad hoc provider. The reducer then adds an anonymous resource to that ad hoc provider to which it writes its statistics in JSON format. When the job completes, the driver then iterates over each of the JSON resources added to that AdHoc provider by the reducers and aggregates the statistics for all of the tiles and stores the results to the MrsPyramid metadata file.

IngestImage Data Provider

This data provider is not expected to be used outside of the image ingest functionality of MrGeo. It abstracts access to source imagery during the MrGeo image ingest processing. This includes the creation of an InputStream for a source image, getting a Hadoop InputFormat for reading tiled image data, getting a Hadoop InputFormat for map/reducing over raw source imagery, and writing tiled raster data.

To understand this provider, you must have a basic understanding of ingest processing, which differs depending on where the source imagery resides. If the source is distributed (i.e. not stored in the local file system) and can be effectively map/reduced, then MrGeo gets a Hadoop InputFormat from the IngestImage data provider for that input. The InputFormat returned from the data provider is responsible reads the source image, cuts and resamples the image into MrGeo tiles, and gives back TileIdWritable keys and RasterWritable values. This data feeds a map/reduce job which mosaics (when there are multiple inputs for a given MrGeo tile) and outputs the final image for each tile, and computes image statistics.

If the source is in the local file system, MrGeo first opens the source and, using local processing, cuts MrGeo tiles from that source, resampling as needed, and saves the results into a Hadoop SequenceFile in HDFS. That SequenceFile is then used as input to the same map/reduce job described above for mosaicing and computing image statistics.

Clone this wiki locally