Skip to content

Accumulo as Primary Data Provider

Andrew Levine edited this page Jan 21, 2015 · 10 revisions

This page is meant to be a guide to the Accumulo Data Provider interface for MrGeo.

The first step in working with Accumulo is to make Accumulo as Primary Data Store.

This document describes how to make Accumulo the primary data store for the functionality of MrGeo.

Confirm that the environment variable of $MRGEO_HOME is set.

If you had MrGeo working with HDFS as the primary location of Images, then it may be set already. If not, make sure the system that is being used to launch jobs has the variable set. For this example, say the config files for MrGeo is in /opt/mrgeo. Then, assuming the use of linux and bash as the shell, in the $HOME/.bash_profile file, add the following:

  export MRGEO_HOME=/opt/mrgeo

Save the file and source it:

  #> source ~/.bash_profile

Then check to see that the variable got set:

  #> echo $MRGEO_HOME

The result should be "/opt/mrgeo".

Next, edit the /opt/mrgeo/conf/mrgeo.conf file. Find:

  datasource = hdfs

change the line so it reads:

  datasource = accumulo

That now sets Accumulo as the primary data provider for MrGeo.

MrGeo Accumulo Configuration File

Now, there needs to be an Accummulo configuration file for MrGeo. Create a file $MRGEO_HOME/conf/mrgeo-accumulo.conf. Add the following to the file. It is mandatory to have the accumulo connection information in the file.

  accumulo.user = root
  accumulo.password = secret
  accumulo.instance = accumulo
  accumulo.zookeepers = localhost:2181
  accumulo.viz = A|B
  accumulo.auths = A,B,C,D,E,F,G,U
  accumulo.queryauths = U
  accumulo.root.auths = U
  accumulo.default.write.viz = null
  accumulo.default.read.auths = U
  accumulo.bulkthreshold = 500

The Accumulo Data Provider can do bulk ingest on jobs. The number of output tiles is a factor in the determination of how the data provider deal with pushing into Accumulo. The "accumlo.bulkthreshold" value sets the threshold for for when to use bulk ingest or just a push with a batch writer. This value should be a consideration based on the size of the cloud in use. In experimentation and development, small clouds were in use and 500 was a good number. Your mileage may very.

When using the Accumulo Data Provider, ensure that a table exists when creating data. If you are ingesting data and want the name to show up in the GetCapabilities as "kathmandu" then make sure the table exists. This is a manual process. It is possible from the APIs of Accumulo to create a table from a program. This has not been implemented. If this is something that is requested, the change to the provider can be made.