Skip to content

PyMrGeo

Tim Tisler edited this page Dec 3, 2015 · 4 revisions

Proposal

Implement Python bindings for MrGeo. Allow a user to easily and seamlessly create Python scripts to interface with MrGeo running on a cluster. The bindings should discover map algebra commands (MapOps) automatically, so a MrGeo developer doesn't need to maintain both the Scala/Java code and the Python code in concert. The bindings should be simple enough that a user comfortable with simple Python can run MrGeo commands "with a line or two of code."

Background

Python is one of the main languages of Data Scientists (R is the other). As such, if MrGeo has a Python interface, the community can grow. If MrGeo interfaces with a much more robust language for map algebra, the scripts can get much more complicated without the need for MrGeo to implementent and maintain the scripting code. The Apache Spark project has done something similar and has had great success.

Solution

There are many ways to interface python with Java code running on the JVM. (A quick reminder that Scala code untimately is compiled into JVM byte code, so the two are compatible.) Three of the most promising are:

Jython, JCC, and Py4J

Jython

Jython is an implentation of the Python runtime written in pure Java. (The "typical" python intepreter is in C, called CPython). After investigation, it was determined that Jython is more suited to running Python code from Java than calling Java code from Python. Setup and running Jython is also more complicated than typical Python

JCC

JCC (Java C Compiler) is a "compiler" than converts Java code into compiled C code, which is then interfaced with Python. This solution is tempting because it cuts out the need for having the JVM run. Ultimately though, it proved difficult to interface with a Spark cluster.

Py4J

Py4J (Python for Java) is an interface between Python and a running JVM. It inspects from and injects to the JVM using java reflection. The result, as a Python user, is a very simple interaction with Java in a very Pythonic way.

Py4J is the interface Spark uses for its Python bindings.

Prototype

After investigaing the three potential solutions, and creating tiny tests of each, it was determined the Py4J route would be the best. As such, a larger prototype was created to test the ideas further.

On the MrGeo Java side, the setup couldn't be simpler. A new MrGeo command was created to create the Py4J connection.

On the Python side, using the Spark code as a template, created a Python class MrGeo, which will be the way to run MrGeo commands. Also created a RasterMapOp class (and eventually a VectorMapOp class as well, but not for the prototype) that will contain the methods to interact with MrGeo rasters.

During the MrGeo class initialization, it discovers all the MapOps available and adds them to as methods to the RasterMapOp.

Thus, to use MrGeo in python is quite simple:

# Create an instance of MrGeo
mrgeo = MrGeo()

# Any initialization, in this case, set it up to use debug mode, a single thread implementation
mrgeo.useDebug()

# Start MrGeo
mrgeo.start()

# Load raster "all-ones"
ones = mrgeo.load_resource("all-ones")

# Run the slope calculation
slope = ones.slope()

# Load another raster - "all-hundreds"
hundreds = mrgeo.load_resource("all-hundreds")
    
# Run aspect
aspect = hundreds.aspect()

# Save the two results
slope.save("slope-test")
aspect.save("aspect-test")

# Stop MrGeo and do any cleanup
mrgeo.stop()

That's it! The prototype worked on these four MapOps (load, slope, aspect, save) in debug mode (locally, 1 thread), local mode (locally, multiple threads), and YARN mode (run in a psuedo-distributed local cluster).

NOTE: The Prototype did not test everything we'll need to do, just enough to prove the approach is feasible.

Timeline

The basic approach will be to design the missing pieces from the prototype then implement them. Finally test at scale.

Task Time
Design method to define overloaded MapOp methos in Python, e.g. slope(), slope("rad") 1d
Design, prototype, and simple test operator overloading ("+", "-", "*", etc.) 3d
Implement MapOp registration changes in core MrGeo 7d
Clean up load_resource in Python 2d
Implement a build load_resources to load a number of files into an array 2d
Create an export mapop to export MrGeo rasters from mapalgebra 2d
Implement VectorMapOp in Python 4d
Scale testing 5d