-
Notifications
You must be signed in to change notification settings - Fork 64
MrsPyramid V3 Proposal
This is a proposal for a change in the MrGeo image storage format to gain performance as well as simplify code.
Currently, the use of the Java Raster class is "baked in" to both the storage of image tiles and the actual image processing (for getting and setting pixel values and creating new tiles). In order to read and write tiles from a data store (HDFS, S3 or Accumulo), we have to translate from a set of stored bytes to and from a Raster object.
The sequence when reading tiles is:
Read bytes --> Create Raster --> Copy bytes to Raster --> Use Raster for processing
And the sequence for writing tiles is:
Processed Raster --> Get pixel values in byte array --> Write bytes
During both reading and writing, it takes time to translate between Raster and the byte array. Also, the actual pixel data of the tile is stored twice - inside the Raster itself as well as in a byte array. And the Java garbage collector has to handle both of those chunks of memory when they are no longer needed. This happens for every tile, which at 30m resolution could be as many as approximately 8.4 million tiles worldwide.
Most of the tile processing in MrGeo requires getting pixel values from one or more input tiles and setting pixel values in an output tile. There is no inherent requirement that we use the Raster class to accomplish that.
We propose creating a new MrGeoRaster class to replace the use of Raster. It will contain a reference to the byte array that is read from and written to the data store. The methods for getting and setting pixel values will directly modify the byte array. This approach will:
- save the processing time required for translating between stored bytes and Raster
- prevent allocating memory twice for tile data when reading/writing tiles
- save time by reducing the amount of memory to be garbage collected
- limit changes to existing code by mimicing a subset of the Raster API
To test the theory on performance improvement, we made some targeted changes to MrGeo to mimic the new approach through a narrow thread of execution of our slope command with the following results. The first set of results is on a small area while the second set of results super tiny.
mrgeo mapalgebra -o slopetest -e"slope([/mrgeo/images/santiago-aster])" -l
Existing code:
Elapsed time: 00:02:09.572 (129572)
Time Serializing (toWritable): 00:00:07.301 (5.63%) calls: 5796 avg time/call: 1ms
Time Deserializing (toRaster): 00:00:08.498 (6.56%) calls: 6913 avg time/call: 1ms
Combined time : 00:00:15.799 (12.19%) calls: 12709 avg time/call: 1ms
Test code:
Elapsed time: 00:01:37.885 (97885)
Time Serializing (toWritable): 00:00:00.000 (0.00%) calls: 5796 avg time/call: 0ms
Time Deserializing (toMrGeoRaster): 00:00:00.500 (0.01%) calls: 6913 avg time/call: 0ms
Combined time : 00:00:00.500 (0.01%) calls: 12709 avg time/call: 0ms
(75.5%!) 1.32x faster!
For the tiny image:
Existing code:
Elapsed time: 00:00:08.650
Time Serializing (toWritable): 00:00:00.261 (3.02%) calls: 96 avg time/call: 2ms
Time Deserializing (toRaster): 00:00:00.319 (3.69%) calls: 130 avg time/call: 2ms
Combined time : 00:00:00.580 (6.71%) calls: 226 avg time/call: 2ms
Test code:
Elapsed time: 00:00:06.774
Time Serializing (toWritable): 00:00:00.000 (0.00%) calls: 96 avg time/call: 0ms
Time Deserializing (toMrGeoRaster): 00:00:00.300 (0.04%) calls: 130 avg time/call: 0ms
Combined time : 00:00:00.300 (0.04%) calls: 226 avg time/call: 0ms
(78.3%) 1.28x faster!
A new MrGeoRaster class will keep a reference to the same byte array that will be used to serialize the tile data. This class will expose a subset of the Raster API in order to minimize the impact within the MrGeo code - in many cases, we can just change a variable type and recompile.
The input and output format code will remain the same because the value type is RasterWritable which handles an array of bytes. The difference will be in how those bytes are handled during processing, but that will not affect the actual input and output formats.
We will need to write some new static methods in RasterWritable to convert between RasterWritable and MrGeoRaster, but this will be very fast - not requiring any byte for byte translation.
We would like to retain the ability to read existing MrGeo images. It will just be slower to work with those images because we will have to translate to the new format while reading. However, we will deprecate this capability and remove it in a later release.
We want to keep methods in place for converting from RasterWritable to Raster. That allows processing code that currently works with Raster to continue to work while we progressively convert the processing code to use MrGeoRaster.
Most of our operations will require a simple change to switch to using MrGeoRaster, which will be quick. However, there is some processing that will require more work, such as cases where we have to write an image file (like WMS, WCS and export).
Input images for tests should be updated to the new format, and new tests will need to be added for verifying the reading of the current image format.