Use Apache Arrow to Standardize Data and Create an Interface Between Python and Scala #649

jbouffard · 2018-04-16T17:38:35Z

In order to send data back and forth between Python and Scala, we use Google's Protocol Buffers to serialize/deserialize data as it goes from language to another. However, this causes a lot of overheard which slows down performance due to each piece of data being converted to its respective representation in the target language. This makes it so that certain features available only in Python (scikit-learn, matplotlib, OpenCV-Python, etc.) cannot be fully taken advantage of as the performance hit needed to make the data usable is too great. In addition, certain TiledRasterLayer methods aren't as performant as they could be due to their requirement of serializing/deserializing the data in order to perform their operation.

One way around this would be to use Apache Arrow to provide an interface between Python and Scala. The main draw of this platform is that it provides a language agnostic way of storing data in a columnar memory format. This allows for Python and Scala to access the same data in memory without doing any copy-reads which will eliminate the need for serialization/deserialization.

Currently, Apache Arrow is state where we can begin testing it in GeoPySpark, and it's something that we should do. However, we probably won't be seeing the full benefits of Arrow until this issue is resolved. Even with that issue, though, I feel that we should still increases in performance.

The text was updated successfully, but these errors were encountered:

jbouffard · 2018-04-19T14:59:28Z

Here's a gist that I'm using as a scratch-pad to figure out how to incorporate PyArrow into GPS https://gist.github.com/jbouffard/1bafad67c3cd13325c6f89accd5bf1b6

jbouffard added enhancement discussion labels Apr 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Apache Arrow to Standardize Data and Create an Interface Between Python and Scala #649

Use Apache Arrow to Standardize Data and Create an Interface Between Python and Scala #649

jbouffard commented Apr 16, 2018

jbouffard commented Apr 19, 2018

Use Apache Arrow to Standardize Data and Create an Interface Between Python and Scala #649

Use Apache Arrow to Standardize Data and Create an Interface Between Python and Scala #649

Comments

jbouffard commented Apr 16, 2018

jbouffard commented Apr 19, 2018