Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Apache Arrow to Standardize Data and Create an Interface Between Python and Scala #649

Open
jbouffard opened this issue Apr 16, 2018 · 1 comment

Comments

@jbouffard
Copy link
Collaborator

In order to send data back and forth between Python and Scala, we use Google's Protocol Buffers to serialize/deserialize data as it goes from language to another. However, this causes a lot of overheard which slows down performance due to each piece of data being converted to its respective representation in the target language. This makes it so that certain features available only in Python (scikit-learn, matplotlib, OpenCV-Python, etc.) cannot be fully taken advantage of as the performance hit needed to make the data usable is too great. In addition, certain TiledRasterLayer methods aren't as performant as they could be due to their requirement of serializing/deserializing the data in order to perform their operation.

One way around this would be to use Apache Arrow to provide an interface between Python and Scala. The main draw of this platform is that it provides a language agnostic way of storing data in a columnar memory format. This allows for Python and Scala to access the same data in memory without doing any copy-reads which will eliminate the need for serialization/deserialization.

Currently, Apache Arrow is state where we can begin testing it in GeoPySpark, and it's something that we should do. However, we probably won't be seeing the full benefits of Arrow until this issue is resolved. Even with that issue, though, I feel that we should still increases in performance.

@jbouffard
Copy link
Collaborator Author

Here's a gist that I'm using as a scratch-pad to figure out how to incorporate PyArrow into GPS https://gist.github.com/jbouffard/1bafad67c3cd13325c6f89accd5bf1b6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant