You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to send data back and forth between Python and Scala, we use Google's Protocol Buffers to serialize/deserialize data as it goes from language to another. However, this causes a lot of overheard which slows down performance due to each piece of data being converted to its respective representation in the target language. This makes it so that certain features available only in Python (scikit-learn, matplotlib, OpenCV-Python, etc.) cannot be fully taken advantage of as the performance hit needed to make the data usable is too great. In addition, certain TiledRasterLayer methods aren't as performant as they could be due to their requirement of serializing/deserializing the data in order to perform their operation.
One way around this would be to use Apache Arrow to provide an interface between Python and Scala. The main draw of this platform is that it provides a language agnostic way of storing data in a columnar memory format. This allows for Python and Scala to access the same data in memory without doing any copy-reads which will eliminate the need for serialization/deserialization.
Currently, Apache Arrow is state where we can begin testing it in GeoPySpark, and it's something that we should do. However, we probably won't be seeing the full benefits of Arrow until this issue is resolved. Even with that issue, though, I feel that we should still increases in performance.
The text was updated successfully, but these errors were encountered:
In order to send data back and forth between Python and Scala, we use Google's Protocol Buffers to serialize/deserialize data as it goes from language to another. However, this causes a lot of overheard which slows down performance due to each piece of data being converted to its respective representation in the target language. This makes it so that certain features available only in Python (scikit-learn, matplotlib, OpenCV-Python, etc.) cannot be fully taken advantage of as the performance hit needed to make the data usable is too great. In addition, certain
TiledRasterLayer
methods aren't as performant as they could be due to their requirement of serializing/deserializing the data in order to perform their operation.One way around this would be to use Apache Arrow to provide an interface between Python and Scala. The main draw of this platform is that it provides a language agnostic way of storing data in a columnar memory format. This allows for Python and Scala to access the same data in memory without doing any copy-reads which will eliminate the need for serialization/deserialization.
Currently, Apache Arrow is state where we can begin testing it in GeoPySpark, and it's something that we should do. However, we probably won't be seeing the full benefits of Arrow until this issue is resolved. Even with that issue, though, I feel that we should still increases in performance.
The text was updated successfully, but these errors were encountered: