Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: GeoMesa Support and GeoPySpark Refactoring #650

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

aheyne
Copy link
Contributor

@aheyne aheyne commented Apr 17, 2018

This is a WIP PR to drive comments and discussion about the proposed merge of geomesa_pyspark module of GeoMesa into GeoPySpark and the subsequent promotion of GeoPySpark from just GeoTrellis/VectorPipe Python bindings to a project that brings general Geo* support to Python/PySpark.

Major thing of note and discussion points.

  • YARN distribution of resources
    • Part of the functionality that has been ported over from geomesa_pyspark is the ability to package up runtime resources and ship them out when the spark context is started as a YARN job. This allows for jobs to be run on a cluster without requiring GeoPySpark to be installed on the worker nodes.
    • More testing needs to be done around this to ensure it works for GeoTrellis and VectorPipe.
  • GeoMesa bindings
    • Verified able to query out of GeoMesa on HBase.
    • The geomesa-*-spark-runtime jar is required for this to work. Integrating this into the install-jar workflow might help end user experience.
  • Discuss aligning VectorPipe PB / Shapely Geometries / SimpleFeatures
    • Allow for workflows such as pulling Polygons from GeoMesa to use in masking operation of GeoTrellis Tiles or creating VectorTiles from vectors pulled from GeoMesa. This is something that would probably need to be done on the Spark level and wired though Python.
  • Kryo Serialization Change
    • Changed to using GeoMesa's Kryo registration for simple features.
    • Need to verify this still works with VectorPipe.
    • Does this break backwards compatibility? Does that matter?

A beta release of this is available here. The GeoMesa pre-release that includes the complementary code is here with my working branch here. The changes in GeoMesa are to bind the pyUDT function in AbstractGeometryUDT to geopyspark.GeometryUDT. This fixes the Non issue we've been seeing.

I've included this in the README but for demonstration and reference purposes I'll include here a code sample of how to pull features from GeoMesa. This uses the YARN packaging mentioned above.

import geopyspark as gps
from pyspark import SparkContext
from pyspark.sql import SQLContext

conf = gps.geopyspark_conf(appName="test", master="yarn")
sc = SparkContext(conf=conf)
spark = SQLContext(sc)

params = { "hbase.catalog": "catalog" }
feature = "gdelt"
df = spark.read\
    .format("geomesa")\
    .options(**params)\
    .option("geomesa.feature", feature)\
    .load()

df.createOrReplaceTempView("gdelt")

spark.sql("select * from gdelt where st_contains(st_makeBBOX(0.0, 0.0, 90.0, 90.0), geom) limit 10").show()
+---------+--------------------+-------------------+--------------------+
|eventCode|          actor1Name|                dtg|                geom|
+---------+--------------------+-------------------+--------------------+
|      042|              TAIWAN|2017-01-01 00:00:00|POINT (6.73333 0....|
|      043|             VATICAN|2017-01-01 00:00:00|POINT (6.73333 0....|
|      161|SAO TOME AND PRIN...|2017-01-01 00:00:00|POINT (6.73333 0....|
|      042|              TAIWAN|2017-01-01 00:00:00|POINT (6.73333 0....|
|      043|              POLICE|2017-01-01 00:00:00|POINT (5.4851 5.4...|
|      160|CONSTITUTIONAL COURT|2017-01-01 00:00:00|POINT (8.78333 3.75)|
|      173|              PRISON|2017-01-01 00:00:00|POINT (8.78333 3.75)|
|      173|   EQUATORIAL GUINEA|2017-01-01 00:00:00|POINT (8.78333 3.75)|
|      042|            CAMEROON|2017-01-01 00:00:00|POINT (9.241 4.1527)|
|      051|             NIGERIA|2017-01-01 00:00:00|POINT (6.08333 4.75)|
+---------+--------------------+-------------------+--------------------+

aheyne added 21 commits March 20, 2018 12:27
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
Signed-off-by: Austin Heyne <[email protected]>
@jbouffard
Copy link
Collaborator

@aheyne Is this ready for review? It's marked as WIP, so I wasn't sure if there's anything else you'd like to add to this PR.

@aheyne
Copy link
Contributor Author

aheyne commented May 21, 2018

@jbouffard I'm planning on squashing and putting up a better PR if we decide to move forward on this. Interest is unclear to me and there are some outstanding questions; mainly the four discussion points above. There is also the more fundamental discrepancy in API level we're utilizing in the two projects (GeoPySpark being more RDD focused and GeoMesa_PySpark being Dataframe focused).

At the very least I think this is a good starting point to get a Python environment with both GeoMesa and GeoTrellis playing nice together. If we're okay with that and want to review/merge it so we can continue this line of development that'd be awesome.

@jbouffard
Copy link
Collaborator

@aheyne Ah, I see. At least on our end, I know there's definite interest in seeing GeoMesa_PySpark integrated into GPS. I've requested some time in these next few weeks to focus on this integration. I think that would be a good time to discuss your 4 points and the API discrepancies.

I'm okay with having this PR be the initial starting point for this integration. We could just mark it as experimental until a later point. What are your opinions @echeipesh @jpolchlo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants