Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grid_tessellateexplode throws "OutOfMemoryError: Java heap space" with geometries that have SRID 28992 #531

Open
rickamsterdam opened this issue Feb 14, 2024 · 0 comments

Comments

@rickamsterdam
Copy link

rickamsterdam commented Feb 14, 2024

Describe the bug
When I try to generate the index struct using grid_tessellateexplode with an geometry that uses SRID 28992, it throws an error. When I ST_Transform it to 4326 it does work.

Also, why do I need to pass a WKB to mos.grid_tessellateexplode and it doesn't accept a geometry struct?

Code:

bs_indexed = (
    spark.read.table(f"...beschermde_stads_dorpsgezichten")
    .withColumn("bs_wkb", mos.st_aswkb("geometry"))
    .withColumn("mosaic_index", mos.grid_tessellateexplode(col("bs_wkb"), lit(optimal_resolution)))
    .select("naam", "bs_wkb", "geometry")
)

Geometry:

{"type_id": 5, "srid": 28992, "boundary": [[[116771.70410447767, 487949.6659262598], [116844.83756018449, 488222.5999227307], [116876.56025659517, 488344.9735636049], [116880.24992223541, 488381.8864943112], [116877.05914440888, 488488.84105297644], [117197.2581056123, 488495.7238242906], [117234.70637364508, 488495.77083400957], [117596.74725183658, 488399.8804343442], [117429.51060969828, 487772.3687708253], [116771.70410447767, 487949.6659262598], [116771.70410447767, 487949.6659262598]]], "holes": [[]]}

To Reproduce

  1. Load a GeoJSON that contains SRID 28992 geometries
  2. grid_tessellateexplode the geometries

Expected behavior
A clear and concise description of what you expected to happen.
grid_tessellateexplode handles 28992 SRID

Additional context

  • mosaic: version 0.4.0
  • Databricks Runtime Version: 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)
  • worker type: Standard_DS3_v2
OutOfMemoryError: Java heap space
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.4 in stage 8.0 (TID 19) (10.68.24.73 executor 8): java.lang.OutOfMemoryError: Java heap space
	at com.uber.h3core.H3Core.polyfill(H3Core.java:689)
	at com.databricks.labs.mosaic.core.index.H3IndexSystem$.$anonfun$polyfill$1(H3IndexSystem.scala:127)
	at com.databricks.labs.mosaic.core.index.H3IndexSystem$.$anonfun$polyfill$1$adapted(H3IndexSystem.scala:123)
	at com.databricks.labs.mosaic.core.index.H3IndexSystem$$$Lambda$2599/2011913190.apply(Unknown Source)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.TraversableLike$$Lambda$97/33779587.apply(Unknown Source)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at com.databricks.labs.mosaic.core.index.H3IndexSystem$.polyfill(H3IndexSystem.scala:123)
	at com.databricks.labs.mosaic.core.Mosaic$.mosaicFill(Mosaic.scala:80)
	at com.databricks.labs.mosaic.core.Mosaic$.getChips(Mosaic.scala:33)
	at com.databricks.labs.mosaic.expressions.index.MosaicExplode.eval(MosaicExplode.scala:76)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$4(GenerateExec.scala:99)
	at org.apache.spark.sql.execution.GenerateExec$$Lambda$2475/1703867331.apply(Unknown Source)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:224)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$$$Lambda$2565/696032892.apply$mcV$sp(Unknown Source)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$$$Lambda$2564/907227846.apply$mcV$sp(Unknown Source)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$1(UnsafeRowBatchUtils.scala:68)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3588)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3519)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3506)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3506)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1516)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1516)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1516)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3835)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3747)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3735)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$runJob$1(DAGScheduler.scala:1240)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1228)
	at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2959)
	at org.apache.spark.sql.execution.collect.Collector.$anonfun$runSparkJobs$1(Collector.scala:338)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
	at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:282)
	at org.apache.spark.sql.execution.collect.Collector.$anonfun$collect$1(Collector.scala:366)
	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
	at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:363)
	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:117)
	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:124)
	at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:126)
	at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:114)
	at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:94)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$computeResult$1(ResultCacheManager.scala:553)
	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.collectResult$1(ResultCacheManager.scala:545)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.computeResult(ResultCacheManager.scala:565)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:426)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:419)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:313)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeCollectResult$1(SparkPlan.scala:519)
	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
	at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:516)
	at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3628)
	at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3619)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$3(Dataset.scala:4544)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:945)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4542)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:282)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:510)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:209)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1113)
	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:152)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:459)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4542)
	at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3618)
	at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:267)
	at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:101)
	at com.databricks.backend.daemon.driver.PythonDriverLocalBase.generateTableResult(PythonDriverLocalBase.scala:773)
	at com.databricks.backend.daemon.driver.JupyterDriverLocal.computeListResultsItem(JupyterDriverLocal.scala:1105)
	at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.addCustomDisplayData(JupyterDriverLocal.scala:261)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at com.uber.h3core.H3Core.polyfill(H3Core.java:689)
	at com.databricks.labs.mosaic.core.index.H3IndexSystem$.$anonfun$polyfill$1(H3IndexSystem.scala:127)
	at com.databricks.labs.mosaic.core.index.H3IndexSystem$.$anonfun$polyfill$1$adapted(H3IndexSystem.scala:123)
	at com.databricks.labs.mosaic.core.index.H3IndexSystem$$$Lambda$2599/2011913190.apply(Unknown Source)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.TraversableLike$$Lambda$97/33779587.apply(Unknown Source)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at com.databricks.labs.mosaic.core.index.H3IndexSystem$.polyfill(H3IndexSystem.scala:123)
	at com.databricks.labs.mosaic.core.Mosaic$.mosaicFill(Mosaic.scala:80)
	at com.databricks.labs.mosaic.core.Mosaic$.getChips(Mosaic.scala:33)
	at com.databricks.labs.mosaic.expressions.index.MosaicExplode.eval(MosaicExplode.scala:76)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$4(GenerateExec.scala:99)
	at org.apache.spark.sql.execution.GenerateExec$$Lambda$2475/1703867331.apply(Unknown Source)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:224)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$$$Lambda$2565/696032892.apply$mcV$sp(Unknown Source)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$$$Lambda$2564/907227846.apply$mcV$sp(Unknown Source)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$1(UnsafeRowBatchUtils.scala:68)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant