feat(pyspark): add official support and ci testing with spark connect #10187

cpcloud · 2024-09-20T17:03:56Z

Description of changes

This PR adds testing for using the pyspark Ibis backend with spark-connect.

The way this is done is running a Spark connect instance as a docker compose
service, similar to our other client-server model backends.

The primary bit of functionality that isn't tested is UDFs (which means JSON unwrapping is also not tested, because that's implemented as a UDF).

These effectively require a clone of the Python environment on the server, and that seems out of scope for initial support of spark connect.

WIP for now.

Wanted to get some feedback on the testing approach, which is basically to set up fixtures
differently depending on the value of the SPARK_REMOTE environment variable.

Issues closed

cpcloud · 2024-09-20T17:06:20Z

ibis/backends/pyspark/tests/test_array.py

I deleted this file, because all of the tests are redundant with other array tests that we have in the main backend test suite.

cpcloud · 2024-09-20T17:07:46Z

ibis/backends/pyspark/tests/test_basic.py

@@ -116,7 +119,22 @@ def test_alias_after_select(t, df):


 def test_interval_columns_invalid(con):
-    msg = r"DayTimeIntervalType\(0, 1\) couldn't be converted to Interval"
+    df_interval_invalid = con._session.createDataFrame(


I moved the setup here, because this is the only place this table is used.

cpcloud · 2024-09-20T18:43:00Z

ibis/backends/pyspark/__init__.py

@@ -761,7 +765,6 @@ def _create_cached_table(self, name, expr):
    def _drop_cached_table(self, name):
        self._session.catalog.dropTempView(name)
        t = self._cached_dataframes.pop(name)
-        assert t.is_cached


I'm not sure why this assert fails with spark connect. Identical objects (computed using id) have different values of this property across time. That said, I haven't looked into how this property is implemented yet.

cpcloud · 2024-09-20T18:44:17Z

ibis/backends/tests/test_array.py

+                    ["pyspark"],
+                    condition=IS_SPARK_REMOTE,
+                    raises=AssertionError,
+                    reason="somehow, transformed results are different types",


I believe these are all the result of Spark connect unconditionally using Arrow for transport, which is great long term, but not compatible with a bunch of our existing array tests that want Python Nones and not numpy.nans.

cpcloud · 2024-09-20T18:45:40Z

ibis/backends/tests/test_temporal.py

+                    ["pyspark"],
+                    condition=IS_SPARK_REMOTE,
+                    raises=PySparkConnectGrpcException,
+                    reason="arrow conversion breaks",


non-duration intervals in arrow are extremely hard to use, i'm surprised only these two test cases are failing here.

gforsyth

I'm of two minds on this one. We shouldn't add another backend for what is effectively just a change in the connection method, and the test suite isn't really designed around changes in connection method this way.

On the other hand, I'm not hugely thrilled with needing to add condition kwargs to some of our xfail markers. My worry there is specifically needing to pull pyspark into a separate mark definition because we need to add condition to only xfail one mode of pyspark failures, which seems cluttered and annoying. But also, it hasn't happened here and if anything, I would expect that the two connection modes would tend towards feature parity and not away from it.

I think this can go in -- it is certainly running the test suite against both connection methods. If I think of something better, we can always revisit and refactor.

Thanks for slogging through all of this!

cpcloud · 2024-09-23T10:46:41Z

Lucky for us there's a deadlock happening during an invocation of _finalize_memtable 🙄

Traceback (most recent call last):
  File "/nix/store/h3i0acpmr8mrjx07519xxmidv8mpax4y-python3-3.12.5/lib/python3.12/weakref.py", line 666, in _exitfunc
    f()
  File "/nix/store/h3i0acpmr8mrjx07519xxmidv8mpax4y-python3-3.12.5/lib/python3.12/weakref.py", line 590, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cloud/src/ibis/ibis/backends/__init__.py", line 1095, in _finalize_in_memory_table
    self._finalize_memtable(name)
  File "/home/cloud/src/ibis/ibis/backends/pyspark/__init__.py", line 459, in _finalize_memtable
    self._session.catalog.dropTempView(name)
  File "/nix/store/0gr6r5154zyrww5cjj4kybd50vsqp8qn-python3-3.12.5-env/lib/python3.12/site-packages/pyspark/sql/connect/catalog.py", line 266, in dropTempView
    pdf = self._execute_and_fetch(plan.DropTempView(view_name=viewName))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/0gr6r5154zyrww5cjj4kybd50vsqp8qn-python3-3.12.5-env/lib/python3.12/site-packages/pyspark/sql/connect/catalog.py", line 49, in _execute_and_fetch
    pdf = DataFrame.withPlan(catalog, session=self._sparkSession).toPandas()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/0gr6r5154zyrww5cjj4kybd50vsqp8qn-python3-3.12.5-env/lib/python3.12/site-packages/pyspark/sql/connect/dataframe.py", line 1663, in toPandas
    return self._session.client.to_pandas(query)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/0gr6r5154zyrww5cjj4kybd50vsqp8qn-python3-3.12.5-env/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 869, in to_pandas
    (self_destruct_conf,) = self.get_config_with_defaults(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/0gr6r5154zyrww5cjj4kybd50vsqp8qn-python3-3.12.5-env/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 1349, in get_config_with_defaults
    configs = dict(self.config(op).pairs)
                   ^^^^^^^^^^^^^^^
  File "/nix/store/0gr6r5154zyrww5cjj4kybd50vsqp8qn-python3-3.12.5-env/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", line 1370, in config
    resp = self._stub.Config(req, metadata=self._builder.metadata())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/0gr6r5154zyrww5cjj4kybd50vsqp8qn-python3-3.12.5-env/lib/python3.12/site-packages/grpc/_channel.py", line 1178, in __call__
    ) = self._blocking(
        ^^^^^^^^^^^^^^^
  File "/nix/store/0gr6r5154zyrww5cjj4kybd50vsqp8qn-python3-3.12.5-env/lib/python3.12/site-packages/grpc/_channel.py", line 1146, in _blocking
    call = self._channel.segregated_call(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 547, in grpc._cython.cygrpc.Channel.segregated_call
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 403, in grpc._cython.cygrpc._segregated_call
  File "/nix/store/h3i0acpmr8mrjx07519xxmidv8mpax4y-python3-3.12.5/lib/python3.12/threading.py", line 300, in __enter__
    return self._lock.__enter__()

cpcloud · 2024-09-23T11:19:54Z

@gforsyth I know you already approved, but wanted to get your thoughts/questions on having to remove memtable finalization for PySpark, to avoid what apparently is a deadlock when trying to invoke dropTempView during memtable finalization.

gforsyth · 2024-09-23T13:10:18Z

Hmm, this seems like a variation on our usual "don't design to the lowest common denominator", but again, having an entirely separate backend just for a different connection method is gross.

In the interest of not kneecapping non-spark-connect-spark, do we want to add in a hacky check that we're running on Spark Connect and use that to set something on the backend instance? And then define our finalizer accordingly?

cpcloud · 2024-09-23T13:25:07Z

Oh, I guess we can check the SparkSession type 😬

cpcloud · 2024-09-23T13:25:30Z

Worth a shot if it works, until we can get some user feedback on how people are deploying it.

cpcloud · 2024-09-23T13:28:19Z

I guess one argument (slightly) in favor of no-op is that these views all get cleaned up on process termination.

gforsyth · 2024-09-23T13:29:57Z

I guess one argument (slightly) in favor of no-op is that these views all get cleaned up on process termination.

Yeah, I don't know if this is more an issue of "purity" in that we should clean up when we can vs. do people have long-running spark sessions that they want to keep "clean"

…hings

…ibis-project#10187) ## Description of changes This PR adds testing for using the pyspark Ibis backend with spark-connect. The way this is done is running a Spark connect instance as a docker compose service, similar to our other client-server model backends. The primary bit of functionality that isn't tested is UDFs (which means JSON unwrapping is also not tested, because that's implemented as a UDF). These effectively require a clone of the Python environment on the server, and that seems out of scope for initial support of spark connect.

cpcloud added this to the 10.0 milestone Sep 20, 2024

cpcloud added the feature Features or general enhancements label Sep 20, 2024

github-actions bot added tests Issues or PRs related to tests ci Continuous Integration issues or PRs pyspark The Apache PySpark backend labels Sep 20, 2024

cpcloud commented Sep 20, 2024

View reviewed changes

cpcloud force-pushed the spark-connect branch from c39f7b2 to 48ff420 Compare September 20, 2024 18:25

cpcloud commented Sep 20, 2024

View reviewed changes

cpcloud marked this pull request as ready for review September 20, 2024 19:00

gforsyth approved these changes Sep 20, 2024

View reviewed changes

cpcloud force-pushed the spark-connect branch 3 times, most recently from 260d1b1 to b02fb34 Compare September 23, 2024 09:56

cpcloud force-pushed the spark-connect branch from b02fb34 to ce64ba6 Compare September 23, 2024 11:04

cpcloud requested a review from gforsyth September 23, 2024 11:18

cpcloud added 5 commits September 23, 2024 09:33

chore: spark-connect docker setup

dee9233

test(pyspark): clean up duplicated array tests

256efb0

test(pyspark): bifurcate when remote is set

b0c3b44

chore: despam spark

40b42ce

chore: why would they do that

2fe57e1

cpcloud added 16 commits September 23, 2024 09:34

test(pyspark): xfail enough things for spark-connect to pass most tests

89974b4

chore: fix types

a6ecd48

ci: actually start spark-connect

3a8cc27

chore: sane default for SPARK_CONFIG

2de07b9

chore: get the path right

ca6e1a8

chore: be lazy about xpasses for spark-connect

834c073

chore: spark inspects spark remote; avoid setting it for non-remote t…

e8bbe93

…hings

chore: fix streaming tests

2da7695

ci: use just up instead of docker compose up --wait

f6f1be1

ci: remove docker compose version step

593c66c

chore: try cleaning up some settings

c028017

chore: remove restart behavior

4200a20

chore: bring on the log spam

e39f573

chore: try invoking sql instead of dataframe methods

a06ef46

chore: disable memtable finalization in the spark backend

ed2adfb

chore: clean up on non-spark-connect pyspark

0f506ae

cpcloud force-pushed the spark-connect branch from fd03718 to 0f506ae Compare September 23, 2024 13:42

gforsyth approved these changes Sep 23, 2024

View reviewed changes

gforsyth merged commit abb5593 into ibis-project:main Sep 23, 2024
78 checks passed

cpcloud deleted the spark-connect branch September 23, 2024 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pyspark): add official support and ci testing with spark connect #10187

feat(pyspark): add official support and ci testing with spark connect #10187

cpcloud commented Sep 20, 2024 •

edited

Loading

cpcloud Sep 20, 2024

cpcloud Sep 20, 2024

cpcloud Sep 20, 2024

cpcloud Sep 20, 2024

cpcloud Sep 20, 2024

gforsyth left a comment

cpcloud commented Sep 23, 2024 •

edited

Loading

cpcloud commented Sep 23, 2024

gforsyth commented Sep 23, 2024

cpcloud commented Sep 23, 2024

cpcloud commented Sep 23, 2024

cpcloud commented Sep 23, 2024

gforsyth commented Sep 23, 2024

feat(pyspark): add official support and ci testing with spark connect #10187

feat(pyspark): add official support and ci testing with spark connect #10187

Conversation

cpcloud commented Sep 20, 2024 • edited Loading

Description of changes

Issues closed

cpcloud Sep 20, 2024

Choose a reason for hiding this comment

cpcloud Sep 20, 2024

Choose a reason for hiding this comment

cpcloud Sep 20, 2024

Choose a reason for hiding this comment

cpcloud Sep 20, 2024

Choose a reason for hiding this comment

cpcloud Sep 20, 2024

Choose a reason for hiding this comment

gforsyth left a comment

Choose a reason for hiding this comment

cpcloud commented Sep 23, 2024 • edited Loading

cpcloud commented Sep 23, 2024

gforsyth commented Sep 23, 2024

cpcloud commented Sep 23, 2024

cpcloud commented Sep 23, 2024

cpcloud commented Sep 23, 2024

gforsyth commented Sep 23, 2024

cpcloud commented Sep 20, 2024 •

edited

Loading

cpcloud commented Sep 23, 2024 •

edited

Loading