-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pyspark): add official support and ci testing with spark connect #10187
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted this file, because all of the tests are redundant with other array tests that we have in the main backend test suite.
@@ -116,7 +119,22 @@ def test_alias_after_select(t, df): | |||
|
|||
|
|||
def test_interval_columns_invalid(con): | |||
msg = r"DayTimeIntervalType\(0, 1\) couldn't be converted to Interval" | |||
df_interval_invalid = con._session.createDataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the setup here, because this is the only place this table is used.
c39f7b2
to
48ff420
Compare
@@ -761,7 +765,6 @@ def _create_cached_table(self, name, expr): | |||
def _drop_cached_table(self, name): | |||
self._session.catalog.dropTempView(name) | |||
t = self._cached_dataframes.pop(name) | |||
assert t.is_cached |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this assert
fails with spark connect. Identical objects (computed using id
) have different values of this property across time. That said, I haven't looked into how this property is implemented yet.
["pyspark"], | ||
condition=IS_SPARK_REMOTE, | ||
raises=AssertionError, | ||
reason="somehow, transformed results are different types", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these are all the result of Spark connect unconditionally using Arrow for transport, which is great long term, but not compatible with a bunch of our existing array tests that want Python None
s and not numpy.nan
s.
["pyspark"], | ||
condition=IS_SPARK_REMOTE, | ||
raises=PySparkConnectGrpcException, | ||
reason="arrow conversion breaks", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-duration intervals in arrow are extremely hard to use, i'm surprised only these two test cases are failing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm of two minds on this one. We shouldn't add another backend for what is effectively just a change in the connection method, and the test suite isn't really designed around changes in connection method this way.
On the other hand, I'm not hugely thrilled with needing to add condition
kwargs to some of our xfail markers. My worry there is specifically needing to pull pyspark
into a separate mark definition because we need to add condition
to only xfail one mode of pyspark failures, which seems cluttered and annoying. But also, it hasn't happened here and if anything, I would expect that the two connection modes would tend towards feature parity and not away from it.
I think this can go in -- it is certainly running the test suite against both connection methods. If I think of something better, we can always revisit and refactor.
Thanks for slogging through all of this!
260d1b1
to
b02fb34
Compare
Lucky for us there's a deadlock happening during an invocation of
|
b02fb34
to
ce64ba6
Compare
@gforsyth I know you already approved, but wanted to get your thoughts/questions on having to remove memtable finalization for PySpark, to avoid what apparently is a deadlock when trying to invoke |
Hmm, this seems like a variation on our usual "don't design to the lowest common denominator", but again, having an entirely separate backend just for a different connection method is gross. In the interest of not kneecapping non-spark-connect-spark, do we want to add in a hacky check that we're running on Spark Connect and use that to set something on the backend instance? And then define our finalizer accordingly? |
Oh, I guess we can check the |
Worth a shot if it works, until we can get some user feedback on how people are deploying it. |
I guess one argument (slightly) in favor of no-op is that these views all get cleaned up on process termination. |
Yeah, I don't know if this is more an issue of "purity" in that we should clean up when we can vs. do people have long-running spark sessions that they want to keep "clean" |
fd03718
to
0f506ae
Compare
…ibis-project#10187) ## Description of changes This PR adds testing for using the pyspark Ibis backend with spark-connect. The way this is done is running a Spark connect instance as a docker compose service, similar to our other client-server model backends. The primary bit of functionality that isn't tested is UDFs (which means JSON unwrapping is also not tested, because that's implemented as a UDF). These effectively require a clone of the Python environment on the server, and that seems out of scope for initial support of spark connect.
Description of changes
This PR adds testing for using the pyspark Ibis backend with spark-connect.
The way this is done is running a Spark connect instance as a docker compose
service, similar to our other client-server model backends.
The primary bit of functionality that isn't tested is UDFs (which means JSON unwrapping is also not tested, because that's implemented as a UDF).
These effectively require a clone of the Python environment on the server, and that seems out of scope for initial support of spark connect.
WIP for now.
Wanted to get some feedback on the testing approach, which is basically to set up fixtures
differently depending on the value of the
SPARK_REMOTE
environment variable.Issues closed