forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from apache:master #441
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…supporting both columnar and row output ### What changes were proposed in this pull request? This patch fixes a corner case in `ApplyColumnarRulesAndInsertTransitions`. When a plan required to output rows, if the node supports both columnar and row output, the rule currently adds a redundant `ColumnarToRowExec` to its upstream. ### Why are the changes needed? This fix is used to avoid redundant `ColumnarToRowExec`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #50239 from viirya/fix_columnar. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>
### What changes were proposed in this pull request? This PR proposes to reformat error classes by: ```python from pyspark.errors.exceptions import _write_self; _write_self() ``` ### Why are the changes needed? To remove the diff from auto formatted output. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing CI ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50241 from HyukjinKwon/minor-format. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…s UDFs (pyarrow 17.0.0, pandas 2.2.3, Python 3.11) ### What changes were proposed in this pull request? This PR updates the chart generated at [SPARK-25666](https://issues.apache.org/jira/browse/SPARK-25666). ### Why are the changes needed? To track the changes in type coercion of PySpark <> PyArrow <> pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Use this code to generate the chart: ```python from pyspark.sql.types import * from pyspark.sql.functions import pandas_udf columns = [ ('none', 'object(NoneType)'), ('bool', 'bool'), ('int8', 'int8'), ('int16', 'int16'), ('int32', 'int32'), ('int64', 'int64'), ('uint8', 'uint8'), ('uint16', 'uint16'), ('uint32', 'uint32'), ('uint64', 'uint64'), ('float64', 'float16'), ('float64', 'float32'), ('float64', 'float64'), ('date', 'datetime64[ns]'), ('tz_aware_dates', 'datetime64[ns, US/Eastern]'), ('string', 'object(string)'), ('decimal', 'object(Decimal)'), ('array', 'object(array[int32])'), ('float128', 'float128'), ('complex64', 'complex64'), ('complex128', 'complex128'), ('category', 'category'), ('tdeltas', 'timedelta64[ns]'), ] def create_dataframe(): import pandas as pd import numpy as np import decimal pdf = pd.DataFrame({ 'none': [None, None], 'bool': [True, False], 'int8': np.arange(1, 3).astype('int8'), 'int16': np.arange(1, 3).astype('int16'), 'int32': np.arange(1, 3).astype('int32'), 'int64': np.arange(1, 3).astype('int64'), 'uint8': np.arange(1, 3).astype('uint8'), 'uint16': np.arange(1, 3).astype('uint16'), 'uint32': np.arange(1, 3).astype('uint32'), 'uint64': np.arange(1, 3).astype('uint64'), 'float16': np.arange(1, 3).astype('float16'), 'float32': np.arange(1, 3).astype('float32'), 'float64': np.arange(1, 3).astype('float64'), 'float128': np.arange(1, 3).astype('float128'), 'complex64': np.arange(1, 3).astype('complex64'), 'complex128': np.arange(1, 3).astype('complex128'), 'string': list('ab'), 'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), np.array([1, 2, 3], dtype=np.int32)]), 'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]), 'date': pd.date_range('19700101', periods=2).values, 'category': pd.Series(list("AB")).astype('category')}) pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]] pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, tz='US/Eastern') return pdf types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), FloatType(), DoubleType(), DateType(), TimestampType(), StringType(), DecimalType(10, 0), ArrayType(IntegerType()), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), BinaryType(), ] df = spark.range(2).repartition(1) results = [] count = 0 total = len(types) * len(columns) values = [] spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for column, pandas_t in columns: v = create_dataframe()[column][0] values.append(v) try: row = df.select(pandas_udf(lambda _: create_dataframe()[column].iloc[:1], t)(df.id)).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Pandas Value(Type): %s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), v, pandas_t, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda values_column: "%s(%s)" % (values_column[0], values_column[1][1]), zip(values, columns))) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50242 from HyukjinKwon/SPARK-51476. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…s.json ### What changes were proposed in this pull request? This PR proposes to generate with a newline at the end of `error-conditions.json`. ### Why are the changes needed? To make the styles same as other Python files. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests in CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50254 from HyukjinKwon/minor-format-gen. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? In the PR, I propose to re-use the method `dateTimeUtilsCls()` instead of `org.apache.spark.sql.catalyst.util.DateTimeUtils` in the `Cast` expression. ### Why are the changes needed? 1. To make code consistent: in some places we use `dateTimeUtilsCls()`, but in others `org.apache.spark.sql.catalyst.util.DateTimeUtils`. 2. To improve code maintenance: `DateTimeUtils` can be moved to another package w/o changing the code generation in `Cast`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By existing GAs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50251 from MaxGekk/use-dateTimeUtilsCls. Authored-by: Max Gekk <[email protected]> Signed-off-by: yangjie01 <[email protected]>
### What changes were proposed in this pull request? In the PR, I propose to support `CAST` from `STRING` to `TIME(n)`. The format of input strings should match to: ``` [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us] ``` ### Why are the changes needed? Before the changes, such casting allowed by the SQL standard failed w/ the error: ```scala scala> Seq("17:18:19").toDS.select($"value".cast(TimeType())).show() org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION] Cannot resolve "CAST(value AS TIME(6))" due to data type mismatch: cannot cast "STRING" to "TIME(6)". SQLSTATE: 42K09; ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the cast above works as expected: ```scala scala> Seq("17:18:19").toDS.select($"value".cast(TimeType())).show() +--------+ | value| +--------+ |17:18:19| +--------+ ``` ### How was this patch tested? By running the new tests: ``` $ build/sbt "test:testOnly *CastWithAnsiOffSuite" $ build/sbt "test:testOnly *CastWithAnsiOnSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50236 from MaxGekk/string-to-time. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.1)
Can you help keep this open source service alive? 💖 Please sponsor : )