[pull] master from apache:master #441

pull · 2025-03-12T17:41:06Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

…supporting both columnar and row output ### What changes were proposed in this pull request? This patch fixes a corner case in `ApplyColumnarRulesAndInsertTransitions`. When a plan required to output rows, if the node supports both columnar and row output, the rule currently adds a redundant `ColumnarToRowExec` to its upstream. ### Why are the changes needed? This fix is used to avoid redundant `ColumnarToRowExec`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #50239 from viirya/fix_columnar. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>

### What changes were proposed in this pull request? This PR proposes to reformat error classes by: ```python from pyspark.errors.exceptions import _write_self; _write_self() ``` ### Why are the changes needed? To remove the diff from auto formatted output. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing CI ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50241 from HyukjinKwon/minor-format. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…s UDFs (pyarrow 17.0.0, pandas 2.2.3, Python 3.11) ### What changes were proposed in this pull request? This PR updates the chart generated at [SPARK-25666](https://issues.apache.org/jira/browse/SPARK-25666). ### Why are the changes needed? To track the changes in type coercion of PySpark <> PyArrow <> pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Use this code to generate the chart: ```python from pyspark.sql.types import * from pyspark.sql.functions import pandas_udf columns = [ ('none', 'object(NoneType)'), ('bool', 'bool'), ('int8', 'int8'), ('int16', 'int16'), ('int32', 'int32'), ('int64', 'int64'), ('uint8', 'uint8'), ('uint16', 'uint16'), ('uint32', 'uint32'), ('uint64', 'uint64'), ('float64', 'float16'), ('float64', 'float32'), ('float64', 'float64'), ('date', 'datetime64[ns]'), ('tz_aware_dates', 'datetime64[ns, US/Eastern]'), ('string', 'object(string)'), ('decimal', 'object(Decimal)'), ('array', 'object(array[int32])'), ('float128', 'float128'), ('complex64', 'complex64'), ('complex128', 'complex128'), ('category', 'category'), ('tdeltas', 'timedelta64[ns]'), ] def create_dataframe(): import pandas as pd import numpy as np import decimal pdf = pd.DataFrame({ 'none': [None, None], 'bool': [True, False], 'int8': np.arange(1, 3).astype('int8'), 'int16': np.arange(1, 3).astype('int16'), 'int32': np.arange(1, 3).astype('int32'), 'int64': np.arange(1, 3).astype('int64'), 'uint8': np.arange(1, 3).astype('uint8'), 'uint16': np.arange(1, 3).astype('uint16'), 'uint32': np.arange(1, 3).astype('uint32'), 'uint64': np.arange(1, 3).astype('uint64'), 'float16': np.arange(1, 3).astype('float16'), 'float32': np.arange(1, 3).astype('float32'), 'float64': np.arange(1, 3).astype('float64'), 'float128': np.arange(1, 3).astype('float128'), 'complex64': np.arange(1, 3).astype('complex64'), 'complex128': np.arange(1, 3).astype('complex128'), 'string': list('ab'), 'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), np.array([1, 2, 3], dtype=np.int32)]), 'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]), 'date': pd.date_range('19700101', periods=2).values, 'category': pd.Series(list("AB")).astype('category')}) pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]] pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, tz='US/Eastern') return pdf types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), FloatType(), DoubleType(), DateType(), TimestampType(), StringType(), DecimalType(10, 0), ArrayType(IntegerType()), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), BinaryType(), ] df = spark.range(2).repartition(1) results = [] count = 0 total = len(types) * len(columns) values = [] spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for column, pandas_t in columns: v = create_dataframe()[column][0] values.append(v) try: row = df.select(pandas_udf(lambda _: create_dataframe()[column].iloc[:1], t)(df.id)).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Pandas Value(Type): %s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), v, pandas_t, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda values_column: "%s(%s)" % (values_column[0], values_column[1][1]), zip(values, columns))) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50242 from HyukjinKwon/SPARK-51476. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…s.json ### What changes were proposed in this pull request? This PR proposes to generate with a newline at the end of `error-conditions.json`. ### Why are the changes needed? To make the styles same as other Python files. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests in CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50254 from HyukjinKwon/minor-format-gen. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? In the PR, I propose to re-use the method `dateTimeUtilsCls()` instead of `org.apache.spark.sql.catalyst.util.DateTimeUtils` in the `Cast` expression. ### Why are the changes needed? 1. To make code consistent: in some places we use `dateTimeUtilsCls()`, but in others `org.apache.spark.sql.catalyst.util.DateTimeUtils`. 2. To improve code maintenance: `DateTimeUtils` can be moved to another package w/o changing the code generation in `Cast`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By existing GAs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50251 from MaxGekk/use-dateTimeUtilsCls. Authored-by: Max Gekk <[email protected]> Signed-off-by: yangjie01 <[email protected]>

### What changes were proposed in this pull request? In the PR, I propose to support `CAST` from `STRING` to `TIME(n)`. The format of input strings should match to: ``` [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us] ``` ### Why are the changes needed? Before the changes, such casting allowed by the SQL standard failed w/ the error: ```scala scala> Seq("17:18:19").toDS.select($"value".cast(TimeType())).show() org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION] Cannot resolve "CAST(value AS TIME(6))" due to data type mismatch: cannot cast "STRING" to "TIME(6)". SQLSTATE: 42K09; ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the cast above works as expected: ```scala scala> Seq("17:18:19").toDS.select($"value".cast(TimeType())).show() +--------+ | value| +--------+ |17:18:19| +--------+ ``` ### How was this patch tested? By running the new tests: ``` $ build/sbt "test:testOnly *CastWithAnsiOffSuite" $ build/sbt "test:testOnly *CastWithAnsiOnSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50236 from MaxGekk/string-to-time. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>

viirya and others added 6 commits March 12, 2025 06:51

pull bot added the ⤵️ pull label Mar 12, 2025

pull bot merged commit baf4b59 into huangxiaopingRD:master Mar 12, 2025

github-actions bot added SQL PYTHON labels Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from apache:master #441

[pull] master from apache:master #441

pull bot commented Mar 12, 2025 •

edited

Loading

[pull] master from apache:master #441

[pull] master from apache:master #441

Conversation

pull bot commented Mar 12, 2025 • edited Loading

pull bot commented Mar 12, 2025 •

edited

Loading