Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from apache:master #441

Merged
merged 6 commits into from
Mar 12, 2025
Merged

Conversation

pull[bot]
Copy link

@pull pull bot commented Mar 12, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

viirya and others added 6 commits March 12, 2025 06:51
…supporting both columnar and row output

### What changes were proposed in this pull request?

This patch fixes a corner case in `ApplyColumnarRulesAndInsertTransitions`. When a plan required to output rows, if the node supports both columnar and row output, the rule currently adds a redundant `ColumnarToRowExec` to its upstream.

### Why are the changes needed?

This fix is used to avoid redundant `ColumnarToRowExec`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #50239 from viirya/fix_columnar.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
### What changes were proposed in this pull request?

This PR proposes to reformat error classes by:

```python
from pyspark.errors.exceptions import _write_self; _write_self()
```

### Why are the changes needed?

To remove the diff from auto formatted output.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing CI

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50241 from HyukjinKwon/minor-format.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…s UDFs (pyarrow 17.0.0, pandas 2.2.3, Python 3.11)

### What changes were proposed in this pull request?

This PR updates the chart generated at [SPARK-25666](https://issues.apache.org/jira/browse/SPARK-25666).

### Why are the changes needed?

To track the changes in type coercion of PySpark <> PyArrow <> pandas.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Use this code to generate the chart:

```python
from pyspark.sql.types import *
from pyspark.sql.functions import pandas_udf

columns = [
    ('none', 'object(NoneType)'),
    ('bool', 'bool'),
    ('int8', 'int8'),
    ('int16', 'int16'),
    ('int32', 'int32'),
    ('int64', 'int64'),
    ('uint8', 'uint8'),
    ('uint16', 'uint16'),
    ('uint32', 'uint32'),
    ('uint64', 'uint64'),
    ('float64', 'float16'),
    ('float64', 'float32'),
    ('float64', 'float64'),
    ('date', 'datetime64[ns]'),
    ('tz_aware_dates', 'datetime64[ns, US/Eastern]'),
    ('string', 'object(string)'),
    ('decimal', 'object(Decimal)'),
    ('array', 'object(array[int32])'),
    ('float128', 'float128'),
    ('complex64', 'complex64'),
    ('complex128', 'complex128'),
    ('category', 'category'),
    ('tdeltas', 'timedelta64[ns]'),
]

def create_dataframe():
    import pandas as pd
    import numpy as np
    import decimal
    pdf = pd.DataFrame({
        'none': [None, None],
        'bool': [True, False],
        'int8': np.arange(1, 3).astype('int8'),
        'int16': np.arange(1, 3).astype('int16'),
        'int32': np.arange(1, 3).astype('int32'),
        'int64': np.arange(1, 3).astype('int64'),
        'uint8': np.arange(1, 3).astype('uint8'),
        'uint16': np.arange(1, 3).astype('uint16'),
        'uint32': np.arange(1, 3).astype('uint32'),
        'uint64': np.arange(1, 3).astype('uint64'),
        'float16': np.arange(1, 3).astype('float16'),
        'float32': np.arange(1, 3).astype('float32'),
        'float64': np.arange(1, 3).astype('float64'),
        'float128': np.arange(1, 3).astype('float128'),
        'complex64': np.arange(1, 3).astype('complex64'),
        'complex128': np.arange(1, 3).astype('complex128'),
        'string': list('ab'),
        'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), np.array([1, 2, 3], dtype=np.int32)]),
        'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]),
        'date': pd.date_range('19700101', periods=2).values,
        'category': pd.Series(list("AB")).astype('category')})
    pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]]
    pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, tz='US/Eastern')
    return pdf

types =  [
    BooleanType(),
    ByteType(),
    ShortType(),
    IntegerType(),
    LongType(),
    FloatType(),
    DoubleType(),
    DateType(),
    TimestampType(),
    StringType(),
    DecimalType(10, 0),
    ArrayType(IntegerType()),
    MapType(StringType(), IntegerType()),
    StructType([StructField("_1", IntegerType())]),
    BinaryType(),
]

df = spark.range(2).repartition(1)
results = []
count = 0
total = len(types) * len(columns)
values = []
spark.sparkContext.setLogLevel("FATAL")
for t in types:
    result = []
    for column, pandas_t in columns:
        v = create_dataframe()[column][0]
        values.append(v)
        try:
            row = df.select(pandas_udf(lambda _: create_dataframe()[column].iloc[:1], t)(df.id)).first()
            ret_str = repr(row[0])
        except Exception:
            ret_str = "X"
        result.append(ret_str)
        progress = "SQL Type: [%s]\n  Pandas Value(Type): %s(%s)]\n  Result Python Value: [%s]" % (
            t.simpleString(), v, pandas_t, ret_str)
        count += 1
        print("%s/%s:\n  %s" % (count, total, progress))
    results.append([t.simpleString()] + list(map(str, result)))

schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda values_column: "%s(%s)" % (values_column[0], values_column[1][1]), zip(values, columns)))
strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False)
print("\n".join(map(lambda line: "    # %s  # noqa" % line, strings.strip().split("\n"))))
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50242 from HyukjinKwon/SPARK-51476.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…s.json

### What changes were proposed in this pull request?

This PR proposes to generate with a newline at the end of `error-conditions.json`.

### Why are the changes needed?

To make the styles same as other Python files.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests in CI.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50254 from HyukjinKwon/minor-format-gen.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
In the PR, I propose to re-use the method `dateTimeUtilsCls()` instead of `org.apache.spark.sql.catalyst.util.DateTimeUtils` in the `Cast` expression.

### Why are the changes needed?
1. To make code consistent: in some places we use `dateTimeUtilsCls()`, but in others `org.apache.spark.sql.catalyst.util.DateTimeUtils`.
2. To improve code maintenance:  `DateTimeUtils` can be moved to another package w/o changing the code generation in `Cast`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By existing GAs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #50251 from MaxGekk/use-dateTimeUtilsCls.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
### What changes were proposed in this pull request?
In the PR, I propose to support `CAST` from `STRING` to `TIME(n)`. The format of input strings should match to:
```
[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]
```

### Why are the changes needed?
Before the changes, such casting allowed by the SQL standard failed w/ the error:
```scala
scala> Seq("17:18:19").toDS.select($"value".cast(TimeType())).show()
org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION] Cannot resolve "CAST(value AS TIME(6))" due to data type mismatch: cannot cast "STRING" to "TIME(6)". SQLSTATE: 42K09;
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the cast above works as expected:
```scala
scala> Seq("17:18:19").toDS.select($"value".cast(TimeType())).show()
+--------+
|   value|
+--------+
|17:18:19|
+--------+
```

### How was this patch tested?
By running the new tests:
```
$ build/sbt "test:testOnly *CastWithAnsiOffSuite"
$ build/sbt "test:testOnly *CastWithAnsiOnSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #50236 from MaxGekk/string-to-time.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
@pull pull bot added the ⤵️ pull label Mar 12, 2025
@pull pull bot merged commit baf4b59 into huangxiaopingRD:master Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants