Skip to content

[SPARK-52811][PYTHON] Optimize ArrowTableToRowsConversion.convert to improve its performance #51508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Jul 16, 2025

What changes were proposed in this pull request?

Optimizes ArrowTableToRowsConversion.convert to improve its performance, similar to #51482.

  • Calculate fields in advance
  • Move conversions to columnar_data creation
  • Make creation of rows for-comprehension to avoid expensive list.append calls

Why are the changes needed?

ArrowTableToRowsConversion.convert has several performance overhead.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The existing tests, and manual benchmarks.

def profile(f, *args, _n=10, **kwargs):
    import cProfile
    import pstats
    import gc
    st = None
    for _ in range(5):
        f(*args, **kwargs)
    for _ in range(_n):
        gc.collect()
        with cProfile.Profile() as pr:
            ret = f(*args, **kwargs)
        if st is None:
            st = pstats.Stats(pr)
        else:
            st.add(pstats.Stats(pr))
    st.sort_stats("time", "cumulative").print_stats()
    return ret

from pyspark.sql.conversion import ArrowTableToRowsConversion, LocalDataToArrowConversion
from pyspark.sql.types import *

data = [
    (i if i % 1000 else None, str(i), i)
    for i in range(1000000)
]
schema = (
    StructType()
    .add("i", IntegerType(), nullable=True)
    .add("s", StringType(), nullable=True)
    .add("ii", IntegerType(), nullable=False)
)

def to_arrow():
    return LocalDataToArrowConversion.convert(data, schema, use_large_var_types=False)

def from_arrow(tbl):
    return ArrowTableToRowsConversion.convert(tbl, schema)

tbl = to_arrow()
profile(from_arrow, tbl)
  • before
100983380 function calls in 24.509 seconds
  • after
70655910 function calls in 16.947 seconds

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants