[SPARK-53791][CORE][SQL] Make the rename operations multi-threaded. #52507

Xtpacz · 2025-10-02T15:12:41Z

What changes were proposed in this pull request?

Improved two classes:
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala.
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala .
Added a configuration item:
spark.files.rename.numThreads in core/src/main/scala/org/apache/spark/internal/config/package.scala.

Why are the changes needed?

For example, during the insert overwrite directory operation, each rename operation triggers an RPC request. Therefore, when there are too many files, it can be time-consuming.
Converting the serial rename operations to multi-threaded operations can save job execution time.

Does this PR introduce any user-facing change?

Yes. A spark configuration item has been added: spark.files.rename.numThreads.

How was this patch tested?

Verified that the affected tests are passing successfully.

Was this patch authored or co-authored using generative AI tooling?

No.

…Commit

Xtpacz · 2025-10-09T02:37:50Z

cc @huaxingao @cloud-fan could you review this PR? This is my first PR in the Spark community. Thank you ：）

cloud-fan · 2025-10-09T04:29:11Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

    .booleanConf
    .createWithDefault(false)

+  private[spark] val FILES_RENAME_NUM_THREADS = ConfigBuilder("spark.files.rename.numThreads")


can this be a dynamic session config in SQLConf?

can this be a dynamic session config in SQLConf?

My understanding is to add a session-scoped SQLConf (e.g., spark.sql.files.rename.numThreads) and have SQL paths read it first, while retaining the global key (spark.files.rename.numThreads) as a fallback since core (e.g., HadoopMapReduceCommitProtocol) cannot depend on SQL. Please let me know if I’ve misunderstood. Thank you!

cloud-fan · 2025-10-09T04:29:58Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

+        case Some(sc) => sc.conf.get(FILES_RENAME_NUM_THREADS)
+        case None => FILES_RENAME_NUM_THREADS.defaultValue.get
+      }
+      val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "file-rename")


shall we have a global long-standing thread pool to do this work?

github-actions bot added SQL CORE labels Oct 2, 2025

Xtpacz changed the title ~~[SPARK-issuesNo][CORE][SQL] Make the rename operations multi-threaded.~~ [SPARK-53791][CORE][SQL] Make the rename operations multi-threaded. Oct 2, 2025

[SPARK][CORE][SQL] Support multi-threaded rename for Spark Two-Phase …

14614f9

…Commit

Xtpacz force-pushed the multi-threaded-rename branch from 1fd74df to 14614f9 Compare October 2, 2025 15:35

fix

fa0d1b7

cloud-fan reviewed Oct 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53791][CORE][SQL] Make the rename operations multi-threaded. #52507

[SPARK-53791][CORE][SQL] Make the rename operations multi-threaded. #52507

Uh oh!

Xtpacz commented Oct 2, 2025

Uh oh!

Xtpacz commented Oct 9, 2025

Uh oh!

cloud-fan Oct 9, 2025

Uh oh!

Xtpacz Oct 9, 2025

Uh oh!

cloud-fan Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-53791][CORE][SQL] Make the rename operations multi-threaded. #52507

Are you sure you want to change the base?

[SPARK-53791][CORE][SQL] Make the rename operations multi-threaded. #52507

Uh oh!

Conversation

Xtpacz commented Oct 2, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Xtpacz commented Oct 9, 2025

Uh oh!

cloud-fan Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Xtpacz Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants