Skip to content

Conversation

@prantogg
Copy link
Contributor

This PR introduces the following changes:

  1. Running a command like this will automatically create 10 files for each large table (in parallel)
    spatialbench-cli -s 100 --parts=10
  2. When --parts option is specified, each table is written out to output_dir/table/table.part.extension
    /sf1-parquet/trip/trip.1.parquet
    /sf1-parquet/trip/trip.2.parquet
    
  3. If the target output file already exists, then it will not be regenerated. This allows you to kill an instance of tpchgen-cli and then run the same command to pick back up where it left off
  4. Atomic file generation - spatialbench-cli writes to a file named ".inprogress" and when complete renames the file
  5. Refactors Zone table generation code for better readability

@prantogg prantogg added documentation Improvements or additions to documentation enhancement New feature or request affects datagen behavior Pull requests that affect the behavior of the data generator labels Oct 26, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces support for generating multiple partitioned files with a single --parts command, along with atomic file generation and improved Zone table code organization.

Key Changes:

  • Automatic generation of multiple partitioned files when --parts is specified without --part
  • Files written to output_dir/table/table.part.extension structure when partitioned
  • Atomic file generation using temporary .inprogress files with rename on completion
  • Skip regeneration if target file already exists

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tools/generate_data.py Simplified by removing staging directory logic, now delegates partitioning to spatialbench-cli
spatialbench-cli/tests/cli_integration.rs Added comprehensive tests for no-overwrite behavior and multi-part generation
spatialbench-cli/src/zone_df.rs Removed monolithic zone generation file
spatialbench-cli/src/zone/writer.rs New module handling atomic parquet file writing with skip-if-exists logic
spatialbench-cli/src/zone/transform.rs New module for zone data SQL transformations
spatialbench-cli/src/zone/stats.rs New module for zone table statistics and row group calculations
spatialbench-cli/src/zone/partition.rs New module implementing partition strategy and batch slicing
spatialbench-cli/src/zone/mod.rs New module orchestrating single and multi-part zone generation
spatialbench-cli/src/zone/main.rs New entry point for zone generation with format handling
spatialbench-cli/src/zone/datasource.rs New module for DataFusion context setup and remote data loading
spatialbench-cli/src/zone/config.rs New module for zone generation configuration
spatialbench-cli/src/runner.rs New parallel execution engine with thread pooling for generation plans
spatialbench-cli/src/plan.rs Enhanced with partitioning support and chunk count tracking
spatialbench-cli/src/output_plan.rs New module for output file planning and directory management
spatialbench-cli/src/main.rs Refactored to use new runner architecture and output planning
README.md Updated with simplified partitioned output example

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@prantogg prantogg requested a review from jiayuasu October 26, 2025 08:09
@jiayuasu
Copy link
Member

jiayuasu commented Oct 27, 2025

LGTM. In another PR, can you add the detailed usage to here? https://sedona.apache.org/spatialbench/datasets-generators/ The README is too long now.

@jiayuasu jiayuasu merged commit 59dff99 into main Oct 27, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects datagen behavior Pull requests that affect the behavior of the data generator documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants