-
Couldn't load subscription status.
- Fork 7
Feat: Support creating multiple files with single --parts command #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # tpchgen-arrow/Cargo.toml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces support for generating multiple partitioned files with a single --parts command, along with atomic file generation and improved Zone table code organization.
Key Changes:
- Automatic generation of multiple partitioned files when
--partsis specified without--part - Files written to
output_dir/table/table.part.extensionstructure when partitioned - Atomic file generation using temporary
.inprogressfiles with rename on completion - Skip regeneration if target file already exists
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/generate_data.py | Simplified by removing staging directory logic, now delegates partitioning to spatialbench-cli |
| spatialbench-cli/tests/cli_integration.rs | Added comprehensive tests for no-overwrite behavior and multi-part generation |
| spatialbench-cli/src/zone_df.rs | Removed monolithic zone generation file |
| spatialbench-cli/src/zone/writer.rs | New module handling atomic parquet file writing with skip-if-exists logic |
| spatialbench-cli/src/zone/transform.rs | New module for zone data SQL transformations |
| spatialbench-cli/src/zone/stats.rs | New module for zone table statistics and row group calculations |
| spatialbench-cli/src/zone/partition.rs | New module implementing partition strategy and batch slicing |
| spatialbench-cli/src/zone/mod.rs | New module orchestrating single and multi-part zone generation |
| spatialbench-cli/src/zone/main.rs | New entry point for zone generation with format handling |
| spatialbench-cli/src/zone/datasource.rs | New module for DataFusion context setup and remote data loading |
| spatialbench-cli/src/zone/config.rs | New module for zone generation configuration |
| spatialbench-cli/src/runner.rs | New parallel execution engine with thread pooling for generation plans |
| spatialbench-cli/src/plan.rs | Enhanced with partitioning support and chunk count tracking |
| spatialbench-cli/src/output_plan.rs | New module for output file planning and directory management |
| spatialbench-cli/src/main.rs | Refactored to use new runner architecture and output planning |
| README.md | Updated with simplified partitioned output example |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
|
LGTM. In another PR, can you add the detailed usage to here? https://sedona.apache.org/spatialbench/datasets-generators/ The README is too long now. |
This PR introduces the following changes:
--partsoption is specified, each table is written out to output_dir/table/table.part.extension