Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ds.write_dataset kwarg overrides to write_deltalake function #1683

Closed
someaveragepunter opened this issue Sep 30, 2023 · 1 comment
Closed
Labels
enhancement New feature or request

Comments

@someaveragepunter
Copy link

Description

allow passing ds.write_dataset kwarg overrides into write_deltalake function

ds.write_dataset(
data,
base_dir="/",
basename_template=f"{current_version + 1}-{uuid.uuid4()}-{{i}}.parquet",
format="parquet",
partitioning=partitioning,
# It will not accept a schema if using a RBR
schema=schema if not isinstance(data, RecordBatchReader) else None,
file_visitor=visitor,
existing_data_behavior="overwrite_or_ignore",
file_options=file_options,
max_open_files=max_open_files,
max_rows_per_file=max_rows_per_file,
min_rows_per_group=min_rows_per_group,
max_rows_per_group=max_rows_per_group,
filesystem=filesystem,
max_partitions=max_partitions,
)

Use Case
my specific use case is to override the basename_template param because at times when I'm bombarding S3 with thousands of concurrent tasks, hitting the parquet file directly without using the txn log to find the files yields performance benefits. hence, explicitly naming the parquet file allows me to statically deterministically specify the filename (as opposed to querying for it at runtime)

Furthermore, this would future proof and expose any additional kwargs / enhancements to the pyarrow datasets api

I'm happy to propose the change and submit PR if this is an acceptable enhancement.

@someaveragepunter someaveragepunter added the enhancement New feature or request label Sep 30, 2023
@ion-elgreco
Copy link
Collaborator

@someaveragepunter can you elaborate a bit more on the use case? We use UUIDs so that there aren't any file collisions

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants