Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference tables have too many Parquet files #14

Open
MrPowers opened this issue Oct 10, 2022 · 3 comments
Open

Reference tables have too many Parquet files #14

MrPowers opened this issue Oct 10, 2022 · 3 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@MrPowers
Copy link
Contributor

Here are the contents of reference_table_1:

out/tables/generated/reference_table_1/
├── delta
│   ├── _delta_log
│   │   ├── 00000000000000000000.json
│   │   └── 00000000000000000001.json
│   ├── part-00000-3411d322-e527-4412-a307-1b7200ee5969-c000.snappy.parquet
│   ├── part-00000-4507d90c-b1b4-421e-a610-64f809bb0f5e-c000.snappy.parquet
│   ├── part-00003-38d8b9cf-6198-40cb-9d0e-3193665f616f-c000.snappy.parquet
│   ├── part-00003-c0f0d2bd-1223-432e-b282-c21e5e84e933-c000.snappy.parquet
│   ├── part-00006-1d2ce96b-e98a-4aca-9f9b-39dd87ed95d2-c000.snappy.parquet
│   ├── part-00006-22499926-c359-49ae-99f0-aa2e2427837f-c000.snappy.parquet
│   ├── part-00009-55b75f6c-f47e-4720-971d-b3d6549d2e22-c000.snappy.parquet
│   ├── part-00009-b10616ab-ec46-45d1-a15f-7cfba3747266-c000.snappy.parquet
│   ├── part-00014-861044c3-1d0f-4d9c-ac60-53d02edeec9f-c000.snappy.parquet
│   └── part-00019-b68afb45-87bb-49f2-95e9-45774d154db4-c000.snappy.parquet
├── parquet
│   └── table_content.parquet
└── table-metadata.json

Ideally, this table would only have two Parquet files. We should be able to clean this up with a repartition(1) before writing.

@MrPowers MrPowers added enhancement New feature or request good first issue Good for newcomers labels Oct 10, 2022
@edmondop
Copy link
Contributor

@MrPowers
Copy link
Contributor Author

@edmondo1984 - Took a look at the function you pointed to:

def _write_delta(write_plan: WritePlan, path: str) -> None:
    for (write_mode, entry) in write_plan.entries:
        entry.write.partitionBy(
            write_plan.table.partition_keys,
        ).format(
            'delta',
        ).mode(
            write_mode,
        ).save(
            path,
        )

Perhaps entry.repartition(1) will solve this?

@edmondop
Copy link
Contributor

It should! Why don't you try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants