Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

table compaction #3043

Closed
sebvey opened this issue Dec 2, 2024 · 3 comments
Closed

table compaction #3043

sebvey opened this issue Dec 2, 2024 · 3 comments
Labels
binding/rust Issues for the Rust crate bug Something isn't working mre-needed Whether an MRE needs to be provided
Milestone

Comments

@sebvey
Copy link

sebvey commented Dec 2, 2024

Environment

Delta-rs version: 0.22.2

Binding: python 0.22.2

Environment: python 3.13.0

  • OS: macOS 13.7.1 (22H221) / Kernel Version: Darwin 22.6.0
  • Other:

Bug

What happened:
When compacting a delta table of about 273 files (380MB) partitioned on a field 'year_month' (3 partitions), the listing of the table files seems invalid:

  • new parquet files are correctly added (2 files by partition -> 6 files)
  • previous files seem correctly marked as 'remove' (273 'remove' in the commit log of the 'OPTIMIZE' operation
  • dt.files() lists 205 files, I don't think it's expected
  • `dt.get_add_actions() also lists 205 files, I'm quite sure it's not what is expected
  • when vacuum (with proper params) is done on the table, it seems to rely on the listed files and keep 205 files

Am I missing something ?

Log file of the 'OPTIMIZE' commit:
00000000000000000274.json

Path column of the get_add_actions():
get_add_actions.json

What you expected to happen:

  • dt.files() should list 6 files ?
  • `dt.get_add_actions() should list 6 files
  • vacuum should only left 6 files untouched

How to reproduce it:
I made a repo with the code used for the test. Use the branch deltars-issue-sample: [email protected]:sebvey/delta-optim.git

I made the README.md as clear as possible.

@sebvey sebvey added the bug Something isn't working label Dec 2, 2024
@rtyler rtyler added the binding/rust Issues for the Rust crate label Dec 3, 2024
@rtyler rtyler added this to the v0.23 milestone Dec 3, 2024
@ion-elgreco
Copy link
Collaborator

@sebvey please create a simple MRE in python, that I can run as is? Pointing to a branch is not very helpful..

@ion-elgreco ion-elgreco added the mre-needed Whether an MRE needs to be provided label Dec 7, 2024
@sebvey
Copy link
Author

sebvey commented Dec 9, 2024

Here is the MRE you asked for.
Apparentky, v0.22.3 solved the problem (it can be reproduced with v0.22.2 though).
@ion-elgreco , it seems that you can close this issue !

from datetime import date, timedelta
from pathlib import Path
import shutil
import duckdb
import pyarrow as pa
from deltalake import Schema, Field, DeltaTable, WriterProperties, write_deltalake
from deltalake.schema import PrimitiveType

DT_PATH = Path("tmp/delta_table")
LOCATIONS = ["Lyon","Paris","Marseille"]
START_DATE, END_DATE = date(2024,1,1), date(2024,4,1)

schema = Schema([
    Field("yearmonth",PrimitiveType("string")),
    Field("datetime",PrimitiveType("timestamp_ntz")),
    Field("location", PrimitiveType("string")),
    Field("value",PrimitiveType("double")),
])

#  TABLE INIT
if DT_PATH.is_dir():
    shutil.rmtree(DT_PATH)

dt = DeltaTable.create(
    table_uri=str(DT_PATH),
    schema=schema,
    partition_by=["yearmonth"],
)

# FEEDING THE TABLE
def batch(day: date, location: str) -> pa.Table:
    """Produces a DataFrame with 86400 records for given day and location:
    - one 'query' per day -> ~30 files per month.
    - data partitioned by yearmonth
    """

    query = f"""
    with dts as (
        SELECT
            unnest(
                generate_series(
                    date '{day}',
                    date '{day}' + interval 1 day - interval 1 second,
                    interval 1 second
                )
            ) AS datetime
    )

    select
        strftime(datetime,'%Y%m') as 'yearmonth',
        datetime,
        '{location}' as 'location',
        random() as 'value'
    from dts
    """

    with duckdb.connect() as con:
        return con.sql(query).to_arrow_table()

days = (
    START_DATE + timedelta(days=i)
    for i in range((END_DATE - START_DATE).days)
)

for day in days:
    for location in LOCATIONS:
        write_deltalake(
            dt,
            batch(day, location),
            mode="append",
            writer_properties=WriterProperties(compression="ZSTD")
        )

paths_len = len(dt.get_add_actions(flatten=True)["path"])
files_len = len(dt.files())
print(f"{paths_len=}") # -> 273
print(f"{files_len=}") # -> 273

dt.optimize.compact()
compacted_paths_len = len(dt.get_add_actions(flatten=True)["path"])
compacted_files_len = len(dt.files())
print(f"{compacted_paths_len=}") # v0.22.2 -> 205 / v0.22.3 -> 6
print(f"{compacted_files_len=}") # v0.22.2 -> 205 / v0.22.3 -> 6


dt.vacuum(
    retention_hours=0,
    enforce_retention_duration=False,
    dry_run=False
)
vacuumed_paths_len = len(dt.get_add_actions(flatten=True)["path"])
vacuumed_files_len = len(dt.files())
print(f"{vacuumed_paths_len=}") # v0.22.2 -> 205 / v0.22.3 -> 6
print(f"{vacuumed_files_len=}") # v0.22.2 -> 205 / v0.22.3 -> 6

@ion-elgreco
Copy link
Collaborator

@sebvey ok good! Then our fix in 0.22.3 resolved that issue as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working mre-needed Whether an MRE needs to be provided
Projects
None yet
Development

No branches or pull requests

3 participants