-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
table compaction #3043
Labels
binding/rust
Issues for the Rust crate
bug
Something isn't working
mre-needed
Whether an MRE needs to be provided
Milestone
Comments
@sebvey please create a simple MRE in python, that I can run as is? Pointing to a branch is not very helpful.. |
Here is the MRE you asked for. from datetime import date, timedelta
from pathlib import Path
import shutil
import duckdb
import pyarrow as pa
from deltalake import Schema, Field, DeltaTable, WriterProperties, write_deltalake
from deltalake.schema import PrimitiveType
DT_PATH = Path("tmp/delta_table")
LOCATIONS = ["Lyon","Paris","Marseille"]
START_DATE, END_DATE = date(2024,1,1), date(2024,4,1)
schema = Schema([
Field("yearmonth",PrimitiveType("string")),
Field("datetime",PrimitiveType("timestamp_ntz")),
Field("location", PrimitiveType("string")),
Field("value",PrimitiveType("double")),
])
# TABLE INIT
if DT_PATH.is_dir():
shutil.rmtree(DT_PATH)
dt = DeltaTable.create(
table_uri=str(DT_PATH),
schema=schema,
partition_by=["yearmonth"],
)
# FEEDING THE TABLE
def batch(day: date, location: str) -> pa.Table:
"""Produces a DataFrame with 86400 records for given day and location:
- one 'query' per day -> ~30 files per month.
- data partitioned by yearmonth
"""
query = f"""
with dts as (
SELECT
unnest(
generate_series(
date '{day}',
date '{day}' + interval 1 day - interval 1 second,
interval 1 second
)
) AS datetime
)
select
strftime(datetime,'%Y%m') as 'yearmonth',
datetime,
'{location}' as 'location',
random() as 'value'
from dts
"""
with duckdb.connect() as con:
return con.sql(query).to_arrow_table()
days = (
START_DATE + timedelta(days=i)
for i in range((END_DATE - START_DATE).days)
)
for day in days:
for location in LOCATIONS:
write_deltalake(
dt,
batch(day, location),
mode="append",
writer_properties=WriterProperties(compression="ZSTD")
)
paths_len = len(dt.get_add_actions(flatten=True)["path"])
files_len = len(dt.files())
print(f"{paths_len=}") # -> 273
print(f"{files_len=}") # -> 273
dt.optimize.compact()
compacted_paths_len = len(dt.get_add_actions(flatten=True)["path"])
compacted_files_len = len(dt.files())
print(f"{compacted_paths_len=}") # v0.22.2 -> 205 / v0.22.3 -> 6
print(f"{compacted_files_len=}") # v0.22.2 -> 205 / v0.22.3 -> 6
dt.vacuum(
retention_hours=0,
enforce_retention_duration=False,
dry_run=False
)
vacuumed_paths_len = len(dt.get_add_actions(flatten=True)["path"])
vacuumed_files_len = len(dt.files())
print(f"{vacuumed_paths_len=}") # v0.22.2 -> 205 / v0.22.3 -> 6
print(f"{vacuumed_files_len=}") # v0.22.2 -> 205 / v0.22.3 -> 6 |
@sebvey ok good! Then our fix in 0.22.3 resolved that issue as well |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
binding/rust
Issues for the Rust crate
bug
Something isn't working
mre-needed
Whether an MRE needs to be provided
Environment
Delta-rs version: 0.22.2
Binding: python 0.22.2
Environment: python 3.13.0
Bug
What happened:
When compacting a delta table of about 273 files (380MB) partitioned on a field 'year_month' (3 partitions), the listing of the table files seems invalid:
dt.files()
lists 205 files, I don't think it's expectedAm I missing something ?
Log file of the 'OPTIMIZE' commit:
00000000000000000274.json
Path column of the get_add_actions():
get_add_actions.json
What you expected to happen:
dt.files()
should list 6 files ?How to reproduce it:
I made a repo with the code used for the test. Use the branch
deltars-issue-sample
: [email protected]:sebvey/delta-optim.gitI made the README.md as clear as possible.
The text was updated successfully, but these errors were encountered: