Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropping columnstore table doesn't deleting its files #22

Open
zizouqidashen opened this issue Nov 11, 2024 · 15 comments
Open

Dropping columnstore table doesn't deleting its files #22

zizouqidashen opened this issue Nov 11, 2024 · 15 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@zizouqidashen
Copy link

What happens?

when I drop a column store table , the file is alredy exists and the file space was not released, how to release the file space when drop a column store table

To Reproduce

create table sometable ..... using columnstore

insert into sometable .....

drop table sometable

OS:

linux

pg_mooncake Version:

latest docker image

Postgres Version:

latest docker image

Are you using pg_mooncake Docker, Neon, or the extension standalone?

pg_mooncake Docker Image

@zizouqidashen zizouqidashen added the bug Something isn't working label Nov 11, 2024
@dpxcc dpxcc changed the title howe to drop column store table Dropping columnstore table doesn't deleting its files Nov 12, 2024
@dpxcc
Copy link
Contributor

dpxcc commented Nov 12, 2024

Thanks for the bug report!
We are aware of this issue and are actively working on it
The challenge is that Postgres uses a process model which makes refcounting its files harder

@dpxcc dpxcc added this to the 0.1.0 milestone Nov 13, 2024
@dpxcc dpxcc modified the milestones: 0.1.0, 0.2.0 Jan 5, 2025
@dpxcc dpxcc modified the milestones: 0.2.0, tbd Jan 15, 2025
@YuweiXiao
Copy link
Contributor

YuweiXiao commented Feb 8, 2025

hey, any updates on this? and design doc or impl detail for resolving the issue?. Besides dropped tables, once we have optimize (or the currently workaround update <table> where 1=1), the old version data files should also be taken care of.

@dpxcc
Copy link
Contributor

dpxcc commented Feb 8, 2025

It's on our shortlist and planned for support soon

Yes, both DDL (e.g. DROP TABLE, TRUNCATE, ALTER DROP COLUMN) and DML (e.g. UPDATE, DELETE) can render data files unused, requiring cleanup

The challenge is that we can't delete these data files immediately upon executing those commands because:

  1. The commands may run within a transaction that could later be rolled back
  2. Other concurrent transactions might still need access to these data files, even if they are deleted in the current transaction

Thus, data files can only be deleted once they have no remaining references. In a thread-based database, reference counting is straightforward, but in Postgres, which is process-based, each connection runs in its own process and may crash without cleaning up references

Postgres faces the same issue with its heap table tuples. We're considering a similar approach by integrating the cleanup into Postgres' VACUUM command

Contributions are welcome if you're interested!

@YuweiXiao
Copy link
Contributor

It's on our shortlist and planned for support soon

Yes, both DDL (e.g. DROP TABLE, TRUNCATE, ALTER DROP COLUMN) and DML (e.g. UPDATE, DELETE) can render data files unused, requiring cleanup

The challenge is that we can't delete these data files immediately upon executing those commands because:

  1. The commands may run within a transaction that could later be rolled back
  2. Other concurrent transactions might still need access to these data files, even if they are deleted in the current transaction

Thus, data files can only be deleted once they have no remaining references. In a thread-based database, reference counting is straightforward, but in Postgres, which is process-based, each connection runs in its own process and may crash without cleaning up references

Postgres faces the same issue with its heap table tuples. We're considering a similar approach by integrating the cleanup into Postgres' VACUUM command

Contributions are welcome if you're interested!

if we track xmin in each parquet file (e.g., store in heap table metadata), the deletion could be done based on oldest xmin (& active txn).

would love to contribute if no one has taken this on yet :)

@dpxcc
Copy link
Contributor

dpxcc commented Feb 9, 2025

Yes, something like that

  1. We currently only support copy-on-write, so all rows within a Parquet file share the same lifetime. We only need to store xmax for each Parquet file in the mooncake.data_files metadata table (This still holds after Support deletion vectors #30)
  2. When VACUUM runs, we can delete Parquet files that are invisible to any active transaction (xmax < oldest_active_txn)

Sounds great! Appreciate the help!

@YuweiXiao
Copy link
Contributor

YuweiXiao commented Feb 16, 2025

Hey @dpxcc , I'd like to discuss the design with you before diving into the code. Thanks in advance for your time and insights!

i am planning to create a new heap table, dead_data_files, to track deleted files. Whenever a file is deleted from data_files, it will be inserted into dead_data_files. The xmax (i.e., the transaction ID of the deletion) of the tuple in the data_files should match the xmin of the corresponding tuple in dead_data_files. Since the heap table internally tracks xmin, so the schema of dead_data_files is defined as (file_name text, ts timestamp). ts could be used during vacuum as a time threshold for physical data file deletion.

Alternative design options:

  1. No New Metadata Table: Instead of creating a new table, we could add new columns to data_files to track deleted files. However, this would complicate the original table and introduce filtering overhead in the read-path.
  2. Use xmax in Tuple Header: We could rely on xmax in data_files to vacuum deleted files (e.g., do a dirty read). However, since the autovacuum process is managed by PG, there is risk that tuples is vacuumed before we recycle the physical data files. Ensuring correctness without modifying upstream PG codebase would be non-trivial.

The code will be organized under vacuum module, hooked by TAM. For dropped table, we may need a dedicated bgwork to handle them, as they won't be scheduled once dropped. Additionally, when a table is droped, we will mark its oid in mooncake.tables to a negative value. This allows the bgworker to identify which files in data_files belong to the dropped table.

@dpxcc
Copy link
Contributor

dpxcc commented Feb 17, 2025

Thanks for working on this! These are very good points

I agree that using a new heap table to track dead data files is the better approach here

For dropped tables, I'd prefer not to add a bgwork since it would require autoloading the extension, which we've been avoiding so far - it's hard to get pg_mooncake autoloaded on hosted PG services like Neon. Instead, I'd rather add a new command or better hook into VACUUM (via DuckdbUtilityHook_Cpp) to clean up data files for dropped tables. Additionally, when a table is dropped, I think it's cleaner to simply move its data files from mooncake.data_files to mooncake.dead_data_files, rather than marking its oid in mooncake.tables to a negative value

Also, there are more data files that need to be deleted. We also need to clean up data files created within an aborted transaction or, even worse, within a crashed session

@YuweiXiao
Copy link
Contributor

YuweiXiao commented Feb 18, 2025

Thanks for working on this! These are very good points

I agree that using a new heap table to track dead data files is the better approach here

For dropped tables, I'd prefer not to add a bgwork since it would require autoloading the extension, which we've been avoiding so far - it's hard to get pg_mooncake autoloaded on hosted PG services like Neon. Instead, I'd rather add a new command or better hook into VACUUM (via DuckdbUtilityHook_Cpp) to clean up data files for dropped tables. Additionally, when a table is dropped, I think it's cleaner to simply move its data files from mooncake.data_files to mooncake.dead_data_files, rather than marking its oid in mooncake.tables to a negative value

yeah, avoiding autoloading is reasonable. What if we hook the cleanup at the very end of other tables' autovacuum. Automating this would significantly improve the user experience. Additionally, we could provide a UDF for manual control, since running vacuum <dropped_table> would fail as PG would complain that the table does not exists.

Also, there are more data files that need to be deleted. We also need to clean up data files created within an aborted transaction or, even worse, within a crashed session

YES! Files produced by aborted transactions must indeed be addressed. A straightforward solution is to perform a full listing of the object store, and cross-check it with entries in data_files. This is heavy though :(

A more PG-like approach would be to track ongoing data files within PG, such as heap_table or pgstat. Not sure if pgstat is extensible enough for purpose (ps. in Neon env, there is issue with pg_stat persistence neondatabase/neon#6560). With the heap_table approach, new file insertions would be recorded in both data_files and shadow_data_files. Autovacuum would be disabled for shadow_data_files, ensuring tuples with an aborted xid (xmin) and xmax=0 are properly cleaned.

Let's create a new issue for handling aborted data files. We may clarify the design as I implementing the "dropped table" one.

@dpxcc
Copy link
Contributor

dpxcc commented Feb 19, 2025

I believe you can run VACUUM without specifying a table name, which should allow us to clean up dropped tables in this case: https://www.postgresql.org/docs/current/sql-vacuum.html

I agree that GC untracked data files is better handled in a follow-up task
Haven't though about it in depth, but I have another idea similar to your heap_table approach but avoids disabling autovacuum for mooncake.dead_data_files:

  • Whenever we need to create a new data file, first run a separate transaction to add it to mooncake.dead_data_files and commit
  • Once the data file is fully written, move it from mooncake.dead_data_files to mooncake.data_files within the current transaction

This way, even if PostgreSQL crashes, orphaned data files remain tracked in mooncake.dead_data_files

@YuweiXiao
Copy link
Contributor

VACUUM without table name leads to processing all relations, including catalog heap tables. Ultimately, it invokes the vacuum routine of each relation. Reviewing the code, it might not be possible to determine whether the user has specified a table name in the command. Or is it acceptable to run GC for dropped table at the end of each table's vacuum?

Haven't though about it in depth, but I have another idea similar to your heap_table approach but avoids disabling autovacuum for mooncake.dead_data_files:

Looks good to me. Separate transaction may require start a new backend (e.g., PQconnectdb) .

@dpxcc
Copy link
Contributor

dpxcc commented Feb 19, 2025

I suppose VACUUM without table name will have VacuumStmt::rels = NIL?
https://github.com/postgres/postgres/blob/302cf15759233e654512979286ce1a5c3b36625f/src/include/nodes/parsenodes.h#L3917

It just sounds weird that when user explicitly says to vacuum one table, but we end up also vacuum other dropped tables
I'm wondering how PG vacuum dropped tables, won't they have the same issue?

@YuweiXiao
Copy link
Contributor

VACUUM will construct relation list if not specified. https://github.com/postgres/postgres/blob/302cf15759233e654512979286ce1a5c3b36625f/src/backend/commands/vacuum.c#L563

When dropping heap table, the storage is deleted in sync (after commit). https://github.com/postgres/postgres/blob/8a695d7998be67445b9cd8e67faa684d4e87a40d/src/backend/catalog/storage.c#L657
In the event of a crash between commit & the physical delete, I guess the delete operation is guaranteed to be completed during recovery.

It just sounds weird that when user explicitly says to vacuum one table, but we end up also vacuum other dropped tables.

Agreed. Ideally, users should not be aware of the vacuum process for dropped tables at all.

@dpxcc
Copy link
Contributor

dpxcc commented Feb 19, 2025

No, utility hook gets called before vacuum, at which time the relation list is just NIL if not specified

Thread 1 "postgres" hit Breakpoint 3, DuckdbUtilityHook_Cpp (pstmt=0xaaaacf051378, query_string=0xaaaacf050910 "VACUUM;", read_only_tree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, query_env=0x0, dest=0xaaaacf051738, qc=0xfffffd1214e8) at ../../src/pgduckdb/pgduckdb_ddl.cpp:139
139			VacuumStmt *stmt = castNode(VacuumStmt, pstmt->utilityStmt);
-exec p *stmt
$2 = {type = T_VacuumStmt, options = 0x0, rels = 0x0, is_vacuumcmd = true}

Thread 1 "postgres" hit Breakpoint 6, vacuum (relations=0x0, params=0xfffffd120c18, bstrategy=0xaaaacf190380, vac_context=0xaaaacf190280, isTopLevel=true) at vacuum.c:489
489		stmttype = (params->options & VACOPT_VACUUM) ? "VACUUM" : "ANALYZE";
-exec bt
#0  vacuum (relations=0x0, params=0xfffffd120c18, bstrategy=0xaaaacf190380, vac_context=0xaaaacf190280, isTopLevel=true) at vacuum.c:489
#1  0x0000aaaacca6b3a0 in ExecVacuum (pstate=0xaaaacf16b110, vacstmt=0xaaaacf0512c8, isTopLevel=true) at vacuum.c:449
#2  0x0000aaaaccd76610 in standard_ProcessUtility (pstmt=0xaaaacf051378, queryString=0xaaaacf050910 "VACUUM;", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xaaaacf051738, qc=0xfffffd1214e8) at utility.c:859
#3  0x0000ffffa3ad9fc8 in MooncakeHandleDDL (pstmt=0xaaaacf051378, query_string=0xaaaacf050910 "VACUUM;", read_only_tree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, query_env=0x0, dest=0xaaaacf051738, qc=0xfffffd1214e8) at ../../src/pgduckdb/pgduckdb_ddl.cpp:130
#4  0x0000ffffa3ada430 in DuckdbUtilityHook_Cpp (pstmt=0xaaaacf051378, query_string=0xaaaacf050910 "VACUUM;", read_only_tree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, query_env=0x0, dest=0xaaaacf051738, qc=0xfffffd1214e8) at ../../src/pgduckdb/pgduckdb_ddl.cpp:187
...

For dropping heap table at transaction commit - what if there are concurrent transaction reading that dropped table?

@YuweiXiao
Copy link
Contributor

No, utility hook gets called before vacuum, at which time the relation list is just NIL if not specified

great! I'll proceed with this approach.

For dropping heap table at transaction commit - what if there are concurrent transaction reading that dropped table?

DROP acquires the highest lock mode (AccessExclusiveLock) on the target relation, which prevents any concurrent transactions. Though the physical delete occurs after commit, the lock is still held during that time.

@dpxcc
Copy link
Contributor

dpxcc commented Feb 20, 2025

I see, that makes sense
This is great! So it's safe to delete data files for dropped tables in DuckdbXactCallback_Cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants