Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using repack to prune old data #279

Open
fazalmajid opened this issue Jul 6, 2021 · 3 comments
Open

Using repack to prune old data #279

fazalmajid opened this issue Jul 6, 2021 · 3 comments

Comments

@fazalmajid
Copy link

fazalmajid commented Jul 6, 2021

At my company we use pg_repack primarily as a tool to delete obsolete historical data or for compliance reasons (e.g. GDPR). Running repack kills two birds with one stone: remove bloat but also remove old data where a long-running DELETE is impractical due to the maintenance window it would entail due to full table locks.

Our current implementation requires no C level changes to the extension, just patching the PL/PGSQL code, using the migration below. Essentially we add a table repack.filters that has a WHERE clause that specified what rows to keep, and optionall a repack.joins table with JOIN clauses if we need information from other tables to make the decision. If there is no matching row in those tables, the existing pg_repack behavior (copy everything) holds.

I think this is a fairly valuable feature and use case for pg_repack, and if there is interest, I'd be happy to package it as a PR for inclusion upstream.

CREATE TABLE IF NOT EXISTS repack.filters (
  tablename TEXT PRIMARY KEY,
  clause    TEXT
);

CREATE TABLE IF NOT EXISTS repack.joins (
  tablename TEXT PRIMARY KEY,
  joins     TEXT NOT NULL
);

DO $$
DECLARE
  _v TEXT;
  _vn TEXT;
BEGIN
  UPDATE pg_proc
  SET prosrc = REPLACE(prosrc, 'quote_ident', '''T.''||quote_ident')
  WHERE proname='get_columns_for_create_as'
    AND pronamespace=(SELECT oid FROM pg_namespace WHERE nspname='repack');
    
  SELECT definition INTO _v
  FROM pg_views WHERE schemaname='repack' AND viewname='tables';
  IF _v ~ 'repack.filters' THEN
    RAISE EXCEPTION 'repack filters patch already applied';
  END IF;
  EXECUTE FORMAT('CREATE OR REPLACE VIEW repack.tables AS %s', _v);
  _vn := REPLACE(_v, 'LEFT JOIN pg_class t ON ((r.reltoastrelid = t.oid))',
                 'LEFT JOIN pg_class T ON R.reltoastrelid = T.oid
     LEFT JOIN repack.filters F ON R.relname=F.tablename
     LEFT JOIN repack.joins J ON R.relname=J.tablename');
  IF _vn = _v THEN
    RAISE EXCEPTION 'repack.table view definition changed, cannot patch 1';
  END IF;
  _v := _vn;
  _vn := REPLACE(_vn,
    ' AS copy_data',
    ' || '' T'' || COALESCE(J.joins, '' '') || COALESCE('' WHERE '' || F.clause || '' '', '''') AS copy_data');
  IF _vn = _v THEN
    RAISE EXCEPTION 'repack.table view definition changed, cannot patch 2';
  END IF;
  _v := _vn;
  _vn := REPLACE(_vn,
    ' AS create_table_2',
    ' || '' T'' AS create_table_2');
  IF _vn = _v THEN
    RAISE EXCEPTION 'repack.table view definition changed, cannot patch 3';
  END IF;
  EXECUTE FORMAT('CREATE OR REPLACE VIEW repack.tables AS %s', _vn);
END;   
$$;

So for instance if you have the table definitions:

CREATE TABLE contract(
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL,
  expiration DATE NULL
);
CREATE TABLE employee(
  id SERIAL PRIMARY KEY
  name TEXT NOT NULL
  contract INT NOT NULL REFERENCES contract(id)
);

And you want to repack out employees with expired contracts, you would:

INSERT INTO repack.filters (tablename, clause)
VALUES ('employee', 'COALESCE(contract.expiration, CURRENT_DATE) >= CURRENT_DATE');
INSERT INTO repack.joins (tablename, joins)
VALUES ('employee', 'JOIN contract ON employee.contract=contract.id');
@scottcc
Copy link

scottcc commented Mar 19, 2024

👍

@Ten0
Copy link

Ten0 commented Sep 12, 2024

I'm doing something similar here:
Ten0@27ddafb

I have used that to reorder through joins, but not to filter out lines.

I'm wondering about whether I can use this to filter out lines (for a similar reason as yours).
I assume this would cause issues if there were foreign key constraints (they wouldn't be checked), but I would expect that if there aren't it shouldn't.
Is that correct, or would swapping the tables after not inserting as many lines as there originally were cause issues?

@fazalmajid
Copy link
Author

Like all DBA-level operations, it should be used carefully, of course, and if you have foreign keys, you would need to repack/purge the tables in topological sort order. That said, we've used this in production with fairly large tables (100+GB) with no issues. You will typically need to quiesce traffic to the DB at the very beginning and very end so the DDL locks used to swap the tables can be taken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants