Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run migrations and vacuum on staging and prod postgres via CLI #17220

Open
devopsmatt opened this issue Jan 29, 2025 · 11 comments
Open

Run migrations and vacuum on staging and prod postgres via CLI #17220

devopsmatt opened this issue Jan 29, 2025 · 11 comments
Assignees
Labels
DevSecOps Team Aq DevSecOps work label Injection Unplanned high priority work reportstream
Milestone

Comments

@devopsmatt
Copy link
Collaborator

Target location for SOP documentation for running raw queries

@devopsmatt devopsmatt added reportstream DevSecOps Team Aq DevSecOps work label Injection Unplanned high priority work labels Jan 29, 2025
@emvaldes
Copy link
Collaborator

emvaldes commented Feb 3, 2025

There is a lot to be considered prior to execute this process and ad-hoc events can have a negative impact in the system's structure, performance and availability of services.

Running VACUUM ANALYZE on all tables in a safe fashion in PostgreSQL is crucial for maintaining database performance while minimizing impact on running queries. Here’s a structured approach:


1. Understand this process does

  • VACUUM: Reclaims space from dead tuples.
  • ANALYZE: Updates the table's statistics for the query planner.

Safe Approach Considerations

  • Avoid locks on active tables: Use VACUUM (ANALYZE, VERBOSE) or VACUUM ANALYZE, but not FULL unless absolutely necessary.
  • Run during low traffic hours: To minimize performance impact.
  • Monitor long-running transactions: Avoid vacuuming if long transactions are active.
  • Use autovacuum tuning: Instead of manual vacuuming, tweak autovacuum parameters for better performance.

2. Run this process on All Tables

Option 1: Simple Command (Safe for Small Databases)

VACUUM ANALYZE;

This applies to all tables but may be expensive on large databases.


Option 2: Run on Each Table Iteratively (Safer for Large Databases)

You can run VACUUM ANALYZE on all tables one by one using:

DO $$
DECLARE
    r RECORD;
BEGIN
    FOR r IN (
        SELECT schemaname, 
               tablename 
        FROM pg_tables 
        WHERE schemaname 
        NOT IN ('pg_catalog', 'information_schema')
    )
    LOOP
        EXECUTE format(
            'VACUUM ANALYZE %I.%I', 
            r.schemaname, 
            r.tablename
        );
    END LOOP;
END $$;
  • Excludes system tables (pg_catalog, information_schema).
  • Iterates safely over each table.

Option 3: Use Parallel Execution (For Faster Execution)

If you have a large number of tables, parallel execution can speed things up.

Using GNU Parallel in Bash

psql -d your_database \
     -t \
     -c "
SELECT 'VACUUM ANALYZE ' || schemaname || '.' || tablename || ';' 
FROM pg_tables 
WHERE schemaname 
NOT IN ('pg_catalog', 'information_schema');
" \
| parallel -j 4 psql -d your_database -c ;
  • Runs VACUUM ANALYZE in parallel on multiple tables (-j 4 runs 4 at a time).

3. Automate Regular Vacuuming with a Cron Job

For automated safe execution, you can set a cron job on a database server:

0 3 * * * psql -d your_database -c "VACUUM ANALYZE;"
  • Runs every day at 3 AM (adjust based on traffic).
  • Ensures regular maintenance without manual intervention.

4. Check for Long-Running Transactions (Before Running Vacuum)

Long transactions can prevent VACUUM from cleaning up dead tuples. Run:

SELECT pid, 
       age(clock_timestamp(), query_start), 
       state, 
       query
FROM pg_stat_activity
WHERE state = 'active' 
AND query NOT LIKE 'VACUUM%';
  • If any query has been running for a long time, consider delaying vacuum.

5. Monitor Vacuum Progress and Logs

After running VACUUM ANALYZE, check for effectiveness:

SELECT relname, 
       last_vacuum, 
       last_autovacuum, 
       last_analyze, 
       last_autoanalyze
FROM pg_stat_user_tables
ORDER BY last_vacuum DESC;
  • Ensures tables are properly maintained.

6. Optimize Autovacuum Settings

Instead of manual vacuuming, fine-tune autovacuum in postgresql.conf:

autovacuum_vacuum_threshold = 50
autovacuum_analyze_threshold = 50
autovacuum_vacuum_cost_limit = 2000
autovacuum_vacuum_cost_delay = 10ms
  • Adjust based on workload.
    Note: This is a native process but not supporting in the existing systems. Once the migration is completed, then we can assign this responsibility to the system itself as it can handle as part of its native services.

7. Final Recommendations

  • For small databases: VACUUM ANALYZE;
  • For large databases: Iterate over tables with a DO $$ block.
  • For large table counts: Use parallel execution with GNU parallel.
  • For automation: Use cron jobs or optimize autovacuum.
  • For safety: Check for long-running transactions before vacuuming.

@emvaldes emvaldes added this to the done milestone Feb 3, 2025
@emvaldes
Copy link
Collaborator

emvaldes commented Feb 3, 2025

I have created a separate GitHub Workflow to handle these operations in a scheduled basis but will track that development in a separate ticket.

@emvaldes
Copy link
Collaborator

emvaldes commented Feb 5, 2025

Based on current scope and instructions provided to me, this operation will be requested to the CDC DBA engineers to perform these requests to all databases (across all environments).
For that, a separate ticket will be produced at the CDC ticketing system. I will update this ticket to reflect the status of the requests when it gets completed.
For now, I recommend to close this issue as we have provided all the documentation that is needed but performing these changes is not delegated to another team.

@emvaldes
Copy link
Collaborator

emvaldes commented Feb 5, 2025

I will be posting shortly, the ticket numbers associated with these operations.

@emvaldes
Copy link
Collaborator

emvaldes commented Feb 5, 2025

Production request at the CDC Service Desk:
https://servicedesk.cdc.gov/esc?id=ticket&table=sc_req_item&sys_id=8b5524ed87275e90bec0cbb6cebb353e&view=sp


Server Name or Alias: pdhprod-pgsql.postgres.database.azure.com
Database Name: prime_data_hub
Enterprise Catalog Number (ESC #): 3205
Associated Application: PRIME ReportStream

Instructions:
The development team has requested for the Vacuum Analyze to be performed to this database.

Implementation time: 02/07/2025 13:48:25

@emvaldes
Copy link
Collaborator

emvaldes commented Feb 5, 2025

Production request at the CDC Service Desk:
https://servicedesk.cdc.gov/esc?id=ticket&table=sc_req_item&sys_id=d007ec2587675e90bec0cbb6cebb358c&view=sp


Server Name or Alias: pdhstaging-pgsql.postgres.database.azure.com
Database Name: prime_data_hub
Enterprise Catalog Number (ESC #): 3205
Associated Application: PRIME ReportStream

Instructions:
The development team has requested for the Vacuum Analyze to be performed to this database.

Implementation time: 02/07/2025 13:48:25

@emvaldes
Copy link
Collaborator

emvaldes commented Feb 5, 2025

Production request at the CDC Service Desk:
https://servicedesk.cdc.gov/esc?id=ticket&table=sc_req_item&sys_id=5f59a46187a75e90bec0cbb6cebb354d&view=sp


Server Name or Alias: pdhtest-pgsql.postgres.database.azure.com
Database Name: prime_data_hub
Enterprise Catalog Number (ESC #): 3205
Associated Application: PRIME ReportStream

Instructions:
The development team has requested for the Vacuum Analyze to be performed to this database.

Implementation time: 02/07/2025 13:48:25

@emvaldes
Copy link
Collaborator

emvaldes commented Feb 7, 2025

I have learned from the CDC DBA team that the Vacuum Analyze (contrary to popular believe) are processes that are already enabled at the system level in the current setup.
There is no need to trigger any additional operations or script anything or automate anything because these arbitrary operations will only cause more problems to both performance and system's availability.

Whenever these operations are performed, there is a significant amount of tables locking and CPU spikes that affects the system. The Vacuum Analyze is currently setup as part of a percentage threshold that gets triggered whenever these conditions are met.

--ALTER TABLE cycle2425.student SET (autovacuum_vacuum_scale_factor = 0.05);
--ALTER TABLE cycle2425.student SET (autovacuum_vacuum_threshold = 2000000);
ALTER TABLE cycle2425.student SET (autovacuum_analyze_scale_factor = 0.20);
ALTER TABLE cycle2425.student SET (autovacuum_analyze_threshold = 20000000);

@emvaldes
Copy link
Collaborator

emvaldes commented Feb 7, 2025

We have these queries that can be ran to determine the status of these operations:
Note: It would be important that someone review these.

last autovacuum:

select schemaname, 
       relname, 
       autovacuum_count, 
       last_autovacuum,  
       last_autoanalyze, 
       last_vacuum, 
       last_analyze 
from pg_stat_user_tables
order by autovacuum_count 
desc, last_autovacuum ;

last auto and manual times and counts:

select schemaname, 
       relname, 
       autovacuum_count, 
       autoanalyze_count, 
       last_autovacuum,  
       last_autoanalyze, 
       last_vacuum, 
       last_analyze 
from pg_stat_user_tables
order by autoanalyze_count 
desc, last_autovacuum ;

last auto times and counts:

select schemaname, 
       relname, 
       autovacuum_count, 
       autoanalyze_count, 
       last_autovacuum,  
       last_autoanalyze 
from pg_stat_user_tables
order by autoanalyze_count 
desc, last_autovacuum ;

all info autovacuum:

select * 
from pg_stat_user_tables 
order by autovacuum_count 
desc ;

@emvaldes
Copy link
Collaborator

emvaldes commented Feb 7, 2025

This is the current understanding we have.

Be mindful that the DBA team is still in the process of performing a proper assessment of the database systems and their configurations. Certain adjustments will need to be made but in time, they will be able to give us a final verdict as to what is:

  • Working within acceptable parameters and in compliance.
  • What needs to be change/updated (modified) with the level of urgency it will have associated.
  • The existing databases have Vacuum Analyze enabled/active and it does not work on a schedule basis but rather triggered based on threshold's parameters. There is no tangible recommendations to enact any changes at this point. This is critical to understand so we do not overwhelm a system that is in an unknown management stage.
  • Determine the proper action-plan and course of action to go ahead with this mandatory migration (upgrade).

Unless the development team has some objections, I propose to cancel these requests to enact scheduled and manual Vacuum Analyze operations. Nevertheless, this someone else with more knowledge should make that decision. I am exclusively stating facts.

@emvaldes
Copy link
Collaborator

emvaldes commented Feb 7, 2025

As a last note and base on the DBA preliminary recommendations, its possible to override these system-wide configurations on a table by table basis.

Note: It's also recommended that we stay away from "automated" auto-vacuuming operations.

@emvaldes emvaldes closed this as completed Feb 7, 2025
@emvaldes emvaldes reopened this Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DevSecOps Team Aq DevSecOps work label Injection Unplanned high priority work reportstream
Projects
None yet
Development

No branches or pull requests

2 participants