Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete unwanted audits #9839

Merged
merged 1 commit into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions app/services/data_migrations/delete_work_history_audits.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
module DataMigrations
class DeleteWorkHistoryAudits
TIMESTAMP = 20240917154444
MANUAL_RUN = false
BATCH_SIZE = 5000

def change
relation.in_batches(of: BATCH_SIZE) do |batch|
DeleteAuditsWorker.perform_async(batch.ids)
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will put here my opinion but feel free to choose to go forward or not with my views. It is just me being over cautious.

When you deleted locally the million of records how long did it take?
While it was deleted did you try to use local apply and see if the DB wasn't busy?

My main concern is deleting the 3 million at one go.

I'd recommend to use Batch Delivery to stagger for over a hour (or more?). Process deletions gradually in time would be a safer option.

The reason is deleting millions of records at one go (the workers processing all at the same time) can still result in locking issues on the audited_audit table. This could block other queries that need to access these records while the deletion is in progress.

Also another thing: how much DB disk space it took while the deletion was happening?
PostgreSQL uses MVCC (Multi-Version Concurrency Control) - this mean when a row is deleted, it isn't immediately removed from disk. Instead, it's marked as "dead" and made invisible to future transactions. The actual space is reclaimed later through a process called vacuuming.
So disk space could sky rocket if we delete million of rows.
And similar to table bloat, indexes will accumulate dead entries, which also need to be cleaned up by vacuuming.
If we have enough DB disk space, we shouldn't worried but would be nice to know how much DB disk usage percentage increases to avoid surprises.

Last thing: probably is in your radar, but let's run this large delete operations during off-peak times to minimize any impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did use the batch delivery and it looked like it did a count(to spread them out I think) on all the records that we want to target, 3 million and it loaded all columns of the audit table. The PR only loads the id and it doesn't do the count

So I thought these extra steps were not necessary.

Don't know how much DB disk space we have, I'll try to find out

@dcyoung-dev what do you think? Should we spread the workers a bit?

end

def relation
Audited::Audit
.where('created_at >= ? and created_at <= ?', DateTime.new(2024, 9, 3, 11), DateTime.new(2024, 9, 3, 20))
CatalinVoineag marked this conversation as resolved.
Show resolved Hide resolved
.where(user_type: nil, user_id: nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth putting the EXPLAIN for this query in the PR description, or here.

Copy link
Contributor Author

@CatalinVoineag CatalinVoineag Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explain looks decent. Using bitmap index scans.

.where(action: :create)
.where(username: '(Automated process)')
.where(associated_type: 'ApplicationChoice')
.where(auditable_type: %w[ApplicationExperience ApplicationWorkHistoryBreak])
.order(id: :asc)
end
end
end
8 changes: 8 additions & 0 deletions app/workers/delete_audits_worker.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
class DeleteAuditsWorker
include Sidekiq::Worker
sidekiq_options queue: :low_priority

def perform(audit_ids)
Audited::Audit.where(id: audit_ids).delete_all
end
end
1 change: 1 addition & 0 deletions lib/tasks/data.rake
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ DATA_MIGRATION_SERVICES = [
'DataMigrations::BackfillWithdrawalReasons',
'DataMigrations::RemoveTeacherDegreeApprenticeshipFeatureFlag',
'DataMigrations::RevertApplicationChoicesUpdatedAt',
'DataMigrations::DeleteWorkHistoryAudits',
'DataMigrations::BackfillApplicationChoicesWithWorkExperiences',
'DataMigrations::MarkUnsubmittedApplicationsWithoutEnglishProficiencyAsElfIncomplete',
'DataMigrations::BackfillEnglishProficiencyRecordsForCarriedOverApplications',
Expand Down
6 changes: 3 additions & 3 deletions spec/factories/audit.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
created_at { Time.zone.now }

transient do
application_experience { build_stubbed(:application_experience) }
application_experience { build_stubbed(:application_work_experience) }
application_choice { build_stubbed(:application_choice, :awaiting_provider_decision) }
changes { {} }
end
Expand All @@ -18,7 +18,7 @@
audit.auditable_type = 'ApplicationExperience'
audit.auditable_id = evaluator.application_experience.id
audit.associated = evaluator.application_choice
audit.user_type = evaluator.user.class.to_s
audit.user_type = evaluator.user.class.to_s unless evaluator.username == '(Automated process)'
audit.audited_changes = evaluator.changes
end
end
Expand All @@ -40,7 +40,7 @@
audit.auditable_type = 'ApplicationWorkHistoryBreak'
audit.auditable_id = evaluator.application_work_history_break.id
audit.associated = evaluator.application_choice
audit.user_type = evaluator.user.class.to_s
audit.user_type = evaluator.user.class.to_s unless evaluator.username == '(Automated process)'
CatalinVoineag marked this conversation as resolved.
Show resolved Hide resolved
audit.audited_changes = evaluator.changes
end
end
Expand Down
101 changes: 101 additions & 0 deletions spec/services/data_migrations/delete_work_history_audits_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
require 'rails_helper'

RSpec.describe DataMigrations::DeleteWorkHistoryAudits do
describe '#change' do
it 'enques jobs to DeleteAuditsWorker' do
create(
:application_work_history_break_audit,
user_type: nil,
user_id: nil,
created_at: DateTime.new(2024, 9, 3, 12),
action: 'create',
username: '(Automated process)',
)

expect { described_class.new.change }.to change(
DeleteAuditsWorker.jobs, :size
).by(1)
end
end

describe '#relation' do
it 'returns the audits that need deleting' do
work_break = create(
:application_work_history_break_audit,
action: 'create',
user_type: nil,
user_id: nil,
username: '(Automated process)',
created_at: DateTime.new(2024, 9, 3, 12),
)
work_break_2 = create(
:application_work_history_break_audit,
action: 'create',
user_type: nil,
user_id: nil,
username: '(Automated process)',
created_at: DateTime.new(2024, 9, 3, 18),
)
work_experience = create(
:application_experience_audit,
action: 'create',
user_type: nil,
user_id: nil,
username: '(Automated process)',
created_at: DateTime.new(2024, 9, 3, 18),
)
work_break_with_provider_username = create(
:application_work_history_break_audit,
action: 'create',
username: 'Provider',
)
work_break_with_update_action = create(
:application_work_history_break_audit,
action: 'update',
username: '(Automated process)',
user_type: nil,
user_id: nil,
)
work_break_outside_time_range = create(
:application_work_history_break_audit,
action: 'create',
username: '(Automated process)',
user_type: nil,
user_id: nil,
created_at: DateTime.new(2024, 9, 4, 18),
)
work_break_with_provider_user_type = create(
:application_work_history_break_audit,
action: 'create',
username: '(Automated process)',
created_at: DateTime.new(2024, 9, 3, 18),
user_type: 'Provider',
)
work_experience_outside_time_range = create(
:application_experience_audit,
action: 'create',
user_type: nil,
user_id: nil,
created_at: DateTime.new(2024, 9, 4, 18),
)
application_form_audit = create(
:application_form_audit,
action: 'create',
user_type: nil,
user_id: nil,
username: '(Automated process)',
created_at: DateTime.new(2024, 9, 3, 12),
)

expect(described_class.new.relation).to include(work_break, work_break_2, work_experience)
expect(described_class.new.relation).not_to include(
work_break_with_provider_username,
work_break_with_update_action,
work_break_outside_time_range,
work_break_with_provider_user_type,
work_experience_outside_time_range,
application_form_audit,
)
end
end
end
17 changes: 17 additions & 0 deletions spec/workers/delete_audits_worker_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
require 'rails_helper'

RSpec.describe DeleteAuditsWorker do
describe '#perform' do
it 'deletes audits by id' do
audit_1 = create(:application_work_history_break_audit)
audit_2 = create(:application_work_history_break_audit)
audit_3 = create(:application_work_history_break_audit)
audit_ids = [audit_1.id, audit_2.id]

described_class.new.perform(audit_ids)

expect(Audited::Audit.where(id: audit_ids)).not_to exist
expect(Audited::Audit.where(id: audit_3.id)).to exist
end
end
end
Loading