-
Notifications
You must be signed in to change notification settings - Fork 26
Undoing an import
If an import fails, changes may need to be reversed, and in particular, data may need to be deleted (depending on how far the import process got).
Note that this document is likely to get out of date, and so should be treated as a guide
There are several places where changes are made:
- Data files on the filesystem
-
log.json
on the filesystem - BigQuery
- Storage
- Data tables in Postgres
-
ImportLog
andTaskLog
tables in Postgres
This document describes the changes that each task (or class of task) makes, and the steps required to reverse them
For any task that has been run and has been assumed to run successfully, there will be a TaskLog
object. If the task is to be run again, the TaskLog
object should either be deleted or have its status
field changed to "failed"
.
Fetchers will download new data files to /home/hello/openprescribing-data/data/{task_name}/{yyyy_mm}/
on largeweb2
.
If an incorrect data file has been downloaded, the file should be deleted from the filesystem. Additionally, it may have been uploaded to Storage, in which case it should be deleted from https://console.cloud.google.com/storage/browser/ebmdatalab/hscic?project=ebmdatalab.
For converters and importers, remove the relevant references in log.json
.
Instructions for reversing reversible tasks are below. Other tasks (eg import_hscic_practices
) may not be reversible, since the data they create does not have a corresponding month.
This task:
- creates a couple of temporary BQ tables
- appends prescribing information to the
hscic.prescribing
table in BQ - downloads a converted CSV file for importing into relevant partition of the
frontend_prescribing
table in Postgres byimport_hscic_prescribing
To reverse:
- on the filesystem:
- delete
/home/hello/openprescribing-data/data/prescribing/{yyyy_mm}/{timestamp}_Detailed_Prescribing_Information_formatted.CSV
- delete
- in BigQuery:
- delete
tmp_eu.raw_prescribing_data_{datestamp}
- delete
tmp_eu.formatted_prescribing_{datestamp}
- run
DELETE FROM hscic.prescribing WHERE month = TIMESTAMP('{yyyy}-{mm}-01')
- delete
- in Storage
- delete
gs://ebmdatalab/hscic/prescribing/{yyyy_mm}/{timestamp}_Detailed_Prescribing_Information.csv_formatted-*.csv.gz
- delete
This task:
- imports the converted prescribing data into a partition of
frontend_prescription
in Postgres - marks as current any section/paragraph/chemical/product that is not currently marked as current
- this is not going to be easy to reverse, so don't try
To reverse:
- in Postgres:
DROP TABLE IF EXISTS frontend_prescription_{yyyy_mm}
ImportLog.objects.delete(category='prescribing', current_at='{yyyy}-{mm}-01'
This task:
- backs up the
hscic.prescribing
table in BQ to Storage
To reverse:
- in Storage
- delete
gs://ebmdatalab/prescribing/prescribing_backups/{yyyy_mm}-*.csv.gz
- delete
This task:
- updates
hscic.practices
,hscic.presentation
,hscic.practice_statistics
,hscic.ccgs
tables in BQ fromPractice
,Presentation
,PracticeStatistics
,PCT
models in Postgres - updates
hscic.bnf
table in BQ from latest CSV inhscic/bnf_codes
in Storage
It should not be necessary to reverse this, as when it is re-run (with correct data in Postgres) the incorrect data will be overwritten (but make sure to mark the corresponding ImportLog
as failed).
This task:
- creates various
MeasureValue
andMeasureGlobal
objects
To reverse:
- in Postgres:
- run
MeasureValue.objects.filter(month='{yyyy}-{mm}-01').delete()
- run
MeasureGlobal.objects.filter(month='{yyyy}-{mm}-01').delete()
- run
This task:
- creates a temporary BQ table
- creates
PPUSavings
objects
To reverse:
- in BigQuery:
- delete
hscic.prescribing_with_merged_codes_{yyyy}_{mm}
- delete
- in Postgres:
- run
PPUSaving.objects.filter(date='{yyyy}-{mm}-01').delete()
- run
ImportLog.objects.delete(category='ppu', current_at='{yyyy}-{mm}-01'
- run
This task:
- exports the results of queries from BQ to Storage
- downloads and imports the data from Storage into Postgres
To reverse:
- in Postgres:
- run
delete from vw__ccgstatistics where date = '{yyyy}-{mm}-01';
- run
delete from vw__chemical_summary_by_ccg where processing_date = '{yyyy}-{mm}-01';
- run
delete from vw__chemical_summary_by_practice where processing_date = '{yyyy}-{mm}-01';
- run
delete from vw__practice_summary where processing_date = '{yyyy}-{mm}-01';
- run
delete from vw__presentation_summary where processing_date = '{yyyy}-{mm}-01';
- run
delete from vw__presentation_summary_by_ccg where processing_date = '{yyyy}-{mm}-01';
- run
There's no need to delete the files in Storage as they'll be deleted when the command next runs