Skip to content

Undoing an import

Peter Inglesby edited this page Jan 10, 2018 · 2 revisions

Undoing an import

If an import fails, changes may need to be reversed, and in particular, data may need to be deleted (depending on how far the import process got).

Note that this document is likely to get out of date, and so should be treated as a guide

There are several places where changes are made:

  • Data files on the filesystem
  • log.json on the filesystem
  • BigQuery
  • Storage
  • Data tables in Postgres
  • ImportLog and TaskLog tables in Postgres

This document describes the changes that each task (or class of task) makes, and the steps required to reverse them

General

For any task that has been run and has been assumed to run successfully, there will be a TaskLog object. If the task is to be run again, the TaskLog object should either be deleted or have its status field changed to "failed".

Fetchers

Fetchers will download new data files to /home/hello/openprescribing-data/data/{task_name}/{yyyy_mm}/ on largeweb2.

If an incorrect data file has been downloaded, the file should be deleted from the filesystem. Additionally, it may have been uploaded to Storage, in which case it should be deleted from https://console.cloud.google.com/storage/browser/ebmdatalab/hscic?project=ebmdatalab.

Non-fetchers

For converters and importers, remove the relevant references in log.json.

convert_hscic_prescribing

This task:

  • creates a couple of temporary BQ tables
  • appends prescribing information to the hscic.prescribing table in BQ
  • downloads a converted CSV file for importing into relevant partition of the frontend_prescribing table in Postgres by import_hscic_prescribing

To reverse:

  • on the filesystem:
    • delete /home/hello/openprescribing-data/data/prescribing/{yyyy_mm}/{timestamp}_Detailed_Prescribing_Information_formatted.CSV
  • in BigQuery:
    • delete tmp_eu.raw_prescribing_data_{datestamp}
    • delete tmp_eu.formatted_prescribing_{datestamp}
    • run DELETE FROM hscic.prescribing WHERE month = TIMESTAMP('{yyyy}-{mm}-01')
  • in Storage
    • delete gs://ebmdatalab/hscic/prescribing/{yyyy_mm}/{timestamp}_Detailed_Prescribing_Information.csv_formatted-*.csv.gz

import_hscic_prescribing

This task:

  • imports the converted prescribing data into a partition of frontend_prescription in Postgres
  • marks as current any section/paragraph/chemical/product that is not currently marked as current
    • this is not going to be easy to reverse, so don't try

To reverse:

  • in Postgres:
    • DROP TABLE IF EXISTS frontend_prescription_{yyyy_mm}
    • ImportLog.objects.delete(category='prescribing', current_at='{yyyy}-{mm}-01'

backup_prescribing

This task:

  • backs up the hscic.prescribing table in BQ to Storage

To reverse:

  • in Storage
    • delete gs://ebmdatalab/prescribing/prescribing_backups/{yyyy_mm}-*.csv.gz

upload_to_bigquery

This task:

  • updates hscic.practices, hscic.presentation, hscic.practice_statistics, hscic.ccgs tables in BQ from Practice, Presentation, PracticeStatistics, PCT models in Postgres
  • updates hscic.bnf table in BQ from latest CSV in hscic/bnf_codes in Storage

It should not be necessary to reverse this, as when it is re-run (with correct data in Postgres) the incorrect data will be overwritten (but make sure to mark the corresponding ImportLog as failed).

import_measures

This task:

  • creates various MeasureValue and MeasureGlobal objects

To reverse:

  • in Postgres:
    • run MeasureValue.objects.filter(month='{yyyy}-{mm}-01').delete()
    • run MeasureGlobal.objects.filter(month='{yyyy}-{mm}-01').delete()

generate_ppu

This task:

  • creates a temporary BQ table
  • creates PPUSavings objects

To reverse:

  • in BigQuery:
    • delete hscic.prescribing_with_merged_codes_{yyyy}_{mm}
  • in Postgres:
    • run PPUSaving.objects.filter(date='{yyyy}-{mm}-01').delete()
    • run ImportLog.objects.delete(category='ppu', current_at='{yyyy}-{mm}-01'

refresh_views

This task:

  • exports the results of queries from BQ to Storage
  • downloads and imports the data from Storage into Postgres

To reverse:

  • in Postgres:
    • run delete from vw__ccgstatistics where date = '{yyyy}-{mm}-01';
    • run delete from vw__chemical_summary_by_ccg where processing_date = '{yyyy}-{mm}-01';
    • run delete from vw__chemical_summary_by_practice where processing_date = '{yyyy}-{mm}-01';
    • run delete from vw__practice_summary where processing_date = '{yyyy}-{mm}-01';
    • run delete from vw__presentation_summary where processing_date = '{yyyy}-{mm}-01';
    • run delete from vw__presentation_summary_by_ccg where processing_date = '{yyyy}-{mm}-01';

There's no need to delete the files in Storage as they'll be deleted when the command next runs