Skip to content

Data storage querying experiments

Dave Evans edited this page Oct 31, 2018 · 4 revisions

Here I compare performance of three ways of storing prescribing data and running measures queries over them:

  1. Store data in Postgres but fully normalised so that instead of repeatedly storing BNF code, practice code and date strings we just store these as integer pointers to other tables. Additionally, store data on disk in (bnf code, practice, date) order so that when querying over small subsets of the BNF, as we do in our measures, the data we need is all clustered together. Finally we declare (bnf code, practice, date) as the compound primary key so we have exactly one index on the table that lets us do the queries we need (code here) .

  2. Exactly the same as above, but in SQLite (code here).

  3. Store data in SQLite, but rather than storing each (presentation, practice, month) as an individual row we store the prescribing for each presentation as a matrix with 14,000 rows (one for each practice) and 60 columns (one for each month). Or, more precisely, for each presentation we store 4 such matrices: one for each of items, quantity, actual_cost and net_cost. We rely on other people's clever libraries to handle getting the matrices in and out of SQLite for us (code here).

After importing 5 years worth of monthly data the on-disk sizes were:

Type Size
Postgres 52GB
SQlite 22GB
SQLite+Matrix 3.9GB

To compare querying performance I then took all our existing measure queries and attempted to translate them into a form that could be run against the new data structure. This wasn't always straightforwardly possible, but it was possible in enough cases to give us a decent picture of overall performance.

In order to run BNF code queries against the normalised data we need to involve both the prescriptions and presentation tables. I found that by far the fastest method (for both Postgres and SQLite) was to use a subquery. For example:

SELECT practice_id, date_id, SUM(quantity)
FROM prescribing
WHERE presentation_id IN (
  SELECT id FROM presentation WHERE bnf_code LIKE '201010%'
)
GROUP BY practice_id, date_id;

Any attempt to join the tables resulted in queries so slow that I gave up waiting for them.

The subselect works fine for the most part, however if we want to do something like multiply each quantity by the corresponding adq_per_quantity then we will need an actual join across the prescribing and presentation tables. It may be that there's something clever that I've missed, or it may be that this is a showstopper for that approach.

The timings for the measure queries are in the sheet below. For each numerator and denominator query it gives the time needed to fetch all the data into Python. The actual calculating of the ratios and the sorting to determine deciles can be done pretty rapidly in numpy (hundreds of ms at worst) and is in any case a fixed cost.

https://docs.google.com/spreadsheets/d/1uswd-FO1_8phROrjE8nHzV18BzwK8stKq0JcxrM46zw/edit?usp=sharing

My headline summary of these results would be:

  1. Postgres is too slow to be powering a live user endpoint (plenty of queries taking over 45s, quite a few over 90s).
  2. SQLite is surprisingly fast. A good chunk of the queries complete in under a second and the majority are in single digit seconds. There are the odd outliers at 12 and 15 seconds, but even that isn't too bad. At the extreme, summing over all items for the entire BNF takes about 680s.
  3. SQLite+Matrix is very fast. The worst case is the silver dressings measure which takes 2.5 seconds because the denominator sums over prescribing of all dressings, which is quite large. At the extreme, summing over all items for the entire BNF takes about 30s.

Based on this, my feeling is that the future of data storage is going to involve SQLite in some fashion. I think this has a bunch of other advantages too in terms of operational simplicity and the ability to import new data offline, sanity check it in staging and then instantaneously push it to production.

Much as I feel attached to the performance of option 3, option 2 has the big advantage of just using vanilla SQL and may be fast enough for our purposes. The big question mark is whether we can work out how to handle queries like those for ADQ-based measures which need to pull in additional data from the presentations table.