Address SNOMED-related inefficiencies in TPP backend #794

sebbacon · 2022-05-09T11:38:54Z

The standard TPP table rebuild is weekly. It is atomic from the point of view of end users, as it happens against temporary tables that are then swapped into the product namespaces using a "rename"-type operation. Under (historically) normal load, this process took 2-3 days, typically completing at midnight on a Sunday.

At the beginning of OpenSAFELY, we only queried CTV3 codelists. Since then, we've added the capability to query SNOMED, which was implemented in the backend database as a duplicate CodedEvent table. However, due to space constraints, this table doesn't fit in the atomic "build-then-rename" process. It is therefore rebuilt in-place in a separate process, meaning end users get errors (or block the rebuild) when querying SNOMED codes. This happens immediately after then main database rebuild, and historically took 10 hours.

As the number of running jobs has increased, and the proportion of those using SNOMED has also increased, database load has meant the main build now finishes on a Tuesday, and the SNOMED build takes 2 days. This presents problems for end users, and for TPP, as contention increases and causes issues (for example, the rebuild is now risking overlapping with the start of the next rebuild).

Possible solutions

We have now implemented a "maintenance mode" that allows opensafely to cancel running cohortextractor processes in response to a signal from TPP. This is pending TPP confirming the database signal they will send to indicate that downtime is requested. When complete, this should allow the rebuild to happen faster, and give user-friendly feedback to researchers that it's happening
- TPP to confirm and implement database signal, and expected schedule
- OpenSAFELY copilots to communicate the maintenance window to end users, so they understand the implications
- OpenSAFELY tech team to add a clear banner during maintenance window with explanatory text
As we're getting a compounding effect from these long-running queries, we think it's worth collectively investigating if there are a small number of queries with pathological performance that we could investigate working around, in the short term.
- TPP to ensure SQL Monitor is running and report back
- OpenSAFELY team to look at logs to generate hypotheses about query types that may be causing problems
- OpenSAFELY to provide the generated SQL that we collectively identify as problematic and workshop improving it with TPP
Ideally TPP would move the SNOMED table over to the atomic build process, but due to resource constraints this may mean dropping CTV3 support. Meeting needed to workshop options?
Consider dropping build frequency

sebbacon mentioned this issue May 9, 2022

Identify and classify the most expensive variable extractions #795

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address SNOMED-related inefficiencies in TPP backend #794

Address SNOMED-related inefficiencies in TPP backend #794

sebbacon commented May 9, 2022 •

edited

Loading

Address SNOMED-related inefficiencies in TPP backend #794

Address SNOMED-related inefficiencies in TPP backend #794

Comments

sebbacon commented May 9, 2022 • edited Loading

Possible solutions

sebbacon commented May 9, 2022 •

edited

Loading