Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address SNOMED-related inefficiencies in TPP backend #794

Open
7 tasks
sebbacon opened this issue May 9, 2022 · 0 comments
Open
7 tasks

Address SNOMED-related inefficiencies in TPP backend #794

sebbacon opened this issue May 9, 2022 · 0 comments

Comments

@sebbacon
Copy link
Contributor

sebbacon commented May 9, 2022

The standard TPP table rebuild is weekly. It is atomic from the point of view of end users, as it happens against temporary tables that are then swapped into the product namespaces using a "rename"-type operation. Under (historically) normal load, this process took 2-3 days, typically completing at midnight on a Sunday.

At the beginning of OpenSAFELY, we only queried CTV3 codelists. Since then, we've added the capability to query SNOMED, which was implemented in the backend database as a duplicate CodedEvent table. However, due to space constraints, this table doesn't fit in the atomic "build-then-rename" process. It is therefore rebuilt in-place in a separate process, meaning end users get errors (or block the rebuild) when querying SNOMED codes. This happens immediately after then main database rebuild, and historically took 10 hours.

As the number of running jobs has increased, and the proportion of those using SNOMED has also increased, database load has meant the main build now finishes on a Tuesday, and the SNOMED build takes 2 days. This presents problems for end users, and for TPP, as contention increases and causes issues (for example, the rebuild is now risking overlapping with the start of the next rebuild).

Possible solutions

  • We have now implemented a "maintenance mode" that allows opensafely to cancel running cohortextractor processes in response to a signal from TPP. This is pending TPP confirming the database signal they will send to indicate that downtime is requested. When complete, this should allow the rebuild to happen faster, and give user-friendly feedback to researchers that it's happening

    • TPP to confirm and implement database signal, and expected schedule
    • OpenSAFELY copilots to communicate the maintenance window to end users, so they understand the implications
    • OpenSAFELY tech team to add a clear banner during maintenance window with explanatory text
  • As we're getting a compounding effect from these long-running queries, we think it's worth collectively investigating if there are a small number of queries with pathological performance that we could investigate working around, in the short term.

    • TPP to ensure SQL Monitor is running and report back
    • OpenSAFELY team to look at logs to generate hypotheses about query types that may be causing problems
    • OpenSAFELY to provide the generated SQL that we collectively identify as problematic and workshop improving it with TPP
  • Ideally TPP would move the SNOMED table over to the atomic build process, but due to resource constraints this may mean dropping CTV3 support. Meeting needed to workshop options?

  • Consider dropping build frequency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant