Identify and classify the most expensive variable extractions #795

sebbacon · 2022-05-09T11:45:17Z

Per #794, we are starting to hit resource contention issues, which may be related to increased general load; increased use of CodedEvent_SNOMED table; or something else.

Use extract-stats command (or similar) to analyse 1 or 2 weeks of log files, to characterise runtime properties at the variable level. For example, perhaps the most_recent_bmi function is particularly expensive in terms of "wall" time, or in terms of CPU or memory, or frequency of its use.

The goal is to see if there is low-hanging fruit for customising a small amount of SQL for maximum overall impact.

The text was updated successfully, but these errors were encountered:

bloodearnest · 2022-05-09T11:47:43Z

We can apparently get SQLServer to include the execution plan for every query:
https://docs.microsoft.com/en-us/sql/t-sql/statements/set-statistics-profile-transact-sql?view=sql-server-ver15

bloodearnest · 2022-05-09T12:08:02Z

If we don't want that, we could get the lighter timings perhaps: https://docs.microsoft.com/en-us/sql/t-sql/statements/set-statistics-time-transact-sql?view=sql-server-ver15

Also IO: https://docs.microsoft.com/en-us/sql/t-sql/statements/set-statistics-io-transact-sql?view=sql-server-ver15

bloodearnest · 2022-05-09T12:11:04Z

Regards the above, worth noting that TIME and IO do not require elevated privileges (i.e. we could add them now) but PROFILE requires us the have the SHOWPLAN permisions.

rebkwok · 2022-05-16T17:25:27Z

I haven't been able to make a lot of progress on this, partly because I spent quite a bit of time trying to work out a good way to look at the logs on the server.

Since we want to look across jobs/workspaces by date, I think it's better to use the collated logs in /srv/high_privacy/logs rather than the workspace's metadata folder, which is what opensafely-cli works on.

I've made a private repo (the notebook branch is my current working branch):
https://github.com/opensafely-core/stats-logs-notebooks/tree/notebook

It just contains a few files:

extract_stats.py script, which is similar to the one in opensafely-cli but modified a bit to work with the format of the files in the log folder
stats-logs.ipnb: a notebook that extracts all the job logs for a specific month range, and currently just finds the timing logs (all, and separately for just those that record SQL execution) and any that record memory (from measures dataframes) and displays the 100 largest by execution time (for timing logs) and by memory (for memory logs).
run_notebook.sh: a script to run the jupyter docker image with this notebook (runs nbconvert and outputs an html file). Takes a start and end month in YYYY-MM format, but I'm not sure it's currently feasible to run it on more than one month. Just running on 2022-05 entails extracting logs for >2000 jobs and takes about a minute.

My plan (when I'd got the notebook to do something more useful) was to figure out how to use the release process to release the output html file to job-server. Currently I'm just using scp to move it from the VM so I can look at it - the latest one is at /home/rebkwok/stats-logs-notebooks/output/stats_logs.html (on the VM)

We time the overall execution of a generate_cohort or generate_measures action and the overall time for each index date as well as individual SQL execution. I separated out the SQL execution from the rest, because obviously the overall timing is the longest. Not sure that's all that useful either yet, as the final batched SELECTs for writing the output files are the slowest things, so we just get rows and rows of SELECT TOP 32000 FROM ...

iaindillingham · 2022-05-18T13:41:40Z

From this Slack thread:

@Iain, FYI "quick and dirty" is OK for immediate purposes. We just want to generate some hypotheses about poorly-performing variables, etc.

The script currently extracts timing logs and memory logs. This is my first pass, based on the mose recent run of the script.

Timing logs

There are four groups of jobs in the timing logs:

>7 hours (x2, hh2ptxvalhnoucop and cmur7rv3yexlq4pi)
2.5 - 3.5 hours (x6)
1 - 2.5 hours (x10)
<0.5 hours

Of the 18 longest running jobs, 16 are generate_cohort jobs. Let's take a look at the longest running jobs.

Job	Variables	`with_these_clinical_events`	`categorised_as`	Freq.
hh2ptxvalhnoucop	~20	5	2	Monthly
cmur7rv3yexlq4pi	~50	28	14	Monthly

Of the SQL timing logs (i.e. the subset of timing logs that record SQL execution), all relate to the final batched SELECTs, as Becky says.

The script doesn't (yet) parse codelist logs. However, my hypothesis is that the longest running jobs contain with_these_clinical_events/categorised_as variables, and these variables reference codelists.

Memory logs

There are three groups of jobs in the memory logs, by measure ID. All are for measures tables. Most relate to two jobs: ifrqal5bi2o2cdw2 (~2.2GB/log) and zru7cw2tqjrwvuxa (~1.8GB/log).

Both have a large number of measures: either programmtically generated (112) or manually generated (54).

sebbacon · 2022-05-19T07:32:15Z

my hypothesis is that the longest running jobs contain with_these_clinical_events/categorised_as variables, and these variables reference codelists.

Is this hypothesis based on eyeballing, or something else?

I think the SQL timing logs are definitely the ones worth a deep dive. Is there enough information in the logs to do this?

iaindillingham · 2022-05-19T08:08:23Z

Eyeballing. We don't log the variables - although we do log the SQL queries, which can be matched to the variables. I'll deep dive into the SQL timing logs today; but as Becky says, the longest running queries are those that write the output files.

sebbacon · 2022-05-19T08:27:37Z

It seems plausible (or even probable?) that output file writing, while taking the longest, is not generating the most server load, as it's going to be limited by IO rather than CPU. My hypothesis is that the slowest things after the output file writing are going to be the queries of interest.

rebkwok · 2022-05-19T08:43:52Z

It should be quite easy to exclude the output writing SQL logs and look at the next slowest ones. From the actual log files I've looked at, I have a suspicion that ethnicity is one of the slowest. I think as well as being a categorised_as query, that one usually also involves a categorised codelist, which could be a contributing factor too

sebbacon · 2022-05-19T11:00:44Z

To add to the list of anecdotal candidates, I think BMI and smoking status might also be baaad

iaindillingham · 2022-05-20T14:15:47Z

Right, sorry, this took longer than anticipated. Initially, that was because of issues with my Windows VM and VPN/RDP; today, that was because it took me some time to understand what was being logged. I should have grepped sql and inspected a couple 🤦🏻‍♂️

Filtering the SQL logs for only those that start Query for shows that longest running are related to ethnicity. However, with one exception, they execute in under a second. Becky tells me that the same operation can have multiple entries in the log, so I may have mistakenly filtered out the useful logs. I'll keep looking.

rebkwok · 2022-06-29T10:21:32Z

Update on the status of this ticket:
My latest branch on the (increasingly poorly-named) stats-logs-notebooks repo is called scripts:
https://github.com/opensafely-core/stats-logs-notebooks/tree/scripts

That branch has a README which describes the scripts it contains and how to run them. The ./run_extract_stats.sh scripts is the most recent one, it parses log files on the server and downloads extracted files included some CSVs with the 1000 largest timing/memory logs by different measurements.

rebkwok · 2022-06-30T10:13:44Z

Next steps (still a bit vague):
Run ./run_extract_stats.sh on the latest month's logs - this will download:

a json file with all logs from all jobs run in that month
a CSV with the 1000 slowest timing logs
a CSV with the 1000 slowest SQL timing logs (timing of the execute command in cohort-extractor) for variable queries (cohort-extractor labels those with a description that starts with "Query for..")
a CSV with the 1000 slowest SQL timing logs as recorded by sql server (note there are some issues with how the timing IDs are assigned to sql server logs - see comment. I think this mostly affects the batched results fetching, which is done in a generator)
a CSV with the 1000 largest memory logs (for logs that record memory - generally for measures dataframes)

Look for the variables that are slowest? Looking at the slowest timing logs may tell us something; do particular queries e.g. age, BMI etc appear more often across jobs? For slowest variable queries, is there something about the job - in the overall logs json there should be a log of job metadata, and also logs that record the total number of variables, table joins etc.

inglesp · 2022-08-02T15:40:50Z

After discussion with, among others, @sebbacon and @lucyb, we're going to close this for now. The pipeline team are working on getting metrics from job-runner into honeycomb. Once this is done, we'll want to use this mechanism to send stats about each query to honeycomb. This will help us to see, for a particular job, which queries were slowest, and might also allow us to compare queries between jobs.

sebbacon mentioned this issue May 9, 2022

Address SNOMED-related inefficiencies in TPP backend #794

Open

7 tasks

iaindillingham self-assigned this May 17, 2022

inglesp assigned rebkwok and unassigned iaindillingham Jun 5, 2022

rebkwok removed their assignment Jun 30, 2022

inglesp closed this as completed Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify and classify the most expensive variable extractions #795

Identify and classify the most expensive variable extractions #795

sebbacon commented May 9, 2022

bloodearnest commented May 9, 2022

bloodearnest commented May 9, 2022 •

edited

Loading

bloodearnest commented May 9, 2022

rebkwok commented May 16, 2022

iaindillingham commented May 18, 2022

sebbacon commented May 19, 2022

iaindillingham commented May 19, 2022

sebbacon commented May 19, 2022

rebkwok commented May 19, 2022

sebbacon commented May 19, 2022

iaindillingham commented May 20, 2022

rebkwok commented Jun 29, 2022

rebkwok commented Jun 30, 2022

inglesp commented Aug 2, 2022

Identify and classify the most expensive variable extractions #795

Identify and classify the most expensive variable extractions #795

Comments

sebbacon commented May 9, 2022

bloodearnest commented May 9, 2022

bloodearnest commented May 9, 2022 • edited Loading

bloodearnest commented May 9, 2022

rebkwok commented May 16, 2022

iaindillingham commented May 18, 2022

Timing logs

Memory logs

sebbacon commented May 19, 2022

iaindillingham commented May 19, 2022

sebbacon commented May 19, 2022

rebkwok commented May 19, 2022

sebbacon commented May 19, 2022

iaindillingham commented May 20, 2022

rebkwok commented Jun 29, 2022

rebkwok commented Jun 30, 2022

inglesp commented Aug 2, 2022

bloodearnest commented May 9, 2022 •

edited

Loading