Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify and classify the most expensive variable extractions #795

Closed
sebbacon opened this issue May 9, 2022 · 14 comments
Closed

Identify and classify the most expensive variable extractions #795

sebbacon opened this issue May 9, 2022 · 14 comments

Comments

@sebbacon
Copy link
Contributor

sebbacon commented May 9, 2022

Per #794, we are starting to hit resource contention issues, which may be related to increased general load; increased use of CodedEvent_SNOMED table; or something else.

Use extract-stats command (or similar) to analyse 1 or 2 weeks of log files, to characterise runtime properties at the variable level. For example, perhaps the most_recent_bmi function is particularly expensive in terms of "wall" time, or in terms of CPU or memory, or frequency of its use.

The goal is to see if there is low-hanging fruit for customising a small amount of SQL for maximum overall impact.

@bloodearnest
Copy link
Member

We can apparently get SQLServer to include the execution plan for every query:
https://docs.microsoft.com/en-us/sql/t-sql/statements/set-statistics-profile-transact-sql?view=sql-server-ver15

@bloodearnest
Copy link
Member

bloodearnest commented May 9, 2022

@bloodearnest
Copy link
Member

Regards the above, worth noting that TIME and IO do not require elevated privileges (i.e. we could add them now) but PROFILE requires us the have the SHOWPLAN permisions.

@rebkwok
Copy link
Contributor

rebkwok commented May 16, 2022

I haven't been able to make a lot of progress on this, partly because I spent quite a bit of time trying to work out a good way to look at the logs on the server.

Since we want to look across jobs/workspaces by date, I think it's better to use the collated logs in /srv/high_privacy/logs rather than the workspace's metadata folder, which is what opensafely-cli works on.

I've made a private repo (the notebook branch is my current working branch):
https://github.com/opensafely-core/stats-logs-notebooks/tree/notebook

It just contains a few files:

  • extract_stats.py script, which is similar to the one in opensafely-cli but modified a bit to work with the format of the files in the log folder
  • stats-logs.ipnb: a notebook that extracts all the job logs for a specific month range, and currently just finds the timing logs (all, and separately for just those that record SQL execution) and any that record memory (from measures dataframes) and displays the 100 largest by execution time (for timing logs) and by memory (for memory logs).
  • run_notebook.sh: a script to run the jupyter docker image with this notebook (runs nbconvert and outputs an html file). Takes a start and end month in YYYY-MM format, but I'm not sure it's currently feasible to run it on more than one month. Just running on 2022-05 entails extracting logs for >2000 jobs and takes about a minute.

My plan (when I'd got the notebook to do something more useful) was to figure out how to use the release process to release the output html file to job-server. Currently I'm just using scp to move it from the VM so I can look at it - the latest one is at /home/rebkwok/stats-logs-notebooks/output/stats_logs.html (on the VM)

We time the overall execution of a generate_cohort or generate_measures action and the overall time for each index date as well as individual SQL execution. I separated out the SQL execution from the rest, because obviously the overall timing is the longest. Not sure that's all that useful either yet, as the final batched SELECTs for writing the output files are the slowest things, so we just get rows and rows of SELECT TOP 32000 FROM ...

@iaindillingham iaindillingham self-assigned this May 17, 2022
@iaindillingham
Copy link
Member

From this Slack thread:

@Iain, FYI "quick and dirty" is OK for immediate purposes. We just want to generate some hypotheses about poorly-performing variables, etc.

The script currently extracts timing logs and memory logs. This is my first pass, based on the mose recent run of the script.

Timing logs

There are four groups of jobs in the timing logs:

Of the 18 longest running jobs, 16 are generate_cohort jobs. Let's take a look at the longest running jobs.

Job Variables with_these_clinical_events categorised_as Freq.
hh2ptxvalhnoucop ~20 5 2 Monthly
cmur7rv3yexlq4pi ~50 28 14 Monthly

Of the SQL timing logs (i.e. the subset of timing logs that record SQL execution), all relate to the final batched SELECTs, as Becky says.

The script doesn't (yet) parse codelist logs. However, my hypothesis is that the longest running jobs contain with_these_clinical_events/categorised_as variables, and these variables reference codelists.

Memory logs

There are three groups of jobs in the memory logs, by measure ID. All are for measures tables. Most relate to two jobs: ifrqal5bi2o2cdw2 (~2.2GB/log) and zru7cw2tqjrwvuxa (~1.8GB/log).

Both have a large number of measures: either programmtically generated (112) or manually generated (54).

@sebbacon
Copy link
Contributor Author

my hypothesis is that the longest running jobs contain with_these_clinical_events/categorised_as variables, and these variables reference codelists.

Is this hypothesis based on eyeballing, or something else?

I think the SQL timing logs are definitely the ones worth a deep dive. Is there enough information in the logs to do this?

@iaindillingham
Copy link
Member

Eyeballing. We don't log the variables - although we do log the SQL queries, which can be matched to the variables. I'll deep dive into the SQL timing logs today; but as Becky says, the longest running queries are those that write the output files.

@sebbacon
Copy link
Contributor Author

It seems plausible (or even probable?) that output file writing, while taking the longest, is not generating the most server load, as it's going to be limited by IO rather than CPU. My hypothesis is that the slowest things after the output file writing are going to be the queries of interest.

@rebkwok
Copy link
Contributor

rebkwok commented May 19, 2022

It should be quite easy to exclude the output writing SQL logs and look at the next slowest ones. From the actual log files I've looked at, I have a suspicion that ethnicity is one of the slowest. I think as well as being a categorised_as query, that one usually also involves a categorised codelist, which could be a contributing factor too

@sebbacon
Copy link
Contributor Author

To add to the list of anecdotal candidates, I think BMI and smoking status might also be baaad

@iaindillingham
Copy link
Member

Right, sorry, this took longer than anticipated. Initially, that was because of issues with my Windows VM and VPN/RDP; today, that was because it took me some time to understand what was being logged. I should have grepped sql and inspected a couple 🤦🏻‍♂️

Filtering the SQL logs for only those that start Query for shows that longest running are related to ethnicity. However, with one exception, they execute in under a second. Becky tells me that the same operation can have multiple entries in the log, so I may have mistakenly filtered out the useful logs. I'll keep looking.

@inglesp inglesp assigned rebkwok and unassigned iaindillingham Jun 5, 2022
@rebkwok
Copy link
Contributor

rebkwok commented Jun 29, 2022

Update on the status of this ticket:
My latest branch on the (increasingly poorly-named) stats-logs-notebooks repo is called scripts:
https://github.com/opensafely-core/stats-logs-notebooks/tree/scripts

That branch has a README which describes the scripts it contains and how to run them. The ./run_extract_stats.sh scripts is the most recent one, it parses log files on the server and downloads extracted files included some CSVs with the 1000 largest timing/memory logs by different measurements.

@rebkwok rebkwok removed their assignment Jun 30, 2022
@rebkwok
Copy link
Contributor

rebkwok commented Jun 30, 2022

Next steps (still a bit vague):
Run ./run_extract_stats.sh on the latest month's logs - this will download:

  1. a json file with all logs from all jobs run in that month
  2. a CSV with the 1000 slowest timing logs
  3. a CSV with the 1000 slowest SQL timing logs (timing of the execute command in cohort-extractor) for variable queries (cohort-extractor labels those with a description that starts with "Query for..")
  4. a CSV with the 1000 slowest SQL timing logs as recorded by sql server (note there are some issues with how the timing IDs are assigned to sql server logs - see comment. I think this mostly affects the batched results fetching, which is done in a generator)
  5. a CSV with the 1000 largest memory logs (for logs that record memory - generally for measures dataframes)

Look for the variables that are slowest? Looking at the slowest timing logs may tell us something; do particular queries e.g. age, BMI etc appear more often across jobs? For slowest variable queries, is there something about the job - in the overall logs json there should be a log of job metadata, and also logs that record the total number of variables, table joins etc.

@inglesp
Copy link
Contributor

inglesp commented Aug 2, 2022

After discussion with, among others, @sebbacon and @lucyb, we're going to close this for now. The pipeline team are working on getting metrics from job-runner into honeycomb. Once this is done, we'll want to use this mechanism to send stats about each query to honeycomb. This will help us to see, for a particular job, which queries were slowest, and might also allow us to compare queries between jobs.

@inglesp inglesp closed this as completed Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

5 participants