-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify and classify the most expensive variable extractions #795
Comments
We can apparently get SQLServer to include the execution plan for every query: |
If we don't want that, we could get the lighter timings perhaps: https://docs.microsoft.com/en-us/sql/t-sql/statements/set-statistics-time-transact-sql?view=sql-server-ver15 |
Regards the above, worth noting that TIME and IO do not require elevated privileges (i.e. we could add them now) but PROFILE requires us the have the SHOWPLAN permisions. |
I haven't been able to make a lot of progress on this, partly because I spent quite a bit of time trying to work out a good way to look at the logs on the server. Since we want to look across jobs/workspaces by date, I think it's better to use the collated logs in I've made a private repo (the It just contains a few files:
My plan (when I'd got the notebook to do something more useful) was to figure out how to use the release process to release the output html file to job-server. Currently I'm just using We time the overall execution of a |
From this Slack thread:
The script currently extracts timing logs and memory logs. This is my first pass, based on the mose recent run of the script. Timing logsThere are four groups of jobs in the timing logs:
Of the 18 longest running jobs, 16 are
Of the SQL timing logs (i.e. the subset of timing logs that record SQL execution), all relate to the final batched The script doesn't (yet) parse codelist logs. However, my hypothesis is that the longest running jobs contain Memory logsThere are three groups of jobs in the memory logs, by measure ID. All are for measures tables. Most relate to two jobs: ifrqal5bi2o2cdw2 (~2.2GB/log) and zru7cw2tqjrwvuxa (~1.8GB/log). Both have a large number of measures: either programmtically generated (112) or manually generated (54). |
Is this hypothesis based on eyeballing, or something else? I think the SQL timing logs are definitely the ones worth a deep dive. Is there enough information in the logs to do this? |
Eyeballing. We don't log the variables - although we do log the SQL queries, which can be matched to the variables. I'll deep dive into the SQL timing logs today; but as Becky says, the longest running queries are those that write the output files. |
It seems plausible (or even probable?) that output file writing, while taking the longest, is not generating the most server load, as it's going to be limited by IO rather than CPU. My hypothesis is that the slowest things after the output file writing are going to be the queries of interest. |
It should be quite easy to exclude the output writing SQL logs and look at the next slowest ones. From the actual log files I've looked at, I have a suspicion that ethnicity is one of the slowest. I think as well as being a |
To add to the list of anecdotal candidates, I think BMI and smoking status might also be baaad |
Right, sorry, this took longer than anticipated. Initially, that was because of issues with my Windows VM and VPN/RDP; today, that was because it took me some time to understand what was being logged. I should have grepped Filtering the SQL logs for only those that start |
Update on the status of this ticket: That branch has a README which describes the scripts it contains and how to run them. The |
Next steps (still a bit vague):
Look for the variables that are slowest? Looking at the slowest timing logs may tell us something; do particular queries e.g. age, BMI etc appear more often across jobs? For slowest variable queries, is there something about the job - in the overall logs json there should be a log of job metadata, and also logs that record the total number of variables, table joins etc. |
After discussion with, among others, @sebbacon and @lucyb, we're going to close this for now. The pipeline team are working on getting metrics from job-runner into honeycomb. Once this is done, we'll want to use this mechanism to send stats about each query to honeycomb. This will help us to see, for a particular job, which queries were slowest, and might also allow us to compare queries between jobs. |
Per #794, we are starting to hit resource contention issues, which may be related to increased general load; increased use of
CodedEvent_SNOMED
table; or something else.Use
extract-stats
command (or similar) to analyse 1 or 2 weeks of log files, to characterise runtime properties at the variable level. For example, perhaps themost_recent_bmi
function is particularly expensive in terms of "wall" time, or in terms of CPU or memory, or frequency of its use.The goal is to see if there is low-hanging fruit for customising a small amount of SQL for maximum overall impact.
The text was updated successfully, but these errors were encountered: