Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load testing OpenPedcan-api server #27

Closed
logstar opened this issue Sep 14, 2021 · 28 comments
Closed

Load testing OpenPedcan-api server #27

logstar opened this issue Sep 14, 2021 · 28 comments
Assignees

Comments

@logstar
Copy link
Contributor

logstar commented Sep 14, 2021

After #21 is resolved, load testing OpenPedcan-api server with multiple concurrent requests, to see whether the server can handle estimated load on production server.

@chris-s-friedman suggested that he would like to see certain benchmarks at #20 (review). The comment is quoted below.

I'd take a look at response times and how long it takes for R to formulate responses. Obviously, the test endpoints are quicker than the endpoints that return information about "real" data, but i'd be curious about how long the /plot endpoints take to build plots, compared to eachother and compared to the /json endpoints. Plotting - particularly with ggplot - isn't the fastest operation. So, as i mention in one of my comments, I'd consider wrapping the plot endpoint functions in promises::future_promise(), so that the server doesn't get overloaded with Plot requests. Obviously - take a look at the data first to see if this is actually an issue. Looking at the response times from curl-test-endpoints, it looks like responses take about 2 - 3 seconds for the tpm/gene-disease-gtex/plot endpoint, so i may consider starting there

@chris-s-friedman Could you follow up with @blackdenc on how to implement the load test specifically to get the benchmarks you need?

@logstar
Copy link
Contributor Author

logstar commented Sep 15, 2021

@blackdenc - The API HTTP server with database backend has been deployed on dev host. Could you work on load testing the dev server to get some benchmarks on handling concurrent and sequential requests?

cc @taylordm @chinwallaa @afarrel

@blackdenc
Copy link
Contributor

blackdenc commented Sep 16, 2021

I ran a test using the https://locust.io/ framework by getting random ensemblId and a random efoId, then running a get request on the API. I split requests 50/50 between the json endpoint and the plot endpoint simulating 5 concurrent users for about 5 minutes. The commit that the environment was using at the time was ec9b9a27653cba45a557f0956661fa933514f274

It looks like the failure rate is a little under 50% with 2-4 requests per second with the errors all returning 500 Server Error: Internal Server Error

I re-ran this with 1, 3 and 5 tasks, and the numbers stayed about the same across each configuration. I've attached the code used in a secret gist and the reports for each test in a zip file.

https://gist.github.com/blackdenc/2b3766617d3c16b0c6241db9f8af1a58

OpenTargets_load_test_reports.zip

@blackdenc
Copy link
Contributor

blackdenc commented Sep 16, 2021

Note: Throughout the testing, the memory and CPU utilization on each task did not exceed 25%/3GB.

image

@chinwallaa
Copy link

thanks Chris. Quick question was the load test (1,3,5 tasks) with a load balancer (horizontal scaling), or was this testing with #horizontal-scaling-containers=1

@blackdenc
Copy link
Contributor

@chinwallaa this was with 1,3, and 5 tasks behind the load balancer routing traffic between each.

@logstar
Copy link
Contributor Author

logstar commented Sep 16, 2021

@blackdenc Thank you for the updates.

Could you use the attached EFO ID file? The previous file has some EFO ID that are removed in an update. The removed EFO IDs will cause 500 error. The attached one is the updated one.

It is expected that all failed responses have 500 internal error, because other error codes are not implemented for the API server yet, and this implementation also has low priority.

Is the locust test running on the DEV server or running on another machine and sending requests to the DEV server? I cannot see base URL in the locust script.

Was the Fargate scaling enabled for this load testing? If not, could you set 2-4 always-there servers, and some other scaling rules?

test_efo_id.txt

@blackdenc
Copy link
Contributor

@logstar confirmed, the tests were running against https://openpedcan-api-dev.d3b.io. I'll update the EFO list and re-run the tests.

@logstar
Copy link
Contributor Author

logstar commented Sep 16, 2021

@logstar confirmed, the tests were running against https://openpedcan-api-dev.d3b.io. I'll update the EFO list and re-run the tests.

Is it running on your laptop? Because the 2MB/s home network may not have bandwidth to get the responses in time.

So the number of tasks is the number of API HTTP server instances available? If so, the runtime results are expected, because plumbr HTTP server handles requests sequentially.

However, the failure results are unexpected. I wonder if you could also share the API HTTP server side log? This will help me understand why concurrent requests randomly fail.

@blackdenc
Copy link
Contributor

blackdenc commented Sep 16, 2021

@logstar updating the EFO ID list reduced the error rate to 35% for one instance. I'll get the report and logs and post them here.

@blackdenc
Copy link
Contributor

Working with @logstar to update EFO list. In the meantime, here is last load test report.

opentargets_api_loadtest-1_count_updated_EFO.html.zip

@logstar
Copy link
Contributor Author

logstar commented Sep 16, 2021

Chris shared the following error that is frequently occur. This is expected. The error occurs when an EFO ID has less than min_n_per_sample_group samples.

<simpleError in get_gene_tpm_tbl(ensg_id = ensemblId, efo_id = efoId, min_n_per_sample_group = 3): nrow(disease_long_tpm_tbl) > 0 is not TRUE>	

Attached is the updated EFO ID list with >= min_n_per_sample_group samples.

test_efo_id.txt

@blackdenc - Could you also test the other two endpoints on https://openpedcan-api-dev.d3b.io/__docs__/ as well?

@blackdenc
Copy link
Contributor

Updated the gist with new code and EFO IDs, and running test with one task. Will upload results once done.

@logstar
Copy link
Contributor Author

logstar commented Sep 16, 2021

Updated the gist with new code and EFO IDs, and running test with one task. Will upload results once done.

The /tpm/gene-all-cancer/ endpoints do not need EFO ID parameter, but the server should omit the parameter. We'll see.

@logstar
Copy link
Contributor Author

logstar commented Sep 16, 2021

@blackdenc - Just noticed that the requests will include headers. The first line of the EFO or ENSG ID files is header. Could you add xxx_list = xxx_list[1:] before randomly selecting them for sending requests?

https://gist.github.com/blackdenc/2b3766617d3c16b0c6241db9f8af1a58#file-locustfile-py-L4-L12

@blackdenc
Copy link
Contributor

The /tpm/gene-all-cancer/ endpoints do not need EFO ID parameter, but the server should omit the parameter. We'll see.

It did omit the parameter, so no issues there.

@blackdenc - Just noticed that the requests will include headers.

The updated file in the gist should have the headers removed, which is how the test was run.

The 5 minute test at one task also just completed with zero errors! Based on that result, I'm comfortable saying that the earlier test failures were due to the test data. I've attached the results below, and will run a sustained use test for 5 concurrent users over lunch to get an idea of how long the responses will be under continuous use.

opentargets_api_loadtest-1_count_updated_EFO_and_endpoints.html.zip

@logstar
Copy link
Contributor Author

logstar commented Sep 16, 2021

The /tpm/gene-all-cancer/ endpoints do not need EFO ID parameter, but the server should omit the parameter. We'll see.

It did omit the parameter, so no issues there.

@blackdenc - Just noticed that the requests will include headers.

The updated file in the gist should have the headers removed, which is how the test was run.

The 5 minute test at one task also just completed with zero errors! Based on that result, I'm comfortable saying that the earlier test failures were due to the test data. I've attached the results below, and will run a sustained use test for 5 concurrent users over lunch to get an idea of how long the responses will be under continuous use.

opentargets_api_loadtest-1_count_updated_EFO_and_endpoints.html.zip

Thank you for the updates.

I have created #31 to separate the scaling rules from this issue.

@logstar
Copy link
Contributor Author

logstar commented Sep 16, 2021

@blackdenc - To add more context to the real use cases:

  • Every loading of an FNL team PedOT website page will send out 1 JSON request and 2 PNG requests at once with the same ENSG and EFO ID query parameters.
  • When two users load two PedOT website pages at once, there will be 6 requests in total sent to the API server.

@blackdenc
Copy link
Contributor

Ok, then I've been running the test at a lot lower capacity than I should have. I'll adjust the parameters and re-run

@blackdenc
Copy link
Contributor

Based on what I'm seeing in the load test, I think we should run with a default setting of 3 desired tasks. Using more than 2 concurrent users with one task seems to crash it within a few minutes while the affected container restarts. 2 or 3 tasks will allow it to recover the affected container(s) recover without any interruption in service. To that end, the following line should be added to the Jenkinsfile

desired_count = "3"

@blackdenc
Copy link
Contributor

I manually changed the desired tasks to 3, then ran the below test for 15 minutes. The api was able to sustain 15 concurrent users sending 1-3 requests per second with 0 failures. This would be the equivalent of 5 users sending 3 API calls each for the same length of time.

Based on this, I'll submit a change to #28

opentargets_api_loadtest_3-tasks_15-users_15-minutes.html.zip

@logstar
Copy link
Contributor Author

logstar commented Sep 16, 2021

Based on what I'm seeing in the load test, I think we should run with a default setting of 3 desired tasks. Using more than 2 concurrent users with one task seems to crash it within a few minutes while the affected container restarts. 2 or 3 tasks will allow it to recover the affected container(s) recover without any interruption in service. To that end, the following line should be added to the Jenkinsfile

desired_count = "3"

Thank you.

I agree two concurrent users can be considered as normal server load, and have desired_count = "3".

In addition, @taylordm suggested the maximum server load case could be 5 concurrent users, and the last response should be completed within 10 seconds.

@blackdenc Could you also test the maximum load criteria as well?

Regarding changes to #28, could we wait for scaling rules and CPU/mem allocations settled and change them at once? After deployments are done, we could do another round of load testing.

cc @chinwallaa @afarrel

@blackdenc
Copy link
Contributor

Added change to desired count in 25f66f4 (#28). Will start testing potential maximum load now.

@blackdenc
Copy link
Contributor

Regarding changes to #28, could we wait for scaling rules and CPU/mem allocations settled and change them at once? After deployments are done, we could do another round of load testing.

I think we are ok to push the change to the count now, then we can take another look at it when we do the CPU/mem allocation later. I don't know when the deadline for this is, but this way it's stable under stress.

@logstar
Copy link
Contributor Author

logstar commented Sep 16, 2021

Regarding changes to #28, could we wait for scaling rules and CPU/mem allocations settled and change them at once? After deployments are done, we could do another round of load testing.

I think we are ok to push the change to the count now, then we can take another look at it when we do the CPU/mem allocation later. I don't know when the deadline for this is, but this way it's stable under stress.

I agree. For having a stable QA deployment that can handle the normal and max server load, the proposed deadline is before next Wednesday. Let me know if you need more time. We just need to communicate back to the FNL team.

cc @taylordm @chinwallaa

@logstar
Copy link
Contributor Author

logstar commented Sep 16, 2021

@blackdenc - FYI, the plot API endpoints will have one more required parameter by today or tomorrow. More details are in https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1631808129073000?thread_ts=1631705868.070100&cid=C021Z53SK98.

@logstar
Copy link
Contributor Author

logstar commented Sep 16, 2021

@blackdenc - The additional required parameter yAxisScale is deployed on DEV server, which is implemented in 239d5ab.

The parameter takes a value of linear or log10. On real use cases, every PedOT web page load will send 1 JSON request, 1 linear PNG request, and 1 log10 PNG request.

The log10 value may be changed to log2 at a later point.

@blackdenc
Copy link
Contributor

Tests show that we can support at most 6 concurrent users with 3 tasks, so to account for additional load, increased minimum number of tasks to 5. Because of how R allocates memory, ECS Service Type 1 will not be able to autoscale without significant module re-work, so this increase for the temporary solution is the recommended approach. Closing this issue as resolved.

@logstar
Copy link
Contributor Author

logstar commented Sep 28, 2021

Tests show that we can support at most 6 concurrent users with 3 tasks, so to account for additional load, increased minimum number of tasks to 5. Because of how R allocates memory, ECS Service Type 1 will not be able to autoscale without significant module re-work, so this increase for the temporary solution is the recommended approach. Closing this issue as resolved.

@blackdenc - Could you summarize the load testing results in CHOP analytics slack channel? Especially, how long does it take to handle 5 concurrent users, i.e. 10 plot requests and 5 table requests at once? I think we expect 5 concurrent users to be handled within 10 seconds.

cc @chinwallaa @taylordm @afarrel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants