-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load testing OpenPedcan-api
server
#27
Comments
@blackdenc - The API HTTP server with database backend has been deployed on dev host. Could you work on load testing the dev server to get some benchmarks on handling concurrent and sequential requests? |
I ran a test using the https://locust.io/ framework by getting random ensemblId and a random efoId, then running a get request on the API. I split requests 50/50 between the It looks like the failure rate is a little under 50% with 2-4 requests per second with the errors all returning I re-ran this with 1, 3 and 5 tasks, and the numbers stayed about the same across each configuration. I've attached the code used in a secret gist and the reports for each test in a zip file. https://gist.github.com/blackdenc/2b3766617d3c16b0c6241db9f8af1a58 |
thanks Chris. Quick question was the load test (1,3,5 tasks) with a load balancer (horizontal scaling), or was this testing with #horizontal-scaling-containers=1 |
@chinwallaa this was with 1,3, and 5 tasks behind the load balancer routing traffic between each. |
@blackdenc Thank you for the updates. Could you use the attached EFO ID file? The previous file has some EFO ID that are removed in an update. The removed EFO IDs will cause 500 error. The attached one is the updated one. It is expected that all failed responses have 500 internal error, because other error codes are not implemented for the API server yet, and this implementation also has low priority. Is the Was the Fargate scaling enabled for this load testing? If not, could you set 2-4 always-there servers, and some other scaling rules? |
@logstar confirmed, the tests were running against https://openpedcan-api-dev.d3b.io. I'll update the EFO list and re-run the tests. |
Is it running on your laptop? Because the 2MB/s home network may not have bandwidth to get the responses in time. So the number of tasks is the number of API HTTP server instances available? If so, the runtime results are expected, because plumbr HTTP server handles requests sequentially. However, the failure results are unexpected. I wonder if you could also share the API HTTP server side log? This will help me understand why concurrent requests randomly fail. |
@logstar updating the EFO ID list reduced the error rate to 35% for one instance. I'll get the report and logs and post them here. |
Working with @logstar to update EFO list. In the meantime, here is last load test report. |
Chris shared the following error that is frequently occur. This is expected. The error occurs when an EFO ID has less than
Attached is the updated EFO ID list with >= @blackdenc - Could you also test the other two endpoints on https://openpedcan-api-dev.d3b.io/__docs__/ as well? |
Updated the gist with new code and EFO IDs, and running test with one task. Will upload results once done. |
The |
@blackdenc - Just noticed that the requests will include headers. The first line of the EFO or ENSG ID files is header. Could you add https://gist.github.com/blackdenc/2b3766617d3c16b0c6241db9f8af1a58#file-locustfile-py-L4-L12 |
It did omit the parameter, so no issues there.
The updated file in the gist should have the headers removed, which is how the test was run. The 5 minute test at one task also just completed with zero errors! Based on that result, I'm comfortable saying that the earlier test failures were due to the test data. I've attached the results below, and will run a sustained use test for 5 concurrent users over lunch to get an idea of how long the responses will be under continuous use. opentargets_api_loadtest-1_count_updated_EFO_and_endpoints.html.zip |
Thank you for the updates. I have created #31 to separate the scaling rules from this issue. |
@blackdenc - To add more context to the real use cases:
|
Ok, then I've been running the test at a lot lower capacity than I should have. I'll adjust the parameters and re-run |
Based on what I'm seeing in the load test, I think we should run with a default setting of 3 desired tasks. Using more than 2 concurrent users with one task seems to crash it within a few minutes while the affected container restarts. 2 or 3 tasks will allow it to recover the affected container(s) recover without any interruption in service. To that end, the following line should be added to the Jenkinsfile
|
I manually changed the desired tasks to 3, then ran the below test for 15 minutes. The api was able to sustain 15 concurrent users sending 1-3 requests per second with 0 failures. This would be the equivalent of 5 users sending 3 API calls each for the same length of time. Based on this, I'll submit a change to #28 opentargets_api_loadtest_3-tasks_15-users_15-minutes.html.zip |
Thank you. I agree two concurrent users can be considered as normal server load, and have In addition, @taylordm suggested the maximum server load case could be 5 concurrent users, and the last response should be completed within 10 seconds. @blackdenc Could you also test the maximum load criteria as well? Regarding changes to #28, could we wait for scaling rules and CPU/mem allocations settled and change them at once? After deployments are done, we could do another round of load testing. |
Added change to desired count in |
I think we are ok to push the change to the count now, then we can take another look at it when we do the CPU/mem allocation later. I don't know when the deadline for this is, but this way it's stable under stress. |
I agree. For having a stable QA deployment that can handle the normal and max server load, the proposed deadline is before next Wednesday. Let me know if you need more time. We just need to communicate back to the FNL team. |
@blackdenc - FYI, the plot API endpoints will have one more required parameter by today or tomorrow. More details are in https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1631808129073000?thread_ts=1631705868.070100&cid=C021Z53SK98. |
@blackdenc - The additional required parameter The parameter takes a value of The |
Tests show that we can support at most 6 concurrent users with 3 tasks, so to account for additional load, increased minimum number of tasks to 5. Because of how R allocates memory, ECS Service Type 1 will not be able to autoscale without significant module re-work, so this increase for the temporary solution is the recommended approach. Closing this issue as resolved. |
@blackdenc - Could you summarize the load testing results in CHOP analytics slack channel? Especially, how long does it take to handle 5 concurrent users, i.e. 10 plot requests and 5 table requests at once? I think we expect 5 concurrent users to be handled within 10 seconds. |
After #21 is resolved, load testing
OpenPedcan-api
server with multiple concurrent requests, to see whether the server can handle estimated load on production server.@chris-s-friedman suggested that he would like to see certain benchmarks at #20 (review). The comment is quoted below.
@chris-s-friedman Could you follow up with @blackdenc on how to implement the load test specifically to get the benchmarks you need?
The text was updated successfully, but these errors were encountered: