Load testing `OpenPedcan-api` server #27

logstar · 2021-09-14T17:16:44Z

After #21 is resolved, load testing OpenPedcan-api server with multiple concurrent requests, to see whether the server can handle estimated load on production server.

@chris-s-friedman suggested that he would like to see certain benchmarks at #20 (review). The comment is quoted below.

I'd take a look at response times and how long it takes for R to formulate responses. Obviously, the test endpoints are quicker than the endpoints that return information about "real" data, but i'd be curious about how long the /plot endpoints take to build plots, compared to eachother and compared to the /json endpoints. Plotting - particularly with ggplot - isn't the fastest operation. So, as i mention in one of my comments, I'd consider wrapping the plot endpoint functions in promises::future_promise(), so that the server doesn't get overloaded with Plot requests. Obviously - take a look at the data first to see if this is actually an issue. Looking at the response times from curl-test-endpoints, it looks like responses take about 2 - 3 seconds for the tpm/gene-disease-gtex/plot endpoint, so i may consider starting there

@chris-s-friedman Could you follow up with @blackdenc on how to implement the load test specifically to get the benchmarks you need?

The text was updated successfully, but these errors were encountered:

logstar · 2021-09-15T21:09:47Z

@blackdenc - The API HTTP server with database backend has been deployed on dev host. Could you work on load testing the dev server to get some benchmarks on handling concurrent and sequential requests?

cc @taylordm @chinwallaa @afarrel

blackdenc · 2021-09-16T13:29:57Z

I ran a test using the https://locust.io/ framework by getting random ensemblId and a random efoId, then running a get request on the API. I split requests 50/50 between the json endpoint and the plot endpoint simulating 5 concurrent users for about 5 minutes. The commit that the environment was using at the time was ec9b9a27653cba45a557f0956661fa933514f274

It looks like the failure rate is a little under 50% with 2-4 requests per second with the errors all returning 500 Server Error: Internal Server Error

I re-ran this with 1, 3 and 5 tasks, and the numbers stayed about the same across each configuration. I've attached the code used in a secret gist and the reports for each test in a zip file.

https://gist.github.com/blackdenc/2b3766617d3c16b0c6241db9f8af1a58

OpenTargets_load_test_reports.zip

blackdenc · 2021-09-16T13:38:50Z

Note: Throughout the testing, the memory and CPU utilization on each task did not exceed 25%/3GB.

chinwallaa · 2021-09-16T13:49:36Z

thanks Chris. Quick question was the load test (1,3,5 tasks) with a load balancer (horizontal scaling), or was this testing with #horizontal-scaling-containers=1

blackdenc · 2021-09-16T13:51:19Z

@chinwallaa this was with 1,3, and 5 tasks behind the load balancer routing traffic between each.

logstar · 2021-09-16T13:53:38Z

@blackdenc Thank you for the updates.

Could you use the attached EFO ID file? The previous file has some EFO ID that are removed in an update. The removed EFO IDs will cause 500 error. The attached one is the updated one.

It is expected that all failed responses have 500 internal error, because other error codes are not implemented for the API server yet, and this implementation also has low priority.

Is the locust test running on the DEV server or running on another machine and sending requests to the DEV server? I cannot see base URL in the locust script.

Was the Fargate scaling enabled for this load testing? If not, could you set 2-4 always-there servers, and some other scaling rules?

test_efo_id.txt

blackdenc · 2021-09-16T13:55:02Z

@logstar confirmed, the tests were running against https://openpedcan-api-dev.d3b.io. I'll update the EFO list and re-run the tests.

logstar · 2021-09-16T14:01:26Z

@logstar confirmed, the tests were running against https://openpedcan-api-dev.d3b.io. I'll update the EFO list and re-run the tests.

Is it running on your laptop? Because the 2MB/s home network may not have bandwidth to get the responses in time.

So the number of tasks is the number of API HTTP server instances available? If so, the runtime results are expected, because plumbr HTTP server handles requests sequentially.

However, the failure results are unexpected. I wonder if you could also share the API HTTP server side log? This will help me understand why concurrent requests randomly fail.

blackdenc · 2021-09-16T14:08:47Z

@logstar updating the EFO ID list reduced the error rate to 35% for one instance. I'll get the report and logs and post them here.

blackdenc · 2021-09-16T14:21:29Z

Working with @logstar to update EFO list. In the meantime, here is last load test report.

opentargets_api_loadtest-1_count_updated_EFO.html.zip

logstar · 2021-09-16T14:27:55Z

Chris shared the following error that is frequently occur. This is expected. The error occurs when an EFO ID has less than min_n_per_sample_group samples.

<simpleError in get_gene_tpm_tbl(ensg_id = ensemblId, efo_id = efoId, min_n_per_sample_group = 3): nrow(disease_long_tpm_tbl) > 0 is not TRUE>

Attached is the updated EFO ID list with >= min_n_per_sample_group samples.

test_efo_id.txt

@blackdenc - Could you also test the other two endpoints on https://openpedcan-api-dev.d3b.io/__docs__/ as well?

blackdenc · 2021-09-16T14:43:11Z

Updated the gist with new code and EFO IDs, and running test with one task. Will upload results once done.

logstar · 2021-09-16T14:45:03Z

Updated the gist with new code and EFO IDs, and running test with one task. Will upload results once done.

The /tpm/gene-all-cancer/ endpoints do not need EFO ID parameter, but the server should omit the parameter. We'll see.

logstar · 2021-09-16T14:50:56Z

@blackdenc - Just noticed that the requests will include headers. The first line of the EFO or ENSG ID files is header. Could you add xxx_list = xxx_list[1:] before randomly selecting them for sending requests?

https://gist.github.com/blackdenc/2b3766617d3c16b0c6241db9f8af1a58#file-locustfile-py-L4-L12

blackdenc · 2021-09-16T15:06:03Z

The /tpm/gene-all-cancer/ endpoints do not need EFO ID parameter, but the server should omit the parameter. We'll see.

It did omit the parameter, so no issues there.

@blackdenc - Just noticed that the requests will include headers.

The updated file in the gist should have the headers removed, which is how the test was run.

The 5 minute test at one task also just completed with zero errors! Based on that result, I'm comfortable saying that the earlier test failures were due to the test data. I've attached the results below, and will run a sustained use test for 5 concurrent users over lunch to get an idea of how long the responses will be under continuous use.

opentargets_api_loadtest-1_count_updated_EFO_and_endpoints.html.zip

logstar · 2021-09-16T15:29:28Z

The /tpm/gene-all-cancer/ endpoints do not need EFO ID parameter, but the server should omit the parameter. We'll see.

It did omit the parameter, so no issues there.

@blackdenc - Just noticed that the requests will include headers.

The updated file in the gist should have the headers removed, which is how the test was run.

The 5 minute test at one task also just completed with zero errors! Based on that result, I'm comfortable saying that the earlier test failures were due to the test data. I've attached the results below, and will run a sustained use test for 5 concurrent users over lunch to get an idea of how long the responses will be under continuous use.

opentargets_api_loadtest-1_count_updated_EFO_and_endpoints.html.zip

Thank you for the updates.

I have created #31 to separate the scaling rules from this issue.

logstar · 2021-09-16T16:14:03Z

@blackdenc - To add more context to the real use cases:

Every loading of an FNL team PedOT website page will send out 1 JSON request and 2 PNG requests at once with the same ENSG and EFO ID query parameters.
When two users load two PedOT website pages at once, there will be 6 requests in total sent to the API server.

blackdenc · 2021-09-16T17:41:57Z

Ok, then I've been running the test at a lot lower capacity than I should have. I'll adjust the parameters and re-run

blackdenc · 2021-09-16T17:55:08Z

Based on what I'm seeing in the load test, I think we should run with a default setting of 3 desired tasks. Using more than 2 concurrent users with one task seems to crash it within a few minutes while the affected container restarts. 2 or 3 tasks will allow it to recover the affected container(s) recover without any interruption in service. To that end, the following line should be added to the Jenkinsfile

desired_count = "3"

blackdenc · 2021-09-16T18:03:16Z

I manually changed the desired tasks to 3, then ran the below test for 15 minutes. The api was able to sustain 15 concurrent users sending 1-3 requests per second with 0 failures. This would be the equivalent of 5 users sending 3 API calls each for the same length of time.

Based on this, I'll submit a change to #28

opentargets_api_loadtest_3-tasks_15-users_15-minutes.html.zip

logstar · 2021-09-16T18:05:06Z

Based on what I'm seeing in the load test, I think we should run with a default setting of 3 desired tasks. Using more than 2 concurrent users with one task seems to crash it within a few minutes while the affected container restarts. 2 or 3 tasks will allow it to recover the affected container(s) recover without any interruption in service. To that end, the following line should be added to the Jenkinsfile

desired_count = "3"

Thank you.

I agree two concurrent users can be considered as normal server load, and have desired_count = "3".

In addition, @taylordm suggested the maximum server load case could be 5 concurrent users, and the last response should be completed within 10 seconds.

@blackdenc Could you also test the maximum load criteria as well?

Regarding changes to #28, could we wait for scaling rules and CPU/mem allocations settled and change them at once? After deployments are done, we could do another round of load testing.

cc @chinwallaa @afarrel

blackdenc · 2021-09-16T18:06:30Z

Added change to desired count in 25f66f4 (#28). Will start testing potential maximum load now.

blackdenc · 2021-09-16T18:08:56Z

Regarding changes to #28, could we wait for scaling rules and CPU/mem allocations settled and change them at once? After deployments are done, we could do another round of load testing.

I think we are ok to push the change to the count now, then we can take another look at it when we do the CPU/mem allocation later. I don't know when the deadline for this is, but this way it's stable under stress.

logstar · 2021-09-16T18:12:04Z

Regarding changes to #28, could we wait for scaling rules and CPU/mem allocations settled and change them at once? After deployments are done, we could do another round of load testing.

I think we are ok to push the change to the count now, then we can take another look at it when we do the CPU/mem allocation later. I don't know when the deadline for this is, but this way it's stable under stress.

I agree. For having a stable QA deployment that can handle the normal and max server load, the proposed deadline is before next Wednesday. Let me know if you need more time. We just need to communicate back to the FNL team.

cc @taylordm @chinwallaa

logstar · 2021-09-16T18:37:11Z

@blackdenc - FYI, the plot API endpoints will have one more required parameter by today or tomorrow. More details are in https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1631808129073000?thread_ts=1631705868.070100&cid=C021Z53SK98.

logstar · 2021-09-16T21:42:45Z

@blackdenc - The additional required parameter yAxisScale is deployed on DEV server, which is implemented in 239d5ab.

The parameter takes a value of linear or log10. On real use cases, every PedOT web page load will send 1 JSON request, 1 linear PNG request, and 1 log10 PNG request.

The log10 value may be changed to log2 at a later point.

blackdenc · 2021-09-28T11:56:31Z

Tests show that we can support at most 6 concurrent users with 3 tasks, so to account for additional load, increased minimum number of tasks to 5. Because of how R allocates memory, ECS Service Type 1 will not be able to autoscale without significant module re-work, so this increase for the temporary solution is the recommended approach. Closing this issue as resolved.

logstar · 2021-09-28T15:16:09Z

Tests show that we can support at most 6 concurrent users with 3 tasks, so to account for additional load, increased minimum number of tasks to 5. Because of how R allocates memory, ECS Service Type 1 will not be able to autoscale without significant module re-work, so this increase for the temporary solution is the recommended approach. Closing this issue as resolved.

@blackdenc - Could you summarize the load testing results in CHOP analytics slack channel? Especially, how long does it take to handle 5 concurrent users, i.e. 10 plot requests and 5 table requests at once? I think we expect 5 concurrent users to be handled within 10 seconds.

cc @chinwallaa @taylordm @afarrel

logstar added the blocked label Sep 14, 2021

logstar assigned blackdenc Sep 14, 2021

This was referenced Sep 14, 2021

Investigte using promises::future_promise() to parralize slow api endpoints #25

Open

📫Implement /tpm/gene-all-cancer/json and /tpm/gene-all-cancer/plot API endpoints #20

Merged

logstar removed the blocked label Sep 15, 2021

logstar mentioned this issue Sep 16, 2021

Reduce DEV/QA/PRD API HTTP server CPU and memory allocations #30

Open

logstar mentioned this issue Sep 16, 2021

Optimize deployment task scaling rules #31

Closed

logstar added the work in progress label Sep 16, 2021

logstar mentioned this issue Sep 17, 2021

Optimize RDS resource allocations and database server configurations #35

Open

blackdenc closed this as completed Sep 28, 2021

logstar reopened this Sep 28, 2021

logstar mentioned this issue Nov 17, 2021

Profile and optimize OpenPedCan-api API server performance #55

Open

logstar mentioned this issue Apr 19, 2022

API PRD instance load balancing and scaling rules #68

Closed

chinwallaa closed this as completed May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load testing `OpenPedcan-api` server #27

Load testing `OpenPedcan-api` server #27

logstar commented Sep 14, 2021

logstar commented Sep 15, 2021

blackdenc commented Sep 16, 2021 •

edited

Loading

blackdenc commented Sep 16, 2021 •

edited

Loading

chinwallaa commented Sep 16, 2021

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021 •

edited

Loading

blackdenc commented Sep 16, 2021 •

edited

Loading

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

logstar commented Sep 16, 2021 •

edited

Loading

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

logstar commented Sep 16, 2021

blackdenc commented Sep 16, 2021

blackdenc commented Sep 16, 2021

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

blackdenc commented Sep 16, 2021

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

logstar commented Sep 16, 2021

logstar commented Sep 16, 2021

blackdenc commented Sep 28, 2021

logstar commented Sep 28, 2021

Load testing OpenPedcan-api server #27

Load testing OpenPedcan-api server #27

Comments

logstar commented Sep 14, 2021

logstar commented Sep 15, 2021

blackdenc commented Sep 16, 2021 • edited Loading

blackdenc commented Sep 16, 2021 • edited Loading

chinwallaa commented Sep 16, 2021

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021 • edited Loading

blackdenc commented Sep 16, 2021 • edited Loading

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

logstar commented Sep 16, 2021 • edited Loading

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

logstar commented Sep 16, 2021

blackdenc commented Sep 16, 2021

blackdenc commented Sep 16, 2021

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

blackdenc commented Sep 16, 2021

blackdenc commented Sep 16, 2021

logstar commented Sep 16, 2021

logstar commented Sep 16, 2021

logstar commented Sep 16, 2021

blackdenc commented Sep 28, 2021

logstar commented Sep 28, 2021

Load testing `OpenPedcan-api` server #27

Load testing `OpenPedcan-api` server #27

blackdenc commented Sep 16, 2021 •

edited

Loading

blackdenc commented Sep 16, 2021 •

edited

Loading

logstar commented Sep 16, 2021 •

edited

Loading

blackdenc commented Sep 16, 2021 •

edited

Loading

logstar commented Sep 16, 2021 •

edited

Loading