Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📫Implement /tpm/gene-all-cancer/json and /tpm/gene-all-cancer/plot API endpoints #20

Merged
merged 24 commits into from
Sep 14, 2021

Conversation

logstar
Copy link
Contributor

@logstar logstar commented Sep 2, 2021

Pull Request Template

Description

Implemented the HTTP GET methods of /tpm/gene-all-cancer/json and /tpm/gene-all-cancer/plot endpoints. These two endpoints handle HTTP requests for OpenPedCan-analysis pan-cancer_group boxplot and summary table, according to the API specifications in https://nih.box.com/s/5cq2jwi6bhg0mgnowad3e6e4i60hwbnr. The changes will be deployed on the dev server at https://openpedcan-api-dev.d3b.io/__docs__/.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Changed

  • Updated data model using OpenPedCan-analysis v9 release data.
  • Changed /tpm/gene-disease-gtex boxplot title from "Disease vs. GTEx tissue bulk gene expression" to "Primary tumor vs GTEx tissue bulk gene expression".
  • Changed "cohort =" to "Dataset =" in boxplot and summary table x-axis labels.
  • Changed "cohort" to "Dataset" in boxplot summary table columns.
  • Increased minimum number of samples required per Disease or GTEx_tissue_subgroup from 1 to 3.
  • Rotated boxplot x-axis labels by 45 degrees.
  • Changed tests/curl_test_endpoints.sh variable API_PORT to LOCAL_API_HOST_PORT.
  • Updated README.md.
  • Changed API version number from "0.1" to "v0.2.0-alpha".

Added

  • Implemented HTTP GET method for /tpm/gene-all-cancer/json API endpoint.
  • Implemented HTTP GET method for /tpm/gene-all-cancer/plot API endpoint.
  • Added API_HOST variable in tests/curl_test_endpoints.sh, in order to test DEV and QA hosts.
  • Added changelog.md.

How Has This Been Tested?

  • Integration test

  • Environment:

Working directory is the git repository root directory, i.e. the directory
that contains the .git directory of the repository.

ubuntu 18
docker 19.03
curl 7.78
sha256sum 8.28 # shasum for Mac OS, but not tested for any version.
R 4.1
R package readr 1.4.0
R package jsonlite 1.7.2
R package lintr 2.0.1
  • Test bash commands:
# working directory is the project directory (the directory that contains .git of this git repo)
#
# git should check out this PR branch
./tests/run_r_lintr.sh

docker build --no-cache -t open-ped-can-api . && docker run --rm -p 8082:80 -e DEBUG=1 open-ped-can-api

./tests/curl_test_endpoints.sh

For more details on test options and resources, see https://github.com/PediatricOpenTargets/OpenPedCan-api/tree/logstar/gene-all-cancer#3-test-run-openpedcan-api-server-locally.

Terminal returns:

$ docker build --no-cache -t open-ped-can-api . && docker run --rm -p 8082:80 -e DEBUG=1 open-ped-can-api

# ... docker build log
Load database from aws_s3...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    85  100    85    0     0   1393      0 --:--:-- --:--:-- --:--:--  1393
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1635M  100 1635M    0     0  34.5M      0  0:00:47  0:00:47 --:--:-- 30.3M


Check database sha256sum...
tpm_data_lists.rds: OK


Done running ./load_db.sh.
...
Successfully built 6d93072884d8
Successfully tagged open-ped-can-api:latest
---------------------------------
 2021-09-02 03:36:24 
 DEBUG =  TRUE 
---------------------------------
---------------------------------
 2021-09-02 03:37:01 
 Primary tumor all-cohorts independent n samples:  1946 
 Primary tumor each-cohort independent n samples:  1961 
 GTEx all n samples:  17382 
 Number of genes:  38939 
---------------------------------
Running plumber API at http://0.0.0.0:80
Running swagger Docs at http://127.0.0.1:80/__docs__/

$ ./tests/curl_test_endpoints.sh 

# ... 20 blank lines to separate from previous commands

GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000213420&efoId=EFO_0000621
http_code: 200
content_type: application/json
time_total: 1.118614 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000213420&efoId=EFO_0000621
http_code: 200
content_type: image/png
time_total: 2.088077 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000213420&efoId=Orphanet_178
http_code: 200
content_type: application/json
time_total: 0.905487 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000213420&efoId=Orphanet_178
http_code: 200
content_type: image/png
time_total: 2.021986 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000213420&efoId=MONDO_0016718
http_code: 200
content_type: application/json
time_total: 0.924767 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000213420&efoId=MONDO_0016718
http_code: 200
content_type: image/png
time_total: 2.460058 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000213420&efoId=MONDO_0016680
http_code: 200
content_type: application/json
time_total: 1.112687 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000213420&efoId=MONDO_0016680
http_code: 200
content_type: image/png
time_total: 2.170083 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000213420&efoId=MONDO_0016685
http_code: 200
content_type: application/json
time_total: 0.778383 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000213420&efoId=MONDO_0016685
http_code: 200
content_type: image/png
time_total: 2.142389 seconds


GET http://localhost:8082/tpm/gene-all-cancer/json?ensemblId=ENSG00000213420
http_code: 200
content_type: application/json
time_total: 0.521959 seconds


GET http://localhost:8082/tpm/gene-all-cancer/plot?ensemblId=ENSG00000213420
http_code: 200
content_type: image/png
time_total: 1.401613 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000157764&efoId=EFO_0000621
http_code: 200
content_type: application/json
time_total: 1.158260 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000157764&efoId=EFO_0000621
http_code: 200
content_type: image/png
time_total: 2.393872 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000157764&efoId=Orphanet_178
http_code: 200
content_type: application/json
time_total: 1.074898 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000157764&efoId=Orphanet_178
http_code: 200
content_type: image/png
time_total: 2.618877 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000157764&efoId=MONDO_0016718
http_code: 200
content_type: application/json
time_total: 1.154072 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000157764&efoId=MONDO_0016718
http_code: 200
content_type: image/png
time_total: 2.694904 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000157764&efoId=MONDO_0016680
http_code: 200
content_type: application/json
time_total: 1.250192 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000157764&efoId=MONDO_0016680
http_code: 200
content_type: image/png
time_total: 2.591269 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000157764&efoId=MONDO_0016685
http_code: 200
content_type: application/json
time_total: 0.776582 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000157764&efoId=MONDO_0016685
http_code: 200
content_type: image/png
time_total: 1.983900 seconds


GET http://localhost:8082/tpm/gene-all-cancer/json?ensemblId=ENSG00000157764
http_code: 200
content_type: application/json
time_total: 0.450952 seconds


GET http://localhost:8082/tpm/gene-all-cancer/plot?ensemblId=ENSG00000157764
http_code: 200
content_type: image/png
time_total: 1.360901 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000273032&efoId=EFO_0000621
http_code: 200
content_type: application/json
time_total: 1.741817 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000273032&efoId=EFO_0000621
http_code: 200
content_type: image/png
time_total: 3.424198 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000273032&efoId=Orphanet_178
http_code: 200
content_type: application/json
time_total: 1.933024 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000273032&efoId=Orphanet_178
http_code: 200
content_type: image/png
time_total: 2.829782 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000273032&efoId=MONDO_0016718
http_code: 200
content_type: application/json
time_total: 2.103507 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000273032&efoId=MONDO_0016718
http_code: 200
content_type: image/png
time_total: 2.716215 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000273032&efoId=MONDO_0016680
http_code: 200
content_type: application/json
time_total: 1.663538 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000273032&efoId=MONDO_0016680
http_code: 200
content_type: image/png
time_total: 3.449910 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/json?ensemblId=ENSG00000273032&efoId=MONDO_0016685
http_code: 200
content_type: application/json
time_total: 2.094036 seconds


GET http://localhost:8082/tpm/gene-disease-gtex/plot?ensemblId=ENSG00000273032&efoId=MONDO_0016685
http_code: 200
content_type: image/png
time_total: 2.880863 seconds


GET http://localhost:8082/tpm/gene-all-cancer/json?ensemblId=ENSG00000273032
http_code: 200
content_type: application/json
time_total: 1.627246 seconds


GET http://localhost:8082/tpm/gene-all-cancer/plot?ensemblId=ENSG00000273032
http_code: 200
content_type: image/png
time_total: 1.965179 seconds

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • [NA] New and existing unit tests pass locally with my changes
  • [NA] Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings
  • I have emojis in my PR and commits 😉

@logstar
Copy link
Contributor Author

logstar commented Sep 2, 2021

As suggested by @jonkiky in #24, enable Cross-Origin Resource Sharing
(CORS) on the API HTTP server side to allow PedOT queries rendered by
browsers.

The security concerns were ruled out by @jonkiky, @blackdenc, and
@logstar, in #24.
Copy link

@chris-s-friedman chris-s-friedman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jharenza asked me to take a look at this implementation of R for handling an API, so I'm only looking at src/plumber.R.

The main point i want to draw attention to that i touch upon in an individual comment:

I'd take a look at response times and how long it takes for R to formulate responses. Obviously, the test endpoints are quicker than the endpoints that return information about "real" data, but i'd be curious about how long the /plot endpoints take to build plots, compared to eachother and compared to the /json endpoints. Plotting - particularly with ggplot - isn't the fastest operation. So, as i mention in one of my comments, I'd consider wrapping the plot endpoint functions in promises::future_promise(), so that the server doesn't get overloaded with Plot requests. Obviously - take a look at the data first to see if this is actually an issue. Looking at the response times from curl-test-endpoints, it looks like responses take about 2 - 3 seconds for the tpm/gene-disease-gtex/plot endpoint, so i may consider starting there


gene_tpm_boxplot_tbl <- get_gene_tpm_boxplot_tbl(gene_tpm_tbl)

res_plot <- get_gene_tpm_boxplot(gene_tpm_boxplot_tbl)

print(res_plot)
}



# Testing endpoints ------------------------------------------------------------

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably get rid of these testing endpoints soon, depending on what the purpose of having the testing endpoints is.

If the purpose is to test that the endpoint can run r code, then i'd drop the /echo endpoint and rename the /plot endpoint to something that makes it clear that a plot of random data is being returned, such as /random_plot.

If the purpose is to test if the endpoint can be reached by the client, perhaps something like /status (or /session_info, that returns information about the environment e.g.

#* Return information about the R environment of the server
#* 
#* @tag "R system information"
#* @get /session_info
function() {
    sessionInfo()
}

I think this approach has the added benefit of giving the user useful information that

  1. may otherwise be hard to find
  2. may change over time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We will discuss with the users of this API to see whether they still need the testing endpoints.

Would exposing R environment sessionInfo() be a security issue? cc @blackdenc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-s-friedman I have discussed with @jonkiky, who is one of the users of this API. @jonkiky suggested to keep the /echo endpoint for health check with minimal resources, and remove the other two testing endpoints, i.e. /plot and /sum.

Implemented in acb0b6f.

src/plumber.R Show resolved Hide resolved
@@ -80,7 +96,8 @@ function(ensemblId, efoId) {
#* @get /tpm/gene-disease-gtex/plot

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since producing plots can take a while, i'd consider attempting to parallelize the plot endpoints so that plot requests don't eat up all the processes. See the plumber 1.1.0 release notes and the article those notes point to.

Granted, the plot endpoint will take longer to send responses to users because there's more to send in plots, but still somthing to consider.

Copy link
Contributor Author

@logstar logstar Sep 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you submit an issue for this suggestion with more implementation details, or submit a PR for implementing this?

I think this feature would be more properly implemented as a standalone PR, because it is not trivial, and we are planning to merge this PR in the next few days. I think implementing this feature would take at least the following steps:

  • Read the documentations of promises package and plumbr parallel execution model completely. I am confused by plumbr documentation claiming that requests are handled sequentially at https://www.rplumber.io/articles/execution-model.html#performance-request-processing. Could you specify the plumbr parallel execution model in the upcoming issue/PR?
  • Check source code if documentations are unclear.
  • Check whether the R functions can actually be wrapped with promises::future_promise and still return the correct results.
  • Implement this feature.
  • Test this feature. Check whether requests are actually executed concurrently.
  • Discuss with DevOps team how this would affect previously designed Amazon Fargate/ECS/etc scaling procedure.
  • Document this feature:
    • How does it work?
    • Why does it work correctly? How to tell whether it is appropriate and beneficial to wrap a function with promises::future_promise.
    • How does it interact with the deployment level scaling?
    • Other relevant specifications.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - yeah this may be too big for this pr. I'm wondering if @blackdenc may be able to speak to any load balancing or anything on the infra side to handle concurrent requests ...

@logstar
Copy link
Contributor Author

logstar commented Sep 12, 2021

@chris-s-friedman Thank you for the review.

@jharenza asked me to take a look at this implementation of R for handling an API, so I'm only looking at src/plumber.R.

Could you also review all other files and test run data building process and plumbr server? The src/plumber.R only defines interfaces, and the interfaces completely rely on all other files to work properly.

I'd take a look at response times and how long it takes for R to formulate responses. Obviously, the test endpoints are quicker than the endpoints that return information about "real" data, but i'd be curious about how long the /plot endpoints take to build plots, compared to eachother and compared to the /json endpoints. Plotting - particularly with ggplot - isn't the fastest operation. So, as i mention in one of my comments, I'd consider wrapping the plot endpoint functions in promises::future_promise(), so that the server doesn't get overloaded with Plot requests. Obviously - take a look at the data first to see if this is actually an issue. Looking at the response times from curl-test-endpoints, it looks like responses take about 2 - 3 seconds for the tpm/gene-disease-gtex/plot endpoint, so i may consider starting there

@blackdenc @chinwallaa and I are planning to evaluate the response time with load test on the dev server after #21 is resolved, because implementing #21 may change the response time and handling procedure significantly. However, we could start relevant discussions, as #21 will probably be implemented by the end of next week.

Could you discuss with @blackdenc on how to implement the load test specifically to get the benchmarks you need, as @blackdenc is going to implement the load test?

I will get back to your specific comments in the threads.

@logstar
Copy link
Contributor Author

logstar commented Sep 14, 2021

@chris-s-friedman - Thank you for reviewing.

This PR will be merged soon. Ongoing discussions will be followed up in #25, #26 and #27.

@logstar logstar merged commit b265d7b into main Sep 14, 2021
@logstar logstar deleted the logstar/gene-all-cancer branch September 14, 2021 18:12
@logstar logstar requested review from kelseykeith and removed request for komalsrathi August 30, 2022 13:21
@kelseykeith
Copy link

Code and Results Review

Everything looks good! No errors that I can find

  • Code is pulling from the correct version of the data.
  • Plot changes (title, x-axis label angle, "Dataset" in x-axis labels) are correct for all test calls
  • Only datasets with 3 or more samples return a result. Ganglioglioma (EFO_0003094) has only one sample and failed as expected, while Choroid plexus carcinoma (MONDO_0016718) and Germinoma (EFO_0000514) with 4 samples did return results.
  • Results from the API call match data in OpenPedCan-analysis tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment