Skip to content

Performance

Jeremy Echols edited this page Nov 7, 2018 · 6 revisions

RAIS Performance

Production Monitoring

As of late 2018, we have about a million JP2 images, which take up roughly 4 terabytes and are mounted from external network storage. Our back-end is very modest:

  • We have a single VM with 6 gigs of RAM
  • Our server runs Solr, MariaDB, and Open ONI (a django application) in addition to RAIS
  • We set up Apache to cache thumbnails to disk in order to reduce load when search results are displayed
  • All non-thumbnail images are served from RAIS
  • We configured our instance of RAIS to cache up to 1000 tiles just in case many people are drawn to a single newspaper for any reason.

Despite our minimal hardware, RAIS uses roughly 600 megs of RAM, including the tile cache. Prior to adding tile-level caching, RAM usage was generally in the range of 50-100 megs with brief spikes up to 400 megs during peak usage. Despite the hefty costs of JP2 decoding and the number of incoming requests, RAIS uses roughly 2-4 CPU hours per day, which is roughly equivalent to the django stack's CPU usage.

Actual performance monitoring data on a moderate-traffic 24-hour span:

Performance monitoring statistics for RAIS

The P99 latency graph pretty much tells it all: during the entire span, 99% of requests were handled in under one second. 95% of requests were under 500ms. This is despite some heavy spidering around 5am which caused a handful of thumbnail requests (about 10) to take 4-6 seconds each.

Load testing: Tiles

Load testing was done on October 22, 2018

Setup:

  • We ran RAIS in a docker container behind an nginx docker container (basically, the demo setup we use in the RAIS repository)
  • We had the stack running on a "t3.medium" EC2 instance (4 gigs of RAM, 2 vCPUs)
  • RAIS had all caching disabled
  • We ran siege from a system in the UO network (i.e., not a local process in the same place RAIS was running)
  • The EC2 server had six JP2s ranging from 400 to 800 megapixels, and all were RGB images
  • The urls list contained 1500 tile requests: 250 for each image
  • None of the tiles are the same as any others in the list
  • Each tile was either at maximum zoom or one level above maximum
  • The URL list (with the server name scrubbed since it's a dynamic EC2 URL): urls.txt
  • RAIS never exceeded 700 megs of RAM during the ten-minute load test
    • RAIS doesn't spawn external processes, so this RAM usage includes all image processing

Results:

$ siege -c 8 -i -t 10m -b -f urls.txt

** SIEGE 4.0.4
** Preparing 8 concurrent users for battle.
The server is now under siege...
Lifting the server siege...
Transactions:                   5989 hits
Availability:                 100.00 %
Elapsed time:                 599.94 secs
Data transferred:            1080.11 MB
Response time:                  0.80 secs
Transaction rate:               9.98 trans/sec
Throughput:                     1.80 MB/sec
Concurrency:                    7.99
Successful transactions:        5989
Failed transactions:               0
Longest transaction:            1.62
Shortest transaction:           0.18


$ siege -c 8 -i -t 10m -b -f urls.txt
** SIEGE 4.0.4
** Preparing 8 concurrent users for battle.
The server is now under siege...
Lifting the server siege...
Transactions:                   6173 hits
Availability:                 100.00 %
Elapsed time:                 599.03 secs
Data transferred:            1108.83 MB
Response time:                  0.78 secs
Transaction rate:              10.30 trans/sec
Throughput:                     1.85 MB/sec
Concurrency:                    8.00
Successful transactions:        6173
Failed transactions:               0
Longest transaction:            1.59
Shortest transaction:           0.18

Load testing: Resize

Resizing is typically slower than loading tiles, hence the separate load test.

Load testing was done on November 7, 2018

Setup was largely the same as the tile load test, other than the URLs used:

  • We ran RAIS in a docker container behind an nginx docker container (basically, the demo setup we use in the RAIS repository)
  • We had the stack running on a "t3.medium" EC2 instance (4 gigs of RAM, 2 vCPUs)
  • RAIS had all caching disabled
  • We ran siege from a system in the UO network (i.e., not a local process in the same place RAIS was running)
  • The EC2 server had six JP2s ranging from 400 to 800 megapixels, and all were RGB images
  • The urls list contained 120 tile requests: 20 per image
    • Each URL was for the full image, resized in the range of 390x390 to 409x409 in order to mimic semi-random resize requests
  • The URL list (with the server name scrubbed since it's a dynamic EC2 URL): urls-resize.txt
  • RAIS never exceeded 400 megs of RAM during the ten-minute load test
    • RAIS doesn't spawn external processes, so this RAM usage includes all image processing

Results:

$ siege -c 8 -i -t 10m -b -f ./urls-resize.txt
** SIEGE 4.0.4
** Preparing 8 concurrent users for battle.
The server is now under siege...
Lifting the server siege...
Transactions:                   4199 hits
Availability:                 100.00 %
Elapsed time:                 599.93 secs
Data transferred:             106.59 MB
Response time:                  1.14 secs
Transaction rate:               7.00 trans/sec
Throughput:                     0.18 MB/sec
Concurrency:                    7.99
Successful transactions:        4199
Failed transactions:               0
Longest transaction:            3.32
Shortest transaction:           0.41


$ siege -c 8 -i -t 10m -b -f ./urls-resize.txt
** SIEGE 4.0.4
** Preparing 8 concurrent users for battle.
The server is now under siege...
Lifting the server siege...
Transactions:                   4213 hits
Availability:                 100.00 %
Elapsed time:                 599.35 secs
Data transferred:             106.74 MB
Response time:                  1.14 secs
Transaction rate:               7.03 trans/sec
Throughput:                     0.18 MB/sec
Concurrency:                    7.99
Successful transactions:        4213
Failed transactions:               0
Longest transaction:            2.08
Shortest transaction:           0.40

Using ab

Same setup as the resize testing, except a single URL is used, so the results aren't as real-world. The image chosen is 15036x30000 (430 megapixels).

No concurrency

$ ab -n 300 'http://XYZZY.us-west-2.compute.amazonaws.com/iiif/s3%3Atestjp2s%2F0-Almeida_J%25C3%25BAnior_
.png.jp2/full/!400,400/0/default.jpg'
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking XYZZY.us-west-2.compute.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Finished 300 requests


Server Software:        nginx/1.15.5
Server Hostname:        XYZZY.us-west-2.compute.amazonaws.com
Server Port:            80

Document Path:          /iiif/s3%3Atestjp2s%2F0-Almeida_J%25C3%25BAnior_.png.jp2/full/!400,400/0/default.jpg
Document Length:        12152 bytes

Concurrency Level:      1
Time taken for tests:   56.665 seconds
Complete requests:      300
Failed requests:        0
Total transferred:      3705900 bytes
HTML transferred:       3645600 bytes
Requests per second:    5.29 [#/sec] (mean)
Time per request:       188.883 [ms] (mean)
Time per request:       188.883 [ms] (mean, across all concurrent requests)
Transfer rate:          63.87 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       22   23   1.2     23      33
Processing:   153  165   8.0    163     240
Waiting:      153  165   8.0    163     240
Total:        176  189   8.1    187     265

Percentage of the requests served within a certain time (ms)
  50%    187
  66%    191
  75%    193
  80%    194
  90%    198
  95%    200
  98%    204
  99%    216
 100%    265 (longest request)

8 concurrent requests

To ensure more realistic data, we fired off 5000 requests against RAIS; 300 would be too fast to get accurate results with concurrency enabled.

(This ends up being very similar to the siege approach except, again, a single image and a single URL)

$ ab -n 5000 -c 8 'http://XYZZY.us-west-2.compute.amazonaws.com/iiif/s3%3Atestjp2s%2F0-Almeida_J%25C3%25B
Anior_.png.jp2/full/!400,400/0/default.jpg'
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking XYZZY.us-west-2.compute.amazonaws.com (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests


Server Software:        nginx/1.15.5
Server Hostname:        XYZZY.us-west-2.compute.amazonaws.com
Server Port:            80

Document Path:          /iiif/s3%3Atestjp2s%2F0-Almeida_J%25C3%25BAnior_.png.jp2/full/!400,400/0/default.jpg
Document Length:        12152 bytes

Concurrency Level:      8
Time taken for tests:   460.738 seconds
Complete requests:      5000
Failed requests:        0
Total transferred:      61765000 bytes
HTML transferred:       60760000 bytes
Requests per second:    10.85 [#/sec] (mean)
Time per request:       737.181 [ms] (mean)
Time per request:       92.148 [ms] (mean, across all concurrent requests)
Transfer rate:          130.91 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       21   23   1.3     23      45
Processing:   324  713  80.0    717    1187
Waiting:      323  709  80.0    711    1187
Total:        347  737  80.0    740    1211

Percentage of the requests served within a certain time (ms)
  50%    740
  66%    762
  75%    777
  80%    785
  90%    812
  95%    840
  98%    941
  99%   1023
 100%   1211 (longest request)