async S3 HTTP with botornado #14

pda · 2014-11-18T02:07:46Z

Summary: thumbor runs on tornado which is async, but thumbor_aws uses boto for HTTP to S3 which is blocking, causing awful performance. This patch uses botornado, a tornado async partial port of boto, to make S3 requests non-blocking.

Test material at https://gist.github.com/pda/5dcf54410b696072fcbd

In production we have an ELB load balancer in front of two EC2 instances each running thumbor + thumbor_aws inside a docker container. We had reports of blank images, and then hints that they were HTTP 503 errors, but it wasn't reproducible. Our logs shows that the load balancer frequently marked the instances as unhealthy, meaning they've failed to respond to health checks. The thumbor processes had been up and running for many days, and had no suspicious log entries.

I suspected that something was blocking inside the thumbor event loop, and narrowed it down to boto in thumbor_aws.

Using a controllably slow backend behind thumbor_aws, I was able to demonstrate the single-concurrency of boto by showing slow requests blocking fast requests:

$ ./race.rb http://localhost:8888/unsafe/sleep/{10,1000,100,10000,500}

dispatching http://localhost:8888/unsafe/sleep/10
dispatching http://localhost:8888/unsafe/sleep/1000
dispatching http://localhost:8888/unsafe/sleep/100
dispatching http://localhost:8888/unsafe/sleep/10000
dispatching http://localhost:8888/unsafe/sleep/500
78 ms for http://localhost:8888/unsafe/sleep/10
2030 ms for http://localhost:8888/unsafe/sleep/1000
2168 ms for http://localhost:8888/unsafe/sleep/100
22108 ms for http://localhost:8888/unsafe/sleep/10000
23051 ms for http://localhost:8888/unsafe/sleep/500

(The double times in the benchmarks are due to the test server sleeping n milliseconds for both the HEAD and GET requests. Perhaps that HEAD request should be eliminated from thumbor_aws, but that's a separate issue).

Note that the responses are serialized into the order they arrived, and that the time is roughly cumulative — the final request which should have taken 1,000ms took 23,000ms.

Switching s3_loader.py from boto to botornado fixes that:

$ ./race.rb http://localhost:8888/unsafe/sleep/{10,1000,100,10000,500}

dispatching http://localhost:8888/unsafe/sleep/10
dispatching http://localhost:8888/unsafe/sleep/1000
dispatching http://localhost:8888/unsafe/sleep/100
dispatching http://localhost:8888/unsafe/sleep/10000
dispatching http://localhost:8888/unsafe/sleep/500
89 ms for http://localhost:8888/unsafe/sleep/10
228 ms for http://localhost:8888/unsafe/sleep/100
1024 ms for http://localhost:8888/unsafe/sleep/500
2028 ms for http://localhost:8888/unsafe/sleep/1000
20024 ms for http://localhost:8888/unsafe/sleep/10000

This fix uses botornado to do async HTTP requests to S3 in a way that fits the tornado event loop model, without blocking other requests.

I've only applied it to the loaders module, as that's the only component that we're using. It should probably also be applied to the storages and result_storages modules. Perhaps we can ship this fix first, though?

willtrking · 2014-11-18T04:04:38Z

Thanks for the patch! My only concern with merging this in is backwards compatibility of botornado with boto. Aside from installing botornado is this a drop in replacement for the existing plugin, or are there additional steps needed for installation?

A good solution may be to have a config option allowing people to switch between the two.

pda · 2014-11-18T18:49:51Z

I don't write much python, so I've barely used boto, and I'd never heard of botornado until I found it trying to fix this. I couldn't comment on backwards compatibility except that “it seems to work” :)

I'm not sure about the config option — I think using a blocking client like boto in an async environment like tornado is a serious bug. I would suggest tagging the current version as v1.0.0 with a big warning in the README about it blocking, and then integrate botornado and release it as v2.0.0. That way people can deliberately upgrade, or stay on the blocking version if it's working for them and they're concerned about incompatibility.

fluke · 2015-02-22T09:55:00Z

I think this needs to be implemented. Could you merge this in.

pda · 2015-02-22T20:29:30Z

I discovered that this wasn't loading AWS credentials from the environment in the expected way.
In the end, I wrote https://github.com/99designs/thumbor_botornado which has a very simple botornado loader and also a loader which can sit in front of that to handle non-S3 URLs separately. 99designs.com has since been using it to serve millions of images.

So — I'm not sure about this PR, it might need more work, but I'm not using it anymore.

fluke · 2015-02-23T10:23:27Z

@pda Well as of now I'm simply using the HTTP loader with URL signing. As I was coming across this issue #19 with multiple buckets. What are the benefits of using the AWS bucket method?

pda · 2015-02-23T18:38:25Z

@kartikluke If the original images are in a public readable S3 bucket, then the HTTP loader is perfect. But if the original images aren't public readable, then S3 credentials are required… as far as I know the HTTP loader doesn't know how to do that…?

fluke · 2015-02-23T23:05:57Z

Ah that's correct. Sorry. Got confused.

On Tue 24 Feb, 2015 00:08 Paul Annesley [email protected] wrote:

@kartikluke https://github.com/kartikluke If the original images are in
a public readable S3 bucket, then the HTTP loader is perfect. But if the
original images aren't public readable, then S3 credentials are
required… as far as I know the HTTP loader doesn't know how to do that…?

—
Reply to this email directly or view it on GitHub
#14 (comment).

* hashed path in S3 prevent to serach by prefix in keys. Without real advantage (not a real FS) * add vows based storage tests * add .gitignore for *.pyc

pda · 2015-04-14T23:45:29Z

Closing this; as I mentioned I wrote https://github.com/99designs/thumbor_botornado which we're using. Feel free to use the code from this PR or that 99designs/thumbor_botornado repo, though.

pda added 2 commits November 17, 2014 17:43

setup.py requires botornado

e6db255

s3_loader uses async botornado for S3 access.

1b464f3

RobertBiehl mentioned this pull request Mar 27, 2015

This plugin is not using the tornado runloop and thus is very slow! #21

Open

pda closed this Apr 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async S3 HTTP with botornado #14

async S3 HTTP with botornado #14

pda commented Nov 18, 2014

willtrking commented Nov 18, 2014

pda commented Nov 18, 2014

fluke commented Feb 22, 2015

pda commented Feb 22, 2015

fluke commented Feb 23, 2015

pda commented Feb 23, 2015

fluke commented Feb 23, 2015

pda commented Apr 14, 2015

async S3 HTTP with botornado #14

async S3 HTTP with botornado #14

Conversation

pda commented Nov 18, 2014

willtrking commented Nov 18, 2014

pda commented Nov 18, 2014

fluke commented Feb 22, 2015

pda commented Feb 22, 2015

fluke commented Feb 23, 2015

pda commented Feb 23, 2015

fluke commented Feb 23, 2015

pda commented Apr 14, 2015