Measuring performance bottleneck #78

atitan · 2019-10-16T08:32:52Z

We use Scout APM to monitoring performance.

It seems Falcon and Puma have different approach handling requests.

Falcon has much higher queue time(yellow part in chart, time before request being processed) and low processing time. Like requests are blocked outside of server to wait for entrance.

Puma has much higher ActiveRecord time(green part in chart) and low queue time.

Both become slow during benchmark test and have similar response time.

Currently we're able to increase Falcon's throughput by using 8 processes for each 4 cpu machine, which originally has only 5 processes.

Is there anyway to probe the situation/bottleneck in Falcon?

ioquatix · 2019-10-17T04:31:20Z

If you are doing high latency blocking operations in the event loop you will see this kind of response.

Because the core of the event loop for the server is:

connection = accept connection
connection.each_request do |request|
  response = process(request)
  conntion.send_response(response)
end

It's not quite that simple but that's generally how it fits together.

If you are blocking in process(request), we can not receive new requests (e.g. multiplexing ala HTTP/2 nor can we accept more connections.

You need to identify what is the blocking operation, probably a database query, and then decide if async-postgres or async-mysql is mature enough to work in your application.

If you have blocking operations that you simply can't avoid, you can spin up a thread and use Async::IO::Notification for handling reactor <-> thread synchronisation. I can give you some example code.

atitan · 2019-10-17T14:32:01Z

Is it the same to start Falcon in hybrid mode to use thread for request handling?

Also, I'd like to know how connection pool plays in this part.
Does it really help in hybrid or fork mode if requests are considered blocking the event loop reactor?

ioquatix · 2019-10-17T19:07:04Z

That is a good question.

Yes, hybrid mode should give you mostly the same performance characteristics as puma cluster mode.

However, ideally you use non-blocking adapters otherwise there are still some cases where you can experience high latency, i.e. if two connections are within the same reactor on the same thread.

Process Model

One parent process spawns N child processes, one reactor per child process.

Thread Model

One parent process spawns N threads, one reactor per thread. GVL contention.

Hybrid Model

One parent process spawns N processes, and each process makes M threads, one reactor per thread. GVL contention, but more threads = better handling of blocking operations.

Let me know if you need further clarifications - happy to discuss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measuring performance bottleneck #78

Measuring performance bottleneck #78

atitan commented Oct 16, 2019

ioquatix commented Oct 17, 2019

atitan commented Oct 17, 2019 •

edited

Loading

ioquatix commented Oct 17, 2019

Measuring performance bottleneck #78

Measuring performance bottleneck #78

Comments

atitan commented Oct 16, 2019

ioquatix commented Oct 17, 2019

atitan commented Oct 17, 2019 • edited Loading

ioquatix commented Oct 17, 2019

Process Model

Thread Model

Hybrid Model

atitan commented Oct 17, 2019 •

edited

Loading