-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative Lwt adapter #54
Conversation
The performance of the Lwt adapter is pretty odd, with apparently higher throughput than Async, but also higher latency under significant load. I suspect that's due to some kind of ordering issue. I started to analyze it with some heavy instrumentation, but eventually ran out of time for doing it and stopped. |
It's also possible that the Lwt adapter just needs a few more point optimizations. For example, see inhabitedtype/faraday#41 for the effect that using |
And finally, the Httpaf+Lwt benchmark lives in
|
After pinning and installing everything, I'm getting type errors in |
Can you tweak this line to use |
@aantron Didn't notice you might be already working on a benchmark for my version. I didn't have Otherwise, the main differences I see are:
|
@bluddy turns out I had locally amended the last commit from ocsigen/lwt#586. It's (force) pushed to GitHub now. The benchmark should work with it. |
@hannesm, @paurkedal, it turns out that the benchmark in this PR is immediately compatible with #53, because the adapters have identical APIs :) In my testing, the adapter in this PR consistently beats #53 by about 5-10% in throughput with this benchmark, and has somewhat better latency under heavy load. Given how close they are, I wouldn't say I'm confident in the difference. I also am too rusty on these adapters to immediately offer an explanation for why this could be so. Typical results follow. This PR:
#53:
The wrk command was
These results are not comparable with the earlier results from the first post in this PR. I made those other measurements about 5 weeks ago, and had an OS upgrade and other changes since. Also, I posted an incorrect wrk command line earlier. I made the original measurements with the command line in this post (with 30K requests/sec, not 1K). Moving on to |
@aantron, are these results using the libev backend? |
I found an issue with the echo server in the current version of this PR:
gives the wrong checksum on first POST and hangs on subsequent connects. |
| `Ok _ -> | ||
Buffer.get read_buffer ~f:(fun bigstring ~off ~len -> | ||
Server_connection.read connection bigstring ~off ~len) | ||
|> ignore; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @seliopou pointed out to me, Server_connection.read
may not consume the full indicated span of the buffer. Same for client reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see the ignore applies to the Buffer.get
result, which already handled the consumed count.
It would be very useful to have more specifics about the setup. Which OCaml version? Is Flambda used? Is the libev backend being used? I tried to reproduce this on my home machine which is WSL on Windows, and I'm getting very low latency but also extremely low throughput for both the async and lwt versions on 4.06.1. Adding libev didn't seem to help. This could very well be a problem with WSL's performance, but running on a Linux machine, I get similar results. Additionally, on the Linux machine and libev, I get
My current theory is that at full capacity, the CPU is bombarded with read requests that dominate the write requests without consideration for fairness. This causes all incoming packets to be processed before there is time to deal with the outgoing packets, causing excessive latency. This possibility is quite evident in the |
Using libev. No Flambda, but that's not the issue. You may be running into the system's maximum fd limit. Try |
@seliopou Using |
I've been able to replicate the latency issue. From my investigation so far, it appears that in lwt_unix, In any case, it's something that should be fixed in lwt, so I think this PR can be integrated once @aantron is happy with it. |
By inserting one
Still not quite as good as Async, but getting better. Looks like there are a few packets that wait till the very end to send for some reason. |
OK, I redid the test with the single yield () modification above. For the record, it involves turning line 1608 in
It turns out that I forgot I was using 4 threads on the same machine for another, fairly intensive purpose. Redoing the test this morning, I got these pretty cool statistics:
So we get improved throughput and great latency. |
Just to flesh out the picture with regard to async, I guess my machine is pretty fast, and async gets around the same throughput as the last lwt statistics but somewhat higher latency. It wasn't consistent enough to post a report -- different iterations give wildly different latency figures (unlike for lwt, which is consistently giving the results listed), but the average is around 250ms. My guess is that async's scheduler draws from current threads at random, while a yield() on lwt prioritizes threads on a FIFO basis, which helps greatly with latency. |
Thanks for finding that, @bluddy. I am getting similar results on my machine. Using Async as the reference point, Lwt+yield gets almost 50% less latency (~110ms vs. 200ms median, 380ms vs. 720ms worst), but the throughput is lower (in one test, 30 MB/sec vs 38 MB/sec). So, it looks like there is some kind of envelope within which we can optimize Going to look at the issue found by @paurkedal next. |
The issue with the echo server involves dropping most of the data, so in case this also affects the the wrk2 benchmark, I would take conclusions about which optimisations work with a grain of salt until it's fixed. |
@paurkedal The echo server dropping data was actually deliberate. I was debugging the client and had forgotten. This only affected the echo server, and not the benchmarks. I uploaded a commit that has the echo server restored to normal functioning. The issue in the client is summarized in this comment: (* Due to a possible bug in http/af, read from the body only once, and
create the response based on the data in that first read.
The bug is (possibly) in the client. Client_connection seems to go into
a read loop despite Client_connection.shutdown being called, due to the
reader being in the Partial state, and the next operation function
unconditionally returning `Read in that case.
One workaround for this is to have the server send a Content-Length
header. To do that, this code has the server simply reply after the
first chunk is read, and use that chunk's length.
The code I would expect to work, without the possible bug, is commented
out below. *) @seliopou I'm pretty rusty on the internals of the client at this point. As I recall, the symptom was that the httpaf client would not exit after streaming chunked data (curl is fine). Is the above description helpful? If not, I can look into it again, and post an issue. |
I've updated how the adapter handles exceptions and socket shutdown, based on a bug report from @anmonteiro, and with very helpful assistance from @bluddy :) This also shouldn't affect the benchmarks. |
Leaving my benchmark results here as requested by @aantron on the Reason Discord. They were run on my Macbook Pro late 2016 with a slight change to the
|
@anmonteiro I assume you're getting this latency even with the fix mentioned in the thread? |
@bluddy are you referring to this?
If yes, then yeah, I ran my benchmarks after that fix. |
Sorry folks, been busy with work. Is this the one I should be looking to merge or is #53 still in the running? |
They're very similar, but I believe this one has been tested and discussed more, so let's just merge this one. I think we can all learn from this experience to create a work in progress PR as soon as work begins, so that multiple people don't spend precious time re-implementing the same thing. The convention I've seen is to put "[WIP]" as the prefix of the PR name, and that once work is complete, the PR name is edited to have the "[RDY]" prefix. |
@seliopou I think main difference between at this point is the inclusion or not of the buffering code. |
Ping on this... would be nice to make httpaf the default community choice for webdev, and that requires lwt support. |
2 months later, ping @seliopou |
Any update on this? |
@seliopou if you're having a hard time maintaining this project, I'm sure the folks at ocaml-community would be happy to help if you migrate it. This is too important to just let it linger. |
I'm working though my backlog and will be reviewing this shortly. In preparation for that, it would be good to rebase this past the switch to OPAM2 in #70, as well as deleting any of the TODO's left int he code that are invalid. |
@seliopou I'll do this shortly (might be a day or more). |
Ok, I rebased over |
httpaf-lwt.opam
Outdated
] | ||
depends: [ | ||
"ocaml" {>= "4.03.0"} | ||
"angstrom-lwt-unix" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is angstrom-lwt-unix
actually being used? I don't see it being referenced in the jbuild
file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, removed the dep.
"httpaf" | ||
"jbuilder" {build & >= "1.0+beta10"} | ||
"lwt" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail the opam 2 lint check since now they expect atleast one of the synopsis
or description
fields to be set.
error 57: Synopsis and description must not be both empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have opam lint
passing, but I'll add a synopsis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started seeing the error after updating to opam 2.0.1 today
examples/lwt/lwt_echo_server.ml
Outdated
in | ||
|
||
(* Due to a possible bug in http/af, read from the body only once, and | ||
create the response based on the data in that first read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was fixed in #71.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comment was removed earlier, you may be reviewing an older version of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I see that now... hmm not sure what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was somehow served a stale version of the PR diff... everything seems to be current now. Sorry about that.
Thanks, sorry it took so long! I'm gonna do a release with everything before this PR was merged and then let this sit for a week and do another release. In case anybody finds some last-minute issues. |
Thanks! |
Upstream issue inhabitedtype#18 details a case where the http/af parser is unable to parse a message if the read buffer size isn't as big as all the bytes that form the message headers. While it is desirable, and a design goal of the library, to be efficient in parsing and thus avoid DoS attacks, it is more sensible to allow each message header to fit within the read buffer size up to its limit. This diff makes that change, which better accommodates real world use cases for the default read buffer size.
For comparison with #53. This one is a pretty direct calque of the Async adapter, including the
Buffer
module.In the hypothetical case this is headed for being merged, I'd still like to address some of the remaining TODOs (all of which are minor), and give a very thorough review to how errors are propagated (since error handling is built into
_ Lwt.t
).The examples need ocsigen/lwt#586.
In terms of performance, Httpaf+Lwt is significantly faster than Cohttp+Lwt, but slower than Httpaf+Async. I'm not sure which further changes would have to be done to Lwt and/or Httpaf to make Httpaf+Lwt faster. Having Faraday use
Lwt_unix.writev
gave a big performance boost to Httpaf+Lwt, so I'll post that change in a PR to Faraday shortly.Here are some histograms.
Httpaf+Lwt, with
Lwt_unix.writev
:Cohttp+Lwt:
Httpaf+Async