-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terrible SQS SendMessage performance #1602
Comments
I just did a quick calculation, and with the above reported performance, it seems like there is currently a 25% cost overhead to using the SQS client provided by the AWS .NET SDK; where that cost is for the EC2 CPU-seconds needed by the SQS client to send each million messages, assuming each message is 64kB. |
Hi, just wondering if there’s an update on this? Does AWS recognise there is a problem? If so, does AWS plan on a fix/update? If so, is the fix scheduled? If so, what’s the time horizon? |
Is the SQS .NET client still being maintained by AWS? |
@billpoole-mi You mentioned messages size of 64K, is that the max size or are there some percentage that are significant larger? What I'm wondering is if your messages are going past 85,000 bytes. The .NET garbage collector will automatically put objects of that size into the large object heap (LOH) and putting a lot of objects into LOH could cause a lot of extra work for the GC. |
In our test, we prebuilt a pool of messages that were randomly sized up to 256 kB. We then sent those messages in a loop using SQS. So we had no allocations on our side. The allocations were all in the AWS .NET SQS client. |
Moreover, as stated above, we saw 5 GB of allocations for every 100 MB sent. So even if we were allocating for each new message sent, there was 50x more allocations in the SQS client than what would have been allocated by the sending code. |
I agree the allocations are very concerning. I am still curious what the affect the LOH is having. Can you run your test harness with messages size just a little above 85,000 bytes and then a little below it. I'm curious how dramatic the difference will be. Is your test harness shareable? |
I’ll have to go dig the test harness up. The last time I used it was May last year when I first raised this issue. I’ll see if I can find it. Have you tried to reproduce the issue yourself? ie what send throughput do you see sending messages? Are you seeing more than 10 MB/s per CPU thread? |
It’s worth noting that although highly frequent LOH allocations are bad for performance, they’re not so bad as to cause a 10 MB/s per CPU thread send rate. Just my 2c, but I think the majority of the problem is unnecessary buffer allocations and copying between buffers. A sender should be able to write into a stream and at most, the whole message is buffered in memory once, and that buffer should be drawn from a memory pool to prevent LOH compaction overheads. We can write well over 100 MB/s per CPU thread over raw HTTP. ie, if we do the same test just writing the messages as basic REST posts, we get way over 100 MB/s per CPU thread - even if the messages are all over 85 kB. |
We have noticed this issue has not received attention in 1 year. We will close this issue for now. If you think this is in error, please feel free to comment and reopen the issue. |
Can AWS please provide an update on this? Is it planned to be fixed? |
It's been nearly 4 years since this issue was first opened. Can AWS please provide an update? Note that I suspect the performance problem here extends well beyond the SQS Sorry to be so direct, but the fact that the AWS .NET client library is over 20x slower than the community-built EfficientDynamoDb library and the fact that AWS has allowed this situation to persist this long is a crushing indictment of AWS's support for the .NET ecosystem. Should the .NET community perhaps follow the lead of the EfficientDynamoDb project and start an open source project to replace the AWS-provided .NET SQS client? Does AWS have no interest in providing support for high performance applications built using .NET? Perhaps .NET developers should just use Azure instead? |
@bill-poole Hello, sorry it took us so long to respond, this wasn't prioritized high enough but was recently re-prioritized. I ran some benchmarks using the version you specified (3.3.102.104) and the latest version (3.7.400.22) One thing to note is that as of Nov 9, 2023, AWS SQS migrated away from the AWS Query protocol and to the AWS Json Protocol. Compared to the AWS Json protocol the query protocol creates a lot more extra string allocations. Either way here are the performance benchmark results: code : https://github.com/aws/aws-sdk-net/blob/main/sdk/test/Performance/EC2PerformanceBenchmarks/SQSBenchmarks.cs Latest: 3.7.400.22
Version= 3.3.102.104
As you can see the latest version allocates about 27.98% the amount of memory, which is a huge improvement (358% improvement). The latest version of the SDK targets Net8.0 so I'm assuming there are some efficiency gains from that as well, but a majority of the allocation improvements come from SQS migrating away from AWSQuery and to AWSJson. Is there a reason you cannot upgrade to the latest version or at least to the version where SQS upgraded to AWSJson? Since the version you listed was from 4 years ago, unfortunately there isn't much we can do to improve it, since the protocol behind the service has completely changed. However, if you switch to a latest version, you should see pretty massive improvements in memory allocation. |
Thanks @peterrsongg for providing an update to this issue; it's very much appreciated. I stopped using SQS in my .NET solutions because of to this issue, so I'm not in a position to test the new version. That being said, I would start using SQS again if this issue were to be resolved. I use the latest version of .NET in my solutions, so I would have no problem using whatever latest version of the .NET SDK that AWS releases. It's really great to see AWS benchmarking the .NET SDK and I hope it is going to become part of your regular build/test pipeline so that performance is continuously improved, but also any change that hurts performance is identified and fixed before release. Back in 2021 when I first raised this issue, we were seeing the .NET SDK able to send 10 MB/s per vCPU. Furthermore, the performance seemed independent of message size - i.e., performance depended only on total volume of data sent. 10 MB/s with 100 kB messages is 100 messages per second, which is 10 ms per message. The benchmark results above have the version of the SDK we tested (3.3.102.104) taking 8.404 ms per message, which is about 16% faster than what we recorded in 2021, which makes sense because we had slightly slower machines in 2021 than today. So on today's hardware, the benchmark says we're getting 11.9 MB/s per vCPU on version 3.3.102.104. However, according to the above benchmark, the latest version of the SDK (3.7.400.22) only increases that throughput to 13.9 MB/s per vCPU (based on 7.184 ms per 100 kB message), which is only a 17% improvement over the 11.9 MB/s per vCPU on version 3.3.102.104. While that improvement is very much appreciated, I still think much more improvement is needed. The .NET System.Text.Json serializer is able to serialize JSON at a rate of 300 MB/s when using a custom-written I would need to see a performance improvement of about 10x before I'd be in a position to use SQS in my .NET solutions. Otherwise, a significant portion of my EC2 costs would be spent on vCPUs executing code in the AWS SDK client. That's why I use the awesome EfficientDynamoDb library to access DynamoDB instead of the AWS .NET SDK. If it weren't for the EfficientDynamoDb library, I wouldn't be able to use DynamoDB either. Note that the benchmark results above have the latest version of the SDK (3.7.400.22) still allocating 475 kB for every 100 kB message sent. That is still very high. With use of buffer pooling, it should be possible to reduce the heap allocations to nearly zero. Again, I very much appreciate your response and the progress on this issue and I very much hope there will be further investment in improving the performance of the SQS client (as well as the broader .NET SDK). Please let me know if there's anything I can do to help. |
@bill-poole Thanks for the detailed analysis. I agree that there is much more we can do performance-wise, and i'm not sure if you're aware but we are working on a new major version of the SDK which gives us the platform to modernize. We've pulled in dependencies such as With regards to throughput, I think a proper load test is required here. Since here we are just testing 1 operation where garbage collection isn't happening, it's difficult to say what the true MB/s would be. I'd be curious to see what happens when we send a high number of messages and GC starts kicking in. This is something I can test myself. Anyways, just to see how much better V4 is in its current state, I ran the same benchmark. The allocations are much better, for a 100KB message size we allocate just around 100KB, but the performance is still not drastically better (only 8.2% faster). But I'm optimistic that this will help in throughput b/c it will decrease the pressure on the GC. Will follow up in a later comment on some load testing numbers. Benchmarking numbers:
|
Yes I was aware of the new version of the SDK and am very enthusiastic about the potential for such performance improvements; however, I'm unclear on the timeframe/priority for this work. As I understand it, such improvements are not in scope for the initial V4 release. It would be great to have much greater clarity on the timeframe/priority for these performance improvements. Note that there are additional libraries that make buffer pooling easier that are not included in the list of new dependencies you mentioned, such as the Will it be possible to bring such libraries into the V4 distribution at a later time (i.e., after its initial GA release)? Or do such decisions need to be made now before initial release?
Sorry, I should have looked at the benchmark code. So, the network and SQS service latency is being included in the benchmarked time per send operation, correct? I agree a proper load test would therefore provide a much more meaningful result. It would allow an apples-to-apples comparison with the result we got in 2021 - i.e., we got our result by doing a load test from a local machine to the SQS service, sending hundreds of messages concurrently. Note that we were in Perth, Western Australia and we were using the ap-southeast-2 (Sydney) region. However, we also tested locally using "localstack" and got the same result. i.e., the latency was absorbed by sending messages concurrently such that the CPU was always busy sending a message while waiting for a response from the SQS service. However, now that I've looked at the benchmark code, it seems that the _messageBody = Utils.CreateMessage(Constants.KiloSize * 10); The private static string CreateStringOfSize(long sizeInBytes)
{
//2 bytes are needed for each characterse, since .net strings are UTF-16
int numCharacters = (int)sizeInBytes / 2;
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < numCharacters; i++)
{
stringBuilder.Append('A');
}
return stringBuilder.ToString();
} The above method is correct that .NET strings are UTF-16 encoded, so the above implementation correctly creates a string with the given length in bytes, but SQS messages are encoded as UTF-8, which means that a string of 10,240 bytes of 'A' characters will result in an SQS message has a payload of 5,120 bytes over the wire. So there is the potential for confusion as to whether the Note that a simpler and much more performant implementation of the private static string CreateStringOfSize(long sizeInBytes)
{
// 2 bytes are needed for each character, since .NET strings are UTF-16
return string.Create(length: (int)sizeInBytes / 2, state: false, (span, state) => span.Fill('A'));
} I recognize that this method isn't being used in any hot path anywhere, but I think its worth making these kinds of changes, if not for performance, then for simplicity. Note there is also a Also note the performance result we got in 2021 was stated in terms of UTF-8 encoded payload bytes sent, not UTF-16 encoded bytes. So, to get an apples-with-apples comparison, the message size needs to be doubled. So it seems that the benchmark result for the V4 preview is showing 100,428 bytes being allocated for each send operation, but seems to be sending only 5 kB (recognizing that is actually 10 kB of UTF-16 encoded text) for each send operation, assuming my above assertion is correct. Am I correct? Or have I got something wrong?
I'm looking forward to seeing the results! Have you considered adding the ability to mock out the HTTPS transport, which would allow benchmarking the client code in isolation of network latency and SQS service performance? |
If we wanted to bring additional libraries we would need to do that before GA, so
Though the code in
Thanks for the suggestions on the code, I'll look to update the code both to simplify it and create an overloaded method that accepts an additional parameter which doubles the size of the message sent over the wire if the service expects a UTF-8 encoded message. Appreciate you looking at the performance benchmarking code!
It's on our radar and would definitely simplify a lot of our testing code for other services as well, but we just haven't gotten around to it yet. The decision to not mock the https transport in v3 of the SDK came down to differences in netframework35 vs netframework 45 and netcoreapp31 but that decision was made before my time so I don't know exactly the details. Now that we are dropping netframework35 support, I believe we could start improving that area of testing in the sdk. |
I assume therefore a design and possible prototype of reading/writing pooled buffers would be needed prior to selecting which of these libraries is needed/appropriate?
If the SQS client can be configured to use a custom |
Some sort of justification as to why we should include these new dependencies would be presented internally to the team. This could include a prototype and some performance improvement numbers or something like that.
Will keep that in mind when designing a mocked client👍 |
I was just thinking that you'd want to be very sure you chose the correct library/libraries before being locked into them, so I'd imagine a prototype would be needed. I'd be very interested to see and provide feedback on the prototype. |
Adding some more clarity on dependencies. Post GA I believe we can still add new dependencies but the value has to significant not just a minor performance improvement in non-hot spot areas. We would need to do some significant version bump and possibly write a blog post to make sure user's of the SDK are not surprised. We do have to support users that are acquiring the SDK outside of NuGet and dependencies get harder in those cases. |
The HttpClientFactory property can be used to configure the SDK to use a mocked HttpClient for testing. |
@bill-poole we're starting the work of switching to STJ marshalling. I put out the first PR here, though it's still a WIP: #3528 if you're interested. |
Thanks @peterrsongg. I've had a bit of a look through the PR. I haven't used T4 templates before, so I found the PR a little hard to navigate, so apologies for whatever I may have misunderstood. The main focus of my comments/observations is to identify where memory/data is being unnecessarily allocated/copied multiple times. JsonRPCRequestMarshaller.tt writes JSON into an Alternatively, an The JSON buffer is then copied again into a new array, with which the request.Content = arrayBufferWriter.WrittenMemory.ToArray(); This allocation/copy can be avoided if the There are also a few places where values are written to a heap-allocated string, and then that string is written into the JSON buffer. For example, in JsonRPCStructureMarshaller.tt: context.Writer.WriteStringValue(StringUtils.FromSpecialDoubleValue(<#=memberProperty#>));
context.Writer.WriteStringValue(Guid.NewGuid().ToString());
context.Writer.WriteStringValue(StringUtils.FromMemoryStream(<#=variableName + "." + member.PropertyName#>)); When writing values that have fixed/constrained lengths and do not use/require escaping (e.g., when writing a When writing a memory stream, try to get direct access to the underlying buffer using |
Thanks for taking a look. Sorry it took a while to get back to you. This is still actively being worked on, and I took your suggestion of using the One note on the default buffer size though. I did some experimentation and I definitely do not want to allocate a 250KB buffer as the initial buffer since most payloads will not be 250KB and that would make us allocate potentially unused memory. Maybe if we had some data on what the average payload size was that would help us, but we don't have that information so I left it at the default. I found that increasing this value actually increased byte allocations most of the time I also thought it was odd that we did a I didn't get a chance to look into the |
This shouldn't matter because allocated buffers are returned to the pool after use. i.e., the number of allocated buffers should only be the number of buffers that are used concurrently (i.e., the number of messages actively being sent over the wire concurrently). The At first, the array pool is empty, so each array requested from the pool is allocated new. However, as arrays are returned to the pool, newly requested arrays are drawn from the pool rather than allocating new arrays. Returning an array to the pool when it is full is a no-op, which leaves the array for the GC to collect. That being said, it doesn't look like this PR is actually returning the buffers to the pool. i.e., I don't think the
Any .NET native APIs accepting a byte array should also have an overload accepting a
I think that reading/writing a |
I'm not sure I completely agree that it "doesn't matter", because it would still mean allocating the maximum possible buffer size for every request, and in a highly concurrent application that is sending small payloads this can cause higher memory allocations even if the buffers are being returned to the pool eventually. And within the context of a lambda function where memory usage matters I wouldn't want to allocate such a high memory for every request. However, given that we disagree on this, maybe this is a good opportunity to make this configurable, like we did for the buffer size in the StreamingUtf8JsonReaders. (We added a config option for this).
Hmm, I wrapped the
Yeah, in theory it should be possible and with more time I could probably do it, but also the fact that
Yeah this was a community ask on my GitHub PR, as was the ArrayPoolBufferWriter, but haven't done any perf tests on it. Good to know though! |
Sorry, I missed the
Yes, but only for the first set of requests, after which the buffers will be drawn from the pool, not allocated.
Yes, that's true. Although, it could be worth testing to determine what the additional memory load would actually be. However, in the current PR there is no
Yes, But, that would increase the memory load as stated above because it would increase the number of buffers in the pool beyond the number of threads in the thread pool because the buffers would be held while the buffers are sent over the network. Therefore, another approach would be to copy the contents of the I strongly suggest doing this (if not sending content over the network directly from the One way to do this would be to update the var bufferSize = arrayPoolBufferWriter.WrittenMemory.Length;
request.Content = new ArraySegment<byte>(array: ArrayPool<byte>.Shared.Rent(bufferSize),
offset: 0, count: bufferSize);
arrayPoolBufferWriter.WrittenMemory.Span.CopyTo(request.Content.AsSpan()); You can pass a You would also need to update the ArrayPool<byte>.Shared.Return(Content.Array); And you'd need to invoke the This would eliminate all heap allocations in the SQS client related to the content buffer, and all but a single memory copy from the 256 kB However, the shape of the Furthermore, it unnecessarily forces a conversion to UTF-16. i.e., if I am building the content of an SQS message with my own Therefore, it would be great if the This could be achieved by adding a This would then eliminate all large buffer allocations, reduce the number of times the content is copied to two (once from the user-supplied buffer to the 256 kB |
@bill-poole thank you for your suggestions. give me some time to investigate but I think this is a good starting point |
Can you expand upon what you mean by there is no
great suggestion, I'll look into this.
I think this optimization comes after the request.Content optimization and I cannot gaurantee that this work will happen since it would be a customization for SQS, but I hear you. UTF8 is the encoding of the web and it is definitely a performance hit that we need to convert from utf16 to utf8 to send it to sqs |
FYI Like I mentioned in #3468 I think it would be great to switch the the content property to ReadonlyMemory or consistently use the ContentStream property and then wrap ROM into a ReadOnlyMemoryStream. As mentioned by Billie using the stream as a common abstraction has implications too though but it might be worth exploring to stream line the content access. If you make sure to carry forward either ROM or the stream with the proper conditionals you can quite likely almost always call then into framework methods that have never span or ROM based overloads and only copy into byte arrays where it is really required (for example in the .NET Standard path). And even for the .NET Standard path it is sometimes possible to backport slightly more modern versions of the API surface as long as you are willing to go down the "unsafe" path. |
From a library / SDK standpoint it is also always worth considering how much you are willing to give up the safety to boost performance. For example the Azure .NET SDK has several places where the internal buffers are copied before handed out to the users to make sure they don't run into weird issues. With the newer version of RabbitMQ .NET client we have decided to expose exposed the buffered ROM directly but it comes with a huge caveat that has to be indicated on the XML docs The users of the rabbitmq client are mostly fairly advanced though, and this was an acceptable tradeoff to make. |
FYI an abstraction that is available in System.Memory is BinaryData that allows treating various "bridging" cases with one uniform data type. |
Expected Behavior
An application should be able to send significantly more than 10MB/s to SQS per vCPU. Ideally, an application should be able to send well over 100MB/s per vCPU.
Current Behavior
We are finding that sending about 130MB/s of messages to SQS is consuming about 15 vCPUs. We are finding a lot of CPU time being spent by the GC because we are finding that there are over 5GB of allocations for each 100MB of messages sent.
We also find that this issue is proportional to total/aggregate payload size, not number of messages. That is, if we send much less data spread over many more messages, the CPU load is significantly less.
Possible Solution
Simplify/streamline the .NET SQS client so it is performance-optimised. Minimise allocations, reducing GC pressure.
Steps to Reproduce (for bugs)
Just create a simple application that uses the SQS client to concurrently send a large number of large messages, such that over 100MB/s of messages are being sent. It uses about 15 vCPUs on .NET Core 3.1 on Linux. The performance is even worse on Windows.
Context
Our application produces a massive amount of data that needs to be sent through SQS, and at 15 vCPUs per 100MB, we find that a lot of our compute costs are coming from the .NET SQS client.
Your Environment
.NET Core Info
The text was updated successfully, but these errors were encountered: