Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terrible SQS SendMessage performance #1602

Open
bill-poole opened this issue May 6, 2020 · 33 comments
Open

Terrible SQS SendMessage performance #1602

bill-poole opened this issue May 6, 2020 · 33 comments
Labels
feature-request A feature should be added or improved. module/sdk-custom p1 This is a high priority issue queued v4

Comments

@bill-poole
Copy link

Expected Behavior

An application should be able to send significantly more than 10MB/s to SQS per vCPU. Ideally, an application should be able to send well over 100MB/s per vCPU.

Current Behavior

We are finding that sending about 130MB/s of messages to SQS is consuming about 15 vCPUs. We are finding a lot of CPU time being spent by the GC because we are finding that there are over 5GB of allocations for each 100MB of messages sent.

We also find that this issue is proportional to total/aggregate payload size, not number of messages. That is, if we send much less data spread over many more messages, the CPU load is significantly less.

Possible Solution

Simplify/streamline the .NET SQS client so it is performance-optimised. Minimise allocations, reducing GC pressure.

Steps to Reproduce (for bugs)

Just create a simple application that uses the SQS client to concurrently send a large number of large messages, such that over 100MB/s of messages are being sent. It uses about 15 vCPUs on .NET Core 3.1 on Linux. The performance is even worse on Windows.

Context

Our application produces a massive amount of data that needs to be sent through SQS, and at 15 vCPUs per 100MB, we find that a lot of our compute costs are coming from the .NET SQS client.

Your Environment

  • AWSSDK.SQS version used: 3.3.102.104
  • Operating System and version: Ubuntu 18.04
  • Visual Studio version: Visual Studio 2019 (16.5)
  • Targeted .NET platform: .NET Core 3.1

.NET Core Info

  • .NET Core version used for development: .NET Core 3.1
  • .NET Core version installed in the environment where application runs: .NET Core 3.1
@bill-poole
Copy link
Author

bill-poole commented May 6, 2020

I just did a quick calculation, and with the above reported performance, it seems like there is currently a 25% cost overhead to using the SQS client provided by the AWS .NET SDK; where that cost is for the EC2 CPU-seconds needed by the SQS client to send each million messages, assuming each message is 64kB.

@philasmar philasmar added bug This issue is a bug. needs-reproduction This issue needs reproduction. needs-triage This issue or PR still needs to be triaged. module/sdk-custom labels Jul 23, 2020
@NGL321 NGL321 added A and removed needs-triage This issue or PR still needs to be triaged. labels Sep 9, 2020
@hunanniu hunanniu added the queued label Oct 7, 2020
@bill-poole
Copy link
Author

Hi, just wondering if there’s an update on this? Does AWS recognise there is a problem? If so, does AWS plan on a fix/update? If so, is the fix scheduled? If so, what’s the time horizon?

@bill-poole
Copy link
Author

Is the SQS .NET client still being maintained by AWS?

@normj
Copy link
Member

normj commented Jul 7, 2021

@billpoole-mi You mentioned messages size of 64K, is that the max size or are there some percentage that are significant larger? What I'm wondering is if your messages are going past 85,000 bytes. The .NET garbage collector will automatically put objects of that size into the large object heap (LOH) and putting a lot of objects into LOH could cause a lot of extra work for the GC.

@bill-poole
Copy link
Author

In our test, we prebuilt a pool of messages that were randomly sized up to 256 kB. We then sent those messages in a loop using SQS. So we had no allocations on our side. The allocations were all in the AWS .NET SQS client.

@bill-poole
Copy link
Author

Moreover, as stated above, we saw 5 GB of allocations for every 100 MB sent. So even if we were allocating for each new message sent, there was 50x more allocations in the SQS client than what would have been allocated by the sending code.

@normj
Copy link
Member

normj commented Jul 7, 2021

I agree the allocations are very concerning. I am still curious what the affect the LOH is having. Can you run your test harness with messages size just a little above 85,000 bytes and then a little below it. I'm curious how dramatic the difference will be. Is your test harness shareable?

@bill-poole
Copy link
Author

bill-poole commented Jul 7, 2021

I’ll have to go dig the test harness up. The last time I used it was May last year when I first raised this issue. I’ll see if I can find it.

Have you tried to reproduce the issue yourself? ie what send throughput do you see sending messages? Are you seeing more than 10 MB/s per CPU thread?

@bill-poole
Copy link
Author

It’s worth noting that although highly frequent LOH allocations are bad for performance, they’re not so bad as to cause a 10 MB/s per CPU thread send rate.

Just my 2c, but I think the majority of the problem is unnecessary buffer allocations and copying between buffers.

A sender should be able to write into a stream and at most, the whole message is buffered in memory once, and that buffer should be drawn from a memory pool to prevent LOH compaction overheads.

We can write well over 100 MB/s per CPU thread over raw HTTP. ie, if we do the same test just writing the messages as basic REST posts, we get way over 100 MB/s per CPU thread - even if the messages are all over 85 kB.

@ashishdhingra ashishdhingra removed the needs-reproduction This issue needs reproduction. label Aug 17, 2021
@github-actions
Copy link

We have noticed this issue has not received attention in 1 year. We will close this issue for now. If you think this is in error, please feel free to comment and reopen the issue.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Aug 18, 2022
@bill-poole
Copy link
Author

Can AWS please provide an update on this? Is it planned to be fixed?

@ashovlin ashovlin removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Aug 18, 2022
@ashishdhingra ashishdhingra added p2 This is a standard priority issue and removed A labels Nov 1, 2022
@bill-poole
Copy link
Author

It's been nearly 4 years since this issue was first opened. Can AWS please provide an update?

Note that I suspect the performance problem here extends well beyond the SQS SendMessage API. For example, I have found similar performance problems with the .NET DynamoDB client library. This is why the EfficientDynamoDb project was created. The EfficientDynamoDb project boasts up to 21x better performance than the AWS-provided client library!!!

Sorry to be so direct, but the fact that the AWS .NET client library is over 20x slower than the community-built EfficientDynamoDb library and the fact that AWS has allowed this situation to persist this long is a crushing indictment of AWS's support for the .NET ecosystem. Should the .NET community perhaps follow the lead of the EfficientDynamoDb project and start an open source project to replace the AWS-provided .NET SQS client?

Does AWS have no interest in providing support for high performance applications built using .NET? Perhaps .NET developers should just use Azure instead?

@bhoradc bhoradc added p1 This is a high priority issue and removed p2 This is a standard priority issue labels Sep 23, 2024
@peterrsongg
Copy link
Contributor

@bill-poole Hello, sorry it took us so long to respond, this wasn't prioritized high enough but was recently re-prioritized. I ran some benchmarks using the version you specified (3.3.102.104) and the latest version (3.7.400.22)

One thing to note is that as of Nov 9, 2023, AWS SQS migrated away from the AWS Query protocol and to the AWS Json Protocol. Compared to the AWS Json protocol the query protocol creates a lot more extra string allocations. Either way here are the performance benchmark results:

code : https://github.com/aws/aws-sdk-net/blob/main/sdk/test/Performance/EC2PerformanceBenchmarks/SQSBenchmarks.cs
Message size (tweaked above code for message size) : 100 KB

Latest: 3.7.400.22

Method Mean Error StdDev P50 P90 P95 Gen0 Gen1 Gen2 Allocated
SQSSendMessageAsync 7.184 ms 0.1558 ms 0.4569 ms 7.159 ms 7.791 ms 7.928 ms 7.8125 7.8125 7.8125 475454 B

Version= 3.3.102.104

Method Mean Error StdDev P50 P90 P95 Gen0 Gen1 Gen2 Allocated
SQSSendMessageAsync 8.404 ms 0.2146 ms 0.6260 ms 8.322 ms 9.215 ms 9.526 ms 375.0000 375.0000 375.0000 1698929 B

As you can see the latest version allocates about 27.98% the amount of memory, which is a huge improvement (358% improvement). The latest version of the SDK targets Net8.0 so I'm assuming there are some efficiency gains from that as well, but a majority of the allocation improvements come from SQS migrating away from AWSQuery and to AWSJson. Is there a reason you cannot upgrade to the latest version or at least to the version where SQS upgraded to AWSJson?

Since the version you listed was from 4 years ago, unfortunately there isn't much we can do to improve it, since the protocol behind the service has completely changed. However, if you switch to a latest version, you should see pretty massive improvements in memory allocation.

@bill-poole
Copy link
Author

bill-poole commented Sep 26, 2024

Thanks @peterrsongg for providing an update to this issue; it's very much appreciated. I stopped using SQS in my .NET solutions because of to this issue, so I'm not in a position to test the new version. That being said, I would start using SQS again if this issue were to be resolved. I use the latest version of .NET in my solutions, so I would have no problem using whatever latest version of the .NET SDK that AWS releases.

It's really great to see AWS benchmarking the .NET SDK and I hope it is going to become part of your regular build/test pipeline so that performance is continuously improved, but also any change that hurts performance is identified and fixed before release.

Back in 2021 when I first raised this issue, we were seeing the .NET SDK able to send 10 MB/s per vCPU. Furthermore, the performance seemed independent of message size - i.e., performance depended only on total volume of data sent. 10 MB/s with 100 kB messages is 100 messages per second, which is 10 ms per message.

The benchmark results above have the version of the SDK we tested (3.3.102.104) taking 8.404 ms per message, which is about 16% faster than what we recorded in 2021, which makes sense because we had slightly slower machines in 2021 than today. So on today's hardware, the benchmark says we're getting 11.9 MB/s per vCPU on version 3.3.102.104.

However, according to the above benchmark, the latest version of the SDK (3.7.400.22) only increases that throughput to 13.9 MB/s per vCPU (based on 7.184 ms per 100 kB message), which is only a 17% improvement over the 11.9 MB/s per vCPU on version 3.3.102.104. While that improvement is very much appreciated, I still think much more improvement is needed.

The .NET System.Text.Json serializer is able to serialize JSON at a rate of 300 MB/s when using a custom-written JsonConverter that writes JSON using a Utf8JsonWriter. I've been able to serialize and send JSON over an HTTPS connection at rates well over 100 MB/s per vCPU. Therefore, I think a 10x improvement in performance is possible.

I would need to see a performance improvement of about 10x before I'd be in a position to use SQS in my .NET solutions. Otherwise, a significant portion of my EC2 costs would be spent on vCPUs executing code in the AWS SDK client. That's why I use the awesome EfficientDynamoDb library to access DynamoDB instead of the AWS .NET SDK. If it weren't for the EfficientDynamoDb library, I wouldn't be able to use DynamoDB either.

Note that the benchmark results above have the latest version of the SDK (3.7.400.22) still allocating 475 kB for every 100 kB message sent. That is still very high. With use of buffer pooling, it should be possible to reduce the heap allocations to nearly zero.

Again, I very much appreciate your response and the progress on this issue and I very much hope there will be further investment in improving the performance of the SQS client (as well as the broader .NET SDK).

Please let me know if there's anything I can do to help.

@peterrsongg
Copy link
Contributor

@bill-poole Thanks for the detailed analysis. I agree that there is much more we can do performance-wise, and i'm not sure if you're aware but we are working on a new major version of the SDK which gives us the platform to modernize. We've pulled in dependencies such as System.Text.Json, System.Memory, and System.Buffers so we can take advantage of spans, and pooling like you are suggesting, which would increase performance. We've actually had plenty of community PRs making small improvements like this, and if you wanted to help you could definitely issue some PRs targeting the v4-development branch. Here is the issue tracking our progress on V4. However, the switch the System.Text.Json is something we want to do internally since it is a large effort.

With regards to throughput, I think a proper load test is required here. Since here we are just testing 1 operation where garbage collection isn't happening, it's difficult to say what the true MB/s would be. I'd be curious to see what happens when we send a high number of messages and GC starts kicking in. This is something I can test myself.

Anyways, just to see how much better V4 is in its current state, I ran the same benchmark. The allocations are much better, for a 100KB message size we allocate just around 100KB, but the performance is still not drastically better (only 8.2% faster). But I'm optimistic that this will help in throughput b/c it will decrease the pressure on the GC. Will follow up in a later comment on some load testing numbers.

Benchmarking numbers:
Version: v4-development (branch)

Method Mean Error StdDev P50 P90 P95 Allocated
SQSSendMessageAsync 6.595 ms 0.2591 ms 0.7433 ms 6.470 ms 7.704 ms 7.945 ms 100428 B

@bill-poole
Copy link
Author

bill-poole commented Sep 29, 2024

we are working on a new major version of the SDK which gives us the platform to modernize. We've pulled in dependencies such as System.Text.Json, System.Memory, and System.Buffers so we can take advantage of spans, and pooling like you are suggesting, which would increase performance.

Yes I was aware of the new version of the SDK and am very enthusiastic about the potential for such performance improvements; however, I'm unclear on the timeframe/priority for this work. As I understand it, such improvements are not in scope for the initial V4 release. It would be great to have much greater clarity on the timeframe/priority for these performance improvements.

Note that there are additional libraries that make buffer pooling easier that are not included in the list of new dependencies you mentioned, such as the Microsoft.Toolkit.HighPerformance library, which provides the ArrayPoolBufferWriter<T> class, which allows high performance writing of data to an auto-resizing contiguous buffer, where the buffers are drawn from the shared array pool (ArrayPool<T>.Shared). Another library in this space is Microsoft.IO.RecyclableMemoryStream, which is used by the EfficientDynamoDb library.

Will it be possible to bring such libraries into the V4 distribution at a later time (i.e., after its initial GA release)? Or do such decisions need to be made now before initial release?

With regards to throughput, I think a proper load test is required here. Since here we are just testing 1 operation where garbage collection isn't happening, it's difficult to say what the true MB/s would be.

Sorry, I should have looked at the benchmark code. So, the network and SQS service latency is being included in the benchmarked time per send operation, correct? I agree a proper load test would therefore provide a much more meaningful result. It would allow an apples-to-apples comparison with the result we got in 2021 - i.e., we got our result by doing a load test from a local machine to the SQS service, sending hundreds of messages concurrently.

Note that we were in Perth, Western Australia and we were using the ap-southeast-2 (Sydney) region. However, we also tested locally using "localstack" and got the same result. i.e., the latency was absorbed by sending messages concurrently such that the CPU was always busy sending a message while waiting for a response from the SQS service.

However, now that I've looked at the benchmark code, it seems that the SQSBenchmarks.SetupForSendMessage method is creating a 10 kB UTF-16 encoded message, which is actually a 5 kB UTF-8 encoded message, which is what is actually sent over the wire to the SQS service, not a 100 kB message:

_messageBody = Utils.CreateMessage(Constants.KiloSize * 10);

The Utils.CreateStringOfSize(long sizeInBytes) method creates the returned string with a loop that invokes the StringBuilder.Append(char) method sizeInBytes / 2 times:

private static string CreateStringOfSize(long sizeInBytes)
{
    //2 bytes are needed for each characterse, since .net strings are UTF-16
    int numCharacters = (int)sizeInBytes / 2;
    StringBuilder stringBuilder = new StringBuilder();
    for (int i = 0; i < numCharacters; i++)
    {
        stringBuilder.Append('A');
    }
    return stringBuilder.ToString();
}

The above method is correct that .NET strings are UTF-16 encoded, so the above implementation correctly creates a string with the given length in bytes, but SQS messages are encoded as UTF-8, which means that a string of 10,240 bytes of 'A' characters will result in an SQS message has a payload of 5,120 bytes over the wire. So there is the potential for confusion as to whether the sizeInBytes parameter means the size of the message payload in memory or on the wire.

Note that a simpler and much more performant implementation of the Utils.CreateStringOfSize(long) method would be:

private static string CreateStringOfSize(long sizeInBytes)
{
    // 2 bytes are needed for each character, since .NET strings are UTF-16
    return string.Create(length: (int)sizeInBytes / 2, state: false, (span, state) => span.Fill('A'));
}

I recognize that this method isn't being used in any hot path anywhere, but I think its worth making these kinds of changes, if not for performance, then for simplicity. Note there is also a StringBuilder.Append(char value, int repeatCount) method, which would have been appropriate to use prior to the availability of Span<T>.Fill(T).

Also note the performance result we got in 2021 was stated in terms of UTF-8 encoded payload bytes sent, not UTF-16 encoded bytes. So, to get an apples-with-apples comparison, the message size needs to be doubled.

So it seems that the benchmark result for the V4 preview is showing 100,428 bytes being allocated for each send operation, but seems to be sending only 5 kB (recognizing that is actually 10 kB of UTF-16 encoded text) for each send operation, assuming my above assertion is correct. Am I correct? Or have I got something wrong?

Will follow up in a later comment on some load testing numbers.

I'm looking forward to seeing the results!

Have you considered adding the ability to mock out the HTTPS transport, which would allow benchmarking the client code in isolation of network latency and SQS service performance?

@peterrsongg
Copy link
Contributor

Will it be possible to bring such libraries into the V4 distribution at a later time (i.e., after its initial GA release)? Or do such decisions need to be made now before initial release?

If we wanted to bring additional libraries we would need to do that before GA, so Microsoft.Toolkit.HighPerformance is something I could bring up to the team along with Microsoft.IO.RecyclableMemoryStream.

However, now that I've looked at the benchmark code, it seems that the SQSBenchmarks.SetupForSendMessage method is creating a 10 kB UTF-16 encoded message, which is actually a 5 kB UTF-8 encoded message, which is what is actually sent over the wire to the SQS service, not a 100 kB message:

Though the code in main looks like that, I ran the code on my own branch where I updated that code to Constants.KiloSize * 100, so it was 100KB message, but that's a good callout that a 100Kb UTF-16 encoded message would actually be a 50Kb UTF-8 encoded message. The details!

Note that a simpler and much more performant implementation of the Utils.CreateStringOfSize(long) method would be:

private static string CreateStringOfSize(long sizeInBytes)
{
    // 2 bytes are needed for each character, since .NET strings are UTF-16
    return string.Create(length: (int)sizeInBytes / 2, state: false, (span, state) => span.Fill('A'));
}

I recognize that this method isn't being used in any hot path anywhere, but I think its worth making these kinds of changes, if not for performance, then for simplicity. Note there is also a StringBuilder.Append(char value, int repeatCount) method, which would have been appropriate to use prior to the availability of Span<T>.Fill(T).

Also note the performance result we got in 2021 was stated in terms of UTF-8 encoded payload bytes sent, not UTF-16 encoded bytes. So, to get an apples-with-apples comparison, the message size needs to be doubled.

So it seems that the benchmark result for the V4 preview is showing 100,428 bytes being allocated for each send operation, but seems to be sending only 5 kB (recognizing that is actually 10 kB of UTF-16 encoded text) for each send operation, assuming my above assertion is correct. Am I correct? Or have I got something wrong?

Thanks for the suggestions on the code, I'll look to update the code both to simplify it and create an overloaded method that accepts an additional parameter which doubles the size of the message sent over the wire if the service expects a UTF-8 encoded message. Appreciate you looking at the performance benchmarking code!

Have you considered adding the ability to mock out the HTTPS transport, which would allow benchmarking the client code in isolation of network latency and SQS service performance?

It's on our radar and would definitely simplify a lot of our testing code for other services as well, but we just haven't gotten around to it yet. The decision to not mock the https transport in v3 of the SDK came down to differences in netframework35 vs netframework 45 and netcoreapp31 but that decision was made before my time so I don't know exactly the details. Now that we are dropping netframework35 support, I believe we could start improving that area of testing in the sdk.

@bill-poole
Copy link
Author

If we wanted to bring additional libraries we would need to do that before GA, so Microsoft.Toolkit.HighPerformance is something I could bring up to the team along with Microsoft.IO.RecyclableMemoryStream.

I assume therefore a design and possible prototype of reading/writing pooled buffers would be needed prior to selecting which of these libraries is needed/appropriate?

It's on our radar and would definitely simplify a lot of our testing code for other services as well, but we just haven't gotten around to it yet.

If the SQS client can be configured to use a custom HttpMessageHandler, then the HttpMessageHandler.SendAsync method can be mocked (because it is declared abstract) to possibly meet this requirement.

@peterrsongg
Copy link
Contributor

I assume therefore a design and possible prototype of reading/writing pooled buffers would be needed prior to selecting which of these libraries is needed/appropriate?

Some sort of justification as to why we should include these new dependencies would be presented internally to the team. This could include a prototype and some performance improvement numbers or something like that.

It's on our radar and would definitely simplify a lot of our testing code for other services as well, but we just haven't gotten around to it yet.

If the SQS client can be configured to use a custom HttpMessageHandler, then the HttpMessageHandler.SendAsync method can be mocked (because it is declared abstract) to possibly meet this requirement.

Will keep that in mind when designing a mocked client👍

@bill-poole
Copy link
Author

Some sort of justification as to why we should include these new dependencies would be presented internally to the team. This could include a prototype and some performance improvement numbers or something like that.

I was just thinking that you'd want to be very sure you chose the correct library/libraries before being locked into them, so I'd imagine a prototype would be needed. I'd be very interested to see and provide feedback on the prototype.

@normj
Copy link
Member

normj commented Sep 30, 2024

Adding some more clarity on dependencies. Post GA I believe we can still add new dependencies but the value has to significant not just a minor performance improvement in non-hot spot areas. We would need to do some significant version bump and possibly write a blog post to make sure user's of the SDK are not surprised. We do have to support users that are acquiring the SDK outside of NuGet and dependencies get harder in those cases.

@normj
Copy link
Member

normj commented Sep 30, 2024

The HttpClientFactory property can be used to configure the SDK to use a mocked HttpClient for testing.

@peterrsongg
Copy link
Contributor

@bill-poole we're starting the work of switching to STJ marshalling. I put out the first PR here, though it's still a WIP: #3528

if you're interested.

@boblodgett boblodgett added v4 and removed bug This issue is a bug. labels Nov 19, 2024
@ashishdhingra ashishdhingra added the feature-request A feature should be added or improved. label Nov 19, 2024
@boblodgett boblodgett removed the p1 This is a high priority issue label Nov 19, 2024
@ashishdhingra ashishdhingra added p1 This is a high priority issue p2 This is a standard priority issue and removed p1 This is a high priority issue labels Nov 19, 2024
@boblodgett boblodgett added p1 This is a high priority issue and removed p2 This is a standard priority issue labels Nov 19, 2024
@bill-poole
Copy link
Author

bill-poole commented Dec 4, 2024

Thanks @peterrsongg. I've had a bit of a look through the PR. I haven't used T4 templates before, so I found the PR a little hard to navigate, so apologies for whatever I may have misunderstood.

The main focus of my comments/observations is to identify where memory/data is being unnecessarily allocated/copied multiple times.

JsonRPCRequestMarshaller.tt writes JSON into an ArrayBufferWriter<byte>, which heap-allocates a buffer of 256 bytes, and increases the buffer size each time the buffer is exceeded, heap-allocating each new buffer and copying the previous buffer into the new buffer. For example, sending a 200 kB SQS message with a 195 kB payload (UTF-8 encoded) might resize the buffer 6 times, allocating 7 buffers (i.e., 256 bytes, 512 bytes, 1 kB, 2 kB, 4 kB, 199 kB, 398 kB), in accordance with the ArrayBufferWriter<T>.CheckAndResizeBuffer method. This would also write about 406.75 kB memory instead of just 200 kB.

Alternatively, an ArrayPoolBufferWriter<T> (from the Microsoft.Toolkit.HighPerformance library) could be used as a drop-in replacement for ArrayBufferWriter<T>, which would mean buffers would be allocated from the ArrayPool<byte>.Shared array pool, thus eliminating these heap allocations. Furthermore, each ArrayPoolBufferWriter<T> instance could be created with an initial capacity of 256 kB (given the maximum SQS message size is 256 kB), which would prevent the memory copying associated with buffer resizing.

The JSON buffer is then copied again into a new array, with which the DefaultRequest.Content property is then set:

request.Content = arrayBufferWriter.WrittenMemory.ToArray();

This allocation/copy can be avoided if the ArrayPoolBufferWriter<byte> can be given directly to the DefaultRequest object. The buffer can then be written directly from the pooled buffer to the network. However, the DefaultRequest object would need to be disposed after it is no longer needed so the buffer is returned to the pool.

There are also a few places where values are written to a heap-allocated string, and then that string is written into the JSON buffer. For example, in JsonRPCStructureMarshaller.tt:

context.Writer.WriteStringValue(StringUtils.FromSpecialDoubleValue(<#=memberProperty#>));

context.Writer.WriteStringValue(Guid.NewGuid().ToString());

context.Writer.WriteStringValue(StringUtils.FromMemoryStream(<#=variableName + "." + member.PropertyName#>));

When writing values that have fixed/constrained lengths and do not use/require escaping (e.g., when writing a Guid or Double), it is better to stack-allocate a Span<byte>, write the UTF-8 encoded string to the stack-allocated buffer, then use context.Writer.WriteRawValue to write the stack allocated buffer into the JSON buffer.

When writing a memory stream, try to get direct access to the underlying buffer using MemoryStream.TryGetBuffer and if successful (which should be most of the time for most users), write using context.Writer.WriteBase64StringValue.

@peterrsongg
Copy link
Contributor

@bill-poole

Thanks for taking a look. Sorry it took a while to get back to you. This is still actively being worked on, and I took your suggestion of using the ArrayPoolBufferWriter. I had to backport the code from the high performance toolkit library because we can't take a dependency on the community library. Here is that PR. I saw a huge improvement in CPU and memory allocations (all of which I noted in the PR description).

One note on the default buffer size though. I did some experimentation and I definitely do not want to allocate a 250KB buffer as the initial buffer since most payloads will not be 250KB and that would make us allocate potentially unused memory. Maybe if we had some data on what the average payload size was that would help us, but we don't have that information so I left it at the default. I found that increasing this value actually increased byte allocations most of the time

I also thought it was odd that we did a .ToArray() and tried to remove that as well, but unforunately there are certain .NET native APIs that require a byte array (ones we specifically use for SigV4 signing) so we must leave that in there. So even if we attach the ArrayPoolBufferWriter to the request, we would still need to eventually call .ToArray() elsewhere to use the byte array for signing.

I didn't get a chance to look into the Context.Writer.WriteRawValue though, but this is certainly something I could look into, but imo the higher priority was switching over to the ArrayPoolBufferWriter which we are now doing😀 I'm also looking into if we can use RecyclableMemoryStream but haven't gotten there yet.

@bill-poole
Copy link
Author

bill-poole commented Dec 14, 2024

I definitely do not want to allocate a 250KB buffer as the initial buffer since most payloads will not be 250KB and that would make us allocate potentially unused memory.

This shouldn't matter because allocated buffers are returned to the pool after use. i.e., the number of allocated buffers should only be the number of buffers that are used concurrently (i.e., the number of messages actively being sent over the wire concurrently). The ArrayPoolBufferWriter<T> uses the ArrayPool<T>.Shared pool, which allocates a maximum 32 buffers per processor (as per Environment.ProcessorCount) per buffer size. e.g., a 64 processor machine would allocate a maximum of 2,048 buffers, which would be 524 MB, although I suspect the maximum of 32 buffers per processor would very rarely be reached, meaning the actual memory usage would typically be significantly less than 524 MB.

At first, the array pool is empty, so each array requested from the pool is allocated new. However, as arrays are returned to the pool, newly requested arrays are drawn from the pool rather than allocating new arrays. Returning an array to the pool when it is full is a no-op, which leaves the array for the GC to collect.

That being said, it doesn't look like this PR is actually returning the buffers to the pool. i.e., I don't think the ArrayPoolBufferWriter<byte> is being disposed after use. i.e., the ArrayPoolBufferWriter<byte> must be disposed (by calling ArrayPoolBufferWriter<byte>.Dispose) after the message has been sent over the network.

Unforunately there are certain .NET native APIs that require a byte array (ones we specifically use for SigV4 signing) so we must leave that in there.

Any .NET native APIs accepting a byte array should also have an overload accepting a ReadOnlySpan<byte>, which you can get from arrayPoolBufferWriter.WrittenMemory.Span without allocation. Note that there is also an AsStream extension method, which you can invoke on an ArrayPoolBufferWriter<byte> to get a Stream over the buffer, which allows the ArrayPoolBufferWriter<byte> to be used with APIs accepting a Stream. However, streams are inherently less efficient to read than a raw ReadOnlySpan<byte>.

I'm also looking into if we can use RecyclableMemoryStream but haven't gotten there yet.

I think that reading/writing a RecyclableMemoryStream is actually about 2x slower than ArrayPoolBufferWriter<byte>, but has the advantage of being implemented as a "sequence" internally, which means new buffers are appended as the buffer size is exceeded, rather than resizing the buffer by drawing a new buffer from the pool and copying from the old buffer to the new buffer. However, that means that the buffer must be read from a ReadOnlySequence<T> or Stream, rather than a ReadOnlySpan<T> (assuming you want to avoid copying the buffer into array). i.e., all APIs using the buffer would need an overload accepting a ReadOnlySequence<T> or Stream.

@peterrsongg
Copy link
Contributor

This shouldn't matter because allocated buffers are returned to the pool after use. i.e., the number of allocated buffers should only be the number of buffers that are used concurrently (i.e., the number of messages actively being sent over the wire concurrently). The ArrayPoolBufferWriter<T> uses the ArrayPool<T>.Shared pool, which allocates a maximum 32 buffers per processor (as per Environment.ProcessorCount) per buffer size. e.g., a 64 processor machine would allocate a maximum of 2,048 buffers, which would be 524 MB, although I suspect the maximum of 32 buffers per processor would very rarely be reached, meaning the actual memory usage would typically be significantly less than 524 MB.

I'm not sure I completely agree that it "doesn't matter", because it would still mean allocating the maximum possible buffer size for every request, and in a highly concurrent application that is sending small payloads this can cause higher memory allocations even if the buffers are being returned to the pool eventually. And within the context of a lambda function where memory usage matters I wouldn't want to allocate such a high memory for every request. However, given that we disagree on this, maybe this is a good opportunity to make this configurable, like we did for the buffer size in the StreamingUtf8JsonReaders. (We added a config option for this).

That being said, it doesn't look like this PR is actually returning the buffers to the pool. i.e., I don't think the ArrayPoolBufferWriter<byte> is being disposed after use. i.e., the ArrayPoolBufferWriter<byte> must be disposed (by calling ArrayPoolBufferWriter<byte>.Dispose) after the message has been sent over the network.

Hmm, I wrapped the ArrayPoolBufferWriter in a using statement, so it should be disposed. However, it is disposed after the marshalling is done and not when the message has been sent over the network b/c we are calling ToArray() and assigning it to request.Context.

Any .NET native APIs accepting a byte array should also have an overload accepting a ReadOnlySpan<byte>, which you can get from arrayPoolBufferWriter.WrittenMemory.Span without allocation. Note that there is also an AsStream extension method, which you can invoke on an ArrayPoolBufferWriter<byte> to get a Stream over the buffer, which allows the ArrayPoolBufferWriter<byte> to be used with APIs accepting a Stream. However, streams are inherently less efficient to read than a raw ReadOnlySpan<byte>.

Yeah, in theory it should be possible and with more time I could probably do it, but also the fact that ReadOnlySpans ref structs make it difficult to work with and required a lot more refactoring than I had initially thought. I initially though, "oh this should be easy", since ReadOnlyMemoryContent exists and I can assign WrittenMemory to the request's ReadOnlyMemoryContent and we should be good, but then ran into the signing issues and working with ref structs in that context made it much more complicated. I haven't completely abandoned it though.

I think that reading/writing a RecyclableMemoryStream is actually about 2x slower than ArrayPoolBufferWriter<byte>, but has the advantage of being implemented as a "sequence" internally, which means new buffers are appended as the buffer size is exceeded, rather than resizing the buffer by drawing a new buffer from the pool and copying from the old buffer to the new buffer. However, that means that the buffer must be read from a ReadOnlySequence<T> or Stream, rather than a ReadOnlySpan<T> (assuming you want to avoid copying the buffer into array). i.e., all APIs using the buffer would need an overload accepting a ReadOnlySequence<T> or Stream.

Yeah this was a community ask on my GitHub PR, as was the ArrayPoolBufferWriter, but haven't done any perf tests on it. Good to know though!

@bill-poole
Copy link
Author

bill-poole commented Dec 18, 2024

I wrapped the ArrayPoolBufferWriter in a using statement, so it should be disposed. However, it is disposed after the marshalling is done and not when the message has been sent over the network b/c we are calling ToArray() and assigning it to request.Context.

Sorry, I missed the using keyword on the declaration of arrayPoolBufferWriter.

it would still mean allocating the maximum possible buffer size for every request

Yes, but only for the first set of requests, after which the buffers will be drawn from the pool, not allocated.

and in a highly concurrent application that is sending small payloads this can cause higher memory allocations even if the buffers are being returned to the pool eventually.

Yes, that's true. Although, it could be worth testing to determine what the additional memory load would actually be.

However, in the current PR there is no await between the creation of the ArrayPoolBufferWriter<byte> and the copying into an array, which would mean the number of buffers would be constrained to the number of threads in the thread pool, which by default is 2x the number of vCPUs (given by Environment.ProcessorCount). So a 64 vCPU machine would allocate 128 buffers, which would only be 32 MB with a buffer size of 256 kB, which is 0.0125% of memory on a compute-optimized instance.

Yeah, in theory it should be possible and with more time I could probably do it, but also the fact that ReadOnlySpans ref structs make it difficult to work with and required a lot more refactoring than I had initially thought. I initially though, "oh this should be easy", since ReadOnlyMemoryContent exists and I can assign WrittenMemory to the request's ReadOnlyMemoryContent and we should be good, but then ran into the signing issues and working with ref structs in that context made it much more complicated. I haven't completely abandoned it though.

Yes, ReadOnlyMemoryContent doesn't provide a way to access the underlying ReadOnlyMemory<byte>, which therefore makes it impossible to access the ReadOnlySpan<byte>. One approach would be to update the DefaultRequest to directly hold a reference to the ArrayPoolBufferWriter<byte>, make DefaultRequest disposable, and dispose it after the buffer has been sent over the network.

But, that would increase the memory load as stated above because it would increase the number of buffers in the pool beyond the number of threads in the thread pool because the buffers would be held while the buffers are sent over the network.

Therefore, another approach would be to copy the contents of the ArrayPoolBufferWriter<byte> into a buffer allocated from the array pool, rather than using the ArrayPoolBufferWriter<byte>.WrittenMemory.ToArray method to copy the buffer into a newly allocated array. This would limit the number of 256 kB buffers concurrently drawn from the pool to 2x Environment.ProcessorCount as stated above, limiting the buffers held while data is sent over the network to buffers close to the exact size of the messages being sent.

I strongly suggest doing this (if not sending content over the network directly from the ArrayPoolBufferWriter<byte>) because customers sending messages over 85 kB would cause the arrays allocated by arrayPoolBufferWriter.WrittenMemory.ToArray() to be allocated on the large object heap (LOH), which would significantly increase GC pressure.

One way to do this would be to update the DefaultRequest.Content property to be of type ArraySegment<byte> (instead of byte[]), which would be assigned as:

var bufferSize = arrayPoolBufferWriter.WrittenMemory.Length;
request.Content = new ArraySegment<byte>(array: ArrayPool<byte>.Shared.Rent(bufferSize), 
    offset: 0, count: bufferSize);
arrayPoolBufferWriter.WrittenMemory.Span.CopyTo(request.Content.AsSpan());

You can pass a ReadOnlySpan<byte> to the APIs you need for SigV4 signing using the AsSpan extension method of the ArraySegment<byte>.

You would also need to update the DefaultRequest class to be disposable, implementing the Dispose method as:

ArrayPool<byte>.Shared.Return(Content.Array);

And you'd need to invoke the Dispose method after the buffer has been sent over the network.

This would eliminate all heap allocations in the SQS client related to the content buffer, and all but a single memory copy from the 256 kB ArrayPoolBufferWriter<byte> to the exact-sized ArraySegment<byte>.

However, the shape of the SendMessageRequest and SendMessageBatchRequestEntry classes forces the user code to heap allocate the content of each message because the message content must be given by a String object, and String objects cannot be reused/recycled because they are immutable. i.e., it prohibits the user code from writing message content to pooled buffers.

Furthermore, it unnecessarily forces a conversion to UTF-16. i.e., if I am building the content of an SQS message with my own Utf8JsonWriter, then I write the content into a UTF-8 buffer (which I can allocate from the array pool), but then I have to encode that into a UTF-16 encoded String, which the SQS client then converts back to UTF-8 when writing to its internal ArrayPoolBufferWriter<byte>.

Therefore, it would be great if the SendMessageRequest and SendMessageBatchRequestEntry classes could be updated to optionally allow specifying UTF-8 content with a ReadOnlyMemory<byte>. This would eliminate the unnecessary String allocation (which would be on the LOH for any message with content over 85 kB UTF-16) and the unnecessary UTF-16 encoding to the String. It would also eliminate the unnecessary transcoding from UTF-16 to UTF-8 when writing to the ArrayPoolBufferWriter<byte>.

This could be achieved by adding a Utf8MessageBody property (of type ReadOnlyMemory<byte>) to the SendMessageRequest and SendMessageBatchRequestEntry classes and allow users to set either the MessageBody property or the Utf8MessageBody property.

This would then eliminate all large buffer allocations, reduce the number of times the content is copied to two (once from the user-supplied buffer to the 256 kB ArrayPoolBufferWriter<byte>, and once from the ArrayPoolBufferWriter<byte> to the exact-sized ArraySegment<byte>) and eliminate unnecessary transcoding from UTF-8 to UTF-16 and back again.

@peterrsongg
Copy link
Contributor

@bill-poole thank you for your suggestions. give me some time to investigate but I think this is a good starting point

@peterrsongg
Copy link
Contributor

However, in the current PR there is no await between the creation of the ArrayPoolBufferWriter<byte> and the copying into an array, which would mean the number of buffers would be constrained to the number of threads in the thread pool, which by default is 2x the number of vCPUs (given by Environment.ProcessorCount). So a 64 vCPU machine would allocate 128 buffers, which would only be 32 MB with a buffer size of 256 kB, which is 0.0125% of memory on a compute-optimized instance.

Can you expand upon what you mean by there is no await? Everything between the creation of the ArrayPoolBufferWriter and the ToArray() call is not awaitable and the ToArray() call isn't either so I'm just trying to figure out what I'm missing here. The number of buffers being constrained to the number of threads in the thread pool is a good callout, but I'd rather make the buffer size configurable than choose 256 for the user. For your use case maybe 256 makes the most sense, but for another person who is sending small payloads it doesn't make sense to allocate 256.

Yes, ReadOnlyMemoryContent doesn't provide a way to access the underlying ReadOnlyMemory<byte>, which therefore makes it impossible to access the ReadOnlySpan<byte>. One approach would be to update the DefaultRequest to directly hold a reference to the ArrayPoolBufferWriter<byte>, make DefaultRequest disposable, and dispose it after the buffer has been sent over the network.

But, that would increase the memory load as stated above because it would increase the number of buffers in the pool beyond the number of threads in the thread pool because the buffers would be held while the buffers are sent over the network.

Therefore, another approach would be to copy the contents of the ArrayPoolBufferWriter<byte> into a buffer allocated from the array pool, rather than using the ArrayPoolBufferWriter<byte>.WrittenMemory.ToArray method to copy the buffer into a newly allocated array. This would limit the number of 256 kB buffers concurrently drawn from the pool to 2x Environment.ProcessorCount as stated above, limiting the buffers held while data is sent over the network to buffers close to the exact size of the messages being sent.

I strongly suggest doing this (if not sending content over the network directly from the ArrayPoolBufferWriter<byte>) because customers sending messages over 85 kB would cause the arrays allocated by arrayPoolBufferWriter.WrittenMemory.ToArray() to be allocated on the large object heap (LOH), which would significantly increase GC pressure.

One way to do this would be to update the DefaultRequest.Content property to be of type ArraySegment<byte> (instead of byte[]), which would be assigned as:

var bufferSize = arrayPoolBufferWriter.WrittenMemory.Length;
request.Content = new ArraySegment<byte>(array: ArrayPool<byte>.Shared.Rent(bufferSize), 
    offset: 0, count: bufferSize);
arrayPoolBufferWriter.WrittenMemory.Span.CopyTo(request.Content.AsSpan());

You can pass a ReadOnlySpan<byte> to the APIs you need for SigV4 signing using the AsSpan extension method of the ArraySegment<byte>.

You would also need to update the DefaultRequest class to be disposable, implementing the Dispose method as:

ArrayPool<byte>.Shared.Return(Content.Array);

And you'd need to invoke the Dispose method after the buffer has been sent over the network.

This would eliminate all heap allocations in the SQS client related to the content buffer, and all but a single memory copy from the 256 kB ArrayPoolBufferWriter<byte> to the exact-sized ArraySegment<byte>.

great suggestion, I'll look into this.

However, the shape of the SendMessageRequest and SendMessageBatchRequestEntry classes forces the user code to heap allocate the content of each message because the message content must be given by a String object, and String objects cannot be reused/recycled because they are immutable. i.e., it prohibits the user code from writing message content to pooled buffers.

Furthermore, it unnecessarily forces a conversion to UTF-16. i.e., if I am building the content of an SQS message with my own Utf8JsonWriter, then I write the content into a UTF-8 buffer (which I can allocate from the array pool), but then I have to encode that into a UTF-16 encoded String, which the SQS client then converts back to UTF-8 when writing to its internal ArrayPoolBufferWriter<byte>.

Therefore, it would be great if the SendMessageRequest and SendMessageBatchRequestEntry classes could be updated to optionally allow specifying UTF-8 content with a ReadOnlyMemory<byte>. This would eliminate the unnecessary String allocation (which would be on the LOH for any message with content over 85 kB UTF-16) and the unnecessary UTF-16 encoding to the String. It would also eliminate the unnecessary transcoding from UTF-16 to UTF-8 when writing to the ArrayPoolBufferWriter<byte>.

This could be achieved by adding a Utf8MessageBody property (of type ReadOnlyMemory<byte>) to the SendMessageRequest and SendMessageBatchRequestEntry classes and allow users to set either the MessageBody property or the Utf8MessageBody property.

This would then eliminate all large buffer allocations, reduce the number of times the content is copied to two (once from the user-supplied buffer to the 256 kB ArrayPoolBufferWriter<byte>, and once from the ArrayPoolBufferWriter<byte> to the exact-sized ArraySegment<byte>) and eliminate unnecessary transcoding from UTF-8 to UTF-16 and back again.

I think this optimization comes after the request.Content optimization and I cannot gaurantee that this work will happen since it would be a customization for SQS, but I hear you. UTF8 is the encoding of the web and it is definitely a performance hit that we need to convert from utf16 to utf8 to send it to sqs

@danielmarbach
Copy link
Contributor

I'm also looking into if we can use RecyclableMemoryStream but haven't gotten there yet.

FYI RecyclableMemoryStream is not much of value when you end up calling ToArray anyway.

Like I mentioned in #3468 I think it would be great to switch the the content property to ReadonlyMemory or consistently use the ContentStream property and then wrap ROM into a ReadOnlyMemoryStream. As mentioned by Billie using the stream as a common abstraction has implications too though but it might be worth exploring to stream line the content access.

If you make sure to carry forward either ROM or the stream with the proper conditionals you can quite likely almost always call then into framework methods that have never span or ROM based overloads and only copy into byte arrays where it is really required (for example in the .NET Standard path).

And even for the .NET Standard path it is sometimes possible to backport slightly more modern versions of the API surface as long as you are willing to go down the "unsafe" path.

@danielmarbach
Copy link
Contributor

From a library / SDK standpoint it is also always worth considering how much you are willing to give up the safety to boost performance. For example the Azure .NET SDK has several places where the internal buffers are copied before handed out to the users to make sure they don't run into weird issues.

With the newer version of RabbitMQ .NET client we have decided to expose exposed the buffered ROM directly but it comes with a huge caveat that has to be indicated on the XML docs

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/RabbitMQ.Client/Events/BasicDeliverEventArgs.cs#L64-L81

The users of the rabbitmq client are mostly fairly advanced though, and this was an acceptable tradeoff to make.

@danielmarbach
Copy link
Contributor

FYI an abstraction that is available in System.Memory is BinaryData that allows treating various "bridging" cases with one uniform data type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. module/sdk-custom p1 This is a high priority issue queued v4
Projects
None yet
Development

No branches or pull requests