Optimize memory usage by avoiding intermediate buffer in message serialization #928

Zerolzj · 2024-04-17T08:35:55Z

This commit replaces the use of an intermediate buffer in the message serialization process with a direct write-to-buffer approach. The original implementation used MustMarshalBinary() which involved an extra memory copy to an intermediate buffer before writing to the final writeBuffer, leading to high memory consumption for large messages. The new WriteTo function writes message data directly to the writeBuffer, significantly reducing memory overhead and CPU time spent on garbage collection.

…alization This commit replaces the use of an intermediate buffer in the message serialization process with a direct write-to-buffer approach. The original implementation used MustMarshalBinary() which involved an extra memory copy to an intermediate buffer before writing to the final writeBuffer, leading to high memory consumption for large messages. The new WriteTo function writes message data directly to the writeBuffer, significantly reducing memory overhead and CPU time spent on garbage collection.

Zerolzj · 2024-04-17T08:41:11Z

Optimization Documentation Summary

Experimental Environment:

System:
- Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Seed Pod Configuration:
- CPU Limit: 1 core
- CPU Request: 1 core
Transfer Setup: 1-to-1 data transmission ; Piece Size 4MiB
Observation Metrics: CPU and memory usage of the uploading Pod (Seeder)

CPU Usage Comparison Table

Component	Pre-Optimization CPU Usage	Post-Optimization CPU Usage
peerRequestDataReader	18.2%	33.5%
messageWriterRunner	20.7%	12.2%
gc (Garbage Collection)	51.4%	37.6%
mcall	7.4%	15.1%

Memory Allocation and Transfer Rate Comparison Table

Metric	Pre-Optimization	Post-Optimization
Memory Allocated	160GB	88GB
Measurement Interval	25 seconds	25 seconds
Peak Transfer Rate	120MB/s	185MB/s

Based on the aforementioned experimental environment and test results, the following conclusions can be drawn regarding the improvements:

The increased CPU usage of peerRequestDataReader not only reflects more efficient data handling capabilities, but also indicates that more CPU resources can now be allocated to reading data as opposed to being consumed by garbage collection processes.
The reduced CPU consumption for messageWriterRunner indicates a minimization of resource overhead during message serialization.
A significant reduction in CPU usage for garbage collection demonstrates enhanced memory allocation and recovery efficiency.
The reduction in memory usage from 160GB to 88GB highlights a more effective use of memory resources, contributing to a lower overall system load.
A substantial improvement in peak transfer rates, from 120MB/s to 185MB/s, underscores a notable enhancement in overall data transfer performance post-optimization.

In summary, the experimental results indicate that the implemented optimizations have succeeded in reducing memory consumption while simultaneously boosting data transmission performance. This represents a significant performance optimization within the BitTorrent client. Future efforts will focus on further optimization to ensure continued enhancement of speed while further reducing resource consumption.

Zerolzj · 2024-04-17T09:03:51Z

Attachment:

Performance Analysis and Upload Throughput Imagery Comparison Table

Metric	Before Optimization	After Optimization
CPU pprof Image
MEM pprof Image
Upload Bps Image

anacrolix · 2024-04-19T05:40:21Z

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

Zerolzj · 2024-04-19T06:58:00Z

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

Thanks for your feedback. I'll update the community when I have some benchmark results to share, and conduct a detailed analysis to assess the benefits of precomputing the message length.

anacrolix · 2024-04-21T03:58:43Z

Thanks!

Zerolzj · 2024-04-22T06:20:37Z

Benchmarking Report

Performed on a macOS system equipped with an Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz and utilizing an amd64 architecture.

Benchmark Results

Below is a tabulated summary showing how each function performs with various piece lengths:

PieceLength	Function	Iterations	Time (ns/op)	Memory (B/op)	Allocations (allocs/op)
1M	BenchmarkWriteToBuffer1M	4323	263913	988	4
	BenchmarkMarshalBinaryWrite1M	1980	598797	2114204	7
4M	BenchmarkWriteToBuffer4M	4534	276714	942	4
	BenchmarkMarshalBinaryWrite4M	529	2208230	8413078	7
8M	BenchmarkWriteToBuffer8M	4130	274931	1033	4
	BenchmarkMarshalBinaryWrite8M	240	4268217	16828753	7

Zerolzj · 2024-04-22T06:25:09Z

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

The new commit has removed the code for manually growing the buffer. The benefit of handling the buffer growth manually appears to be marginal.

anacrolix · 2024-04-24T08:33:18Z

Thanks, I'll check it out soon.

anacrolix · 2024-04-25T04:27:10Z

I made some tweaks to the benchmarks to make them more consistent. In particular, one form was always using the 4M length. I also added the most common size actually in use (16KiB, the default block size).

I think a few types have gone missing in the changes, I'll fix those up too and likely merge shortly.

anacrolix · 2024-04-25T04:31:03Z

The start and stop of the timer actually caused issues with the timing, as the reset is too small to have any effect on the benchmark.

anacrolix · 2024-04-25T05:18:34Z

I reduced the changes a bit to reuse the old payload writing code, since HashRequest was missing.

The MarshalBinary performance is doubled, but the write to buffer stuff is not quite as fast (but still 16x faster than it was).

I'll merge what I have so far, but if you want to make tweaks to reduce the allocations further to get the extra 50% back, I think the code base is in a better position to take those changes, and it will be clearer what is being done to get that.

I should add the allocated space is massively reduced. Sorry if the changes were pretty heavy handed, but I think restructuring the write to buffer, and then doing smaller cleanups in a separate change is less risky.

Zerolzj · 2024-04-26T01:20:00Z

I reduced the changes a bit to reuse the old payload writing code, since HashRequest was missing.

The MarshalBinary performance is doubled, but the write to buffer stuff is not quite as fast (but still 16x faster than it was).

I'll merge what I have so far, but if you want to make tweaks to reduce the allocations further to get the extra 50% back, I think the code base is in a better position to take those changes, and it will be clearer what is being done to get that.

I should add the allocated space is massively reduced. Sorry if the changes were pretty heavy handed, but I think restructuring the write to buffer, and then doing smaller cleanups in a separate change is less risky.

I apologize for the oversight with the HashRequest; it was unintentional and I regret any inconvenience caused. Also, I'm grateful for your thorough work on the updates. Collaborating with you has been incredibly enlightening, and I look forward to doing so again.

luozhengjie.lzj added 2 commits April 22, 2024 11:54

add benchmark for write

1c14f45

benchmark for 1M/4M/8M

0aea51f

Tidy up new benchmarks

3b1eaa9

Maintain older payload write implementation

25b5a4f

anacrolix merged commit 3f5ef0b into anacrolix:master Apr 25, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize memory usage by avoiding intermediate buffer in message serialization #928

Optimize memory usage by avoiding intermediate buffer in message serialization #928

Zerolzj commented Apr 17, 2024

Zerolzj commented Apr 17, 2024 •

edited

Loading

Zerolzj commented Apr 17, 2024

anacrolix commented Apr 19, 2024

Zerolzj commented Apr 19, 2024

anacrolix commented Apr 21, 2024

Zerolzj commented Apr 22, 2024

Zerolzj commented Apr 22, 2024

anacrolix commented Apr 24, 2024

anacrolix commented Apr 25, 2024

anacrolix commented Apr 25, 2024

anacrolix commented Apr 25, 2024 •

edited

Loading

Zerolzj commented Apr 26, 2024

Optimize memory usage by avoiding intermediate buffer in message serialization #928

Optimize memory usage by avoiding intermediate buffer in message serialization #928

Conversation

Zerolzj commented Apr 17, 2024

Zerolzj commented Apr 17, 2024 • edited Loading

Optimization Documentation Summary

Zerolzj commented Apr 17, 2024

anacrolix commented Apr 19, 2024

Zerolzj commented Apr 19, 2024

anacrolix commented Apr 21, 2024

Zerolzj commented Apr 22, 2024

Benchmarking Report

Benchmark Results

Zerolzj commented Apr 22, 2024

anacrolix commented Apr 24, 2024

anacrolix commented Apr 25, 2024

anacrolix commented Apr 25, 2024

anacrolix commented Apr 25, 2024 • edited Loading

Zerolzj commented Apr 26, 2024

Zerolzj commented Apr 17, 2024 •

edited

Loading

anacrolix commented Apr 25, 2024 •

edited

Loading