Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize memory usage by avoiding intermediate buffer in message serialization #928

Merged
merged 5 commits into from
Apr 25, 2024

Conversation

Zerolzj
Copy link
Contributor

@Zerolzj Zerolzj commented Apr 17, 2024

This commit replaces the use of an intermediate buffer in the message serialization process with a direct write-to-buffer approach. The original implementation used MustMarshalBinary() which involved an extra memory copy to an intermediate buffer before writing to the final writeBuffer, leading to high memory consumption for large messages. The new WriteTo function writes message data directly to the writeBuffer, significantly reducing memory overhead and CPU time spent on garbage collection.

…alization

This commit replaces the use of an intermediate buffer in the message serialization process with a direct write-to-buffer approach. The original implementation used MustMarshalBinary() which involved an extra memory copy to an intermediate buffer before writing to the final writeBuffer, leading to high memory consumption for large messages. The new WriteTo function writes message data directly to the writeBuffer, significantly reducing memory overhead and CPU time spent on garbage collection.
@Zerolzj
Copy link
Contributor Author

Zerolzj commented Apr 17, 2024

Optimization Documentation Summary

Experimental Environment:

  • System:

    • Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
  • Seed Pod Configuration:

    • CPU Limit: 1 core

    • CPU Request: 1 core

  • Transfer Setup: 1-to-1 data transmission ; Piece Size 4MiB

  • Observation Metrics: CPU and memory usage of the uploading Pod (Seeder)

CPU Usage Comparison Table

Component Pre-Optimization CPU Usage Post-Optimization CPU Usage
peerRequestDataReader 18.2% 33.5%
messageWriterRunner 20.7% 12.2%
gc (Garbage Collection) 51.4% 37.6%
mcall 7.4% 15.1%

Memory Allocation and Transfer Rate Comparison Table

Metric Pre-Optimization Post-Optimization
Memory Allocated 160GB 88GB
Measurement Interval 25 seconds 25 seconds
Peak Transfer Rate 120MB/s 185MB/s

Based on the aforementioned experimental environment and test results, the following conclusions can be drawn regarding the improvements:

  1. The increased CPU usage of peerRequestDataReader not only reflects more efficient data handling capabilities, but also indicates that more CPU resources can now be allocated to reading data as opposed to being consumed by garbage collection processes.

  2. The reduced CPU consumption for messageWriterRunner indicates a minimization of resource overhead during message serialization.

  3. A significant reduction in CPU usage for garbage collection demonstrates enhanced memory allocation and recovery efficiency.

  4. The reduction in memory usage from 160GB to 88GB highlights a more effective use of memory resources, contributing to a lower overall system load.

  5. A substantial improvement in peak transfer rates, from 120MB/s to 185MB/s, underscores a notable enhancement in overall data transfer performance post-optimization.

In summary, the experimental results indicate that the implemented optimizations have succeeded in reducing memory consumption while simultaneously boosting data transmission performance. This represents a significant performance optimization within the BitTorrent client. Future efforts will focus on further optimization to ensure continued enhancement of speed while further reducing resource consumption.

@Zerolzj
Copy link
Contributor Author

Zerolzj commented Apr 17, 2024

Attachment: 

Performance Analysis and Upload Throughput Imagery Comparison Table

Metric Before Optimization After Optimization
CPU pprof Image Before_Optimization_CPU_pprof_Image After_Optimization_CPU_pprof_Image
MEM pprof Image Before_Optimization_MEM_pprof_Image After_Optimization_MEM_pprof_Image
Upload Bps Image Before_Optimization_Upload_bps_Image After_Optimization_Upload_bps_Image

@anacrolix
Copy link
Owner

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

@Zerolzj
Copy link
Contributor Author

Zerolzj commented Apr 19, 2024

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

Thanks for your feedback. I'll update the community when I have some benchmark results to share, and conduct a detailed analysis to assess the benefits of precomputing the message length.

@anacrolix
Copy link
Owner

Thanks!

@Zerolzj
Copy link
Contributor Author

Zerolzj commented Apr 22, 2024

Benchmarking Report

Performed on a macOS system equipped with an Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz and utilizing an amd64 architecture.

Benchmark Results

Below is a tabulated summary showing how each function performs with various piece lengths:

PieceLength Function Iterations Time (ns/op) Memory (B/op) Allocations (allocs/op)
1M BenchmarkWriteToBuffer1M 4323 263913 988 4
BenchmarkMarshalBinaryWrite1M 1980 598797 2114204 7
4M BenchmarkWriteToBuffer4M 4534 276714 942 4
BenchmarkMarshalBinaryWrite4M 529 2208230 8413078 7
8M BenchmarkWriteToBuffer8M 4130 274931 1033 4
BenchmarkMarshalBinaryWrite8M 240 4268217 16828753 7

@Zerolzj
Copy link
Contributor Author

Zerolzj commented Apr 22, 2024

Could you include a benchmark for peerConnMsgWriter.write?

I suspect that precomputing the message length, and growing the buffer manually aren't necessary. I would expect that the majority of the gains would come just from writing directly into the buffer, which should retain its size and underlying storage everytime the buffers are swapped in the writer routine.

The new commit has removed the code for manually growing the buffer. The benefit of handling the buffer growth manually appears to be marginal.

@anacrolix
Copy link
Owner

Thanks, I'll check it out soon.

@anacrolix
Copy link
Owner

I made some tweaks to the benchmarks to make them more consistent. In particular, one form was always using the 4M length. I also added the most common size actually in use (16KiB, the default block size).

I think a few types have gone missing in the changes, I'll fix those up too and likely merge shortly.

@anacrolix
Copy link
Owner

The start and stop of the timer actually caused issues with the timing, as the reset is too small to have any effect on the benchmark.

@anacrolix
Copy link
Owner

anacrolix commented Apr 25, 2024

I reduced the changes a bit to reuse the old payload writing code, since HashRequest was missing.

The MarshalBinary performance is doubled, but the write to buffer stuff is not quite as fast (but still 16x faster than it was).

I'll merge what I have so far, but if you want to make tweaks to reduce the allocations further to get the extra 50% back, I think the code base is in a better position to take those changes, and it will be clearer what is being done to get that.

I should add the allocated space is massively reduced. Sorry if the changes were pretty heavy handed, but I think restructuring the write to buffer, and then doing smaller cleanups in a separate change is less risky.

@anacrolix anacrolix merged commit 3f5ef0b into anacrolix:master Apr 25, 2024
10 checks passed
@Zerolzj
Copy link
Contributor Author

Zerolzj commented Apr 26, 2024

I reduced the changes a bit to reuse the old payload writing code, since HashRequest was missing.

The MarshalBinary performance is doubled, but the write to buffer stuff is not quite as fast (but still 16x faster than it was).

I'll merge what I have so far, but if you want to make tweaks to reduce the allocations further to get the extra 50% back, I think the code base is in a better position to take those changes, and it will be clearer what is being done to get that.

I should add the allocated space is massively reduced. Sorry if the changes were pretty heavy handed, but I think restructuring the write to buffer, and then doing smaller cleanups in a separate change is less risky.

I apologize for the oversight with the HashRequest; it was unintentional and I regret any inconvenience caused. Also, I'm grateful for your thorough work on the updates. Collaborating with you has been incredibly enlightening, and I look forward to doing so again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants