Skip to content

Conversation

@arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Oct 22, 2025

This PR removes 2 buffer copies while writing data frames to the underlying net.Conn: one within gRPC and the other in the framer. Care is taken to avoid any extra heap allocations which can affect performance for smaller payloads.

A CL is out for review which allows using the framer to write frame headers. This PR duplicates the header writing code as a temporary workaround. This PR will be merged only after the CL is merged.

Results

Small payloads

Performance for small payloads increases slightly due to the reduction of a deferred statement.

$ go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \
   -compression=off -maxConcurrentCalls=120 -trace=off \
   -reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}"

$ go run benchmark/benchresult/main.go unary-before unary-after
               Title       Before        After Percentage
            TotalOps      7600878      7653522     0.69%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     10007.07     10000.89    -0.07%
           Allocs/op       146.93       146.91     0.00%
             ReqT/op 101345040.00 102046960.00     0.69%
            RespT/op 101345040.00 102046960.00     0.69%
            50th-Lat    833.724µs    830.041µs    -0.44%
            90th-Lat   1.281969ms   1.275336ms    -0.52%
            99th-Lat   2.403961ms   2.360606ms    -1.80%
             Avg-Lat    946.123µs    939.734µs    -0.68%
           GoVersion     go1.24.8     go1.24.8
         GrpcVersion   1.77.0-dev   1.77.0-dev

Large payloads

Local benchmarks show a ~5-10% regression with 1 MB payloads on my dev machine. The profiles show increased time spent in the copy operation inside the buffered writer. Counterintuitively, copying the grpc header and message data into a larger buffer increased the performance by 4% (compared to master).

To validate this behaviour (extra copy increasing performance) I ran the k8s benchmark for 1MB payloads and 100 concurrent streams which showed ~5% increase in QPS without the copies across multiple runs. Adding a copy reduced the performance.

Load test config file: loadtest.yaml

# 30 core client and server
Before
QPS: 498.284 (16.6095/server core)
Latencies (50/90/95/99/99.9%-ile): 233256/275972/281250/291803/298533 us
Server system time: 93.0164
Server user time:   142.533
Client system time: 97.2688
Client user time:   144.542

After
QPS: 526.776 (17.5592/server core)
Latencies (50/90/95/99/99.9%-ile): 211010/263189/270969/280656/288828 us
Server system time: 96.5959
Server user time:   147.668
Client system time: 101.973
Client user time:   150.234

# 8 core client and server
Before
QPS: 291.049 (36.3811/server core)
Latencies (50/90/95/99/99.9%-ile): 294552/685822/903554/1.48399e+06/1.50757e+06 us
Server system time: 49.0355
Server user time:   87.1783
Client system time: 60.1945
Client user time:   103.633

After
QPS: 334.119 (41.7649/server core)
Latencies (50/90/95/99/99.9%-ile): 279395/518849/706327/1.09273e+06/1.11629e+06 us
Server system time: 69.3136
Server user time:   102.549
Client system time: 80.9804
Client user time:   107.103

RELEASE NOTES:

  • transport: Avoid two buffer copies when writing data.

@arjan-bal arjan-bal added this to the 1.77 Release milestone Oct 22, 2025
@arjan-bal arjan-bal added Type: Performance Performance improvements (CPU, network, memory, etc) Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. labels Oct 22, 2025
@codecov
Copy link

codecov bot commented Oct 22, 2025

Codecov Report

❌ Patch coverage is 71.62162% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.85%. Comparing base (254ab10) to head (bfee28b).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
internal/transport/controlbuf.go 42.85% 5 Missing and 7 partials ⚠️
internal/transport/http_util.go 69.56% 2 Missing and 5 partials ⚠️
mem/buffer_slice.go 93.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8667      +/-   ##
==========================================
- Coverage   83.20%   82.85%   -0.36%     
==========================================
  Files         417      415       -2     
  Lines       32308    32200     -108     
==========================================
- Hits        26883    26678     -205     
+ Misses       4057     4033      -24     
- Partials     1368     1489     +121     
Files with missing lines Coverage Δ
mem/buffer_slice.go 80.14% <93.33%> (-17.16%) ⬇️
internal/transport/http_util.go 90.51% <69.56%> (-4.70%) ⬇️
internal/transport/controlbuf.go 71.31% <42.85%> (-18.94%) ⬇️

... and 31 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@easwars easwars assigned arjan-bal and unassigned easwars and dfawley Oct 28, 2025
@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Oct 28, 2025
@easwars easwars assigned dfawley and unassigned easwars Oct 28, 2025
@easwars
Copy link
Contributor

easwars commented Oct 28, 2025

@dfawley : Again moving to your plate if you feel like having a second look.

@arjan-bal arjan-bal modified the milestones: 1.77 Release, 1.78 Release Oct 30, 2025
}
if dSize > 0 {
var err error
l.writeBuf, err = reader.Peek(dSize, l.writeBuf)
Copy link
Member

@dfawley dfawley Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this buffer can only grow and never shrinks.

  1. What happens if a slice holds a pointer to a huge amount of data? I believe it isn't possible to free it, but am not certain. E.g.
l.writeBuf = [][]byte{nil, nil, nil, nil, nil, nil, make([]byte, 10GB)}
l.writeBuf = l.writeBuf[:0]
  1. What happens if cap(l.writeBuf) grows to a large value and then we never need it to be that large ever again?

I think we need to have some way to scale this buffer back down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For point 1, I've updated the code to clear the buffer after calling Write. This releases references to all the slices and allows them to be GCed.

With respect to point 2, I've now set a limit of 64 on the buffer's length. If a buffer is longer than that, it's immediately freed after use instead of being cached.

Background on the 64-element limit: The BufferSlice from the proto codec is 1 element. With a potential gRPC header, the length is almost always 2. While custom codecs might produce larger slices, 64 is a generous limit that covers common cases without caching excessive memory.

This change also mitigates a worst-case memory scenario. Since Peek() filters empty slices, a 16KB http2 Data frame (the max size) could theoretically be split into 16K (16,384) distinct 1-byte slices. In that case, the memory overhead for the slice headers alone would be 24 bytes * 16 * 1024 (approx. 393KB), with the 64 size limit, the max held memory is approx 1.5KB. Also note that the framer already has a data buffer that grows up to 16KB, and after this change, that buffer should no longer be used for Data frames.

@dfawley dfawley assigned arjan-bal and unassigned dfawley Oct 30, 2025
Comment on lines 454 to 456
if len(d) == 0 {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be a bug if it were zero? I would have expected it to be.

If it is, then we should delete it. Write should handle a zero-length buffer as a nop already anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. There should not be any empty buffers in the list, since Peek() filters them out. This was an artifact from the time I spent root-causing unexpected behavior on the local benchmarks with large payloads

@arjan-bal arjan-bal assigned dfawley and unassigned arjan-bal Oct 30, 2025
Comment on lines 1022 to 1025
// This must never happen since the reader must have at least dSize
// bytes.
clear(l.writeBuf)
l.writeBuf = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is impossible then:

  1. logger.Error seems like a good idea, unless the caller already does that with what we return..
  2. We probably don't need to bother with the clear/nil (and surely don't want to do both?)?

@dfawley dfawley assigned arjan-bal and unassigned dfawley Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. Type: Performance Performance improvements (CPU, network, memory, etc)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants