Skip to content

Conversation

arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Oct 2, 2025

The pprof profiles for unary RPC benchmarks indicate significant time spent in runtime.mallocgc and runtime.gcBgMarkWorker. This indicates gRPC is spending significant CPU cycles allocating or garbage collecting.

This change reduces the number of pointer fields in the structs that represent client and server stream. This will reduce number of memory allocations (faster) and also reduce pressure on garbage collector (faster garbage collections) since the GC doesn't need to scan non-pointer fields. For structs which were stored as pointers to ensure values are not copied, a noCopy struct is embedded that will cause go vet to fail if copies are performed. Non-pointer fields are also moved to the end of the struct to improve allocation speed.

Results

There are improvements in QPS, latency and allocs/op for unary RPCs.

# test command
go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \
   -compression=off -maxConcurrentCalls=500 -trace=off \
   -reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}"   -recvBufferPool=simple

go run benchmark/benchresult/main.go unary-before unary-after       
               Title       Before        After Percentage
            TotalOps      7690250      7991877     3.92%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     10218.14     10084.00    -1.31%
           Allocs/op       164.85       151.85    -7.89%
             ReqT/op 102536666.67 106558360.00     3.92%
            RespT/op 102536666.67 106558360.00     3.92%
            50th-Lat    3.57283ms   3.435143ms    -3.85%
            90th-Lat   5.152403ms   4.979906ms    -3.35%
            99th-Lat   5.985282ms   5.827893ms    -2.63%
             Avg-Lat    3.89872ms   3.750449ms    -3.80%
           GoVersion     go1.24.4     go1.24.4
         GrpcVersion   1.77.0-dev   1.77.0-dev

Resources

  • go/go/performance?polyglot=open-source#application-spends-too-much-on-gc-or-allocations
  • go/go/performance?polyglot=open-source#memory-optimizations

RELEASE NOTES:

  • transport: Reduce pointer usage to lower garbage collection pressure and improve unary RPC performance.

@arjan-bal arjan-bal added the Type: Performance Performance improvements (CPU, network, memory, etc) label Oct 2, 2025
@arjan-bal arjan-bal added this to the 1.77 Release milestone Oct 2, 2025
@arjan-bal arjan-bal changed the title transport: Reduce the use of pointer fields in Stream structs transport: Reduce pointer usage in Stream structs Oct 2, 2025
Copy link

codecov bot commented Oct 2, 2025

Codecov Report

❌ Patch coverage is 87.17949% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.01%. Comparing base (d0ebcdf) to head (5ac363c).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
internal/transport/transport.go 50.00% 2 Missing ⚠️
rpc_util.go 0.00% 2 Missing ⚠️
stream.go 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8624      +/-   ##
==========================================
- Coverage   82.12%   82.01%   -0.12%     
==========================================
  Files         415      415              
  Lines       40701    40709       +8     
==========================================
- Hits        33425    33386      -39     
- Misses       5895     5937      +42     
- Partials     1381     1386       +5     
Files with missing lines Coverage Δ
internal/transport/client_stream.go 100.00% <ø> (ø)
internal/transport/flowcontrol.go 96.39% <100.00%> (-0.10%) ⬇️
internal/transport/handler_server.go 90.84% <100.00%> (+0.03%) ⬆️
internal/transport/http2_client.go 91.93% <100.00%> (-0.30%) ⬇️
internal/transport/http2_server.go 90.86% <100.00%> (-0.01%) ⬇️
internal/transport/server_stream.go 95.31% <ø> (ø)
server.go 82.00% <100.00%> (+0.54%) ⬆️
stream.go 81.57% <87.50%> (-0.25%) ⬇️
internal/transport/transport.go 83.87% <50.00%> (-0.87%) ⬇️
rpc_util.go 82.09% <0.00%> (-0.43%) ⬇️

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal arjan-bal force-pushed the optimize-heap-allocs branch from a44546b to 4ebd663 Compare October 2, 2025 18:26
@arjan-bal arjan-bal force-pushed the optimize-heap-allocs branch from 4ebd663 to 42b1067 Compare October 2, 2025 18:39
@arjan-bal arjan-bal requested review from easwars and dfawley October 2, 2025 18:48
@arjan-bal arjan-bal added the Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. label Oct 2, 2025
func newWriteQuota(sz int32, done <-chan struct{}) *writeQuota {
w := &writeQuota{
func initWriteQuota(wq *writeQuota, sz int32, done <-chan struct{}) {
*wq = writeQuota{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax does feel a little weird to me. Did you try some of these options to see if they read better (and don't perform worse)?

  • directly set fields of wq in here instead of setting wq to a completely new instance of writeQuota
  • Can this initWriteQuota be a method on Stream?
  • Can we make the zero value of writeQouta something that can actually work?

This comment applies to other types as well like recvBuffer.

Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the function into a method with a pointer receiver and replaced the single struct literal assignment with individual field assignments. I also added a godoc comment to explain that this initialization pattern is used to avoid heap allocations. Individual field assignment may be slightly faster than the struct assignment as it avoid allocating an intermediate object on the stack before copying it over.

I defined the method on the writeQuota struct itself, rather than on the Stream struct. This keeps the initialization logic co-located with its type and reduces coupling with Stream.

@easwars easwars assigned arjan-bal and unassigned easwars Oct 3, 2025
@arjan-bal arjan-bal assigned easwars and dfawley and unassigned dfawley and arjan-bal Oct 6, 2025
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we ensure that we don't have pointer fields inside structs (as much as possible/feasible/required) going forward? At least personally, if I have a field in a struct that has a pointer receiver, I by default store that field as a pointer. And if I see someone else doing the same in a code review, I would not even notice it, because it is so ingrained in me.

Some of the things mentioned in the docs you said were very useful. But I'm wondering how we make that knowledge more accessible to everyone on the team and ensure we keep certain things in mind when writing and reviewing code.

bytesReceived atomic.Bool // indicates whether any bytes have been received on this stream
unprocessed atomic.Bool // set if the server sends a refused stream or GOAWAY including this stream

status *status.Status // the status error received from the server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to the top as well as the link you sent initially also said grouping pointer fields at the top of the struct helps.

@easwars easwars assigned arjan-bal and unassigned easwars Oct 6, 2025
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any way to make this foolproof. We must allow pointers in structs, for obvious reasons. We could possibly add a test that runs a quick local benchmark for a fixed number of iterations, and checks the number & size of allocations afterwards?

@dfawley dfawley removed their assignment Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. Type: Performance Performance improvements (CPU, network, memory, etc)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants