|
| 1 | +# The Go Performance Optimization Loop: From Benchmarks to Zero Allocations |
| 2 | + |
| 3 | +When optimizing Go code for performance, particularly in hot paths like |
| 4 | +cryptographic operations or protocol handling, the journey from identifying |
| 5 | +bottlenecks to achieving zero-allocation code follows a well-defined |
| 6 | +methodology. This document walks through the complete optimization loop using |
| 7 | +Go's built-in tooling, demonstrating how to systematically eliminate allocations |
| 8 | +and improve performance. |
| 9 | + |
| 10 | +## Understanding the Performance Baseline |
| 11 | + |
| 12 | +The first step in any optimization effort is establishing a measurable baseline. |
| 13 | +Go's benchmark framework provides the foundation for this measurement. When |
| 14 | +writing benchmarks for allocation-sensitive code, always include a call to |
| 15 | +`b.ReportAllocs()` before `b.ResetTimer()`. This ensures the benchmark reports |
| 16 | +both timing and allocation statistics without including setup costs in the |
| 17 | +measurements. |
| 18 | + |
| 19 | +Consider a benchmark that exercises a cryptographic write path with the largest |
| 20 | +possible message size to stress test allocations: |
| 21 | + |
| 22 | +```go |
| 23 | +func BenchmarkWriteMessage(b *testing.B) { |
| 24 | + // Setup code here... |
| 25 | + |
| 26 | + b.ReportAllocs() // Essential for tracking allocations |
| 27 | + b.ResetTimer() |
| 28 | + |
| 29 | + for i := 0; i < b.N; i++ { |
| 30 | + // Hot path being measured |
| 31 | + } |
| 32 | +} |
| 33 | +``` |
| 34 | + |
| 35 | +Running the benchmark with `go test -bench=BenchmarkWriteMessage -benchmem |
| 36 | +-count=10` provides statistical confidence through multiple runs. The |
| 37 | +`-benchmem` flag is redundant if you've called `b.ReportAllocs()`, but it |
| 38 | +doesn't hurt to include it explicitly. The output reveals three critical |
| 39 | +metrics: nanoseconds per operation, bytes allocated per operation, and the |
| 40 | +number of distinct allocations per operation. |
| 41 | + |
| 42 | +## Profiling Memory Allocations |
| 43 | + |
| 44 | +Once you have baseline measurements showing undesirable allocations, the next |
| 45 | +phase involves profiling to understand where these allocations originate. |
| 46 | +Generate memory profiles during benchmark execution using: |
| 47 | + |
| 48 | +``` |
| 49 | +go test -bench=BenchmarkWriteMessage -memprofile=mem.prof -cpuprofile=cpu.prof -count=1 |
| 50 | +``` |
| 51 | + |
| 52 | +The resulting profile can be analyzed through several lenses. To see which |
| 53 | +functions allocate the most memory by total bytes, use: |
| 54 | +`go tool pprof -alloc_space -top mem.prof`. |
| 55 | + |
| 56 | +However, for understanding allocation frequency rather than size, `go tool pprof -alloc_objects -top mem.prof` often provides more actionable insights, especially when hunting small but frequent allocations. |
| 57 | + |
| 58 | +Here's what the allocation object analysis might reveal: |
| 59 | + |
| 60 | +``` |
| 61 | +$ go tool pprof -alloc_objects -top mem.prof | head -20 |
| 62 | +File: brontide.test |
| 63 | +Type: alloc_objects |
| 64 | +Time: Aug 30, 2024 at 2:07pm (WEST) |
| 65 | +Showing nodes accounting for 39254, 100% of 39272 total |
| 66 | +Dropped 32 nodes (cum <= 196) |
| 67 | + flat flat% sum% cum cum% |
| 68 | + 32768 83.44% 83.44% 32768 83.44% github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt |
| 69 | + 5461 13.91% 97.34% 5461 13.91% runtime.acquireSudog |
| 70 | + 1025 2.61% 100% 1025 2.61% runtime.allocm |
| 71 | +``` |
| 72 | + |
| 73 | +This output immediately shows that `cipherState.Encrypt` is responsible for 83% |
| 74 | +of allocations by count, focusing our investigation. |
| 75 | + |
| 76 | +The most powerful profiling technique involves examining allocations at the |
| 77 | +source line level. Running `go tool pprof -list 'FunctionName' mem.prof` shows |
| 78 | +exactly which lines within a function trigger heap allocations: |
| 79 | + |
| 80 | +``` |
| 81 | +$ go tool pprof -list 'cipherState.*Encrypt' mem.prof |
| 82 | +Total: 8.73MB |
| 83 | +ROUTINE ======================== github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt |
| 84 | + 512.01kB 512.01kB (flat, cum) 5.73% of Total |
| 85 | + . . 111:func (c *cipherState) Encrypt(associatedData, cipherText, plainText []byte) []byte { |
| 86 | + . . 112: defer func() { |
| 87 | + . . 113: c.nonce++ |
| 88 | + . . 114: |
| 89 | + . . 115: if c.nonce == keyRotationInterval { |
| 90 | + . . 116: c.rotateKey() |
| 91 | + . . 117: } |
| 92 | + . . 118: }() |
| 93 | + . . 119: |
| 94 | + 512.01kB 512.01kB 120: var nonce [12]byte |
| 95 | + . . 121: binary.LittleEndian.PutUint64(nonce[4:], c.nonce) |
| 96 | + . . 122: |
| 97 | + . . 123: return c.cipher.Seal(cipherText, nonce[:], plainText, associatedData) |
| 98 | +``` |
| 99 | + |
| 100 | +This granular view reveals that line 120, a seemingly innocent stack array |
| 101 | +declaration, is allocating 512KB total across all benchmark iterations. |
| 102 | + |
| 103 | +## CPU Profiling for Hot Spots |
| 104 | + |
| 105 | +While memory allocations often dominate optimization efforts, CPU profiling |
| 106 | +reveals where computational time is spent. The CPU profile generated alongside |
| 107 | +the memory profile provides complementary insights: |
| 108 | + |
| 109 | +``` |
| 110 | +$ go tool pprof -top cpu.prof | head -15 |
| 111 | +File: brontide.test |
| 112 | +Type: cpu |
| 113 | +Time: Aug 30, 2024 at 2:07pm (WEST) |
| 114 | +Duration: 1.8s, Total samples = 1.71s (94.40%) |
| 115 | +Showing nodes accounting for 1.65s, 96.49% of 1.71s total |
| 116 | + flat flat% sum% cum cum% |
| 117 | + 0.51s 29.82% 29.82% 0.51s 29.82% vendor/golang.org/x/crypto/chacha20poly1305.(*chacha20poly1305).sealGeneric |
| 118 | + 0.28s 16.37% 46.20% 0.28s 16.37% vendor/golang.org/x/crypto/internal/poly1305.updateGeneric |
| 119 | + 0.24s 14.04% 60.23% 0.24s 14.04% vendor/golang.org/x/crypto/chacha20.(*Cipher).XORKeyStream |
| 120 | + 0.19s 11.11% 71.35% 0.19s 11.11% runtime.memmove |
| 121 | + 0.12s 7.02% 78.36% 0.86s 50.29% github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt |
| 122 | +``` |
| 123 | + |
| 124 | +This profile shows that cryptographic operations dominate CPU usage, which is |
| 125 | +expected. However, note the presence of `runtime.memmove` at 11% - this often |
| 126 | +indicates unnecessary copying that could be eliminated through careful buffer |
| 127 | +management. |
| 128 | + |
| 129 | +For line-level CPU analysis of a specific function: |
| 130 | + |
| 131 | +``` |
| 132 | +$ go tool pprof -list 'WriteMessage' cpu.prof |
| 133 | +Total: 1.71s |
| 134 | +ROUTINE ======================== github.com/lightningnetwork/lnd/brontide.(*Machine).WriteMessage |
| 135 | + 10ms 1.21s (flat, cum) 70.76% of Total |
| 136 | + . . 734:func (b *Machine) WriteMessage(p []byte) error { |
| 137 | + . . 735: if len(p) > math.MaxUint16 { |
| 138 | + . . 736: return ErrMaxMessageLengthExceeded |
| 139 | + . . 737: } |
| 140 | + . . 738: |
| 141 | + . 10ms 739: if len(b.nextHeaderSend) > 0 || len(b.nextBodySend) > 0 { |
| 142 | + . . 740: return ErrMessageNotFlushed |
| 143 | + . . 741: } |
| 144 | + . . 742: |
| 145 | + 10ms 10ms 743: fullLength := uint16(len(p)) |
| 146 | + . . 744: var pktLen [2]byte |
| 147 | + . 10ms 745: binary.BigEndian.PutUint16(pktLen[:], fullLength) |
| 148 | + . . 746: |
| 149 | + . 580ms 747: b.nextHeaderSend = b.sendCipher.Encrypt(nil, nil, pktLen[:]) |
| 150 | + . 600ms 748: b.nextBodySend = b.sendCipher.Encrypt(nil, nil, p) |
| 151 | +``` |
| 152 | + |
| 153 | +This shows that the two `Encrypt` calls consume virtually all the CPU time in |
| 154 | +`WriteMessage`, confirming that cryptographic operations are the bottleneck |
| 155 | +rather than the message handling logic itself. |
| 156 | + |
| 157 | +## Understanding Escape Analysis |
| 158 | + |
| 159 | +When the profiler indicates that seemingly stack-local variables are being heap |
| 160 | +allocated, escape analysis becomes your next investigative tool. The Go |
| 161 | +compiler's escape analysis determines whether variables can remain on the stack |
| 162 | +or must be moved to the heap. Variables escape to the heap when their lifetime |
| 163 | +extends beyond the function that creates them or when the compiler cannot prove |
| 164 | +they remain local. |
| 165 | + |
| 166 | +To see the compiler's escape analysis decisions, build with verbose flags: |
| 167 | + |
| 168 | +``` |
| 169 | +go build -gcflags="-m" ./... |
| 170 | +``` |
| 171 | + |
| 172 | +For more detailed output including the reasons for escape, use `-m=2`. The |
| 173 | +output reveals escape flows, showing exactly why variables move to the heap. |
| 174 | +When investigating specific escapes, you can grep for the variable in question: |
| 175 | + |
| 176 | +``` |
| 177 | +$ go build -gcflags="-m=2" ./... 2>&1 | grep -A2 -B2 "nonce escapes" |
| 178 | +./noise.go:183:17: &errors.errorString{...} does not escape |
| 179 | +./noise.go:183:17: new(chacha20poly1305.chacha20poly1305) escapes to heap |
| 180 | +./noise.go:120:6: nonce escapes to heap: |
| 181 | +./noise.go:120:6: flow: {heap} = &nonce: |
| 182 | +./noise.go:120:6: from nonce (address-of) at ./noise.go:123:40 |
| 183 | +-- |
| 184 | +./noise.go:469:21: &keychain.PrivKeyECDH{...} escapes to heap |
| 185 | +./noise.go:483:40: []byte{} escapes to heap |
| 186 | +./noise.go:138:6: nonce escapes to heap: |
| 187 | +./noise.go:138:6: flow: {heap} = &nonce: |
| 188 | +./noise.go:138:6: from nonce (address-of) at ./noise.go:141:39 |
| 189 | +``` |
| 190 | + |
| 191 | +This output shows the exact flow analysis: the nonce array escapes because its |
| 192 | +address is taken when creating a slice (`nonce[:]`) and passed to a function |
| 193 | +that the compiler cannot fully analyze. |
| 194 | + |
| 195 | +Common causes include passing pointers to interfaces, storing references in |
| 196 | +heap-allocated structures, or passing slices of stack arrays to functions that |
| 197 | +might retain them. A particularly instructive example is the seemingly innocent |
| 198 | +pattern of passing a stack array to a function: |
| 199 | + |
| 200 | +```go |
| 201 | +var nonce [12]byte |
| 202 | +binary.LittleEndian.PutUint64(nonce[4:], counter) |
| 203 | +return cipher.Seal(ciphertext, nonce[:], plaintext, nil) |
| 204 | +``` |
| 205 | + |
| 206 | +Here, `nonce[:]` creates a slice backed by the stack array, but if the compiler |
| 207 | +cannot prove that `cipher.Seal` won't retain a reference to this slice, the |
| 208 | +entire array escapes to the heap. |
| 209 | + |
| 210 | +## The Optimization Strategy |
| 211 | + |
| 212 | +Armed with profiling data and escape analysis insights, the optimization phase |
| 213 | +begins. The general strategy for eliminating allocations follows a predictable |
| 214 | +pattern: move temporary buffers from function scope to longer-lived structures, |
| 215 | +typically as fields in the enclosing type. This transformation changes |
| 216 | +allocation from per-operation to per-instance. |
| 217 | + |
| 218 | +For the nonce example above, the optimization involves adding a buffer field to |
| 219 | +the containing struct: |
| 220 | + |
| 221 | +```go |
| 222 | +type cipherState struct { |
| 223 | + // ... other fields ... |
| 224 | + nonceBuffer [12]byte // Reusable buffer to avoid allocations |
| 225 | +} |
| 226 | + |
| 227 | +func (c *cipherState) Encrypt(...) []byte { |
| 228 | + binary.LittleEndian.PutUint64(c.nonceBuffer[4:], c.nonce) |
| 229 | + return c.cipher.Seal(ciphertext, c.nonceBuffer[:], plaintext, nil) |
| 230 | +} |
| 231 | +``` |
| 232 | + |
| 233 | +This pattern extends to any temporary buffer. When dealing with variable-sized |
| 234 | +data up to a known maximum, pre-allocate buffers at that maximum size and slice |
| 235 | +into them as needed. The key insight is using the three-index slice notation to |
| 236 | +control capacity separately from length: |
| 237 | + |
| 238 | +```go |
| 239 | +// Pre-allocated: var buffer [maxSize]byte |
| 240 | + |
| 241 | +// Creating a zero-length slice with full capacity for append: |
| 242 | +slice := buffer[:0] // length=0, capacity=maxSize |
| 243 | +``` |
| 244 | + |
| 245 | +## Verification and Iteration |
| 246 | + |
| 247 | +After implementing optimizations, the cycle returns to benchmarking. Run the |
| 248 | +same benchmark to measure improvement, but don't stop at the aggregate numbers. |
| 249 | +Generate new profiles to verify that specific allocations have been eliminated |
| 250 | +and to identify any remaining allocation sites. |
| 251 | + |
| 252 | +The benchstat tool provides statistical comparison between runs: |
| 253 | + |
| 254 | +``` |
| 255 | +go test -bench=BenchmarkWriteMessage -count=10 > old.txt |
| 256 | +# Make optimizations |
| 257 | +go test -bench=BenchmarkWriteMessage -count=10 > new.txt |
| 258 | +benchstat old.txt new.txt |
| 259 | +``` |
| 260 | + |
| 261 | +This comparison reveals not just whether performance improved, but whether the |
| 262 | +improvement is statistically significant. A typical benchstat output after |
| 263 | +successful optimization looks like: |
| 264 | + |
| 265 | +``` |
| 266 | +goos: darwin |
| 267 | +goarch: arm64 |
| 268 | +pkg: github.com/lightningnetwork/lnd/brontide |
| 269 | +cpu: Apple M4 Max |
| 270 | + │ old.txt │ new.txt │ |
| 271 | + │ sec/op │ sec/op vs base │ |
| 272 | +WriteMessage-16 50.34µ ± 1% 46.48µ ± 0% -7.68% (p=0.000 n=10) |
| 273 | +
|
| 274 | + │ old.txt │ new.txt │ |
| 275 | + │ B/op │ B/op vs base │ |
| 276 | +WriteMessage-16 73788.000 ± 0% 2.000 ± 0% -100.00% (p=0.000 n=10) |
| 277 | +
|
| 278 | + │ old.txt │ new.txt │ |
| 279 | + │ allocs/op │ allocs/op vs base │ |
| 280 | +WriteMessage-16 5.000 ± 0% 0.000 ± 0% -100.00% (p=0.000 n=10) |
| 281 | +``` |
| 282 | + |
| 283 | +The key metrics to examine are: |
| 284 | +- The percentage change (vs base column) showing the magnitude of improvement |
| 285 | + |
| 286 | +- The p-value (p=0.000) indicating statistical significance - values below 0.05 |
| 287 | +suggest real improvements rather than noise |
| 288 | + |
| 289 | +- The variance (± percentages) showing consistency across runs |
| 290 | + |
| 291 | +This output confirms both a 7.68% speed improvement and complete elimination of |
| 292 | +allocations, with high statistical confidence. |
| 293 | + |
| 294 | +If allocations remain, the cycle continues. Profile again, identify the source, |
| 295 | +understand why the allocation occurs through escape analysis, and apply the |
| 296 | +appropriate optimization pattern. Each iteration should show measurable progress |
| 297 | +toward the goal of zero allocations in the hot path. |
| 298 | + |
| 299 | +## Advanced Techniques |
| 300 | + |
| 301 | +When standard profiling doesn't reveal the allocation source, more advanced |
| 302 | +techniques come into play. Memory profiling with different granularities can |
| 303 | +help. Instead of looking at total allocations, examine the profile with `go tool |
| 304 | +pprof -sample_index=alloc_objects` to focus on allocation count rather than |
| 305 | +size. This distinction matters when hunting for small, frequent allocations that |
| 306 | +might not show up prominently in byte-focused views. |
| 307 | + |
| 308 | +Additional pprof commands that prove invaluable during optimization: |
| 309 | + |
| 310 | +```bash |
| 311 | +# Interactive mode for exploring the profile |
| 312 | +go tool pprof mem.prof |
| 313 | +(pprof) top10 # Show top 10 memory consumers |
| 314 | +(pprof) list regexp # List functions matching regexp |
| 315 | +(pprof) web # Open visual graph in browser |
| 316 | + |
| 317 | +# Generate a flame graph for visual analysis |
| 318 | +go tool pprof -http=:8080 mem.prof |
| 319 | + |
| 320 | +# Compare two profiles directly |
| 321 | +go tool pprof -base=old.prof new.prof |
| 322 | + |
| 323 | +# Show allocations only from specific packages |
| 324 | +go tool pprof -focus=github.com/lightningnetwork/lnd/brontide mem.prof |
| 325 | + |
| 326 | +# Check for specific small allocations |
| 327 | +go tool pprof -alloc_space -inuse_space mem.prof |
| 328 | +``` |
| 329 | + |
| 330 | +When dealing with elusive allocations, checking what might be escaping to heap |
| 331 | +can be done more surgically: |
| 332 | + |
| 333 | +```bash |
| 334 | +# Check specific function or type for escapes |
| 335 | +go build -gcflags="-m" 2>&1 | grep -E "(YourType|yourFunc)" |
| 336 | + |
| 337 | +# See all heap allocations in a package |
| 338 | +go build -gcflags="-m" 2>&1 | grep "moved to heap" |
| 339 | + |
| 340 | +# Check which variables are confirmed to stay on the stack |
| 341 | +go build -gcflags="-m=2" 2>&1 | grep "does not escape" |
| 342 | +``` |
| 343 | + |
| 344 | +For particularly elusive allocations, instrumenting the code with runtime memory |
| 345 | +statistics can provide real-time feedback: |
| 346 | + |
| 347 | +```go |
| 348 | +var m runtime.MemStats |
| 349 | +runtime.ReadMemStats(&m) |
| 350 | +before := m.Alloc |
| 351 | +// Operation being measured |
| 352 | +runtime.ReadMemStats(&m) |
| 353 | +allocated := m.Alloc - before |
| 354 | +``` |
| 355 | + |
| 356 | +While this approach adds overhead and shouldn't be used in production, it can |
| 357 | +help isolate allocations to specific code sections during development. |
| 358 | + |
| 359 | +## The Zero-Allocation Goal |
| 360 | + |
| 361 | +Achieving zero allocations in hot paths represents more than just a performance |
| 362 | +optimization. It provides predictable latency, reduces garbage collection |
| 363 | +pressure, and improves overall system behavior under load. In systems handling |
| 364 | +thousands of operations per second, the difference between five allocations per |
| 365 | +operation and zero can mean the difference between smooth operation and periodic |
| 366 | +latency spikes during garbage collection. |
| 367 | + |
| 368 | +The journey from initial benchmark to zero-allocation code demonstrates the |
| 369 | +power of Go's built-in tooling. By systematically applying the |
| 370 | +benchmark-profile-optimize loop, even complex code paths can be transformed into |
| 371 | +allocation-free implementations. The key lies not in guessing or premature |
| 372 | +optimization, but in measuring, understanding, and methodically addressing each |
| 373 | +allocation source. |
| 374 | + |
| 375 | +It's best to focus optimization efforts on true hot paths identified through |
| 376 | +production profiling or realistic load testing. The techniques described here |
| 377 | +provide the tools to achieve zero-allocation code when it matters, but the |
| 378 | +judgment of when to apply them remains a critical engineering decision. |
0 commit comments