Skip to content

Commit f4736b1

Browse files
Merge pull request #927 from lightninglabs/docs-lnd
Update lnd documentation
2 parents 4af13ca + 4a205bd commit f4736b1

File tree

1 file changed

+378
-0
lines changed

1 file changed

+378
-0
lines changed

docs/lnd/benchmark_perf_loop.md

Lines changed: 378 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,378 @@
1+
# The Go Performance Optimization Loop: From Benchmarks to Zero Allocations
2+
3+
When optimizing Go code for performance, particularly in hot paths like
4+
cryptographic operations or protocol handling, the journey from identifying
5+
bottlenecks to achieving zero-allocation code follows a well-defined
6+
methodology. This document walks through the complete optimization loop using
7+
Go's built-in tooling, demonstrating how to systematically eliminate allocations
8+
and improve performance.
9+
10+
## Understanding the Performance Baseline
11+
12+
The first step in any optimization effort is establishing a measurable baseline.
13+
Go's benchmark framework provides the foundation for this measurement. When
14+
writing benchmarks for allocation-sensitive code, always include a call to
15+
`b.ReportAllocs()` before `b.ResetTimer()`. This ensures the benchmark reports
16+
both timing and allocation statistics without including setup costs in the
17+
measurements.
18+
19+
Consider a benchmark that exercises a cryptographic write path with the largest
20+
possible message size to stress test allocations:
21+
22+
```go
23+
func BenchmarkWriteMessage(b *testing.B) {
24+
// Setup code here...
25+
26+
b.ReportAllocs() // Essential for tracking allocations
27+
b.ResetTimer()
28+
29+
for i := 0; i < b.N; i++ {
30+
// Hot path being measured
31+
}
32+
}
33+
```
34+
35+
Running the benchmark with `go test -bench=BenchmarkWriteMessage -benchmem
36+
-count=10` provides statistical confidence through multiple runs. The
37+
`-benchmem` flag is redundant if you've called `b.ReportAllocs()`, but it
38+
doesn't hurt to include it explicitly. The output reveals three critical
39+
metrics: nanoseconds per operation, bytes allocated per operation, and the
40+
number of distinct allocations per operation.
41+
42+
## Profiling Memory Allocations
43+
44+
Once you have baseline measurements showing undesirable allocations, the next
45+
phase involves profiling to understand where these allocations originate.
46+
Generate memory profiles during benchmark execution using:
47+
48+
```
49+
go test -bench=BenchmarkWriteMessage -memprofile=mem.prof -cpuprofile=cpu.prof -count=1
50+
```
51+
52+
The resulting profile can be analyzed through several lenses. To see which
53+
functions allocate the most memory by total bytes, use:
54+
`go tool pprof -alloc_space -top mem.prof`.
55+
56+
However, for understanding allocation frequency rather than size, `go tool pprof -alloc_objects -top mem.prof` often provides more actionable insights, especially when hunting small but frequent allocations.
57+
58+
Here's what the allocation object analysis might reveal:
59+
60+
```
61+
$ go tool pprof -alloc_objects -top mem.prof | head -20
62+
File: brontide.test
63+
Type: alloc_objects
64+
Time: Aug 30, 2024 at 2:07pm (WEST)
65+
Showing nodes accounting for 39254, 100% of 39272 total
66+
Dropped 32 nodes (cum <= 196)
67+
flat flat% sum% cum cum%
68+
32768 83.44% 83.44% 32768 83.44% github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt
69+
5461 13.91% 97.34% 5461 13.91% runtime.acquireSudog
70+
1025 2.61% 100% 1025 2.61% runtime.allocm
71+
```
72+
73+
This output immediately shows that `cipherState.Encrypt` is responsible for 83%
74+
of allocations by count, focusing our investigation.
75+
76+
The most powerful profiling technique involves examining allocations at the
77+
source line level. Running `go tool pprof -list 'FunctionName' mem.prof` shows
78+
exactly which lines within a function trigger heap allocations:
79+
80+
```
81+
$ go tool pprof -list 'cipherState.*Encrypt' mem.prof
82+
Total: 8.73MB
83+
ROUTINE ======================== github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt
84+
512.01kB 512.01kB (flat, cum) 5.73% of Total
85+
. . 111:func (c *cipherState) Encrypt(associatedData, cipherText, plainText []byte) []byte {
86+
. . 112: defer func() {
87+
. . 113: c.nonce++
88+
. . 114:
89+
. . 115: if c.nonce == keyRotationInterval {
90+
. . 116: c.rotateKey()
91+
. . 117: }
92+
. . 118: }()
93+
. . 119:
94+
512.01kB 512.01kB 120: var nonce [12]byte
95+
. . 121: binary.LittleEndian.PutUint64(nonce[4:], c.nonce)
96+
. . 122:
97+
. . 123: return c.cipher.Seal(cipherText, nonce[:], plainText, associatedData)
98+
```
99+
100+
This granular view reveals that line 120, a seemingly innocent stack array
101+
declaration, is allocating 512KB total across all benchmark iterations.
102+
103+
## CPU Profiling for Hot Spots
104+
105+
While memory allocations often dominate optimization efforts, CPU profiling
106+
reveals where computational time is spent. The CPU profile generated alongside
107+
the memory profile provides complementary insights:
108+
109+
```
110+
$ go tool pprof -top cpu.prof | head -15
111+
File: brontide.test
112+
Type: cpu
113+
Time: Aug 30, 2024 at 2:07pm (WEST)
114+
Duration: 1.8s, Total samples = 1.71s (94.40%)
115+
Showing nodes accounting for 1.65s, 96.49% of 1.71s total
116+
flat flat% sum% cum cum%
117+
0.51s 29.82% 29.82% 0.51s 29.82% vendor/golang.org/x/crypto/chacha20poly1305.(*chacha20poly1305).sealGeneric
118+
0.28s 16.37% 46.20% 0.28s 16.37% vendor/golang.org/x/crypto/internal/poly1305.updateGeneric
119+
0.24s 14.04% 60.23% 0.24s 14.04% vendor/golang.org/x/crypto/chacha20.(*Cipher).XORKeyStream
120+
0.19s 11.11% 71.35% 0.19s 11.11% runtime.memmove
121+
0.12s 7.02% 78.36% 0.86s 50.29% github.com/lightningnetwork/lnd/brontide.(*cipherState).Encrypt
122+
```
123+
124+
This profile shows that cryptographic operations dominate CPU usage, which is
125+
expected. However, note the presence of `runtime.memmove` at 11% - this often
126+
indicates unnecessary copying that could be eliminated through careful buffer
127+
management.
128+
129+
For line-level CPU analysis of a specific function:
130+
131+
```
132+
$ go tool pprof -list 'WriteMessage' cpu.prof
133+
Total: 1.71s
134+
ROUTINE ======================== github.com/lightningnetwork/lnd/brontide.(*Machine).WriteMessage
135+
10ms 1.21s (flat, cum) 70.76% of Total
136+
. . 734:func (b *Machine) WriteMessage(p []byte) error {
137+
. . 735: if len(p) > math.MaxUint16 {
138+
. . 736: return ErrMaxMessageLengthExceeded
139+
. . 737: }
140+
. . 738:
141+
. 10ms 739: if len(b.nextHeaderSend) > 0 || len(b.nextBodySend) > 0 {
142+
. . 740: return ErrMessageNotFlushed
143+
. . 741: }
144+
. . 742:
145+
10ms 10ms 743: fullLength := uint16(len(p))
146+
. . 744: var pktLen [2]byte
147+
. 10ms 745: binary.BigEndian.PutUint16(pktLen[:], fullLength)
148+
. . 746:
149+
. 580ms 747: b.nextHeaderSend = b.sendCipher.Encrypt(nil, nil, pktLen[:])
150+
. 600ms 748: b.nextBodySend = b.sendCipher.Encrypt(nil, nil, p)
151+
```
152+
153+
This shows that the two `Encrypt` calls consume virtually all the CPU time in
154+
`WriteMessage`, confirming that cryptographic operations are the bottleneck
155+
rather than the message handling logic itself.
156+
157+
## Understanding Escape Analysis
158+
159+
When the profiler indicates that seemingly stack-local variables are being heap
160+
allocated, escape analysis becomes your next investigative tool. The Go
161+
compiler's escape analysis determines whether variables can remain on the stack
162+
or must be moved to the heap. Variables escape to the heap when their lifetime
163+
extends beyond the function that creates them or when the compiler cannot prove
164+
they remain local.
165+
166+
To see the compiler's escape analysis decisions, build with verbose flags:
167+
168+
```
169+
go build -gcflags="-m" ./...
170+
```
171+
172+
For more detailed output including the reasons for escape, use `-m=2`. The
173+
output reveals escape flows, showing exactly why variables move to the heap.
174+
When investigating specific escapes, you can grep for the variable in question:
175+
176+
```
177+
$ go build -gcflags="-m=2" ./... 2>&1 | grep -A2 -B2 "nonce escapes"
178+
./noise.go:183:17: &errors.errorString{...} does not escape
179+
./noise.go:183:17: new(chacha20poly1305.chacha20poly1305) escapes to heap
180+
./noise.go:120:6: nonce escapes to heap:
181+
./noise.go:120:6: flow: {heap} = &nonce:
182+
./noise.go:120:6: from nonce (address-of) at ./noise.go:123:40
183+
--
184+
./noise.go:469:21: &keychain.PrivKeyECDH{...} escapes to heap
185+
./noise.go:483:40: []byte{} escapes to heap
186+
./noise.go:138:6: nonce escapes to heap:
187+
./noise.go:138:6: flow: {heap} = &nonce:
188+
./noise.go:138:6: from nonce (address-of) at ./noise.go:141:39
189+
```
190+
191+
This output shows the exact flow analysis: the nonce array escapes because its
192+
address is taken when creating a slice (`nonce[:]`) and passed to a function
193+
that the compiler cannot fully analyze.
194+
195+
Common causes include passing pointers to interfaces, storing references in
196+
heap-allocated structures, or passing slices of stack arrays to functions that
197+
might retain them. A particularly instructive example is the seemingly innocent
198+
pattern of passing a stack array to a function:
199+
200+
```go
201+
var nonce [12]byte
202+
binary.LittleEndian.PutUint64(nonce[4:], counter)
203+
return cipher.Seal(ciphertext, nonce[:], plaintext, nil)
204+
```
205+
206+
Here, `nonce[:]` creates a slice backed by the stack array, but if the compiler
207+
cannot prove that `cipher.Seal` won't retain a reference to this slice, the
208+
entire array escapes to the heap.
209+
210+
## The Optimization Strategy
211+
212+
Armed with profiling data and escape analysis insights, the optimization phase
213+
begins. The general strategy for eliminating allocations follows a predictable
214+
pattern: move temporary buffers from function scope to longer-lived structures,
215+
typically as fields in the enclosing type. This transformation changes
216+
allocation from per-operation to per-instance.
217+
218+
For the nonce example above, the optimization involves adding a buffer field to
219+
the containing struct:
220+
221+
```go
222+
type cipherState struct {
223+
// ... other fields ...
224+
nonceBuffer [12]byte // Reusable buffer to avoid allocations
225+
}
226+
227+
func (c *cipherState) Encrypt(...) []byte {
228+
binary.LittleEndian.PutUint64(c.nonceBuffer[4:], c.nonce)
229+
return c.cipher.Seal(ciphertext, c.nonceBuffer[:], plaintext, nil)
230+
}
231+
```
232+
233+
This pattern extends to any temporary buffer. When dealing with variable-sized
234+
data up to a known maximum, pre-allocate buffers at that maximum size and slice
235+
into them as needed. The key insight is using the three-index slice notation to
236+
control capacity separately from length:
237+
238+
```go
239+
// Pre-allocated: var buffer [maxSize]byte
240+
241+
// Creating a zero-length slice with full capacity for append:
242+
slice := buffer[:0] // length=0, capacity=maxSize
243+
```
244+
245+
## Verification and Iteration
246+
247+
After implementing optimizations, the cycle returns to benchmarking. Run the
248+
same benchmark to measure improvement, but don't stop at the aggregate numbers.
249+
Generate new profiles to verify that specific allocations have been eliminated
250+
and to identify any remaining allocation sites.
251+
252+
The benchstat tool provides statistical comparison between runs:
253+
254+
```
255+
go test -bench=BenchmarkWriteMessage -count=10 > old.txt
256+
# Make optimizations
257+
go test -bench=BenchmarkWriteMessage -count=10 > new.txt
258+
benchstat old.txt new.txt
259+
```
260+
261+
This comparison reveals not just whether performance improved, but whether the
262+
improvement is statistically significant. A typical benchstat output after
263+
successful optimization looks like:
264+
265+
```
266+
goos: darwin
267+
goarch: arm64
268+
pkg: github.com/lightningnetwork/lnd/brontide
269+
cpu: Apple M4 Max
270+
│ old.txt │ new.txt │
271+
│ sec/op │ sec/op vs base │
272+
WriteMessage-16 50.34µ ± 1% 46.48µ ± 0% -7.68% (p=0.000 n=10)
273+
274+
│ old.txt │ new.txt │
275+
│ B/op │ B/op vs base │
276+
WriteMessage-16 73788.000 ± 0% 2.000 ± 0% -100.00% (p=0.000 n=10)
277+
278+
│ old.txt │ new.txt │
279+
│ allocs/op │ allocs/op vs base │
280+
WriteMessage-16 5.000 ± 0% 0.000 ± 0% -100.00% (p=0.000 n=10)
281+
```
282+
283+
The key metrics to examine are:
284+
- The percentage change (vs base column) showing the magnitude of improvement
285+
286+
- The p-value (p=0.000) indicating statistical significance - values below 0.05
287+
suggest real improvements rather than noise
288+
289+
- The variance (± percentages) showing consistency across runs
290+
291+
This output confirms both a 7.68% speed improvement and complete elimination of
292+
allocations, with high statistical confidence.
293+
294+
If allocations remain, the cycle continues. Profile again, identify the source,
295+
understand why the allocation occurs through escape analysis, and apply the
296+
appropriate optimization pattern. Each iteration should show measurable progress
297+
toward the goal of zero allocations in the hot path.
298+
299+
## Advanced Techniques
300+
301+
When standard profiling doesn't reveal the allocation source, more advanced
302+
techniques come into play. Memory profiling with different granularities can
303+
help. Instead of looking at total allocations, examine the profile with `go tool
304+
pprof -sample_index=alloc_objects` to focus on allocation count rather than
305+
size. This distinction matters when hunting for small, frequent allocations that
306+
might not show up prominently in byte-focused views.
307+
308+
Additional pprof commands that prove invaluable during optimization:
309+
310+
```bash
311+
# Interactive mode for exploring the profile
312+
go tool pprof mem.prof
313+
(pprof) top10 # Show top 10 memory consumers
314+
(pprof) list regexp # List functions matching regexp
315+
(pprof) web # Open visual graph in browser
316+
317+
# Generate a flame graph for visual analysis
318+
go tool pprof -http=:8080 mem.prof
319+
320+
# Compare two profiles directly
321+
go tool pprof -base=old.prof new.prof
322+
323+
# Show allocations only from specific packages
324+
go tool pprof -focus=github.com/lightningnetwork/lnd/brontide mem.prof
325+
326+
# Check for specific small allocations
327+
go tool pprof -alloc_space -inuse_space mem.prof
328+
```
329+
330+
When dealing with elusive allocations, checking what might be escaping to heap
331+
can be done more surgically:
332+
333+
```bash
334+
# Check specific function or type for escapes
335+
go build -gcflags="-m" 2>&1 | grep -E "(YourType|yourFunc)"
336+
337+
# See all heap allocations in a package
338+
go build -gcflags="-m" 2>&1 | grep "moved to heap"
339+
340+
# Check which variables are confirmed to stay on the stack
341+
go build -gcflags="-m=2" 2>&1 | grep "does not escape"
342+
```
343+
344+
For particularly elusive allocations, instrumenting the code with runtime memory
345+
statistics can provide real-time feedback:
346+
347+
```go
348+
var m runtime.MemStats
349+
runtime.ReadMemStats(&m)
350+
before := m.Alloc
351+
// Operation being measured
352+
runtime.ReadMemStats(&m)
353+
allocated := m.Alloc - before
354+
```
355+
356+
While this approach adds overhead and shouldn't be used in production, it can
357+
help isolate allocations to specific code sections during development.
358+
359+
## The Zero-Allocation Goal
360+
361+
Achieving zero allocations in hot paths represents more than just a performance
362+
optimization. It provides predictable latency, reduces garbage collection
363+
pressure, and improves overall system behavior under load. In systems handling
364+
thousands of operations per second, the difference between five allocations per
365+
operation and zero can mean the difference between smooth operation and periodic
366+
latency spikes during garbage collection.
367+
368+
The journey from initial benchmark to zero-allocation code demonstrates the
369+
power of Go's built-in tooling. By systematically applying the
370+
benchmark-profile-optimize loop, even complex code paths can be transformed into
371+
allocation-free implementations. The key lies not in guessing or premature
372+
optimization, but in measuring, understanding, and methodically addressing each
373+
allocation source.
374+
375+
It's best to focus optimization efforts on true hot paths identified through
376+
production profiling or realistic load testing. The techniques described here
377+
provide the tools to achieve zero-allocation code when it matters, but the
378+
judgment of when to apply them remains a critical engineering decision.

0 commit comments

Comments
 (0)