forked from davecheney/high-performance-go-workshop
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhigh-performance-go-workshop.slide
1117 lines (633 loc) · 43.8 KB
/
high-performance-go-workshop.slide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
High Performance Go
Tokyo 🌸
4 Dec 2016
Dave Cheney
http://dave.cheney.net/
@davecheney
* License and Materials
This presentation is licensed under the [[https://creativecommons.org/licenses/by-sa/4.0/][Creative Commons Attribution-ShareAlike 4.0 International]] licence.
The materials for this presentation are available on GitHub:
.link https://github.com/davecheney/high-performance-go-workshop
You are encouraged to remix, transform, or build upon the material, providing you give appropriate credit and distribute your contributions under the same license.
If you have suggestions or corrections to this presentation, please raise [[https://github.com/davecheney/high-performance-go-workshop/issues][an issue on the GitHub project]].
* Agenda
This workshop is aimed at development teams who are building production Go applications intended for high scale deployment.
Today we are going to cover five areas:
- What does performance mean, what is possible?
- Benchmarking
- Performance measurement and profiling
- Memory management and GC tuning
- Concurrency
After each section we'll have time for questions.
* One more thing ...
This isn't a lecture, it's a conversation.
If you don't understand something, or think what you're hearing is incorrect, please ask.
* What does performance mean, what is possible?
* What does performance mean?
Before we talk about writing high performance code, we need to talk about the hardware that will execute this code.
What are its properties and how have they changed over time?
As software authors we have benefited from Moore's Law, the doubling of the number of available transistors on a chip every 18 months, for 50 years.
No other industry has experienced a _six_order_of_magnitude_ improvement in their tools in the space of a lifetime.
But this is all changing.
* The CPU
.image images/cpu.svg _ 600
The number of transistors per CPU die continues to increase.
However clock speeds have not increased in a decade, and transistors per dollar has started to fall.
* More cores
.image images/Nehalem_Die_Shot_3.jpg _ 600
Over the last decade performance, especially on the server, has been dictated by adding more CPU cores.
CPU core count is dominated by heat dissipation and cost.
* Transistor size reductions have run out of steam
.image images/Mjc5MTM2Nw.png _ 580
Shrinking the size of a transistor, to fit more in the same sized die, is hitting a wall.
CPU dies can be made larger, but that increases latency, and cost per transistor.
* Modern CPUs are optimised for bulk transfers
"Modern processors are a like nitro fueled funny cars, they excel at the quarter mile. Unfortunately modern programming languages are like Monte Carlo, they are full of twists and turns."
.caption David Ungar, OOPSLA (year unknown)
# https://youtu.be/4LG-RtcSYUQ?t=39m59s
Much of the improvement in performance in the last two decades has come from architectural improvements:
- out of order execution (super-scalar)
- speculative execution
- vector (SSE) instructions
Thus, modern CPUs are optimised for bulk transfers and bulk operations. At every level, the setup cost of an operation encourages you to work in bulk.
e.g. memory is not loaded per byte, but per multiple of cache lines, this is why alignment is becoming less of an issue.
* Memory
Physical memory attached to a server has increased geometrically.
.image images/latency.png _ 600
But, in terms of processor cycles lost, physical memory is still as far away as ever.
* Cache rules everything around it
.image images/xeon_e5_v4_hcc_rings.jpg _ 600
Cache rules everything around it, but it is small, and will remain small because the speed of light determines how large a cache can be for a certain latency.
You can have a larger cache, but it will be slower because, in a universe where electricity travels a foot every nanosecond, distance equals latency.
* Network and disk I/O are still expensive
Network and disk I/O are still expensive, so expensive that the Go runtime will schedule something else while those operations are in progress.
.image images/media-20160803.jpg
* The free lunch is over
In 2005 Herb Sutter, the C++ committee leader, wrote an article entitled [[http://www.gotw.ca/publications/concurrency-ddj.htm][_The_free_lunch_is_over_]].
In his article Sutter discussed all the points I covered and asserted that programmers could no longer rely on faster hardware to fix slow programs—or slow programming languages.
Now, a decade later, there is no doubt that Herb Sutter was right. Memory is slow, caches are too small, CPU clock speeds are going backwards, and the simple world of a single threaded CPU is long gone.
* A fast programming language
It's time for the software to come to the party.
As [[https://www.youtube.com/watch?v=aiv1JOfMjm0][Rick Hudson noted at GopherCon]] in 2015, it's time for a programming language that works _with_ the limitations of today's hardware, rather than continue to ignore the reality that CPU designers find themselves.
So, for best performance on today's hardware in today's world, you need a programming language which:
- Is compiled, not interpreted.
- Permits efficient code to be written.
- Lets programmers talk about memory effectively, think structs vs java objects
- Has a compiler that produces efficient code, it has to be small code as well, because cache.
Obviously the language I'm talking about is the one we're here to discuss: Go.
* Discussion
Discussion:
- Do you agree that computers are no longer getting faster?
- Do you think we need to change languages to take advantage of modern hardware?
Further reading
.link https://www.youtube.com/watch?v=rKnDgT73v8s The Go Programming language (Nov 10, 2009)
.link https://www.youtube.com/watch?v=5kj5ApnhPAE OSCON 2010: Rob Pike, "Public Static Void"
* Benchmarking
* Benchmarking
Before you can begin to tune your application, you need to establish a reliable baseline to measure the impact of your change to know if you're making things better, or worse.
In other words, _"Don't_guess,_measure"_
This section focuses on how to construct useful benchmarks using the Go testing framework, and gives practical tips for avoiding the pitfalls.
Benchmarking is closely related to profiling, which we'll touch on during this section, then cover it in detail in the next.
* Benchmarking ground rules
Before you benchmark, you must have a stable environment to get repeatable results.
- The machine must be idle—don't profile on shared hardware, don't browse the web while waiting for a long benchmark to run.
- Watch out for power saving and thermal scaling.
- Avoid virtual machines and shared cloud hosting; they are too noisy for consistent measurements.
- There is a kernel bug on OS X versions before El Capitan; upgrade or avoid profiling on OS X. This also affects other *BSDs.
If you can afford it, buy dedicated performance test hardware. Rack it, disable all the power management and thermal scaling and never update the software on those machines.
For everyone else, have a before and after sample and run them multiple times to get consistent results.
* Using the testing package for benchmarking
The `testing` package has built in support for writing benchmarks.
.code examples/fib/fib_test.go /STARTFIB OMIT/,/ENDFIB OMIT/
.caption fib.go
.code examples/fib/fib_test.go /STARTBENCH OMIT/,/ENDBENCH OMIT/
.caption fib_test.go
DEMO: `go`test`-bench=.`./examples/fib`
* How benchmarks work
Each benchmark is run `b.N` times until it takes longer than 1 second.
`b.N` starts at 1, if the benchmark completes in under 1 second `b.N` is increased and the benchmark run again.
`b.N` increases in the approximate sequence; 1, 2, 3, 5, 10, 20, 30, 50, 100, ...
% go test -bench=. ./examples/fib
BenchmarkFib-4 30000 46408 ns/op
PASS
ok _/Users/dfc/devel/high-performance-go-workshop/examples/fib 1.910s
_Beware:_ below the μs mark you will start to see the relativistic effects of instruction reordering and code alignment.
- Run benchmarks longer to get more accuracy; `go`test`-benchtime=10s`
- Run benchmarks multiple times; `go`test`-count=10`
_Tip:_ If this is required, codify it in a `Makefile` so everyone is comparing apples to apples.
* Comparing benchmarks
For repeatable results, you should run benchmarks multiple times.
You can do this manually, or use the `-count=` flag.
Determining the performance delta between two sets of benchmarks can be tedious and error prone.
Tools like [[https://godoc.org/rsc.io/benchstat][rsc.io/benchstat]] are useful for comparing results.
% go test -c
% mv fib.test fib.golden
DEMO: Improve `Fib`
% go test -c
% ./fib.golden -test.bench=. -test.count=5 > old.txt
% ./fib.test -test.bench=. -test.count=5 > new.txt
% benchstat old.txt new.txt
DEMO: `benchstat`{old,new}.txt`
* Avoid benchmarking start up costs
Sometimes your benchmark has a once per run setup cost. `b.ResetTimer()` will can be used to ignore the time accrued in setup.
.code examples/reset.go /START1 OMIT/,/END1 OMIT/
If you have some expensive setup logic _per_loop_iteration, use `b.StopTimer()` and `b.StartTimer()` to pause the benchmark timer.
.code examples/reset.go /START2 OMIT/,/END2 OMIT/
* Benchmarking allocations
Allocation count and size is strongly correlated with benchmark time.
You can tell the `testing` framework to record the number of allocations made by code under test.
.code examples/benchmark.go
DEMO: `go`test`-run=^$`-bench=.`bufio`
_Note:_ you can also use the `go`test`-benchmem` flag to do the same for _all_ benchmarks.
DEMO: `go`test`-run=^$`-bench=.`-benchmem`bufio`
* Watch out for compiler optimisations
This example comes from [[https://github.com/golang/go/issues/14813#issue-140603392][issue 14813]]. How fast will this function benchmark?
.code examples/popcnt/popcnt_test.go /START OMIT/,/END OMIT/
* What happened?
% go test -bench=. ./examples/popcnt
`popcnt` is a leaf function, so the compiler can inline it.
Because the function is inlined, the compiler can see it has no side effects, so the call is eliminated. This is what the compiler sees:
.code examples/popcnt/popcnt2_test.go /START OMIT/,/END OMIT/
The same optimisations that make real code fast, by removing unnecessary computation, are the same ones that remove benchmarks that have no observable side effects.
This is only going to get more common as the Go compiler improves.
DEMO: show how to fix popcnt
* Benchmark mistakes
The `for` loop is crucial to the operation of the benchmark.
Here are two incorrect benchmarks, can you explain what is wrong with them?
.code examples/benchfib/wrong_test.go /START OMIT/,/END OMIT/
* Discussion
Are there any questions?
Perhaps it is time for a break.
* Performance measurement and profiling
* Performance measurement and profiling
In the previous section we studied how to measure the performance of programs from the outside.
In this section we'll use profiling tools built into Go to investigate the operation of the program from the inside.
* pprof
The primary tool we're going to be talking about today is _pprof_.
[[https://github.com/google/pprof][pprof]] descends from the [[https://github.com/gperftools/gperftools][Google Perf Tools]] suite of tools.
`pprof` profiling is built into the Go runtime.
It consists of two parts:
- `runtime/pprof` package built into every Go program
- `go`tool`pprof` for investigating profiles.
pprof supports several types of profiling, we'll discuss three of these today:
- CPU profiling.
- Memory profiling.
- Block (or blocking) profiling.
* CPU profiling
CPU profiling is the most common type of profile, and the most obvious.
When CPU profiling is enabled the runtime will interrupt itself every 10ms and record the stack trace of the currently running goroutines.
Once the profile is complete we can analyse it to determine the hottest code paths.
The more times a function appears in the profile, the more time that code path is taking as a percentage of the total runtime.
* Memory profiling
Memory profiling records the stack trace when a _heap_ allocation is made.
Stack allocations are assumed to be free and are _not_tracked_ in the memory profile.
Memory profiling, like CPU profiling is sample based, by default memory profiling samples 1 in every 1000 allocations. This rate can be changed.
Because of memory profiling is sample based and because it tracks _allocations_ not _use_, using memory profiling to determine your application's overall memory usage is difficult.
_Personal_Opinion:_ I do not find memory profiling useful for finding memory leaks. There are better ways to determine how much memory your application is using. We will discuss these later in the presentation.
* Block profiling
Block profiling is quite unique.
A block profile is similar to a CPU profile, but it records the amount of time a goroutine spent waiting for a shared resource.
This can be useful for determining _concurrency_ bottlenecks in your application.
Block profiling can show you when a large number of goroutines _could_ make progress, but were _blocked_. Blocking includes:
- Sending or receiving on a unbuffered channel.
- Sending to a full channel, receiving from an empty one.
- Trying to `Lock` a `sync.Mutex` that is locked by another goroutine.
Block profiling is a very specialised tool, it should not be used until you believe you have eliminated all your CPU and memory usage bottlenecks.
* One profile at at time
Profiling is not free.
Profiling has a moderate, but measurable impact on program performance—especially if you increase the memory profile sample rate.
Most tools will not stop you from enabling multiple profiles at once.
If you enable multiple profile's at the same time, they will observe their own interactions and throw off your results.
*Do*not*enable*more*than*one*kind*of*profile*at*a*time.*
* Using pprof
Now that I've talked about what pprof can measure, I will talk about how to use pprof to analyse a profile.
pprof should always be invoked with _two_ arguments.
go tool pprof /path/to/your/binary /path/to/your/profile
The `binary` argument *must* be the binary that produced this profile.
The `profile` argument *must* be the profile generated by this binary.
*Warning*: Because pprof also supports an online mode where it can fetch profiles from a running application over http, the pprof tool can be invoked without the name of your binary ([[https://github.com/golang/go/issues/10863][issue 10863]]):
go tool pprof /tmp/c.pprof
*Do*not*do*this*or*pprof*will*report*your*profile*is*empty.*
* Using pprof (cont.)
This is a sample CPU profile:
% go tool pprof $BINARY /tmp/c.p
Entering interactive mode (type "help" for commands)
(pprof) top
Showing top 15 nodes out of 63 (cum >= 4.85s)
flat flat% sum% cum cum%
21.89s 9.84% 9.84% 128.32s 57.71% net.(*netFD).Read
17.58s 7.91% 17.75% 40.28s 18.11% runtime.exitsyscall
15.79s 7.10% 24.85% 15.79s 7.10% runtime.newdefer
12.96s 5.83% 30.68% 151.41s 68.09% test_frame/connection.(*ServerConn).readBytes
11.27s 5.07% 35.75% 23.35s 10.50% runtime.reentersyscall
10.45s 4.70% 40.45% 82.77s 37.22% syscall.Syscall
9.38s 4.22% 44.67% 9.38s 4.22% runtime.deferproc_m
9.17s 4.12% 48.79% 12.73s 5.72% exitsyscallfast
8.03s 3.61% 52.40% 11.86s 5.33% runtime.casgstatus
7.66s 3.44% 55.85% 7.66s 3.44% runtime.cas
7.59s 3.41% 59.26% 7.59s 3.41% runtime.onM
6.42s 2.89% 62.15% 134.74s 60.60% net.(*conn).Read
6.31s 2.84% 64.98% 6.31s 2.84% runtime.writebarrierptr
6.26s 2.82% 67.80% 32.09s 14.43% runtime.entersyscall
Often this output is hard to understand.
* Using pprof (cont.)
A better way to understand your profile is to visualise it.
% go tool pprof application /tmp/c.p
Entering interactive mode (type "help" for commands)
(pprof) web
Opens a web page with a graphical display of the profile.
.link images/profile.svg
_Note_: visualisation requires graphviz.
I find this method to be superior to the text mode, I strongly recommend you try it.
pprof also supports these modes in a non interactive form with flags like `-svg`, `-pdf`, etc. See `go`tool`pprof`-help` for more details.
.link http://blog.golang.org/profiling-go-programs Further reading: Profiling Go programs
.link https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs Further reading: Debugging performance issues in Go programs
* Using pprof (cont.)
The output of a memory profile can be similarly visualised.
% go build -gcflags='-memprofile=/tmp/m.p'
% go tool pprof --alloc_objects -svg $(go tool -n compile) /tmp/m.p > alloc_objects.svg
% go tool pprof --inuse_objects -svg $(go tool -n compile) /tmp/m.p > inuse_objects.svg
Memory profiles come in two varieties
- Alloc objects reports the call site where each allocation was made
.link images/alloc_objects.svg
- Inuse objects reports the call site where an allocation was made _iff_ it was reachable at the end of the profile
.link images/inuse_objects.svg
DEMO: `examples/inuseallocs`
* Using pprof (cont.)
Here is a visualisation of a block profile:
% go test -run=XXX -bench=ClientServer -blockprofile=/tmp/b.p net/http
% go tool pprof -svg http.test /tmp/b.p > block.svg
.link images/block.svg
* Profiling benchmarks
The `testing` package has built in support for generating CPU, memory, and block profiles.
- `-cpuprofile=$FILE` writes a CPU profile to `$FILE`.
- `-memprofile=$FILE`, writes a memory profile to `$FILE`, `-memprofilerate=N` adjusts the profile rate to `1/N`.
- `-blockprofile=$FILE`, writes a block profile to `$FILE`.
Using any of these flags also preserves the binary.
% go test -run=XXX -bench=. -cpuprofile=c.p bytes
% go tool pprof bytes.test c.p
_Note:_ use `-run=XXX` to disable tests, you only want to profile benchmarks. You can also use `-run=^$` to accomplish the same thing.
* Profiling applications
Profiling `testing` benchmarks is useful for _microbenchmarks_.
We use microbenchmarks inside the standard library to make sure individual packages do not regress, but what if you want to profile a complete application?
The Go runtime's profiling interface is in the `runtime/pprof` package.
`runtime/pprof` is a very low level tool, and for historic reasons the interfaces to the different kinds of profile are not uniform.
A few years ago I wrote a small package, [[https://github.com/pkg/profile][github.com/pkg/profile]], to make it easier to profile an application.
import "github.com/pkg/profile"
func main() {
defer profile.Start().Stop()
...
}
* Exercise
- Generate a profile from a piece of code you know well. If you don't have a code sample, try profiling `godoc`.
- If you were to generate a profile on one machine and inspect it on another, how would you do it?
* Framepointers
Go 1.7 has been released and along with a new compiler for amd64, the compiler now enables frame pointers by default.
The frame pointer is a register that always points to the top of the current stack frame.
Framepointers enable tools like `gdb(1)`, and `perf(1)` to understand the Go call stack.
We won't cover these tools in this workshop, but you can read and watch a presentation I gave on seven different ways to profile Go programs.
Further reading:
.link https://talks.godoc.org/github.com/davecheney/presentations/seven.slide Seven ways to profile a Go program (slides)
.link https://www.youtube.com/watch?v=2h_NFBFrciI Video (30 mins)
.link https://www.bigmarker.com/remote-meetup-go/Seven-ways-to-profile-a-Go-program Recording (60 mins)
* go tool trace
In Go 1.5, Dmitry Vyukov added a new kind of profiling to the runtime; [[https://golang.org/doc/go1.5#trace_command][execution trace profiling]].
Gives insight into dynamic execution of a program.
Captures with nanosecond precision:
- goroutine creation/start/end
- goroutine blocking/unblocking
- network blocking
- system calls
- GC events
Execution traces are essentially undocumented ☹️, see [[https://github.com/golang/go/issues/16526][github/go#16526]]
* go tool trace (cont.)
Generating an execution trace
% cd $(go env GOROOT)/test/bench/go1
% go test -bench=HTTPClientServer -trace=/tmp/t.p
Viewing the trace:
% go tool trace /tmp/t.p
2016/08/13 17:01:04 Parsing trace...
2016/08/13 17:01:04 Serializing trace...
2016/08/13 17:01:05 Splitting trace...
2016/08/13 17:01:06 Opening browser
Bonus: [[https://github.com/pkg/profile/releases/tag/v1.2.0][`github.com/pkg/profile`]] supports generating trace profiles.
defer profile.Start(profile.TraceProfile).Stop()
* Exercises
- Create a trace profile of your application using `go`tool`trace` and [[https://github.com/pkg/profile/releases/tag/v1.2.0][`github.com/pkg/profile`]].
* Compiler optimisations
* Compiler optimisations
This section gives a brief background on three important optimisations that the Go compiler performs.
- Escape analysis
- Inlining
- Dead code elimination
These are all handled in the front end of the compiler, while the code is still in its AST form; then the code is passed to the SSA compiler for further optimisation.
* Escape analysis
A compliant Go implementation _could_ store every allocation on the heap, but that would put a lot of pressure on the gc.
However, the stack exists as a cheap place to store local variables; there is no need to garbage collect things on the stack.
In some languages, like C and C++, stack/heap allocation is manual, and a common cause of memory corruption bugs.
In Go, the compiler automatically moves local values to the heap if they live beyond the lifetime of the function call. It is said that the value _escapes_ to the heap.
But the compiler can also do the opposite, it can find things which would be assumed to be allocated on the heap, `new`, `make`, etc, and move them to stack.
* Escape analysis (example)
Sum adds the ints between 1 and 100 and returns the result.
.code examples/esc/sum.go /START OMIT/,/END OMIT/
.caption examples/esc/sum.go
Because the numbers slice is only referenced inside Sum, the compiler will arrange to store the 100 integers for that slice on the stack, rather than the heap. There is no need to garbage collect `numbers`, it is automatically free'd when `Sum` returns.
* Investigating escape analysis
Prove it!
To print the compilers escape analysis decisions, use the `-m` flag.
% go build -gcflags=-m examples/esc/sum.go
# command-line-arguments
examples/esc/sum.go:10: Sum make([]int, 100) does not escape
examples/esc/sum.go:25: Sum() escapes to heap
examples/esc/sum.go:25: main ... argument does not escape
Line 10 shows the compiler has correctly deduced that the result of `make([]int,`100)` does not escape to the heap.
We'll come back to line 25 soon.
* Escape analysis (example)
This example is a little contrived.
.code examples/esc/center.go /START OMIT/,/END OMIT/
`NewPoint` creates a new `*Point` value, `p`. We pass `p` to the `Center` function which moves the point to a position in the center of the screen. Finally we print the values of `p.X` and `p.Y`.
* Escape analysis (example)
% go build -gcflags=-m examples/esc/center.go
# command-line-arguments
examples/esc/center.go:12: can inline Center
examples/esc/center.go:19: inlining call to Center
examples/esc/center.go:12: Center p does not escape
examples/esc/center.go:20: p.X escapes to heap
examples/esc/center.go:20: p.Y escapes to heap
examples/esc/center.go:18: NewPoint new(Point) does not escape
examples/esc/center.go:20: NewPoint ... argument does not escape
Even though `p` was allocated with the `new` function, it will not be stored on the heap, because no reference `p` escapes the `Center` function.
Question: What about line 20, if `p` doesn't escape, what is escaping to the heap?
examples/esc/center.go:20: p.X escapes to heap
examples/esc/center.go:20: p.Y escapes to heap
.link https://github.com/golang/go/issues/7714 Escape analysis is not perfect
* Exercise
- What is happening on line 25? Open up `examples/esc/sum.go` and see.
- Write a benchmark to provide that `Sum` does not allocate
* Inlining
In Go function calls in have a fixed overhead; stack and preemption check.
Some of this is ameliorated by hardware branch predictors, but it's still a cost in terms of function size and clock cycles.
Inlining is the classical optimisation to avoid these costs.
Inlining only works on _leaf_functions_, a function that does not call another. The justification for this is:
- If your function does a lot of work, then the preamble overhead will be negligible. That's why functions over a certain size (currently some count of instructions, plus a few operations which prevent inlining all together (eg. switch before Go 1.7)
- Small functions on the other hand pay a fixed overhead for a relatively small amount of useful work performed. These are the functions that inlining targets as they benefit the most.
The other reason is it makes stack traces harder to follow.
* Inlining (example)
.play examples/max/max.go /START OMIT/,/END OMIT/
* Inlining (cont.)
Again we use the `-m` flag to view the compilers optimisation decision.
% go build -gcflags=-m examples/max/max.go
# command-line-arguments
examples/max/max.go:4: can inline Max
examples/max/max.go:13: inlining call to Max
Compile `max.go` and see what the optimised version of `F()` became.
DEMO: `go`build`-gcflags="-m`-S"`examples/max/max.go`2>&1`|`less`
* Discussion
- Why did I declare `a` and `b` in `F()` to be constants?
- What happens if they are variables?
- What happens if they are passing into `F()` as parameters?
* Dead code elimination
Why is it important that `a` and `b` are constants?
After inlining, this is what the compiler saw
.play examples/max/max2.go /START OMIT/,/END OMIT/
- The call to `Max` has been inlined.
- If `a`>`b` then there is nothing to do, so the function returns.
- If `a`<`b` then the branch is false and we fall through to `panic`
- But, because `a` and `b` are constants, we know that the branch will never be false, so the compiler can optimise `F()` to a return.
* Dead code elimination (cont.)
Dead code elimination work together with inlining to reduce the amount of code generated by removing loops and branches that are proven unreachable.
You can take advantage of this to implement expensive debugging, and hide it behind
const debug = false
Combined with build tags this can be very useful.
Further reading:
.link http://dave.cheney.net/2014/09/28/using-build-to-switch-between-debug-and-release Using // +build to switch between debug and release builds
.link http://dave.cheney.net/2013/10/12/how-to-use-conditional-compilation-with-the-go-build-tool How to use conditional compilation with the go build tool
* Compiler flags Exercises
Compiler flags are provided with:
go build -gcflags=$FLAGS
Investigate the operation of the following compiler functions:
- `-S` prints the (Go flavoured) assembly of the _package_ being compiled.
- `-l` controls the behaviour of the inliner; `-l` disables inlining, `-l`-l` increases it (more `-l` 's increases the compiler's appetite for inlining code). Experiment with the difference in compile time, program size, and run time.
- `-m` controls printing of optimisation decision like inlining, escape analysis. `-m`-m` prints more details about what the compiler was thinking.
- `-l`-N` disables all optimisations.
.link http://go-talks.appspot.com/github.com/rakyll/talks/gcinspect/talk.slide#1 Further reading: Codegen Inspection by Jaana Burcu Dogan
Perhaps we shall take a break now.
* Memory management and GC tuning
* Memory management and GC tuning
Go is a garbage collected language. This is a design principle, it will not change.
As a garbage collected language, the performance of Go programs is often determined by their interaction with the garbage collector.
Next to your choice of algorithms, memory consumption is the most important factor that determines the performance and scalability of your application.
This section discusses the operation of the garbage collector, how to measure the memory usage of your program and strategies for lowering memory usage if garbage collector performance is a bottleneck.
* Garbage collector world view
The purpose of a garbage collector is to present the illusion that there is an infinite amount of memory available to the program.
You may disagree with this statement, but this is the base assumption of how garbage collector designers think.
# A stop the world, mark sweep GC is the most efficient in terms of total run time; good for batch processing, simulation, etc.
The Go GC is designed for low latency servers and interactive applications.
The Go GC favors _lower_latency_ over _maximum_throughput_; it moves some of the allocation cost to the mutator to reduce the cost of cleanup later.
* Garbage collector design
The design of the Go GC has changed over the years
- Go 1.0, stop the world mark sweep collector based heavily on tcmalloc.
- Go 1.3, fully precise collector, wouldn't mistake big numbers on the heap for pointers, thus leaking memory.
- Go 1.5, new GC design, focusing on _latency_ over _throughput_.
- Go 1.6, GC improvements, handling larger heaps with lower latency.
- Go 1.7, small GC improvements, mainly refactoring.
- Go 1.8, further work to reduce STW times, now down to the 100 microsecond range.
- Go 1.9, ROC collector is an experiment to extend the idea of escape analysis per goroutine.
* Garbage collector monitoring
A simple way to obtain a general idea of how hard the garbage collector is working is to enable the output of GC logging.
These stats are always collected, but normally suppressed, you can enable their display by setting the `GODEBUG` environment variable.
% env GODEBUG=gctrace=1 godoc -http=:8080
gc 1 @0.017s 8%: 0.021+3.2+0.10+0.15+0.86 ms clock, 0.043+3.2+0+2.2/0.002/0.009+1.7 ms cpu, 5->6->1 MB, 4 MB goal, 4 P
gc 2 @0.026s 12%: 0.11+4.9+0.12+1.6+0.54 ms clock, 0.23+4.9+0+3.0/0.50/0+1.0 ms cpu, 4->6->3 MB, 6 MB goal, 4 P
gc 3 @0.035s 14%: 0.031+3.3+0.76+0.17+0.28 ms clock, 0.093+3.3+0+2.7/0.012/0+0.84 ms cpu, 4->5->3 MB, 3 MB goal, 4 P
gc 4 @0.042s 17%: 0.067+5.1+0.15+0.29+0.95 ms clock, 0.20+5.1+0+3.0/0/0.070+2.8 ms cpu, 4->5->4 MB, 4 MB goal, 4 P
gc 5 @0.051s 21%: 0.029+5.6+0.33+0.62+1.5 ms clock, 0.11+5.6+0+3.3/0.006/0.002+6.0 ms cpu, 5->6->4 MB, 5 MB goal, 4 P
gc 6 @0.061s 23%: 0.080+7.6+0.17+0.22+0.45 ms clock, 0.32+7.6+0+5.4/0.001/0.11+1.8 ms cpu, 6->6->5 MB, 7 MB goal, 4 P
gc 7 @0.071s 25%: 0.59+5.9+0.017+0.15+0.96 ms clock, 2.3+5.9+0+3.8/0.004/0.042+3.8 ms cpu, 6->8->6 MB, 8 MB goal, 4 P
The trace output gives a general measure of GC activity.
DEMO: Show `godoc` with `GODEBUG=gctrace=1` enabled
_Recommendation_: use this env var in production, it has no performance impact.
* Garbage collector monitoring (cont.)
Using `GODEBUG=gctrace=1` is good when you _know_ there is a problem, but for general telemetry on your Go application I recommend the `net/http/pprof` interface.
import _ "net/http/pprof"
Importing the `net/http/pprof` package will register a handler at `/debug/pprof` with various runtime metrics, including:
- A list of all the running goroutines, `/debug/pprof/heap?debug=1`.
- A report on the memory allocation statistics, `/debug/pprof/heap?debug=1`.
*Warning*: `net/http/pprof` will register itself with your default `http.ServeMux`.
Be careful as this will be visible if you use `http.ListenAndServe(address,`nil)`.
DEMO: `godoc`-http=:8080`, show `/debug/pprof`.
* Garbage collector tuning
The Go runtime provides one environment variable to tune the GC, `GOGC`.
The formula for GOGC is as follows.
goal = reachable * (1 + GOGC/100)
For example, if we currently have a 256MB heap, and `GOGC=100` (the default), when the heap fills up it will grow to
512MB = 256MB * (1 + 100/100)
- Values of `GOGC` greater than 100 causes the heap to grow faster, reducing the pressure on the GC.
- Values of `GOGC` less than 100 cause the heap to grow slowly, increasing the pressure on the GC.
The default value of 100 is _just_a_guide_. you should choose your own value _after_profiling_your_application_with_production_loads_.
* Reduce allocations
Make sure your APIs allow the caller to reduce the amount of garbage generated.
Consider these two Read methods
func (r *Reader) Read() ([]byte, error)
func (r *Reader) Read(buf []byte) (int, error)
The first Read method takes no arguments and returns some data as a `[]byte`. The second takes a `[]byte` buffer and returns the amount of bytes read.
The first Read method will _always_ allocate a buffer, putting pressure on the GC. The second fills the buffer it was given.
_Exercise_: Can you name examples in the std lib which follow this pattern?
* strings and []bytes
In Go `string` values are immutable, `[]byte` are mutable.
Most programs prefer to work `string`, but most IO is done with `[]byte`.
Avoid `[]byte` to string conversions wherever possible, this normally means picking one representation, either a `string` or a `[]byte` for a value. Often this will be `[]byte` if you read the data from the network or disk.
The [[https://golang.org/pkg/bytes/][`bytes`]] package contains many of the same operations— `Split`, `Compare`, `HasPrefix`, `Trim`, etc—as the [[https://golang.org/pkg/strings/][`strings`]] package.
Under the hood `strings` uses same assembly primitives as the `bytes` package.
* Using []byte as a map key
It is very common to use a `string` as a map key, but often you have a `[]byte`.
The compiler implements a specific optimisation for this case
var m map[string]string
v, ok := m[string(bytes)]
This will avoid the conversion of the byte slice to a string for the map lookup. This is very specific, it won't work if you do something like
key := string(bytes)
val, ok := m[key]
* Avoid string concatenation
Go strings are immutable. Concatenating two strings generates a third. Which of the following is fastest?
.code examples/concat/concat_test.go /START1 OMIT/,/END1 OMIT/
.code examples/concat/concat_test.go /START2 OMIT/,/END2 OMIT/
.code examples/concat/concat_test.go /START3 OMIT/,/END3 OMIT/
.code examples/concat/concat_test.go /START4 OMIT/,/END4 OMIT/
DEMO: `go`test`-bench=.`./examples/concat`
* Preallocate slices if the length is known
Append is convenient, but wasteful.
Slices grow by doubling up to 1024 elements, then by approximately 25% after that. What is the capacity of `b` after we append one more item to it?
.play examples/grow.go /START OMIT/,/END OMIT/
If you use the append pattern you could be copying a lot of data and creating a lot of garbage.
* Preallocate slices if the length is known (cont.)
If know know the length of the slice beforehand, then pre-allocate the target to avoid copying and to make sure the target is exactly the right size.
_Before:_
var s []string
for _, v := range fn() {
s = append(s, v)
}
return s
_After:_
vals := fn()
s := make([]string, len(vals))
for i, v := range vals {
s[i] = v
}
return s
* Using sync.Pool
The `sync` package comes with a `sync.Pool` type which is used to reuse common objects.
`sync.Pool` has no fixed size or maximum capacity. You add to it and take from it until a GC happens, then it is emptied unconditionally.
.code examples/pool.go /START OMIT/,/END OMIT/
*Warning*: `sync.Pool` is not a cache. It can and will be emptied _at_any_time_.
Do not place important items in a `sync.Pool`, they will be discarded.
_Personal_opinion_: `sync.Pool` is hard to use safely. Don't use `sync.Pool`.
* Exercises
- Using `godoc` (or another program) observe the results of changing `GOGC` using `GODEBUG=gctrace=1`.
- Benchmark byte's string(byte) map keys
- Benchmark allocs from different concat strategies.
* Concurrency
* Concurrency
Go's signature feature is its lightweight concurrency model.
While cheap, these features are not free, and their overuse often leads to unexpected performance problems.
This final section concludes with a set of do's and don't's for efficient use of Go's concurrency primitives.
* Goroutines
The key feature of Go that makes it a great fit for modern hardware are goroutines.
Goroutines are so easy to use, and so cheap to create, you could think of them as _almost_ free.
The Go runtime has been written for programs with tens of thousands of goroutines as the norm, hundreds of thousands are not unexpected.
However, each goroutine does consume a minimum amount of memory for the goroutine's stack which is currently at least 2k.
2048 * 1,000,000 goroutines == 2GB of memory, and they haven't done anything yet.
# Maybe this is a lot, maybe it isn't given the other usages of your application
* Know when to stop a goroutine
Goroutines are cheap to start and cheap to run, but they do have a finite cost in terms of memory footprint; you cannot create an infinite number of them.
Every time you use the `go` keyword in your program to launch a goroutine, you must *know* how, and when, that goroutine will exit.
If you don't know the answer, that's a potential memory leak.
In your design, some goroutines may run until the program exits. These goroutines are rare enough to not become an exception to the rule.
*Never*start*a*goroutine*without*knowing*how*it*will*stop*.
# * Know when to stop a goroutine (cont.)
#
#TODO SHOW HOW TO STOP A GOROUTINE USING A DONE CHANNEL
* Go uses efficient network polling for some requests
The Go runtime handles network IO using an efficient operating system polling mechanism (kqueue, epoll, windows IOCP, etc). Many waiting goroutines will be serviced by a single operating system thread.
However, for local file IO, Go does not implement any IO polling. Each operation on a `*os.File` consumes one operating system thread while in progress.
Heavy use of local file IO can cause your program to spawn hundreds or thousands of threads; possibly more than your operating system allows.
Your disk subsystem does not expect to be able to handle hundreds or thousands of concurrent IO requests.
* io.Reader and io.Writer are not buffered
`io.Reader` and `io.Writer` implementations are not buffered.
This includes `net.Conn` and `os.Stdout`.
Use `bufio.NewReader(r)` and `bufio.NewWriter(w)` to get a buffered reader and writer.
Don't forget to `Flush` or `Close` your buffered writers to flush the buffer to the underlying `Writer`.
* Watch out for IO multipliers in your application
If you're writing a server process, its primary job is to multiplex clients connected over the network, and data stored in your application.
Most server programs take a request, do some processing, then return a result. This sounds simple, but depending on the result it can let the client consume a large (possibly unbounded) amount of resources on your server. Here are some things to pay attention to:
- The amount of IO requests per incoming request; how many IO events does a single client request generate? It might be on average 1, or possibly less than one if many requests are served out of a cache.
- The amount of reads required to service a query; is it fixed, N+1, or linear (reading the whole table to generate the last page of results).
If memory is slow, relatively speaking, then IO is so slow that you should avoid doing it at all costs. Most importantly avoid doing IO in the context of a request—don't make the user wait for your disk subsystem to write to disk, or even read.
* Use streaming IO interfaces
Where-ever possible avoid reading data into a `[]byte` and passing it around.
Depending on the request you may end up reading megabytes (or more!) of data into memory. This places huge pressure on the GC, which will increase the average latency of your application.
Instead use `io.Reader` and `io.Writer` to construct processing pipelines to cap the amount of memory in use per request.