-
Notifications
You must be signed in to change notification settings - Fork 6
/
parallel-basics.html
1494 lines (1158 loc) · 68.5 KB
/
parallel-basics.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>Basics of Parallel Processing in R, Python, Matlab, and C</title>
<script type="text/javascript">
window.onload = function() {
var imgs = document.getElementsByTagName('img'), i, img;
for (i = 0; i < imgs.length; i++) {
img = imgs[i];
// center an image if it is the only element of its parent
if (img.parentElement.childElementCount === 1)
img.parentElement.style.textAlign = 'center';
}
};
</script>
<!-- Styles for R syntax highlighter -->
<style type="text/css">
pre .operator,
pre .paren {
color: rgb(104, 118, 135)
}
pre .literal {
color: #990073
}
pre .number {
color: #099;
}
pre .comment {
color: #998;
font-style: italic
}
pre .keyword {
color: #900;
font-weight: bold
}
pre .identifier {
color: rgb(0, 0, 0);
}
pre .string {
color: #d14;
}
</style>
<!-- R syntax highlighter -->
<script type="text/javascript">
var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.childNodes[r].nodeName=="BR"){p+="\n"}else{p+=h(t.childNodes[r])}}}if(/MSIE [678]/.test(navigator.userAgent)){p=p.replace(/\r/g,"\n")}return p}function a(s){var r=s.className.split(/\s+/);r=r.concat(s.parentNode.className.split(/\s+/));for(var q=0;q<r.length;q++){var p=r[q].replace(/^language-/,"");if(e[p]){return p}}}function c(q){var p=[];(function(s,t){for(var r=0;r<s.childNodes.length;r++){if(s.childNodes[r].nodeType==3){t+=s.childNodes[r].nodeValue.length}else{if(s.childNodes[r].nodeName=="BR"){t+=1}else{if(s.childNodes[r].nodeType==1){p.push({event:"start",offset:t,node:s.childNodes[r]});t=arguments.callee(s.childNodes[r],t);p.push({event:"stop",offset:t,node:s.childNodes[r]})}}}}return t})(q,0);return p}function k(y,w,x){var q=0;var z="";var s=[];function u(){if(y.length&&w.length){if(y[0].offset!=w[0].offset){return(y[0].offset<w[0].offset)?y:w}else{return w[0].event=="start"?y:w}}else{return y.length?y:w}}function t(D){var A="<"+D.nodeName.toLowerCase();for(var B=0;B<D.attributes.length;B++){var C=D.attributes[B];A+=" "+C.nodeName.toLowerCase();if(C.value!==undefined&&C.value!==false&&C.value!==null){A+='="'+m(C.value)+'"'}}return A+">"}while(y.length||w.length){var v=u().splice(0,1)[0];z+=m(x.substr(q,v.offset-q));q=v.offset;if(v.event=="start"){z+=t(v.node);s.push(v.node)}else{if(v.event=="stop"){var p,r=s.length;do{r--;p=s[r];z+=("</"+p.nodeName.toLowerCase()+">")}while(p!=v.node);s.splice(r,1);while(r<s.length){z+=t(s[r]);r++}}}}return z+m(x.substr(q))}function j(){function q(x,y,v){if(x.compiled){return}var u;var s=[];if(x.k){x.lR=f(y,x.l||hljs.IR,true);for(var w in x.k){if(!x.k.hasOwnProperty(w)){continue}if(x.k[w] instanceof Object){u=x.k[w]}else{u=x.k;w="keyword"}for(var r in u){if(!u.hasOwnProperty(r)){continue}x.k[r]=[w,u[r]];s.push(r)}}}if(!v){if(x.bWK){x.b="\\b("+s.join("|")+")\\s"}x.bR=f(y,x.b?x.b:"\\B|\\b");if(!x.e&&!x.eW){x.e="\\B|\\b"}if(x.e){x.eR=f(y,x.e)}}if(x.i){x.iR=f(y,x.i)}if(x.r===undefined){x.r=1}if(!x.c){x.c=[]}x.compiled=true;for(var t=0;t<x.c.length;t++){if(x.c[t]=="self"){x.c[t]=x}q(x.c[t],y,false)}if(x.starts){q(x.starts,y,false)}}for(var p in e){if(!e.hasOwnProperty(p)){continue}q(e[p].dM,e[p],true)}}function d(B,C){if(!j.called){j();j.called=true}function q(r,M){for(var L=0;L<M.c.length;L++){if((M.c[L].bR.exec(r)||[null])[0]==r){return M.c[L]}}}function v(L,r){if(D[L].e&&D[L].eR.test(r)){return 1}if(D[L].eW){var M=v(L-1,r);return M?M+1:0}return 0}function w(r,L){return L.i&&L.iR.test(r)}function K(N,O){var M=[];for(var L=0;L<N.c.length;L++){M.push(N.c[L].b)}var r=D.length-1;do{if(D[r].e){M.push(D[r].e)}r--}while(D[r+1].eW);if(N.i){M.push(N.i)}return f(O,M.join("|"),true)}function p(M,L){var N=D[D.length-1];if(!N.t){N.t=K(N,E)}N.t.lastIndex=L;var r=N.t.exec(M);return r?[M.substr(L,r.index-L),r[0],false]:[M.substr(L),"",true]}function z(N,r){var L=E.cI?r[0].toLowerCase():r[0];var M=N.k[L];if(M&&M instanceof Array){return M}return false}function F(L,P){L=m(L);if(!P.k){return L}var r="";var O=0;P.lR.lastIndex=0;var M=P.lR.exec(L);while(M){r+=L.substr(O,M.index-O);var N=z(P,M);if(N){x+=N[1];r+='<span class="'+N[0]+'">'+M[0]+"</span>"}else{r+=M[0]}O=P.lR.lastIndex;M=P.lR.exec(L)}return r+L.substr(O,L.length-O)}function J(L,M){if(M.sL&&e[M.sL]){var r=d(M.sL,L);x+=r.keyword_count;return r.value}else{return F(L,M)}}function I(M,r){var L=M.cN?'<span class="'+M.cN+'">':"";if(M.rB){y+=L;M.buffer=""}else{if(M.eB){y+=m(r)+L;M.buffer=""}else{y+=L;M.buffer=r}}D.push(M);A+=M.r}function G(N,M,Q){var R=D[D.length-1];if(Q){y+=J(R.buffer+N,R);return false}var P=q(M,R);if(P){y+=J(R.buffer+N,R);I(P,M);return P.rB}var L=v(D.length-1,M);if(L){var O=R.cN?"</span>":"";if(R.rE){y+=J(R.buffer+N,R)+O}else{if(R.eE){y+=J(R.buffer+N,R)+O+m(M)}else{y+=J(R.buffer+N+M,R)+O}}while(L>1){O=D[D.length-2].cN?"</span>":"";y+=O;L--;D.length--}var r=D[D.length-1];D.length--;D[D.length-1].buffer="";if(r.starts){I(r.starts,"")}return R.rE}if(w(M,R)){throw"Illegal"}}var E=e[B];var D=[E.dM];var A=0;var x=0;var y="";try{var s,u=0;E.dM.buffer="";do{s=p(C,u);var t=G(s[0],s[1],s[2]);u+=s[0].length;if(!t){u+=s[1].length}}while(!s[2]);if(D.length>1){throw"Illegal"}return{r:A,keyword_count:x,value:y}}catch(H){if(H=="Illegal"){return{r:0,keyword_count:0,value:m(C)}}else{throw H}}}function g(t){var p={keyword_count:0,r:0,value:m(t)};var r=p;for(var q in e){if(!e.hasOwnProperty(q)){continue}var s=d(q,t);s.language=q;if(s.keyword_count+s.r>r.keyword_count+r.r){r=s}if(s.keyword_count+s.r>p.keyword_count+p.r){r=p;p=s}}if(r.language){p.second_best=r}return p}function i(r,q,p){if(q){r=r.replace(/^((<[^>]+>|\t)+)/gm,function(t,w,v,u){return w.replace(/\t/g,q)})}if(p){r=r.replace(/\n/g,"<br>")}return r}function n(t,w,r){var x=h(t,r);var v=a(t);var y,s;if(v){y=d(v,x)}else{return}var q=c(t);if(q.length){s=document.createElement("pre");s.innerHTML=y.value;y.value=k(q,c(s),x)}y.value=i(y.value,w,r);var u=t.className;if(!u.match("(\\s|^)(language-)?"+v+"(\\s|$)")){u=u?(u+" "+v):v}if(/MSIE [678]/.test(navigator.userAgent)&&t.tagName=="CODE"&&t.parentNode.tagName=="PRE"){s=t.parentNode;var p=document.createElement("div");p.innerHTML="<pre><code>"+y.value+"</code></pre>";t=p.firstChild.firstChild;p.firstChild.cN=s.cN;s.parentNode.replaceChild(p.firstChild,s)}else{t.innerHTML=y.value}t.className=u;t.result={language:v,kw:y.keyword_count,re:y.r};if(y.second_best){t.second_best={language:y.second_best.language,kw:y.second_best.keyword_count,re:y.second_best.r}}}function o(){if(o.called){return}o.called=true;var r=document.getElementsByTagName("pre");for(var p=0;p<r.length;p++){var q=b(r[p]);if(q){n(q,hljs.tabReplace)}}}function l(){if(window.addEventListener){window.addEventListener("DOMContentLoaded",o,false);window.addEventListener("load",o,false)}else{if(window.attachEvent){window.attachEvent("onload",o)}else{window.onload=o}}}var e={};this.LANGUAGES=e;this.highlight=d;this.highlightAuto=g;this.fixMarkup=i;this.highlightBlock=n;this.initHighlighting=o;this.initHighlightingOnLoad=l;this.IR="[a-zA-Z][a-zA-Z0-9_]*";this.UIR="[a-zA-Z_][a-zA-Z0-9_]*";this.NR="\\b\\d+(\\.\\d+)?";this.CNR="\\b(0[xX][a-fA-F0-9]+|(\\d+(\\.\\d*)?|\\.\\d+)([eE][-+]?\\d+)?)";this.BNR="\\b(0b[01]+)";this.RSR="!|!=|!==|%|%=|&|&&|&=|\\*|\\*=|\\+|\\+=|,|\\.|-|-=|/|/=|:|;|<|<<|<<=|<=|=|==|===|>|>=|>>|>>=|>>>|>>>=|\\?|\\[|\\{|\\(|\\^|\\^=|\\||\\|=|\\|\\||~";this.ER="(?![\\s\\S])";this.BE={b:"\\\\.",r:0};this.ASM={cN:"string",b:"'",e:"'",i:"\\n",c:[this.BE],r:0};this.QSM={cN:"string",b:'"',e:'"',i:"\\n",c:[this.BE],r:0};this.CLCM={cN:"comment",b:"//",e:"$"};this.CBLCLM={cN:"comment",b:"/\\*",e:"\\*/"};this.HCM={cN:"comment",b:"#",e:"$"};this.NM={cN:"number",b:this.NR,r:0};this.CNM={cN:"number",b:this.CNR,r:0};this.BNM={cN:"number",b:this.BNR,r:0};this.inherit=function(r,s){var p={};for(var q in r){p[q]=r[q]}if(s){for(var q in s){p[q]=s[q]}}return p}}();hljs.LANGUAGES.cpp=function(){var a={keyword:{"false":1,"int":1,"float":1,"while":1,"private":1,"char":1,"catch":1,"export":1,virtual:1,operator:2,sizeof:2,dynamic_cast:2,typedef:2,const_cast:2,"const":1,struct:1,"for":1,static_cast:2,union:1,namespace:1,unsigned:1,"long":1,"throw":1,"volatile":2,"static":1,"protected":1,bool:1,template:1,mutable:1,"if":1,"public":1,friend:2,"do":1,"return":1,"goto":1,auto:1,"void":2,"enum":1,"else":1,"break":1,"new":1,extern:1,using:1,"true":1,"class":1,asm:1,"case":1,typeid:1,"short":1,reinterpret_cast:2,"default":1,"double":1,register:1,explicit:1,signed:1,typename:1,"try":1,"this":1,"switch":1,"continue":1,wchar_t:1,inline:1,"delete":1,alignof:1,char16_t:1,char32_t:1,constexpr:1,decltype:1,noexcept:1,nullptr:1,static_assert:1,thread_local:1,restrict:1,_Bool:1,complex:1},built_in:{std:1,string:1,cin:1,cout:1,cerr:1,clog:1,stringstream:1,istringstream:1,ostringstream:1,auto_ptr:1,deque:1,list:1,queue:1,stack:1,vector:1,map:1,set:1,bitset:1,multiset:1,multimap:1,unordered_set:1,unordered_map:1,unordered_multiset:1,unordered_multimap:1,array:1,shared_ptr:1}};return{dM:{k:a,i:"</",c:[hljs.CLCM,hljs.CBLCLM,hljs.QSM,{cN:"string",b:"'\\\\?.",e:"'",i:"."},{cN:"number",b:"\\b(\\d+(\\.\\d*)?|\\.\\d+)(u|U|l|L|ul|UL|f|F)"},hljs.CNM,{cN:"preprocessor",b:"#",e:"$"},{cN:"stl_container",b:"\\b(deque|list|queue|stack|vector|map|set|bitset|multiset|multimap|unordered_map|unordered_set|unordered_multiset|unordered_multimap|array)\\s*<",e:">",k:a,r:10,c:["self"]}]}}}();hljs.LANGUAGES.r={dM:{c:[hljs.HCM,{cN:"number",b:"\\b0[xX][0-9a-fA-F]+[Li]?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+(?:[eE][+\\-]?\\d*)?L\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+\\.(?!\\d)(?:i\\b)?",e:hljs.IMMEDIATE_RE,r:1},{cN:"number",b:"\\b\\d+(?:\\.\\d*)?(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\.\\d+(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"keyword",b:"(?:tryCatch|library|setGeneric|setGroupGeneric)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\.",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\d+(?![\\w.])",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\b(?:function)",e:hljs.IMMEDIATE_RE,r:2},{cN:"keyword",b:"(?:if|in|break|next|repeat|else|for|return|switch|while|try|stop|warning|require|attach|detach|source|setMethod|setClass)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"literal",b:"(?:NA|NA_integer_|NA_real_|NA_character_|NA_complex_)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"literal",b:"(?:NULL|TRUE|FALSE|T|F|Inf|NaN)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"identifier",b:"[a-zA-Z.][a-zA-Z0-9._]*\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"<\\-(?!\\s*\\d)",e:hljs.IMMEDIATE_RE,r:2},{cN:"operator",b:"\\->|<\\-",e:hljs.IMMEDIATE_RE,r:1},{cN:"operator",b:"%%|~",e:hljs.IMMEDIATE_RE},{cN:"operator",b:">=|<=|==|!=|\\|\\||&&|=|\\+|\\-|\\*|/|\\^|>|<|!|&|\\||\\$|:",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"%",e:"%",i:"\\n",r:1},{cN:"identifier",b:"`",e:"`",r:0},{cN:"string",b:'"',e:'"',c:[hljs.BE],r:0},{cN:"string",b:"'",e:"'",c:[hljs.BE],r:0},{cN:"paren",b:"[[({\\])}]",e:hljs.IMMEDIATE_RE,r:0}]}};
hljs.initHighlightingOnLoad();
</script>
<!-- MathJax scripts -->
<script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<style type="text/css">
body, td {
font-family: sans-serif;
background-color: white;
font-size: 13px;
}
body {
max-width: 800px;
margin: auto;
padding: 1em;
line-height: 20px;
}
tt, code, pre {
font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
}
h1 {
font-size:2.2em;
}
h2 {
font-size:1.8em;
}
h3 {
font-size:1.4em;
}
h4 {
font-size:1.0em;
}
h5 {
font-size:0.9em;
}
h6 {
font-size:0.8em;
}
a:visited {
color: rgb(50%, 0%, 50%);
}
pre, img {
max-width: 100%;
}
pre {
overflow-x: auto;
}
pre code {
display: block; padding: 0.5em;
}
code {
font-size: 92%;
border: 1px solid #ccc;
}
code[class] {
background-color: #F8F8F8;
}
table, td, th {
border: none;
}
blockquote {
color:#666666;
margin:0;
padding-left: 1em;
border-left: 0.5em #EEE solid;
}
hr {
height: 0px;
border-bottom: none;
border-top-width: thin;
border-top-style: dotted;
border-top-color: #999999;
}
@media print {
* {
background: transparent !important;
color: black !important;
filter:none !important;
-ms-filter: none !important;
}
body {
font-size:12pt;
max-width:100%;
}
a, a:visited {
text-decoration: underline;
}
hr {
visibility: hidden;
page-break-before: always;
}
pre, blockquote {
padding-right: 1em;
page-break-inside: avoid;
}
tr, img {
page-break-inside: avoid;
}
img {
max-width: 100% !important;
}
@page :left {
margin: 15mm 20mm 15mm 10mm;
}
@page :right {
margin: 15mm 10mm 15mm 20mm;
}
p, h2, h3 {
orphans: 3; widows: 3;
}
h2, h3 {
page-break-after: avoid;
}
}
</style>
</head>
<body>
<h1>Basics of Parallel Processing in R, Python, Matlab, and C</h1>
<h2>Threaded linear algebra and parallel for loops in a shared memory (single machine) context</h2>
<p>Chris Paciorek, Department of Statistics, UC Berkeley</p>
<h1>0) This Tutorial</h1>
<p>This tutorial covers basic strategies for using parallel processing in R, Python, Matlab, and C on single machine in which multiple processors (cores) share memory. </p>
<p>You should be able to replicate much of what is covered here provided you have R, Python, or Matlab installed on your computer, but some of the parallelization approaches may not work on Windows.</p>
<p>This tutorial was developed using a virtual machine developed here at Berkeley, the <a href="http://bce.berkeley.edu">Berkeley Common Environment (BCE)</a>. BCE is a virtual Linux machine - basically it is a Linux computer that you can run within your own computer, regardless of whether you are using Windows, Mac, or Linux. This provides a common environment so that things behave the same for all of us. You're welcome to try using BCE with this tutorial, but I should not that we haven't updated BCE recently and the VirtualBox software on which BCE will run can be a pain.</p>
<p>This tutorial assumes you have a working knowledge of either R, Python, Matlab, or C. </p>
<p>Materials for this tutorial, including the R markdown file and associated code files that were used to create this document are available on Github at <a href="https://github.com/berkeley-scf/tutorial-parallel-basics">https://github.com/berkeley-scf/tutorial-parallel-basics</a>. You can download the files by doing a git clone from a terminal window on a UNIX-like machine, as follows:</p>
<pre><code class="r">git clone https://github.com/berkeley-scf/tutorial-parallel-basics
</code></pre>
<p>To create this HTML document, simply compile the corresponding R Markdown file in R as follows (the following will work from within BCE after cloning the repository as above).</p>
<pre><code class="r">Rscript -e "library(knitr); knit2html('parallel-basics.Rmd')"
</code></pre>
<p>This tutorial by Christopher Paciorek is licensed under a Creative Commons Attribution 3.0 Unported License.</p>
<h1>1) Types of parallel processing</h1>
<p>There are two basic flavors of parallel processing (leaving aside
GPUs): distributed memory and shared memory. With shared memory, multiple
processors (which I'll call cores) share the same memory. With distributed
memory, you have multiple nodes, each with their own memory. You can
think of each node as a separate computer connected by a fast network. </p>
<h2>1.1) Some useful terminology:</h2>
<ul>
<li><em>cores</em>: We'll use this term to mean the different processing
units available on a single node.</li>
<li><em>nodes</em>: We'll use this term to mean the different computers,
each with their own distinct memory, that make up a cluster or supercomputer.</li>
<li><em>processes</em>: computational tasks executing on a machine; multiple
processes may be executing at once. A given program may start up multiple
processes at once. Ideally we have no more processes than cores on
a node.</li>
<li><em>threads</em>: multiple paths of execution within a single process;
the OS sees the threads as a single process, but one can think of
them as 'lightweight' processes. Ideally when considering the processes
and their threads, we would have no more processes and threads combined
than cores on a node.</li>
<li><em>forking</em>: child processes are spawned that are identical to
the parent, but with different process IDs and their own memory.</li>
<li><em>sockets</em>: some of R's parallel functionality involves creating
new R processes (e.g., starting processes via <em>Rscript</em>) and
communicating with them via a communication technology called sockets.</li>
</ul>
<h2>1.2) Shared memory</h2>
<p>For shared memory parallelism, each core is accessing the same memory
so there is no need to pass information (in the form of messages)
between different machines. But in some programming contexts one needs
to be careful that activity on different cores doesn't mistakenly
overwrite places in memory that are used by other cores.</p>
<p>We'll cover two types of shared memory parallelism approaches:</p>
<ul>
<li>threaded linear algebra </li>
<li>multicore functionality </li>
</ul>
<h3>Threading</h3>
<p>Threads are multiple paths of execution within a single process. If you are monitoring CPU
usage (such as with <em>top</em> in Linux or Mac) and watching a job that is executing threaded code, you'll
see the process using more than 100% of CPU. When this occurs, the
process is using multiple cores, although it appears as a single process
rather than as multiple processes.</p>
<p>Note that this is a different notion than a processor that is hyperthreaded. With hyperthreading a single core appears as two cores to the operating system.</p>
<h2>1.3) Distributed memory</h2>
<p>Parallel programming for distributed memory parallelism requires passing
messages between the different nodes. The standard protocol for doing
this is MPI, of which there are various versions, including <em>openMPI</em>.</p>
<p>The R package <em>Rmpi</em> implements MPI in R. The <em>pbd</em> packages for R also implement MPI as well as distributed linear algebra (linear algebra calculations across nodes). In addition, there are various ways to do simple parallelization of multiple computational tasks (across multiple nodes) that use MPI and other tools on the back-end without users needing to understand them.</p>
<p>Python has a package <em>mpi4py</em> that allows use of MPI within Python. And the IPython parallel tools allow one to do simple parallelization of multiple computational tasks across multiple nodes (as do other Python packages as well).</p>
<p>MATLAB has its own system for distributed computation, called the Distributed Computing Server (DCS), requiring additional licensing above the standard MATLAB installation. </p>
<p>This tutorial will not cover distributed memory parallelization; please see our <a href="https://github.com/berkeley-scf/tutorial-parallel-distributed">tutorial on parallel processing in a distributed context</a>.</p>
<h2>1.4) Other type of parallel processing</h2>
<p>We won't cover either of these in this material.</p>
<h3>GPUs</h3>
<p>GPUs (Graphics Processing Units) are processing units originally designed
for rendering graphics on a computer quickly. This is done by having
a large number of simple processing units for massively parallel calculation.
The idea of general purpose GPU (GPGPU) computing is to exploit this
capability for general computation. In spring 2016, I gave a <a href="http://statistics.berkeley.edu/computing/gpu">workshop on using GPUs</a>.</p>
<p>Most researchers don't program for a GPU directly but rather use software (often machine learning software such as Tensorflow or Caffe) that has been programmed to take advantage of a GPU if one is available.</p>
<h3>Spark and Hadoop</h3>
<p>Spark and Hadoop are systems for implementing computations in a distributed
memory environment, using the MapReduce approach. </p>
<h1>2) Threading, particularly for linear algebra</h1>
<h1>2.1) What is the BLAS?</h1>
<p>The BLAS is the library of basic linear algebra operations (written in
Fortran or C). A fast BLAS can greatly speed up linear algebra
relative to the default BLAS on a machine. Some fast BLAS libraries
are </p>
<ul>
<li>Intel's <em>MKL</em>; may be available for educational use for free</li>
<li><em>OpenBLAS</em> (formerly <em>GotoBLAS</em>); open source and free</li>
<li>AMD's <em>ACML</em>; free</li>
<li><em>vecLib</em> for Macs; provided with your Mac</li>
</ul>
<p>In addition to being fast when used on a single core, all of these BLAS libraries are
threaded - if your computer has multiple cores and there are free
resources, your linear algebra will use multiple cores, provided your
program is linked against the threaded BLAS installed on your machine and provided
OMP_NUM_THREADS is not set to one. (Macs make use of
VECLIB_MAXIMUM_THREADS rather than OMP_NUM_THREADS.)</p>
<p>On BCE, both R and (to some degree) Python are linked against OpenBLAS as of BCE-fall-2015. </p>
<h2>2.2) Using threading</h2>
<h3>2.2.1) R</h3>
<p>Threading in R is limited to linear algebra, provided R is linked against a threaded BLAS.</p>
<p>Here's some code that illustrates
the speed of using a threaded BLAS:</p>
<pre><code class="r"># install.packages('RhpcBLASctl')
library(RhpcBLASctl)
x <- matrix(rnorm(5000^2), 5000)
blas_set_num_threads(4)
system.time({
x <- crossprod(x)
U <- chol(x)
})
# user system elapsed
# 14.104 5.403 6.752
blas_set_num_threads(1)
system.time({
x <- crossprod(x)
U <- chol(x)
})
# user system elapsed
# 12.393 0.055 12.344
</code></pre>
<p>Here the elapsed time indicates that using four threads gave us a two times (2x) speedup in terms of real time, while the user time indicates that the threaded calculation took a bit more total processing time (combining time across all processors) because of the overhead of using multiple threads. </p>
<p>Note that the code also illustrates use of an R package that can control the number of threads from within R.</p>
<h3>2.2.2) Python</h3>
<p>As with R, threading in Python is limited to linear algebra, provided Python is linked against a threaded BLAS. Python has something
called the <a href="https://wiki.python.org/moin/GlobalInterpreterLock">Global Interpreter Lock</a>
that interferes with threading in Python (but not in threaded linear algebra packages called by Python). </p>
<p>Here's some linear algebra in Python that will use threading if <em>numpy</em> is linked against a threaded BLAS, though I don't compare the timing for different numbers of threads here. </p>
<pre><code class="python">import numpy as np
n = 5000
x = np.random.normal(0, 1, size=(n, n))
x = x.T.dot(x)
U = np.linalg.cholesky(x)
</code></pre>
<h3>2.2.3) MATLAB</h3>
<p>Many MATLAB functions are automatically threaded (not just linear
algebra), so you don't need to do anything special in your code to
take advantage of this.
So if you're running MATLAB and monitoring CPU usage (e.g., using
<em>top</em> on Linux or OS X), you may see a process using more than 100% of CPU. However
worker tasks within a <em>parfor</em> (see Section 3) use only a single thread.
MATLAB uses MKL for linear algebra. </p>
<h2>2.3) Fixing the number of threads (cores used)</h2>
<p>In general, threaded code will
detect the number of cores available on a machine and make use of
them. However, you can also explicitly control the number of threads
available to a process. </p>
<h3>2.3.1) R, Python, C, etc.</h3>
<p>For most threaded code (that based on the openMP protocol), the number
of threads can be set by setting the OMP_NUM_THREADS environment
variable (VECLIB_MAXIMUM_THREADS on a Mac). E.g., to set it for four
threads in bash:</p>
<p><code>export OMP_NUM_THREADS=4</code></p>
<p>Do this before starting your R or Python session or before running your compiled executable. </p>
<p>Alternatively, you can set OMP_NUM_THREADS as you invoke your job, e.g., here with R:</p>
<p><code>OMP_NUM_THREADS=4 R CMD BATCH --no-save job.R job.out</code></p>
<h3>2.3.2) MATLAB</h3>
<p>MATLAB is an exception. Threading in MATLAB can be controlled
in two ways. From within your MATLAB code you can set the number of
threads, e.g., to four in this case:</p>
<p><code>feature('numThreads', 4)</code></p>
<p>To use only a single thread, you can use 1 instead of 4 above, or
you can start MATLAB with the <em>singleCompThread</em> flag:</p>
<p><code>matlab -singleCompThread ...</code></p>
<h2>2.4) Important warnings about use of threaded BLAS</h2>
<h3>2.4.1) Speed and threaded BLAS</h3>
<p>In many cases, using multiple threads for linear algebra operations
will outperform using a single thread, but there is no guarantee that
this will be the case, in particular for operations with small matrices
and vectors. Testing with openBLAS suggests that sometimes a job may
take more time when using multiple threads; this seems to be less
likely with ACML. This presumably occurs because openBLAS is not doing
a good job in detecting when the overhead of threading outweights
the gains from distributing the computations. You can compare speeds
by setting OMP_NUM_THREADS to different values. In cases where threaded
linear algebra is slower than unthreaded, you would want to set OMP_NUM_THREADS
to 1. </p>
<p>More generally, if you are using the parallel tools in Section 3 to
simultaneously carry out many independent calculations (tasks), it is
likely to be more effective to use the fixed number of cores available on your machine
so as to split up the tasks, one per core, without taking advantage of the threaded BLAS (i.e., restricting
each process to a single thread). </p>
<h3>2.4.2) Conflict between openBLAS and some parallel functionality in R</h3>
<p>There are conflicts between forking in R and threaded BLAS that in
some cases have affected:</p>
<ul>
<li><em>foreach</em> (when using the <em>parallel</em> (and <em>multicore</em>) backends),</li>
<li><em>mclapply</em>, and</li>
<li><em>parLapply</em> and <em>parSapply</em> (only if <em>cluster</em> is set up with forking – not the default).</li>
</ul>
<p>The result is that if linear algebra is used within your parallel
code, R hangs. This has affected both openBLAS and ACML in the past,
though it may not affect current versions of these software.</p>
<p>If you find your R session hanging, before running an R job that does linear algebra,
you can set OMP_NUM_THREADS to 1 to prevent the BLAS from doing
threaded calculations. Alternatively, you can use MPI as the parallel
backend (via <em>doMPI</em> in place of <em>doParallel</em>).
You may also be able to convert your code to use <em>par{L,S}apply</em>
with the default PSOCK type and avoid <em>foreach</em> entirely.</p>
<h3>2.4.3) Conflict between threaded BLAS and R profiling</h3>
<p>There is also a conflict between threaded BLAS and R profiling, so
if you are using <em>Rprof</em>, you may need to set OMP_NUM_THREADS
to one. This has definitely occurred with openBLAS; I'm not sure about
other threaded BLAS libraries.</p>
<p><strong>Caution</strong>: Note that I don't pay any attention to possible
danger in generating random numbers in separate processes in this Section. More on
this issue in Section 4.</p>
<h2>2.5) Using an optimized BLAS on your own machine(s)</h2>
<p>To use an optimized BLAS with R, talk to your systems administrator, see <a href="https://cran.r-project.org/manuals.html">Section A.3 of the R Installation and Administration Manual</a>, or see <a href="http://statistics.berkeley.edu/computing/blas">these instructions to use <em>vecLib</em> BLAS from Apple's Accelerate framework on your own Mac</a>.</p>
<p>It's also possible to use an optimized BLAS with Python's numpy and scipy packages, on either Linux or using the Mac's <em>vecLib</em> BLAS. Details will depend on how you install Python, numpy, and scipy. </p>
<h1>3) Basic parallelized loops/maps/apply</h1>
<p>All of the functionality discussed here applies <em>only</em> if the iterations/loops of your calculations can be done completely separately and do not depend on one another. This scenario is called an <em>embarrassingly parallel</em> computation. So coding up the evolution of a time series or a Markov chain is not possible using these tools. However, bootstrapping, random forests, simulation studies, cross-validation
and many other statistical methods can be handled in this way.</p>
<h2>3.1) Parallel loops and <em>apply</em> functions in R</h2>
<h3>3.1.1) Parallel for loops with <em>foreach</em></h3>
<p>A simple way to exploit parallelism in R is to use the <em>foreach</em> package to do a for loop in parallel.</p>
<p>The <em>foreach</em> package provides a <em>foreach</em> command that
allows you to do this easily. <em>foreach</em> can use a variety of
parallel “back-ends''. For our purposes, the main one is use of the <em>parallel</em> package to use shared
memory cores. When using <em>parallel</em> as the
back-end, you should see multiple processes (as many as you registered;
ideally each at 100%) when you monitor CPU usage. The multiple processes
are created by forking or using sockets. <em>foreach</em> can also use <em>Rmpi</em> or <em>SNOW</em> to access cores in
a distributed memory setting; please see the tutorial on distributed parallel processing mentioned above.</p>
<p>Here we'll parallelize leave-one-out cross-validation for a random forest model. An iteration involves holding out a data point, fitting the model with all the other data points, and then predicting the held-out point.</p>
<pre><code class="r">library(doParallel) # uses parallel package, a core R package
</code></pre>
<pre><code>## Loading required package: foreach
</code></pre>
<pre><code>## Loading required package: iterators
</code></pre>
<pre><code>## Loading required package: parallel
</code></pre>
<pre><code class="r">source('rf.R') # loads in data and looFit()
</code></pre>
<pre><code>## randomForest 4.6-12
</code></pre>
<pre><code>## Type rfNews() to see new features/changes/bug fixes.
</code></pre>
<pre><code class="r">looFit
</code></pre>
<pre><code>## function (i, Y, X, loadLib = FALSE)
## {
## if (loadLib)
## library(randomForest)
## out <- randomForest(y = Y[-i], x = X[-i, ], xtest = X[i,
## ])
## return(out$test$predicted)
## }
</code></pre>
<pre><code class="r">nCores <- 4 # to set manually
registerDoParallel(nCores)
nSub <- 30 # do only first 30 for illustration
result <- foreach(i = 1:nSub) %dopar% {
cat('Starting ', i, 'th job.\n', sep = '')
output <- looFit(i, Y, X)
cat('Finishing ', i, 'th job.\n', sep = '')
output # this will become part of the out object
}
print(result[1:5])
</code></pre>
<pre><code>## [[1]]
## 1
## 1.524218
##
## [[2]]
## 2
## 0.2154531
##
## [[3]]
## 3
## 1.056725
##
## [[4]]
## 4
## -0.2916539
##
## [[5]]
## 5
## -0.2747813
</code></pre>
<p>(Note that the printed statements from <code>cat</code> are not showing up in the creation of this document but should show if you run the code.)</p>
<p>Note that <em>foreach</em>
also provides functionality for collecting and managing
the results to avoid some of the bookkeeping
you would need to do if writing your own standard for loop.
The result of <em>foreach</em> will generally be a list, unless
we request the results be combined in different way, as we do here using <code>.combine = c</code>.</p>
<p>You can debug by running serially using <em>%do%</em> rather than
<em>%dopar%</em>. Note that you may need to load packages within the
<em>foreach</em> construct to ensure a package is available to all of
the calculations.</p>
<p>It is possible to use foreach to parallelize over nested loops. Suppose that the outer loop has too few tasks to effectively parallelize over and you also want to parallelize over the inner loop as well. Provided the calculations in each task (defined based on the pair of indexes from both loops) are independent of the other tasks, you can define two foreach loops, with the outer foreach using the <code>%:%</code> operator and the inner foreach using the usual <code>%dopar%</code> operator. More details can be found <a href="https://cran.r-project.org/web/packages/foreach/vignettes/nested.pdf">in this vignette</a>. </p>
<h3>3.1.2) Parallel apply functionality</h3>
<p>The <em>parallel</em> package has the ability to parallelize the various
<em>apply</em> functions (apply, lapply, sapply, etc.). It's a bit hard to find the <a href="http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf">vignette for the parallel package</a>
because parallel is not listed as one of
the contributed packages on CRAN (it gets installed with R by default).</p>
<p>We'll consider parallel lapply and sapply. These rely on having started a cluster using <em>cluster</em>, which uses the PSOCK mechanism as in the SNOW package - starting new jobs via <em>Rscript</em>
and communicating via a technology called sockets.</p>
<pre><code class="r">library(parallel)
nCores <- 4 # to set manually
cl <- makeCluster(nCores)
nSub <- 30
input <- seq_len(nSub) # same as 1:nSub but more robust
# clusterExport(cl, c('x', 'y')) # if the processes need objects
# from master's workspace (not needed here as no global vars used)
# need to load randomForest package within function
# when using par{L,S}apply
system.time(
res <- parSapply(cl, input, looFit, Y, X, TRUE)
)
</code></pre>
<pre><code>## user system elapsed
## 0.008 0.004 23.350
</code></pre>
<pre><code class="r">system.time(
res2 <- sapply(input, looFit, Y, X)
)
</code></pre>
<pre><code>## user system elapsed
## 84.296 0.036 84.321
</code></pre>
<pre><code class="r">res <- parLapply(cl, input, looFit, Y, X, TRUE)
</code></pre>
<p>Here the miniscule user time is probably because the time spent in the worker processes is not counted at the level of the overall master process that dispatches the workers.</p>
<p>For help with these functions and additional related parallelization functions (including <em>parApply</em>), see <code>help(clusterApply)</code>.</p>
<p><em>mclapply</em> is an alternative that uses forking to start up the worker processes.</p>
<pre><code class="r">system.time(
res <- mclapply(input, looFit, Y, X, mc.cores = nCores)
)
</code></pre>
<pre><code>## user system elapsed
## 84.060 0.160 22.516
</code></pre>
<p>Note that some R packages can directly interact with the parallelization
packages to work with multiple cores. E.g., the <em>boot</em> package
can make use of the <em>parallel</em> package directly. </p>
<h3>3.1.3) Loading packages and accessing global variables within your parallel tasks</h3>
<p>Whether you need to explicitly load packages and export global variables from the master process to the parallelized worker processes depends on the details of how you are doing the parallelization.</p>
<p>With <em>foreach</em> with the <em>doParallel</em> backend, parallel <em>apply</em> statements (starting the cluster via <em>makeForkCluster</em>, instead of the usual <em>makeCluster</em>), and <em>mclapply</em>, packages and global variables in the main R process are automatically available to the worker tasks without any work on your part. This is because all of these approaches fork the original R process, thereby creating worker processes with the same state as the original R process. Interestingly, this means that global variables in the forked worker processes are just references to the objects in memory in the original R process. So the additional processes do not use additional memory for those objects (despite what is shown in <em>top</em>) and there is no time involved in making copies. However, if you modify objects in the worker processes then copies are made. </p>
<p>In contrast, with parallel <em>apply</em> statements when starting the cluster using the standard <em>makeCluster</em> (which sets up a so-called <em>PSOCK</em> cluster, starting the R worker processes via <em>Rscript</em>), one needs to load packages within the code that is executed in parallel. In addition one needs to use <em>clusterExport</em> to tell R which objects in the global environment should be available to the worker processes. This involves making as many copies of the objects as there are worker processes, so one can easily exceed the physical memory (RAM) on the machine if one has large objects, and the copying of large objects will take time. </p>
<h2>3.2) Parallel looping in Python</h2>
<p>I'll cover IPython parallel functionality, which allows one to parallelize on a single machine (discussed here) or across multiple machines (see the tutorial on distributed memory parallelization). There are a variety of other approaches one could use, of which I discuss two (the <em>pp</em> and <em>multiprocessing</em> packages) in the Appendix.</p>
<p>First we need to start our worker engines.</p>
<pre><code class="bash">ipcluster start -n 4 &
sleep 45
</code></pre>
<p>Here we'll do the same random forest cross-validation as before. </p>
<pre><code class="python">import numpy as np
np.random.seed(0)
n = 500
p = 50
X = np.random.normal(0, 1, size = (n, p))
Y = X[: , 0] + pow(abs(X[:,1] * X[:,2]), 0.5) + X[:,1] - X[:,2] + np.random.normal(0, 1, n)
def looFit(index, Ylocal, Xlocal):
rf = rfr(n_estimators=100)
fitted = rf.fit(np.delete(Xlocal, index, axis = 0), np.delete(Ylocal, index))
pred = rf.predict(np.array([Xlocal[index, :]]))
return(pred[0])
from ipyparallel import Client
c = Client()
c.ids
dview = c[:]
dview.block = True
dview.apply(lambda : "Hello, World")
lview = c.load_balanced_view()
lview.block = True
dview.execute('from sklearn.ensemble import RandomForestRegressor as rfr')
dview.execute('import numpy as np')
mydict = dict(X = X, Y = Y, looFit = looFit)
dview.push(mydict)
nSub = 50 # for illustration only do a subset
# need a wrapper function because map() only operates on one argument
def wrapper(i):
return(looFit(i, Y, X))
import time
time.time()
pred = lview.map(wrapper, range(nSub))
time.time()
print(pred[0:10])
# import pylab
# import matplotlib.pyplot as plt
# plt.plot(Y, pred, '.')
# pylab.show()
</code></pre>
<p>Finally we stop the worker engines:</p>
<pre><code class="bash">ipcluster stop
</code></pre>
<h2>3.3) Basic shared memory parallel programming in MATLAB</h2>
<p>To run a loop in parallel in MATLAB, you can use the <em>parfor</em>
construction. Before running the parfor you
need to start up a set of workers using <em>parpool</em>. MATLAB will use only one thread
per worker.
Here is some demo code:</p>
<pre><code class="bash">nslots = 4; % to set manually
mypool = parpool(nslots)
% parpool open local nslots # alternative
n = 3000;
nIts = 500;
c = zeros(n, nIts);
parfor i = 1:nIts
c(:,i) = eig(rand(n));
end
delete(mypool)
% delete(gcp) works if you forget to name your pool by assigning the output of parpool to a variable
</code></pre>
<p>MATLAB has a default limit on the number of workers in a pool, but you can modify your MATLAB settings as follows to increase that limit (in this case to allow up to 32 workers). It should work to run the following code once in a MATLAB session, which will modify the settings for future MATLAB sessions.</p>
<pre><code>cl = parcluster('local');
cl.NumWorkers = 32;
saveProfile(cl);
</code></pre>
<p>By default MATLAB uses a single thread (core) per worker. However you can also use multiple threads. Here is how you can set that up.</p>
<pre><code>cl = parcluster('local');
cl.NumThreads = 2; % 2 threads per worker
cl.parpool(4); % 4 workers
</code></pre>
<h2>3.4) Using <em>OpenMP</em> threads for basic shared memory programming in C</h2>
<p>It's straightforward to write threaded code in C and C++ (as well as Fortran) to exploit multiple cores. The basic approach is to use the OpenMP protocol. Here's how one would parallelize a loop in C/C++ using an OpenMP compiler directive. In this case we are parallelizing the outer loop; the iterations of the outer loop are done in parallel, while the iterations of the inner loop are done serially within a thread. As with <em>foreach</em> in R, you only want to do this if the iterations do not depend on each other. The code is available as a C++ program (but the core of the code is just C code) in <em>testOpenMP.cpp</em>.</p>
<pre><code>// see testOpenMP.cpp
#include <iostream>
using namespace std;
// compile with: g++ -fopenmp -L/usr/local/lib
// testOpenMP.cpp -o testOpenMP
int main(){
int nReps = 20;
double x[nReps];
#pragma omp parallel for
for (int i=0; i<nReps; i++){
x[i] = 0.0;
for ( int j=0; j<1000000000; j++){
x[i] = x[i] + 1.0;
}
cout << x[i] << endl;
}
return 0;
}
</code></pre>
<p>We would compile this program as follows</p>
<pre><code>$ g++ -fopenmp testOpenMP.cpp -o testOpenMP
</code></pre>
<p>The main thing to be aware of in using OpenMP is not having different threads overwrite variables used by other threads. In the example above, variables declared within the <code>#pragma</code> directive will be recognized as variables that are private to each thread. In fact, you could declare <code>int i</code> before the compiler directive and things would be fine because OpenMP is smart enough to deal properly with the primary looping variable. But big problems would ensue if you had instead written the following code:</p>
<pre><code>int main(){
int nReps = 20;
int j; // DON'T DO THIS !!!!!!!!!!!!!
double x[nReps];
#pragma omp parallel for
for (int i=0; i<nReps; i++){
x[i] = 0.0;
for (j=0; j<1000000000; j++){
x[i] = x[i] + 1.0;
}
cout << x[i] << endl;
}
return 0;
}
</code></pre>
<p>Note that we do want <em>x</em> declared before the compiler directive because we want all the threads to write to a common <em>x</em> (but, importantly, to different components of <em>x</em>). That's the point!</p>
<p>We can also be explicit about what is shared and what is private to each thread:</p>
<pre><code>int main(){
int nReps = 20;
int i, j;
double x[nReps];
#pragma omp parallel for private(i,j) shared(x, nReps)
for (i=0; i<nReps; i++){
x[i] = 0.0;
for (j=0; j<1000000000; j++){
x[i] = x[i] + 1.0;
}
cout << x[i] << endl;
}
return 0;
}
</code></pre>
<h2>3.5) Calling OpenMP-based C code from R</h2>
<p>The easiest path here is to use the <em>Rcpp</em> package. In this case, you can write your C++ code with OpenMP pragma statemetns as in the previous subsection. You'll need to make sure that the <em>PKG_CXXFLAGS</em> and <em>PKG_LIBS</em> environment variables are set to include <code>-f openmp</code> so the compilation is done correctly. More details/examples linked to from <a href="http://stackoverflow.com/questions/22748358/rcpp-with-openmp">this Stack overflow post</a>.</p>
<h1>4) Parallelization strategies</h1>
<p>The following are some basic principles/suggestions for how to parallelize
your computation.</p>
<p>Should I use one machine/node or many machines/nodes?</p>
<ul>
<li>If you can do your computation on the cores of a single node using
shared memory, that will be faster than using the same number of cores
(or even somewhat more cores) across multiple nodes. Similarly, jobs
with a lot of data/high memory requirements that one might think of
as requiring Spark or Hadoop may in some cases be much faster if you can find
a single machine with a lot of memory.</li>
<li>That said, if you would run out of memory on a single node, then you'll
need to use distributed memory.</li>
</ul>
<p>What level or dimension should I parallelize over?</p>
<ul>
<li>If you have nested loops, you generally only want to parallelize at
one level of the code. That said, there may be cases in which it is
helpful to do both. Keep in mind whether your linear algebra is being
threaded. Often you will want to parallelize over a loop and not use
threaded linear algebra.</li>
<li>Often it makes sense to parallelize the outer loop when you have nested
loops.</li>
<li>You generally want to parallelize in such a way that your code is
load-balanced and does not involve too much communication. </li>
</ul>
<p>How do I balance communication overhead with keeping my cores busy?</p>
<ul>
<li>If you have very few tasks, particularly if the tasks take different
amounts of time, often some processors will be idle and your code
poorly load-balanced.</li>
<li>If you have very many tasks and each one takes little time, the communication
overhead of starting and stopping the tasks will reduce efficiency.</li>
</ul>
<p>Should multiple tasks be pre-assigned to a process (i.e., a worker) (sometimes called <em>prescheduling</em>) or should tasks
be assigned dynamically as previous tasks finish? </p>
<ul>
<li>Basically if you have many tasks that each take similar time, you
want to preschedule the tasks to reduce communication. If you have few tasks
or tasks with highly variable completion times, you don't want to
preschedule, to improve load-balancing.</li>
<li>For R in particular, some of R's parallel functions allow you to say whether the
tasks should be prescheduled. E.g., the <em>mc.preschedule</em> argument in <em>mclapply</em> and (in principle) the <em>parLapplyLB</em> function in place of <em>parLapply</em>. However, there appears to be a bug in <em>parLapplyLB</em> such that it doesn't actually act in a load-balanced way.</li>
</ul>
<h1>5) Random number generation (RNG) in parallel</h1>
<p>The key thing when thinking about random numbers in a parallel context
is that you want to avoid having the same 'random' numbers occur on
multiple processes. On a computer, random numbers are not actually
random but are generated as a sequence of pseudo-random numbers designed
to mimic true random numbers. The sequence is finite (but very long)
and eventually repeats itself. When one sets a seed, one is choosing
a position in that sequence to start from. Subsequent random numbers
are based on that subsequence. All random numbers can be generated
from one or more random uniform numbers, so we can just think about
a sequence of values between 0 and 1. </p>
<p>The worst thing that could happen is that one sets things up in such
a way that every process is using the same sequence of random numbers.
This could happen if you mistakenly set the same seed in each process,
e.g., using <em>set.seed(mySeed)</em> in R on every process.</p>
<p>The naive approach is to use a different seed for each process. E.g.,
if your processes are numbered <code>id = 1,2,...,p</code> with a variable <em>id</em> that is unique
to a process, setting the seed to be the value of <em>id</em> on each process. This is likely
not to cause problems, but raises the danger that two (or more sequences)
might overlap. For an algorithm with dependence on the full sequence,
such as an MCMC, this probably won't cause big problems (though you
likely wouldn't know if it did), but for something like simple simulation
studies, some of your 'independent' samples could be exact replicates
of a sample on another process. Given the period length of the default
generators in R, MATLAB and Python, this is actually quite unlikely,
but it is a bit sloppy.</p>
<p>One approach to avoid the problem is to do all your RNG on one process
and distribute the random deviates, but this can be infeasible with
many random numbers.</p>
<p>More generally to avoid this problem, the key is to use an algorithm
that ensures sequences that do not overlap.</p>
<h2>5.1) Ensuring separate sequences in R</h2>
<p>In R, the <em>rlecuyer</em> package deals with this.
The L'Ecuyer algorithm has a period of \(2^{191}\), which it divides
into subsequences of length \(2^{127}\). </p>
<h3>5.1.1) With the parallel package</h3>
<p>Here's how you initialize independent sequences on different processes
when using the <em>parallel</em> package's parallel apply functionality
(illustrated here with <em>parSapply</em>).</p>
<pre><code class="r">library(parallel)
library(rlecuyer)
nSims <- 250
taskFun <- function(i){
val <- runif(1)
return(val)
}
nCores <- 4
RNGkind()
</code></pre>
<pre><code>## [1] "Mersenne-Twister" "Inversion"
</code></pre>
<pre><code class="r">cl <- makeCluster(nCores)
iseed <- 0
clusterSetRNGStream(cl = cl, iseed = iseed)
RNGkind() # clusterSetRNGStream sets RNGkind as L'Ecuyer-CMRG
</code></pre>
<pre><code>## [1] "Mersenne-Twister" "Inversion"
</code></pre>
<pre><code class="r"># but it doesn't show up here on the master
res <- parSapply(cl, 1:nSims, taskFun)
# now redo with same master seed to see results are the same
clusterSetRNGStream(cl = cl, iseed = iseed)
res2 <- parSapply(cl, 1:nSims, taskFun)
identical(res,res2)
</code></pre>
<pre><code>## [1] TRUE
</code></pre>
<pre><code class="r">stopCluster(cl)
</code></pre>
<p>If you want to explicitly move from stream to stream, you can use
<em>nextRNGStream</em>. For example:</p>
<pre><code class="r">RNGkind("L'Ecuyer-CMRG")
seed <- 0
set.seed(seed)
## now start M workers
s <- .Random.seed
for (i in 1:M) {
s <- nextRNGStream(s)
# send s to worker i as .Random.seed