forked from chenzomi12/AISystem
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03.srt
1936 lines (1452 loc) · 30.3 KB
/
03.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:00,000 --> 00:00:04,000
Subtitle:PlusV98
2
00:00:05,075 --> 00:00:11,100
哈喽大家好 我是在大大的公司里面挖呀挖呀挖的ZOMI
3
00:00:20,000 --> 00:00:26,000
今天来到了AI芯片GPU详解里面的Tensor Core第三个内容深度剖析
4
00:00:26,000 --> 00:00:31,188
那这个所谓的深度剖析呢更多ZOMI个人的理解
5
00:00:31,188 --> 00:00:34,188
今天呢主要给大家去分享三个内容
6
00:00:34,188 --> 00:00:39,188
第一个呢就是Tensor Core回顾一下整个Tensor Core的具体执行的流程
7
00:00:39,188 --> 00:00:43,188
第二个呢就去看一下指令的流水
8
00:00:43,188 --> 00:00:46,188
特别是指Tensor Core的Instruction Pipeline
9
00:00:46,188 --> 00:00:50,481
第三个呢来一起去看看CUDA Thread
10
00:00:50,481 --> 00:00:55,289
就是CUDA的线程执行跟硬件具体是怎么去结合的
11
00:00:55,289 --> 00:01:00,025
这里面就会给大家带来一些非常之个人主观的技术意见
12
00:01:00,173 --> 00:01:05,173
如果大家觉得不对的也可以去吐槽和指正和指点
13
00:01:06,000 --> 00:01:08,000
现在呢来到了第一个内容
14
00:01:08,000 --> 00:01:10,000
去看看Tensor Core的执行
15
00:01:10,000 --> 00:01:12,525
一个4x4的矩阵A
16
00:01:12,525 --> 00:01:14,125
4x4的矩阵B
17
00:01:14,125 --> 00:01:17,125
再加上一个4x4的矩阵C
18
00:01:17,125 --> 00:01:22,125
那所谓混合精度呢就是在计算的过程当中呢使用FP16去计算
19
00:01:22,125 --> 00:01:27,125
但是存储的时候呢使用FP32或者FP16进行存储
20
00:01:27,125 --> 00:01:30,573
但是在数学实际计算的时候呢
21
00:01:30,573 --> 00:01:34,573
是把矩一行乘以矩阵的一列
22
00:01:34,573 --> 00:01:39,573
然后再加上单独一个元素得到D矩阵的第一个元素
23
00:01:39,573 --> 00:01:42,715
把第一行跟第一列进行相乘
24
00:01:42,715 --> 00:01:47,579
实际上在整个计算的时候呢会把第二行跟第二列进行相乘
25
00:01:47,579 --> 00:01:49,904
再加上对角线的C11
26
00:01:49,904 --> 00:01:52,656
再接下来还有更多的操作
27
00:01:52,656 --> 00:01:52,904
就是每一行跟每一列都需要相乘
28
00:01:52,904 --> 00:01:55,728
就是每一行跟每一列都需要相乘
29
00:01:55,728 --> 00:01:58,096
才能够得到所有的元素
30
00:01:58,096 --> 00:02:01,000
也就是第一行跟第一列相乘
31
00:02:01,000 --> 00:02:03,000
第一行跟第二列相乘
32
00:02:03,000 --> 00:02:04,000
第三列第四列
33
00:02:04,000 --> 00:02:07,871
英伟达的V100其实并不是一行的去计算
34
00:02:07,871 --> 00:02:10,871
而是整一个矩阵整一个矩阵的去计算的
35
00:02:10,871 --> 00:02:13,871
来看看下面官方给出来的模拟图
36
00:02:13,871 --> 00:02:15,604
PASCA呢就是上一代的架构
37
00:02:15,604 --> 00:02:16,604
没有Tensor Core之前的
38
00:02:16,604 --> 00:02:18,604
是一个元素跟一行进行相乘
39
00:02:18,604 --> 00:02:21,356
每个时钟周期呢执行四次相乘
40
00:02:21,356 --> 00:02:23,120
得到一列数据
41
00:02:23,120 --> 00:02:25,120
而在V100里面呢
42
00:02:25,120 --> 00:02:29,120
把整个矩阵A跟整个矩阵B进行相乘
43
00:02:29,120 --> 00:02:31,806
然后呢得到整一个矩阵的输出
44
00:02:31,806 --> 00:02:33,256
整体来说
45
00:02:33,256 --> 00:02:34,057
右边的这个Tensor Core呢
46
00:02:34,057 --> 00:02:35,529
在单个时钟周期内呢
47
00:02:35,529 --> 00:02:37,529
就能够执行四乘四乘四
48
00:02:37,529 --> 00:02:39,904
等于64次的FMA
49
00:02:39,904 --> 00:02:41,600
也就是乘加的计算操作
50
00:02:41,600 --> 00:02:43,509
它的吞吐呢
51
00:02:43,509 --> 00:02:46,261
会比左边的PASCA架构呢快了12倍
52
00:02:46,261 --> 00:02:48,261
那PASCA架构这种计算呢
53
00:02:48,261 --> 00:02:50,261
使用的是CUDA Core
54
00:02:50,261 --> 00:02:51,765
而V100呢就出现了Tensor Core
55
00:02:51,765 --> 00:02:54,710
专门去对矩阵进行加速
56
00:02:54,710 --> 00:02:58,000
现在呢来看一下V100的一个V架构
57
00:02:58,000 --> 00:03:00,214
刚才看到 一个时钟周期呢
58
00:03:00,214 --> 00:03:02,145
其实只能执行16个FMA
59
00:03:02,145 --> 00:03:05,000
但是呢v100的Tensor Core里面呢
60
00:03:05,000 --> 00:03:06,817
一个时钟周期内呢
61
00:03:06,817 --> 00:03:09,817
可以执行两个4x4x4的FMA的操作
62
00:03:09,817 --> 00:03:10,817
整体来说
63
00:03:10,817 --> 00:03:12,183
Tensor Core的计算吞吐呢
64
00:03:12,183 --> 00:03:15,183
比右边的CUDA Core要高12倍
65
00:03:15,183 --> 00:03:17,183
那下面呢再看一下
66
00:03:17,183 --> 00:03:19,843
在一个SM里面有四个Sub Core
67
00:03:19,843 --> 00:03:22,019
每个Sub Core里面呢有两个Tensor Core
68
00:03:22,019 --> 00:03:24,387
单个Tensor Core一个时钟周期内呢
69
00:03:24,387 --> 00:03:26,387
能执行64个FMA
70
00:03:26,387 --> 00:03:28,893
里面的一个SM呢 单个时钟周期
71
00:03:28,893 --> 00:03:29,000
就可以执行1024次FFMA了
72
00:03:29,000 --> 00:03:31,743
就可以执行1024次FFMA了
73
00:03:33,000 --> 00:03:35,000
现在来到了第二个内容
74
00:03:35,000 --> 00:03:37,569
Tensor Core的指令流水
75
00:03:37,569 --> 00:03:40,421
有两个不同的符号
76
00:03:40,421 --> 00:03:41,421
一个是加号
77
00:03:41,421 --> 00:03:42,421
一个是乘号
78
00:03:42,421 --> 00:03:44,133
要在Tensor Core里面呢
79
00:03:44,133 --> 00:03:46,885
去实现刚才的一条简单的矩阵相乘
80
00:03:46,885 --> 00:03:50,149
把A的一行呢乘以B矩阵的一列
81
00:03:50,149 --> 00:03:52,453
也就是下面对应的这条公式
82
00:03:52,453 --> 00:03:54,600
于是呢电路呢
83
00:03:54,600 --> 00:03:56,275
假设这里面只是我的臆想
84
00:03:56,275 --> 00:03:58,656
不一定Tensor Core里面就是这么去实现的
85
00:03:58,656 --> 00:04:03,072
可能会把A跟B进行相乘再相加
86
00:04:03,072 --> 00:04:06,000
通过这种虚拟硬件电路的方式呢
87
00:04:06,000 --> 00:04:06,072
去实现整个相关的硬件电路
88
00:04:06,072 --> 00:04:08,768
去实现整个相关的硬件电路
89
00:04:08,768 --> 00:04:10,768
下面呢再看一下
90
00:04:10,768 --> 00:04:12,768
实际上呢在中间的过程当中
91
00:04:12,768 --> 00:04:14,768
或者计算的过程当中呢
92
00:04:14,768 --> 00:04:16,128
绿色的是寄存器
93
00:04:16,128 --> 00:04:18,688
离不开相关的寄存器
94
00:04:18,688 --> 00:04:20,688
竖着呢是一个32位的
95
00:04:20,688 --> 00:04:22,688
输出呢也是32位的
96
00:04:22,688 --> 00:04:25,688
但是呢我中间去计算A和B矩阵呢
97
00:04:25,688 --> 00:04:27,688
它是可以是16位的
98
00:04:27,688 --> 00:04:29,688
而在乘加完之后呢
99
00:04:29,688 --> 00:04:32,286
它确实中间需要有一个简单的寄存器
100
00:04:32,286 --> 00:04:34,286
去存储中间的数据
101
00:04:34,286 --> 00:04:36,254
可以看到寄存器
102
00:04:36,254 --> 00:04:39,254
离实际的计算单元呢
103
00:04:39,254 --> 00:04:40,822
是非常的接近
104
00:04:40,822 --> 00:04:41,572
通过这种方式呢
105
00:04:41,572 --> 00:04:42,806
可以简单的实现
106
00:04:42,806 --> 00:04:46,006
A矩阵的一行乘以B矩阵的一列
107
00:04:46,006 --> 00:04:48,677
刚才提到的只是一行跟一列
108
00:04:48,677 --> 00:04:50,469
那原来在CUDA Core里面呢
109
00:04:50,469 --> 00:04:52,645
是做一个点跟一行进行相乘
110
00:04:52,645 --> 00:04:54,373
现在在V100里面呢
111
00:04:54,373 --> 00:04:57,061
它是一个矩阵跟另外一个矩阵
112
00:04:57,061 --> 00:05:00,061
直接相乘得到一个新的矩阵
113
00:05:00,061 --> 00:05:01,789
而刚才提到的
114
00:05:01,789 --> 00:05:05,789
只是其中一行跟其中一列进行相乘
115
00:05:05,789 --> 00:05:07,789
得到中间一个元素
116
00:05:07,789 --> 00:05:09,119
那更多的怎么办呢
117
00:05:09,119 --> 00:05:12,564
假设刚才展示的A的一行跟B的一列相乘呢
118
00:05:12,564 --> 00:05:14,062
得到一个元素
119
00:05:14,062 --> 00:05:15,337
那下面呢看一下
120
00:05:15,337 --> 00:05:16,662
把这么一个简单的元素呢
121
00:05:16,662 --> 00:05:17,662
进行一个组成
122
00:05:17,662 --> 00:05:21,662
我把A0I A1I A2I A3I
123
00:05:21,662 --> 00:05:26,662
A的每一行跟B的每一列进行相乘
124
00:05:26,662 --> 00:05:27,662
这个时候呢
125
00:05:27,662 --> 00:05:29,662
就可以得到整个矩阵的
126
00:05:29,662 --> 00:05:31,662
每一个元素
127
00:05:31,662 --> 00:05:32,662
那这个时候
128
00:05:32,662 --> 00:05:34,662
对应的A矩阵的寄存器呢
129
00:05:34,662 --> 00:05:35,662
就应该是一堆
130
00:05:35,662 --> 00:05:37,662
对应B矩阵的寄存器呢
131
00:05:37,662 --> 00:05:38,662
也应该是一堆
132
00:05:38,662 --> 00:05:41,086
而不像刚才的只有单一个了
133
00:05:42,000 --> 00:05:44,408
ZOMI对硬件呢不是非常了解
134
00:05:44,408 --> 00:05:46,408
我尝试从我个人理解的角度
135
00:05:46,408 --> 00:05:48,408
去给大家做一个简单的分享
136
00:05:48,408 --> 00:05:49,398
如果大家觉得有不对呢
137
00:05:49,398 --> 00:05:51,000
随时欢迎大家随时指正
138
00:05:51,500 --> 00:05:55,287
当一个元素 也就是Scalar的乘加操作的指令呢
139
00:05:55,287 --> 00:05:56,380
就像下面所示
140
00:05:56,380 --> 00:05:57,212
但实际上呢
141
00:05:57,212 --> 00:05:58,488
Tensor Core里面的Mod呢
142
00:05:58,488 --> 00:06:00,488
只有Fp16存储或者加的时候呢
143
00:06:00,488 --> 00:06:02,008
是用到Fp32的
144
00:06:02,008 --> 00:06:04,303
于是呢把刚才的一个Mod呢
145
00:06:04,303 --> 00:06:05,303
把它节省掉
146
00:06:05,303 --> 00:06:07,303
现在呢实现两个元素相乘呢
147
00:06:07,303 --> 00:06:09,573
就要把两条流水并行起来
148
00:06:09,573 --> 00:06:10,917
这个就指令的流水
149
00:06:10,917 --> 00:06:13,477
现在实现用A的一行乘以B的一列
150
00:06:13,477 --> 00:06:16,477
于是呢就有四条Pipeline的流水
151
00:06:16,477 --> 00:06:19,477
现在只是实现简单计算一个元素
152
00:06:19,477 --> 00:06:21,477
就需要四条流水
153
00:06:21,477 --> 00:06:23,127
通过上面的绿色指令流水
154
00:06:23,127 --> 00:06:24,663
计算出了D00
155
00:06:24,663 --> 00:06:25,879
通过黄色的流水呢
156
00:06:25,879 --> 00:06:27,223
计算出了D01
157
00:06:27,223 --> 00:06:28,503
接下来想要把
158
00:06:28,503 --> 00:06:30,231
所有的元素计算出来了
159
00:06:30,231 --> 00:06:32,791
就有大量的指令的流水去拼接
160
00:06:32,791 --> 00:06:34,791
现在去把四条流水拼接起来
161
00:06:34,791 --> 00:06:37,143
就简单的实现了一个矩阵的
162
00:06:37,143 --> 00:06:39,191
D01到D03的一个结果
163
00:06:39,191 --> 00:06:43,000
那现在其实还要把所有的都拼起来
164
00:06:43,000 --> 00:06:44,631
那整个指令的流水呢
165
00:06:44,631 --> 00:06:46,935
在一个屏幕呢已经放不下了
166
00:06:46,935 --> 00:06:48,560
看一下这里面的颜色呢
167
00:06:48,560 --> 00:06:50,330
其实在某一个时间段
168
00:06:50,330 --> 00:06:53,330
对数据的读写是有规律的
169
00:06:53,330 --> 00:06:56,330
Mod呢就是需要对数据读取出来
170
00:06:56,330 --> 00:06:57,330
读取出来之后呢
171
00:06:57,330 --> 00:06:59,330
下面这个算完Round呢
172
00:06:59,330 --> 00:07:00,762
就是数据的写入
173
00:07:00,762 --> 00:07:02,762
所以在某一个时刻呢
174
00:07:02,762 --> 00:07:05,237
对整个流水是有四个数据
175
00:07:05,237 --> 00:07:07,237
从计存器里面读到计算单元
176
00:07:07,237 --> 00:07:08,237
然后有一个数据呢
177
00:07:08,237 --> 00:07:10,237
存到计存器里面
178
00:07:10,237 --> 00:07:12,582
通过大量的指令流水呢
179
00:07:12,582 --> 00:07:15,132
实现了整个Tensor Core的计算
180
00:07:16,000 --> 00:07:18,790
这次呢ZOMI的语速呢放得非常的慢
181
00:07:18,790 --> 00:07:20,262
然后也没有那么多废话了
182
00:07:20,262 --> 00:07:22,262
现在呢来看一下第三个内容
183
00:07:22,262 --> 00:07:24,262
Tensor Core的线程的执行
184
00:07:24,262 --> 00:07:26,726
在整体CUDA的软件设计方面呢
185
00:07:26,726 --> 00:07:29,076
其实是希望能够去匹配
186
00:07:29,076 --> 00:07:31,076
英伟达计算和存储分层的整个结构
187
00:07:31,076 --> 00:07:34,076
那整体呢英伟达对于Tensor Core的定义呢
188
00:07:34,076 --> 00:07:36,281
其实主要是通过CUDA来提供一个
189
00:07:36,281 --> 00:07:37,369
范型的编程
190
00:07:37,369 --> 00:07:38,591
那这个所谓的范型编程
191
00:07:38,591 --> 00:07:40,393
可以抛开一边先不谈
192
00:07:40,393 --> 00:07:41,545
他们的一个术语呢
193
00:07:41,545 --> 00:07:43,545
是General Programming
194
00:07:43,545 --> 00:07:44,695
后面的我的例子呢
195
00:07:44,695 --> 00:07:46,409
会有一个简单的
196
00:07:46,409 --> 00:07:47,817
A乘以B等于C
197
00:07:47,817 --> 00:07:48,817
作为Demo
198
00:07:48,817 --> 00:07:49,817
就没有了刚才
199
00:07:49,817 --> 00:07:51,817
D等于A乘以B加C
200
00:07:51,817 --> 00:07:52,817
没有了这个概念
201
00:07:52,817 --> 00:07:54,817
就简单的一个矩阵层
202
00:07:55,000 --> 00:07:59,000
矩阵A呢跟矩阵B相乘
203
00:07:59,000 --> 00:08:01,000
得到矩阵C
204
00:08:01,000 --> 00:08:01,600
但实际上呢
205
00:08:01,600 --> 00:08:02,600
不可能把这么大的一个矩阵
206
00:08:02,600 --> 00:08:04,600
载到具体的Tensor Core里面
207
00:08:04,600 --> 00:08:05,600
因为Tensor Core
208
00:08:05,600 --> 00:08:07,508
只能从容纳四乘四的
209
00:08:07,508 --> 00:08:07,600
一个简单的计算
210
00:08:07,600 --> 00:08:08,508
一个简单的计算
211
00:08:08,508 --> 00:08:09,508
那这个时候呢
212
00:08:09,508 --> 00:08:11,650
会对矩阵进行切片
213
00:08:11,650 --> 00:08:12,650
放到Thread Board
214
00:08:12,650 --> 00:08:13,650
就是线程块里面
215
00:08:13,650 --> 00:08:14,650
接着呢
216
00:08:14,650 --> 00:08:16,420
再放在软件上面
217
00:08:16,420 --> 00:08:17,420
定义一个的Warp
218
00:08:17,420 --> 00:08:18,420
Warp的概念呢
219
00:08:18,420 --> 00:08:19,708
在之前已经给大家讲过了
220
00:08:19,708 --> 00:08:21,708
最后整个线程去执行的
221
00:08:21,708 --> 00:08:23,708
就是真正的Tensor Core
222
00:08:23,708 --> 00:08:26,708
下面呢逐层去打开具体的内容
223
00:08:27,483 --> 00:08:29,593
抛开CUDA和英伟达的所有概念呢
224
00:08:29,593 --> 00:08:30,593
看一个矩阵层
225
00:08:30,593 --> 00:08:32,593
也就是所谓的GEMM
226
00:08:32,593 --> 00:08:33,593
其实呢一次呢
227
00:08:33,593 --> 00:08:35,593
是计算一个小的矩阵块
228
00:08:35,593 --> 00:08:36,593
也就是把矩阵A呢
229
00:08:36,593 --> 00:08:37,593
拿出一个小块
230
00:08:37,593 --> 00:08:38,593
把矩阵B呢
231
00:08:38,593 --> 00:08:39,593
拿出一个小块
232
00:08:39,593 --> 00:08:41,593
算出来一个矩阵C
233
00:08:41,593 --> 00:08:42,593
那这个时候呢
234
00:08:42,593 --> 00:08:43,835
在整体的软件
235
00:08:43,835 --> 00:08:45,115
去编程的时候呢
236
00:08:45,115 --> 00:08:47,115
就会沿着每一个维度
237
00:08:47,115 --> 00:08:48,115
也就是沿着
238
00:08:48,115 --> 00:08:49,115
每个m啊k啊
239
00:08:49,115 --> 00:08:50,982
还有n啊进行切分
240
00:08:50,982 --> 00:08:51,913
具体呢就划分成为
241
00:08:51,913 --> 00:08:53,363
mTile跟mTile的
242
00:08:53,363 --> 00:08:55,363
一个独立的矩阵乘法
243
00:08:55,363 --> 00:08:57,000
通过这种累积
244
00:08:57,000 --> 00:08:59,000
n维跟n维还有k维的Tile呢
245
00:08:59,000 --> 00:09:00,425
把整个矩阵层累积起来
246
00:09:00,425 --> 00:09:01,425
计算出整个大的
247
00:09:01,425 --> 00:09:02,425
矩阵的结果
248
00:09:02,425 --> 00:09:03,425
在整个编程里面呢
249
00:09:03,425 --> 00:09:04,425
每一个维度呢
250
00:09:04,425 --> 00:09:05,425
就进行分开