Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the performance of seq2seq model on GPU #162

Open
Xreki opened this issue Aug 9, 2019 · 2 comments
Open

Optimize the performance of seq2seq model on GPU #162

Xreki opened this issue Aug 9, 2019 · 2 comments

Comments

@Xreki
Copy link
Collaborator

Xreki commented Aug 9, 2019

初始性能

  • 测试时间:2019年8月8日
  • 测试者:@Xreki
@Xreki
Copy link
Collaborator Author

Xreki commented Aug 9, 2019

Profile结果

------------------------->     Profiling Report     <-------------------------

Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                  Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.      
recurrent_grad                         20          3334.05     3130.058004 (0.938815)  203.995332 (0.061185)   27.1338     260.926     166.703     0.308526    
recurrent                              20          1497.97     1452.955293 (0.969947)  45.018247 (0.030053)    16.0935     110.292     74.8987     0.138619    
elementwise_add_grad                   4795        1377.74     168.491482 (0.122296)   1209.244141 (0.877704)  0.026835    1.0058      0.287328    0.127493    
matmul_grad                            3196        844.682     311.706298 (0.369022)   532.976110 (0.630978)   0.087976    13.3952     0.264294    0.0781651   
matmul                                 3196        526.856     173.302980 (0.328938)   353.553345 (0.671062)   0.036523    8.69278     0.164849    0.0487542   
sum                                    12389       503.771     391.555491 (0.777248)   112.215826 (0.222752)   0.021764    6.88414     0.0406628   0.0466179   
GpuMemcpyAsync(same_gpu):GPU->GPU      18475       334.162     303.616585 (0.908592)   30.544991 (0.091408)    0.013014    0.120487    0.0180872   0.0309226   
elementwise_mul_grad                   6734        256.62      238.810646 (0.930602)   17.808921 (0.069398)    0.027794    7.31021     0.038108    0.023747    
elementwise_mul                        6864        203.549     187.715197 (0.922212)   15.833627 (0.077788)    0.020651    1.87285     0.0296545   0.018836    
elementwise_add                        4795        164.711     153.425376 (0.931485)   11.285194 (0.068515)    0.021858    1.79897     0.0343505   0.015242    
concat                                 3794        164.522     137.125469 (0.833479)   27.396308 (0.166521)    0.029323    0.125735    0.0433637   0.0152245   
rnn_memory_helper_grad                 5244        140.187     140.106174 (0.999421)   0.081113 (0.000579)     0.020427    0.150314    0.0267329   0.0129726   
rnn_memory_helper                      5244        134.466     134.422311 (0.999676)   0.043546 (0.000324)     0.02023     0.131774    0.0256418   0.0124432   
concat_grad                            2340        114.213     102.730713 (0.899469)   11.481954 (0.100531)    0.032481    0.15419     0.0488088   0.010569    
sigmoid_grad                           4362        113.874     107.131844 (0.940796)   6.741797 (0.059204)     0.020312    0.086206    0.0261058   0.0105376   
sigmoid                                4362        108.18      101.646670 (0.939606)   6.533444 (0.060394)     0.019966    0.132063    0.0248006   0.0100108   
elementwise_sub                        2362        79.7406     75.840477 (0.951089)    3.900169 (0.048911)     0.027971    0.095837    0.0337598   0.00737903  
tanh_grad                              2908        76.4149     71.989046 (0.942082)    4.425817 (0.057918)     0.021056    0.090353    0.0262775   0.00707127  
split                                  1454        75.8946     60.162568 (0.792712)    15.732015 (0.207288)    0.043571    0.630367    0.0521971   0.00702312  
elementwise_sub_grad                   2352        71.542      68.192457 (0.953180)    3.349583 (0.046820)     0.022459    0.168436    0.0304175   0.00662035  
tanh                                   2908        67.9909     63.220004 (0.929830)    4.770911 (0.070170)     0.019094    0.128054    0.0233806   0.00629174  
reshape2                               1762        67.5811     67.563221 (0.999736)    0.017841 (0.000264)     0.012027    0.104218    0.0383547   0.00625381  
dropout                                1454        67.5326     52.996526 (0.784755)    14.536027 (0.215245)    0.038525    0.110662    0.046446    0.00624932  
dropout_grad                           1454        50.5008     47.855241 (0.947613)    2.645600 (0.052387)     0.027463    0.624477    0.0347324   0.00467324  
fill_constant                          1496        41.2608     38.647031 (0.936653)    2.613757 (0.063347)     0.019868    0.110909    0.0275807   0.00381819  
GpuMemcpyAsync:CPU->GPU                2684        38.2639     33.139905 (0.866089)    5.123964 (0.133911)     0.008246    0.077719    0.0142563   0.00354086  
softmax                                433         37.6717     32.780017 (0.870150)    4.891664 (0.129850)     0.075404    0.126331    0.0870016   0.00348606  
transpose2_grad                        906         29.4507     26.412712 (0.896845)    3.037988 (0.103155)     0.023196    0.156133    0.0325063   0.00272531  
unsqueeze2                             433         29.4025     29.319538 (0.997179)    0.082947 (0.002821)     0.059813    0.128764    0.0679041   0.00272084  
transpose2                             916         28.7084     25.577694 (0.890949)    3.130678 (0.109051)     0.023465    0.092804    0.031341    0.00265661  
softmax_grad                           433         27.5446     23.813278 (0.864535)    3.731316 (0.135465)     0.05221     0.123997    0.0636134   0.00254892  
squeeze2_grad                          433         27.0135     26.944857 (0.997458)    0.068676 (0.002542)     0.054414    0.133838    0.0623869   0.00249978  
eager_deletion                         870         26.7456     26.736183 (0.999646)    0.009456 (0.000354)     0.002324    1.38676     0.0307421   0.00247499  
squeeze2                               433         25.2393     25.181926 (0.997725)    0.057410 (0.002275)     0.050768    0.135624    0.0582895   0.00233559  
unsqueeze2_grad                        433         25.0274     24.968753 (0.997655)    0.058683 (0.002345)     0.050575    0.119195    0.0578001   0.00231599  
softmax_with_cross_entropy             10          18.02       2.409284 (0.133700)     15.610725 (0.866300)    0.722111    2.15961     1.802       0.00166753  
adam                                   130         14.9652     5.707370 (0.381375)     9.257877 (0.618625)     0.034774    0.354246    0.115117    0.00138485  
reduce_sum                             140         10.8105     8.748800 (0.809285)     2.061724 (0.190715)     0.037435    0.162729    0.077218    0.00100038  
square                                 130         6.64256     3.823993 (0.575681)     2.818564 (0.424319)     0.022074    0.138738    0.0510966   0.000614688 
scale                                  260         6.51412     6.142491 (0.942950)     0.371628 (0.057050)     0.019617    0.263       0.0250543   0.000602803 
lookup_table_grad                      20          5.0277      1.516919 (0.301712)     3.510781 (0.698288)     0.131618    0.334729    0.251385    0.000465253 
softmax_with_cross_entropy_grad        10          5.01725     0.894024 (0.178190)     4.123227 (0.821810)     0.222553    0.583161    0.501725    0.000464286 
sequence_mask                          30          4.32902     4.122014 (0.952182)     0.207004 (0.047818)     0.109838    0.224241    0.144301    0.000400598 
slice_grad                             80          3.75367     3.079189 (0.820314)     0.674482 (0.179686)     0.028986    0.145544    0.0469209   0.000347357 
slice                                  110         3.24341     3.036975 (0.936351)     0.206440 (0.063649)     0.012624    0.076516    0.0294856   0.000300139 
lookup_table                           20          3.21076     0.940230 (0.292837)     2.270531 (0.707163)     0.056535    0.242628    0.160538    0.000297117 
fill_constant_batch_size_like          50          1.41506     1.316025 (0.930012)     0.099038 (0.069988)     0.023255    0.042394    0.0283013   0.000130947 
TensorCopy:GPU->CPU                    30          1.08744     1.044846 (0.960830)     0.042595 (0.039170)     0.03174     0.046723    0.036248    0.000100629 
GpuMemcpySync:GPU->CPU                 30          0.990879    0.899023 (0.907298)     0.091856 (0.092702)     0.029195    0.04386     0.0330293   9.16939e-05 
TensorCopy:CPU->GPU                    30          0.986733    0.953919 (0.966745)     0.032814 (0.033255)     0.027728    0.046333    0.0328911   9.13102e-05 
Fetch                                  10          0.946421    0.866478 (0.915531)     0.079943 (0.084469)     0.079943    0.133364    0.0946421   8.75798e-05 
GpuMemcpySync:CPU->GPU                 30          0.904039    0.824342 (0.911843)     0.079697 (0.088157)     0.024434    0.042919    0.0301346   8.36579e-05 
reduce_sum_grad                        10          0.671832    0.578052 (0.860412)     0.093780 (0.139588)     0.063234    0.076912    0.0671832   6.21699e-05 
reduce_mean                            10          0.660142    0.576509 (0.873311)     0.083633 (0.126689)     0.054697    0.114415    0.0660142   6.10882e-05 
elementwise_max                        10          0.558056    0.495460 (0.887832)     0.062596 (0.112168)     0.047459    0.07431     0.0558056   5.16413e-05 
reshape2_grad                          30          0.51679     0.500414 (0.968312)     0.016376 (0.031688)     0.013623    0.02873     0.0172263   4.78227e-05 
GpuMemcpyAsync:GPU->CPU                10          0.466456    0.409467 (0.877826)     0.056989 (0.122174)     0.039485    0.063719    0.0466456   4.31649e-05 
FastThreadedSSAGraphExecutorPrepare    10          0.453106    0.396860 (0.875866)     0.056246 (0.124134)     0.039105    0.056246    0.0453106   4.19295e-05 
elementwise_div                        10          0.446607    0.392265 (0.878323)     0.054342 (0.121677)     0.041335    0.060296    0.0446607   4.13281e-05 
reduce_mean_grad                       10          0.436627    0.371762 (0.851441)     0.064865 (0.148559)     0.040098    0.056068    0.0436627   4.04045e-05 
sqrt                                   10          0.428603    0.366501 (0.855106)     0.062102 (0.144894)     0.034099    0.062782    0.0428603   3.9662e-05  
Scale LossGrad                         10          0.371738    0.334618 (0.900145)     0.037120 (0.099855)     0.032202    0.048616    0.0371738   3.43999e-05 
shape                                  30          0.36791     0.342755 (0.931627)     0.025155 (0.068373)     0.008387    0.028157    0.0122637   3.40456e-05 
TensorCopy:GPU->GPU                    30          0.06052     0.058630 (0.968771)     0.001890 (0.031229)     0.001675    0.002818    0.00201733  5.60039e-06 

@Xreki
Copy link
Collaborator Author

Xreki commented Aug 9, 2019

timeline分析

  • step总览:
    • 模型里面用到了两个StaticRNN结构?
    • 可以看到明显的GPU空闲时间

image

image

  • recurrent开始需要创建Operators,准备ExecutorPrepareContext。每个step用自己的step_scope,需创建Variables。

image

  • 无论是前面的准备部分,还是StaticRNN里面,GPU利用率都不高。

image

image

  • 这些都是elementwise计算,可以考虑融合起来。正在开发支持这类融合的通用方法。

image

  • rnn_memory_helper引入很多GPU <-> GPU之间的内存拷贝,应该考虑从设计上优化。
    image

  • rerecurrent_grad中有个同步,需要很长时间的等待。

image

  • 梯度的聚合,使用很多sum_op(已经有PR在优化)

image

@Xreki Xreki changed the title Optimize the performance of seq2seq model Optimize the performance of seq2seq model on GPU Aug 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant