forked from MarisaKirisame/MarisaKirisame.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdtr.html
42 lines (35 loc) · 2.28 KB
/
dtr.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
layout: default
title: Dynamic Tensor Rematerialization
---
<h1>Dynamic Tensor Rematerialization(DTR)</h1>
<div style="color:#222;font-size:0.8rem">Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, Zachary Tatlock </div>
<div> <a href="https://arxiv.org/abs/2006.09616">paper</a> <a href="resources/DTR.pptx">slides</a> <a href="https://www.youtube.com/watch?v=S9KJ37Sx2XY">video</a> </div>
<br>
<h2>DTR:</h2>
<ul>
<li>Is a very simple Tensor Level Runtime Cache</li>
<li>Save GPU memory for DNN training by recomputing result when needed.</li>
<li>Save 3x memory for 20% extra compute time!</li>
<li>Due to the simplicity, can be easily implemented into different deep learning framework.</li>
</ul>
<h2>Technical Dive</h2>
<ol start="0">
<li>Gradient Checkpointing is a technique to save memory for Deep Neural Network training.</li>
<li>Or more generally, for reverse-mode automatic differentiation.</li>
<li>However, memory planning is np-complete.</li>
<li>Checkpointing also has to deal with program with arbitary control flow.</li>
<li>To combat this, previous work made different restriction which sacrifice performance or usability.</li>
<li>Some works models the program as a stack machine with no heap...</li>
<li>And suffers performance degradtion when the assumption is broken!</li>
<li>(For example, NN with highway connection/branching).</li>
<li>Other works use an ILP solver, which consume lots of time to find the optimal memory planning.</li>
<li>And can only be used for program/framework without control flow, posing problem for real world adoption.</li>
<li>Additionally, gradient checkpointing couple derivative calculation, with memory saving by recomputing.</li>
<li>This add complexity and limit applications range.</li>
<li>DTR tackle the problems above by planning the memory greedily at runtime, instead of as a compiler pass.</li>
<li>This solves the control flow and stack machine issue, as we do not model the program in anyway!</li>
<li>However, with a novel cache eviction policy, we are still able to achieve great performance.</li>
</ol>
<h2>Adoption:</h2>
<a href="https://github.com/MegEngine/MegEngine/wiki/Reduce-GPU-memory-usage-by-Dynamic-Tensor-Rematerialization">MegEngine</a> <br>