forked from denizyuret/AutoGrad.jl
-
Notifications
You must be signed in to change notification settings - Fork 0
/
ChangeLog
729 lines (612 loc) · 35.1 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
2017-03-12 Deniz Yuret <[email protected]>
* DONE:
## Fix buggy gradient of getindex for repeated indices.
## fix sum_outgrads as well to do the right thing.
## fix full to use addindex!
* TODO:
# ungetindex problems:
## we could have indices instead of index and use UngetIndex as a more efficient accumulator.
## use full much more rarely and keep things as UngetIndex.
## implement KnetArray version of addindex!
## figure out julia4 problem with Array{CartesianIndex}
# gradcheck with getindex tuples is broken and not easy to fix: tuple->tuple->tuple type struct means everything needs to be replaced when one leaf changes.
# gradcheck with getindex dict is broken.
# gradcheck with ungetindex does not work, fix and move ungetindex tests from src to test.
# repmat does not work.
# oftype(Rec(a), b) should ignore Rec.
# similar(Rec(a), 2, 3) does not work.
# Check the todo items in interfaces.jl.
# Get the tests out of src.
# Look into the pull request and speed benchmark issue.
# Generate reference document using Documenter.jl converting core.jl comments into docstrings.
# Pick test ranges to avoid floating failures (e.g. acoth args close to 1)
# Test and fix Julia v0.6 compatibility: gradients with broadcast.
# test higher order gradients of cat, getindex and other utilities.
# log and unbroadcast fail test with the new toscalar.
# Implement backward operations in profile.jl.
# Also rnn profile may be different from mnist, may be better to compare on rnn: charlm or copyseq. need to get primitives right.
# rnn example? other examples from autograd, knet? can we do a large mt or lm model?
# implement convenience_wrappers (jacobian etc)? utilities (quick_grad_check for mnist)?
# subarrays: can be used to make gradients for concat faster: as long as we don't have overwriting.
# finish abstractarray(hvcat), array, tuple and dict support.
# avoid double normalization in softmax.
# find the problem with runtests when length=2 and higher order dict support, open issues.
# finish all function gradients.
2017-03-08 Deniz Yuret <[email protected]>
* DONE:
# Rename OneHot -> UngetIndex.
2017-02-23 Deniz Yuret <[email protected]>
* DONE:
# Add convert as a primitive, need to handle Node(::Rec) constructor. Too dangerous, could not find a safe way.
2016-09-03 Deniz Yuret <[email protected]>
* DONE:
# improve testing: generate multiple tests compatible with given types.
# fix problem with check_grads(log,0.12099921186616151,[1.7601369948066048,1.0407040634792377])
# fix problem with (:check_grads,lfact,:args,(1.667842864901635,),:exact,(0.3202972807833342,),:numeric,(0.7822992913247839,))
# Email announce new release with profiling results.
# check slowdown in simple functions
# f4ccc0d is faster than ae84389 ?
* f4ccc0d-vs-ae84389: unboxing was the culprit, fixed.
abb2cfe 2016-09-03
1.961046 seconds (1.45 M allocations: 604.340 MB, 1.71% gc time)
4.399855 seconds (2.76 M allocations: 2.182 GB, 2.34% gc time)
a71400c 2016-09-03
2.108507 seconds (1.45 M allocations: 600.952 MB, 1.66% gc time)
4.650728 seconds (2.76 M allocations: 2.177 GB, 2.21% gc time)
ae84389 2016-08-31
2.060078 seconds (1.45 M allocations: 600.952 MB, 1.58% gc time)
4.598383 seconds (2.94 M allocations: 2.183 GB, 2.09% gc time)
f4ccc0d 2016-08-31
2.041250 seconds (1.45 M allocations: 601.044 MB, 1.60% gc time)
4.605514 seconds (3.11 M allocations: 2.191 GB, 2.07% gc time)
2016-09-01 Deniz Yuret <[email protected]>
* DONE:
# test dict and tuple access.
2016-08-31 Deniz Yuret <[email protected]>
* DONE:
# fix rosenbrock(x) = sum(map((i, j) -> (1 - j)^2 + 100*(i - j^2)^2, x[2:end], x[1:end-1]))
# improve ungetindex (sparse container?)
* Rosenbrock:
# Profile memory overhead, check cost of recording large number of ops
# rosenbrock(x) = sum(map((i, j) -> (1 - j)^2 + 100*(i - j^2)^2, x[2:end], x[1:end-1]))
# grad fails with very large inputs. Debug memory usage. Closures?
# Vectorized version:
# rosenbrock2(x) = sum((1-x[1:end-1]).^2 + 100*(x[2:end]-x[1:end-1].^2).^2)
using AutoGrad
f1(x)=sum(map((i, j) -> (1 - j)^2 + 100*(i - j^2)^2, x[2:end], x[1:end-1]))
f2(x)=sum((1-x[1:end-1]).^2 + 100*(x[2:end]-x[1:end-1].^2).^2)
g1=grad(f1)
g2=grad(f2)
x = [ rand(10^i) for i=1:6 ]
for i=1:6; print("$i "); @time g2(x[i]); end
#for i=1:6; print("$i "); @time g1(x[i]); end
# Is it map? No:
f3(x)=(s=0; for i=2:length(x); s+=(1-x[i-1])^2 + 100*(x[i]-x[i-1]^2)^2; end; s)
g3 = grad(f3)
@time g3(rand(200)) => 58.218724 seconds (143.24 M allocations: 7.839 GB, 3.70% gc time)
@time g3(rand(200)) => 0.020137 seconds (203.65 k allocations: 5.928 MB)
what is going on the first time? even if we change the length by
1 all slows down! memalloc shoots up. only works fast on the
exact length it has seen before!
Changed the way we do sum_outgrads. Much better performance.
Still explodes when much more than 10K entries used. The reason
is ungetindex. It creates N dense copies of an N element array,
one for each element. Could use sparse arrays? SparseMatrix only
covers 2-D. No gpu implementation Slows everything down. Still
cannot handle 100K. Giving up for now but will think about
ungetindex improvements.
2016-08-30 Deniz Yuret <[email protected]>
* DONE:
# Profile speed
# Get rid of closures
# Make gradients generic
# Implement broadcast and reduce more efficiently then CUDNN.
# Test higher order derivatives
# Update documentation and comments.
# Problem with getindex. Fix it, then test small version of mnist.
# exporting @primitive by itself doesn't work, it needs recorder, getval etc. to be exported to work.
# remove sum_outgrads by using the first outgrad as the accumulator
## but sum_outgrads is a primitive and we can only do this after we support overwriting operations.
## did this by using a functional accumulator
# optimize for (tuple) loops in core.jl?
# check and eliminate all tight uses of (?:)
* CANCELLED:
# CUBLAS:master already has highlevel.jl implemented with A_mul_B etc. (it is incomplete!) (make sure it doesn't transpose). Update CUDNN as well. Put the rest of the primitives in CUDArt? General broadcasting make sense for GPU?
# Check out https://github.com/timholy/FastAnonymous.jl -- don't use them any more.
# work on memory: overwriting: 3arg (overwriting) functions (Ax_mul_Bx, broadcast) or julia v0.5 overwrite syntax or InplaceOps.
# use TypeNode <: Type instead of Node{Type}?
2016-08-29 Deniz Yuret <[email protected]>
* Timing:
run forw grad
e3e85c9 1.09 1.82 3.53
e6b7b49 1.09 1.82 3.30 # fixed useless copy in broadcast.
@time 1.802994 seconds (3.67 M allocations: 160.433 MB, 1.86% gc time)
292b854 1.09 1.68 2.99 # fixed recorder
@time 1.715418 seconds (3.13 M allocations: 127.752 MB, 1.68% gc time)
8edef74 1.10 1.55 2.96 # eliminated gradient closures
@time 1.553175 seconds (2.59 M allocations: 110.687 MB)
* Operations: commented output from nnet shows unnecessary operations:
sum(((w[3]*max(0,w[1]*x.+w[2]).+w[4])-y).^2)
187:w (:rcall,AutoGrad.merge_tapes,:A187_4,:A187_4,:A187_4)
226:x
153:y
119:w3 (:rcall,getindex,:A119_10_64,:A187_4,3)
581:w1 (:rcall,getindex,:A581_64_784,:A187_4,1)
506:a1=w1*x (:rcall,*,:A506_64_100,:A581_64_784,:A226_784_100)
429:w2 (:rcall,getindex,:A429_64,:A187_4,2)
373:a2=a1+w2 (:rcall,.+,:A373_64_100,:A506_64_100,:A429_64)
285:a3=max(0,a2) (:rcall,max,:A285_64_100,0,:A373_64_100)
104:a4=w3*a3 (:rcall,*,:A104_10_100,:A119_10_64,:A285_64_100)
866:w4 (:rcall,getindex,:A866_10,:A187_4,4)
404:a5=a4+w4 (:rcall,.+,:A404_10_100,:A104_10_100,:A866_10)
551:a6=a5-y (:rcall,-,:A551_10_100,:A404_10_100,:A153_10_100)
271:a7=a6^2 (:rcall,.^,:A271_10_100,:A551_10_100,2)
000:a8=sum(a7) (:rcall,sum,332.66202f0,:A271_10_100)
000.d8=1
000.413:d7=1.+zeros(a7) ***
# Leave sum grad as identity, matmul may have a problem!
- 271.544:d6a=a6^1 *** (:rcall,.^,:A544_10_100,:A551_10_100,1)
# # Fix gradient of .^ to check for .^2 and .^1
? 271.344:d6b=2*d6a (:rcall,.*,:A344_10_100,2,:A544_10_100)
+ 271.939:d5=d6=d7*d6b * (:rcall,.*,:A939_10_100,:A413_10_100,:A344_10_100)
- 404.481:d4=d5*1 *** (:rcall,.*,:A481_10_100,:A939_10_100,1)
- 404.156:g4a=d5*1 *** (:rcall,.*,:A156_10_100,:A939_10_100,1)
# # Fix gradient of .+ to be identity
+ 404.159:g4=sum(g4a,2) (:rcall,sum,:A159_10_1,:A156_10_100,2)
+ 866.945:g=unget(w,g4,4) (:rcall,AutoGrad.ungetindex,:A945_4,:A187_4,:A159_10_1,4)
+ 104.466:g3=d4*a3' (:rcall,A_mul_Bc,:A466_10_64,:A481_10_100,:A285_64_100)
+ 104.446:d3=w3'*d4 (:rcall,Ac_mul_B,:A446_64_100,:A119_10_64,:A481_10_100)
+ 285.853:d2=d3.*(x.==y) (:rcall,.*,:A853_64_100,:A446_64_100,:A43_64_100)
- 373.091:d1=d2.*1 *** (:rcall,.*,:A91_64_100,:A853_64_100,1)
- 373.489:g2a=d2.*1 *** (:rcall,.*,:A489_64_100,:A853_64_100,1)
# # Same .+ problem.
+ 373.048:g2=sum(g2a,2) (:rcall,sum,:A48_64_1,:A489_64_100,2)
+ 429.730:g=unget(w,g2,2) (:rcall,AutoGrad.ungetindex,:A730_4,:A187_4,:A48_64_1,2)
+ 506.636:g1=d1*x' (:rcall,A_mul_Bc,:A636_64_784,:A91_64_100,:A226_784_100)
+ 581.050:g=unget(w,g1,1) (:rcall,AutoGrad.ungetindex,:A50_4,:A187_4,:A636_64_784,1)
+ 119.117:g=unget(w,g3,3) (:rcall,AutoGrad.ungetindex,:A117_4,:A187_4,:A466_10_64,3)
+ 187.993:g=sum (:rcall,AutoGrad.sum_helper,:A993_4,:A945_4,:A730_4,:A50_4,:A117_4)
2016-08-25 Deniz Yuret <[email protected]>
* DONE:
# Try autograd mnist with own memory allocator for the gpu (to use instead of similar).
# Finish math.jl.
* PLAN:
# exhaust primitives by implementing more models.
# complete primitive implementations, utilities and release optimized cpu version.
# try supporting cuda without overwriting, see the speed.
# if that doesn't work seriously think about overwriting.
2016-08-24 Deniz Yuret <[email protected]>
* test/profprim.jl: profiling primitives. See the file for results.
Fixed inefficient cpu Knet.relu: (Not good to use (?:) instead of if!)
profknet.jl:train0(epochs=10,1layer): cpu=2.5444, gpu=1.8276
profgpu.jl:train0(epochs=10,1layer): cpu=2.7664, gpu=3.5490; # not sure why this changed.
profknet.jl:train0(epochs=10,2layer): cpu=4.3184, gpu=2.9201
profgpu.jl:train0(epochs=10,2layer): cpu=5.2096, gpu=7.9503 # why cpu slower than knet? why is gpu slow?
* timing:
profile.jl:train0(epochs=1,2layer): 0.6234 using logsumexp, 0.5543 using xentloss; cannot test gpu with weights(64) without relu or max.
profgpu.jl:train0(epochs=1,2layer): 0.9308 cpu using relu, 0.8403 gpu using relu; why is relu slower than max? because it used (:?) instead of if/else!
profgpu.jl:train0(epochs=10,1layer): cpu=2.7302, gpu=2.9500; is this the slowdown from gpu alloc?
profknet.jl:train0(epochs=10,2layer): cpu=7.7560, gpu=2.9201 # cpu slow!
* DONE:
# support CudaArrays and compare mnist speed with Knet. (without overwriting at first, fix gc)
## need gpu *,.+,.-,.*,max,log,exp,sum,sum1 and whatever is needed for their gradients.
## basically matmul, broadcast, relu and softmax...
# It might be better to profile with primitive functions with and without allocation: test matmul,badd,relu,softmax on cpu/gpu w/wo alloc. compare different methods for relu and softmax.
2016-08-22 Deniz Yuret <[email protected]>
* DONE:
# fixtest should give warning for unknown argtypes, not error.
# BUG: keyword args mishandled by @primitive.
## sin{##7516 <: Number}(x::Node{##7516},o...; a=3) = sin(getval(x),o...; a=3)
# write documentation and publish.
2016-08-21 Deniz Yuret <[email protected]>
* DONE:
# @primitive should take regular argtypes and generate signatures with Nodes.
# concatenation? cat,hcat,vcat
## need to figure out where each input ends up in final array.
## call signature is complicated, vcat is already defined generically, including numbers. need Node inputs.
2016-08-20 Deniz Yuret <[email protected]>
* DONE:
# copying, reshaping.
* q: Node{Type} vs TypeNode <: Type
We chose the first (a single parametric type) whereas the Python
implementation uses the second (many types, I don't think python
supports parametric types). In the first, all types are children
of Node, in the second we can hang them anywhere we want in the
type hierarchy. Today Emre Yolcu suggested a possible advantage of
the second approach: in cases where one Julia function calls a
lower level Julia function in Base, we could get away with
defining only the lower level function as primitive as long as we
make AbstractFloatNode a subtype of AbstractFloat, and the Julia
functions are written for a supertype of AbstractFloat, for
example. It is not possible to make Node{Float32} a subtype of
anything other than Node and Any. This would make it easier to
cover groups of functions such as (*)->A_mul_B!->gemm_wrapper! or
vcat->cat by only letting us define the lowest level one. We'd
need to cover arrays, matrices, vectors, tuples, dicts, floats.
It would also solve the problem of not being able to define
Node{A<:AbstractArray{T<:Number}}. The current Node type used by
core.jl would have to be a big union instead of a supertype, which
may effect efficiency. On the downside we may have to define many
types and the "lowest level" we depend on may change in the next
Julia version.
Another point about ArrayNode:
Currently we can have f{T<:AbstractArray}(x::Node{T})
This will cover array nodes as well as abstractarray nodes.
If we switch and have AbstractArrayNode <: AbstractArray and
ArrayNode <: Array, any function we define for the first will not
be inherited by the second. If we use ArrayNode <:
AbstractArrayNode, then we lose the main advantage of capturing
function groups defined for Arrays.
2016-08-19 Deniz Yuret <[email protected]>
* DONE:
# removed zeros_like: should use `nothing` in tuples, cell arrays, and dicts and have sum_outgrads understand this: will save 20%!
# optimize dy.*1 gradients, but be careful not to pass back same array multiple times if we are going to use accumulators. sum_grads already does?
# remove the need for transpose in matmul.
# use more specific type signatures instead of ambiguous Nval.
# get rid of the _name hash: it prevents gc, write a specific dbg printer instead.
# work on efficiency (are closures efficient? nothing instead of zero arrays?). do closures get garbage collected?
# check out JuliaDiff/ReverseDiffSource: cannot handle while loops? http://www.juliadiff.org/
* profiling:
(17) after getting rid of zeros_like: (compare to #15)
0. 1.059346 seconds (776.62 k allocations: 309.133 MB)
1. 1.011051 seconds (774.22 k allocations: 309.096 MB)
2. 0.267993 seconds (60.78 k allocations: 58.240 MB)
3. 0.241966 seconds (29.58 k allocations: 49.606 MB)
4. 0.459659 seconds (454.38 k allocations: 76.330 MB)
6576 ...d/examples/footime.jl; train0; line: 73 g = gradfun(w, x, y)
6571 ...AutoGrad/src/core.jl; gradfun; line: 36 backward_pass(forward_pass(fun, args, kwargs, argnum)...)
437 ...utoGrad/src/core.jl; backward_pass; line: 209 cur_outgrad = sum_outgrads(node.outgrads)
428 ...utoGrad/src/core.jl; sum_outgrads; line: 541 sum_helper(a...)
422 ...utoGrad/src/core.jl; sum_helper; line: 549 sum_helper{T}(a::AbstractArray{T},b::AbstractArray{T},c::AbstractArray{T}...) =
410 ...utoGrad/src/core.jl; backward_pass; line: 213 for (gradfun, parent) in node.parent_grad_ops
142 ./array.jl; next; line: 277
171 tuple.jl; indexed_next; line: 21
2094 ...utoGrad/src/core.jl; backward_pass; line: 215 og = gradfun(cur_outgrad)
171 .../src/collections.jl; anonymous; line: 30 getindex(::D1,y,x,i...) = dy->ungetindex(x,dy,i...) # y=x[i], dy=df/dy
713 ...rc/linalg/matmul.jl; anonymous; line: 2
383 linalg/matmul.jl; A_mul_Bc; line: 187
# There is some inefficiency here with actual transposing rather than calling transpose gemm!
295 operators.jl; A_mul_Bc; line: 164
279 no file; ctranspose; line: 0
250 ...toGrad/src/core.jl; r; line: 127 result = f(argvals...; kwargs...)
249 arraymath.jl; ctranspose; line: 414
239 arraymath.jl; transpose!; line: 323
155 arraymath.jl; transposeblock!; line: 350
889 ...utoGrad/src/util.jl; anonymous; line: 22 gexp = :(dy->dy.*$(_d[i]))
106 no file; .*; line: 0
# Who is calling this? Probably max.
212 no file; .==; line: 0
204 ...utoGrad/src/core.jl; u; line: 408 u(x...; o...)=f(map(getval,x)...; o...)
190 broadcast.jl; .==; line: 363
171 broadcast.jl; broadcast!; line: 246
437 sparse/sparsematrix.jl; .*; line: 1026
3496 ...utoGrad/src/core.jl; forward_pass; line: 72 end_node = fun(args...; kwargs...)
1783 ...examples/footime.jl; loss; line: 39 ypred = predict(w, x)
1142 ...examples/footime.jl; predict; line: 32 x = max(0, w[i]*x .+ w[i+1])
419 no file; *; line: 0
337 ...toGrad/src/core.jl; r; line: 127 result = f(argvals...; kwargs...)
337 linalg/matmul.jl; *; line: 132
242 no file; .+; line: 0
161 no file; getindex; line: 0
319 no file; max; line: 0
235 ...toGrad/src/core.jl; r; line: 127 result = f(argvals...; kwargs...)
234 operators.jl; max; line: 391
601 ...examples/footime.jl; predict; line: 35 return w[i]*x .+ w[i+1]
157 no file; *; line: 0
239 no file; .+; line: 0
205 no file; getindex; line: 0
729 ...examples/footime.jl; loss; line: 40 ynorm = ypred .- log(sum(exp(ypred),1))
225 no file; .-; line: 0
255 no file; exp; line: 0
153 ...utoGrad/src/core.jl; r; line: 127 result = f(argvals...; kwargs...)
149 operators.jl; exp; line: 381
103 no file; log; line: 0
140 no file; sum; line: 0
982 ...examples/footime.jl; loss; line: 41 -sum(ygold .* ynorm) / size(ygold,2)
207 no file; -; line: 0
339 no file; .*; line: 0
120 ...utoGrad/src/core.jl; r; line: 127 result = f(argvals...; kwargs...)
116 ...se/sparsematrix.jl; .*; line: 1026
257 no file; /; line: 0
168 no file; sum; line: 0
(18) after matmul is fixed (no transpose, compare to #17):
0. 1.001226 seconds (769.42 k allocations: 256.563 MB)
1. 0.927784 seconds (767.02 k allocations: 256.527 MB)
2. 0.262618 seconds (60.78 k allocations: 58.240 MB)
3. 0.241753 seconds (29.58 k allocations: 49.606 MB)
4. 0.421250 seconds (450.78 k allocations: 76.220 MB)
5567 ...d/examples/footime.jl; train0; line: 73 g = gradfun(w, x, y)
5557 ...AutoGrad/src/core.jl; gradfun; line: 36 backward_pass(forward_pass(fun, args, kwargs, argnum)...)
321 ...utoGrad/src/core.jl; backward_pass; line: 209 cur_outgrad = sum_outgrads(node.outgrads)
303 ...utoGrad/src/core.jl; sum_outgrads; line: 530 sum_helper(a...)
296 ...utoGrad/src/core.jl; sum_helper; line: 538 sum_helper{T}(a::AbstractArray{T},b::AbstractArray{T},c::AbstractArray{T}...) =
359 ...utoGrad/src/core.jl; backward_pass; line: 213 for (gradfun, parent) in node.parent_grad_ops
108 ./array.jl; next; line: 277
153 tuple.jl; indexed_next; line: 21
1740 ...utoGrad/src/core.jl; backward_pass; line: 215 og = gradfun(cur_outgrad)
181 .../src/collections.jl; anonymous; line: 30 getindex(::D1,y,x,i...) = dy->ungetindex(x,dy,i...) # y=x[i], dy=df/dy
154 no file; ungetindex; line: 0
574 ...rc/linalg/matmul.jl; anonymous; line: 2
440 linalg/matmul.jl; A_mul_Bc; line: 187
440 linalg/matmul.jl; A_mul_Bt; line: 156
433 linalg/matmul.jl; gemm_wrapper!; line: 329
433 linalg/blas.jl; gemm!; line: 633
680 ...utoGrad/src/util.jl; anonymous; line: 29 gexp = :(dy->dy.*$(_d[i]))
193 no file; .==; line: 0
183 ...utoGrad/src/core.jl; u; line: 397 u(x...; o...)=f(map(getval,x)...; o...)
168 broadcast.jl; .==; line: 363
129 broadcast.jl; broadcast!; line: 246
325 sparse/sparsematrix.jl; .*; line: 1026
197 broadcast.jl; broadcast!; line: 246
107 broadcast.jl; _F_; line: 97
102 ...utoGrad/src/util.jl; new_fun; line: 227 return sum(result, d)
2968 ...utoGrad/src/core.jl; forward_pass; line: 72 end_node = fun(args...; kwargs...)
1778 ...examples/footime.jl; loss; line: 39 ypred = predict(w, x)
1177 ...examples/footime.jl; predict; line: 32 x = max(0, w[i]*x .+ w[i+1])
525 no file; *; line: 0
447 ...toGrad/src/core.jl; r; line: 127 result = f(argvals...; kwargs...)
445 linalg/matmul.jl; *; line: 132
432 linalg/matmul.jl; gemm_wrapper!; line: 329
432 linalg/blas.jl; gemm!; line: 633
156 no file; .+; line: 0
188 no file; getindex; line: 0
307 no file; max; line: 0
237 ...toGrad/src/core.jl; r; line: 127 result = f(argvals...; kwargs...)
235 operators.jl; max; line: 391
548 ...examples/footime.jl; predict; line: 35 return w[i]*x .+ w[i+1]
141 no file; *; line: 0
197 no file; .+; line: 0
207 no file; getindex; line: 0
552 ...examples/footime.jl; loss; line: 40 ynorm = ypred .- log(sum(exp(ypred),1))
198 no file; .-; line: 0
156 no file; exp; line: 0
113 no file; sum; line: 0
635 ...examples/footime.jl; loss; line: 41 -sum(ygold .* ynorm) / size(ygold,2)
139 no file; -; line: 0
217 no file; .*; line: 0
157 no file; /; line: 0
116 no file; sum; line: 0
(19) AutoGrad vs Knet timing with/without gpu, 10 epoch, mnist64 on biyofiz-4-1:
AutoGrad CPU:
0. 6.171894 seconds (7.69 M allocations: 2.505 GB) forw+rec-back-update
1. 6.265890 seconds (7.67 M allocations: 2.505 GB) forw+rec-back
2. 1.673144 seconds (607.80 k allocations: 582.394 MB) forw
3. 1.471988 seconds (295.80 k allocations: 496.060 MB) forw (no loss)
4. 3.023412 seconds (4.51 M allocations: 762.204 MB) forw+rec
Knet CPU:
0. 7.661409 seconds (88.90 M allocations: 1.367 GB) forw-back-update
1. 8.407676 seconds (108.31 M allocations: 1.653 GB) forw-back
2. 4.235033 seconds (54.22 M allocations: 845.108 MB) forw
Knet GPU:
0. 2.925609 seconds (4.22 M allocations: 189.742 MB) forw-back-update
1. 2.792329 seconds (4.08 M allocations: 184.254 MB) forw-back
2. 1.417462 seconds (1.96 M allocations: 86.087 MB) forw
2016-08-18 Deniz Yuret <[email protected]>
* DONE:
# profile mnist
* profiling:
- Hypotheses:
- closures slow.
- untyped functions slow.
- memory allocation slow.
- two types of memory alloc: user code, grad code.
- for user code, let the user do it, support overwriting ops (careful about non-Node arrays)
- for grad code, keep around a tape after it is used? how about args in closures?
(1) AutoGrad: Testing 1 epoch h=64 mnist with minibatch=100, best out of 3 on ural from fresh start.
0. Regular train (record, update)
1. Forward back (record, no update)
2. Nonrecording, no update, just forward (computes softloss)
3. Just predict, no softloss.
4. forward_pass(loss) (recording version of #2)
0. 13.944595 seconds (3.95 M allocations: 2.299 GB, 18.09% gc time)
1. 13.100620 seconds (3.93 M allocations: 1.844 GB, 15.03% gc time)
2. 4.288226 seconds (64.38 k allocations: 114.196 MB, 0.17% gc time)
3. 4.242340 seconds (33.18 k allocations: 98.294 MB, 0.12% gc time)
(2) Problem: mixing Float64 weigths with Float32 data!
(3) AutoGrad: Use Float64 weights and data
0. 6.512184 seconds (3.94 M allocations: 2.299 GB, 38.67% gc time)
1. 5.719212 seconds (3.92 M allocations: 1.843 GB, 34.23% gc time)
2. 0.483972 seconds (60.78 k allocations: 114.032 MB, 1.31% gc time)
3. 0.444158 seconds (29.58 k allocations: 98.129 MB, 1.01% gc time)
(4) AutoGrad: Use Float32 weights and data
0. 4.275396 seconds (3.94 M allocations: 1.237 GB, 33.43% gc time)
1. 3.702224 seconds (3.92 M allocations: 1.008 GB, 32.20% gc time)
2. 0.269955 seconds (60.78 k allocations: 58.239 MB, 1.22% gc time)
3. 0.242950 seconds (29.58 k allocations: 49.606 MB, 1.13% gc time)
(5) Continue with Float32.
(6) AutoGrad: gc_enable(false) while measuring time:
0. 2.760838 seconds (3.94 M allocations: 1.236 GB)
1. 2.489116 seconds (3.92 M allocations: 1.007 GB)
2. 0.266896 seconds (60.78 k allocations: 58.239 MB)
3. 0.240501 seconds (29.58 k allocations: 49.606 MB)
(7) AutoGrad: use axpy! during update:
0. 2.559526 seconds (3.92 M allocations: 1.008 GB)
(8) Compare with Knet: same task.
0. 1.128186 seconds (8.94 M allocations: 141.341 MB, 1.63% gc time) Forw-back-update
1. 1.244718 seconds (10.76 M allocations: 168.832 MB, 1.70% gc time) Forward-back (no update)
2. 0.590746 seconds (5.39 M allocations: 84.441 MB, 1.41% gc time) Forward only
(9) broadcast vs vectorized ops. (testing exp(a) where a=rand(10000,10000))
2.989990 seconds (2 allocations: 762.940 MB): exp(a)
2.920066 seconds (12 allocations: 762.940 MB, 0.22% gc time): broadcast(exp,a)
2.725942 seconds (1 allocation: 32 bytes): broadcast!(exp,b,a)
2.724059 seconds (1 allocation: 32 bytes): broadcast!(exp,a,a)
Not much difference, vectorized ops to be deprecated in favor of broadcast in v6.0.
(10) Julia v0.5 is not faster (but maybe it hasn't been compiled optimally).
0. 2.858207 seconds (3.78 M allocations: 1.011 GB)
(11) Profile1
2276 ...AutoGrad/src/core.jl; gradfun; line: 36 backward_pass(forward_pass(fun, args, kwargs, argnum)...)
242 ...AutoGrad/src/core.jl; backward_pass; line: 208 cur_outgrad = sum_outgrads(node.outgrads...)
237 ...utoGrad/src/core.jl; sum_outgrads; line: 570
192 ...utoGrad/src/core.jl; sum_outgrads; line: 570
187 broadcast.jl; broadcast; line: 253
142 broadcast.jl; broadcast!; line: 246
711 ...AutoGrad/src/core.jl; backward_pass; line: 214 og = gradfun(cur_outgrad)
415 .../src/collections.jl; anonymous; line: 14 getindex(::D1,y,x,i...) = dy->ungetindex(x,dy,i...)
412 no file; ungetindex; line: 0
298 ...utoGrad/src/core.jl; r; line: 126
298 .../src/collections.jl; ungetindex; line: 21
297 ...toGrad/src/core.jl; fill_internal; line: 546
295 ...oGrad/src/core.jl; fill_check; line: 549
291 ...oGrad/src/core.jl; fill_internal; line: 546
264 array.jl; fill!; line: 193
150 ...utoGrad/src/util.jl; anonymous; line: 22
953 ...AutoGrad/src/core.jl; forward_pass; line: 72 end_node = fun(args...; kwargs...)
519 .../examples/footime.jl; loss; line: 39 ypred = predict(w, x)
329 ...examples/footime.jl; predict; line: 32 x = max(0, w[i]*x .+ w[i+1])
108 no file; .+; line: 0
186 ...examples/footime.jl; predict; line: 35 return w[i]*x .+ w[i+1]
260 .../examples/footime.jl; loss; line: 40 ynorm = ypred .- log(sum(exp(ypred),1))
173 .../examples/footime.jl; loss; line: 41 -sum(ygold .* ynorm) / size(ygold,2)
(12) Focusing on forward:
2. 0.266896 seconds (60.78 k allocations: 58.239 MB) call loss
4. 0.939486 seconds (1.96 M allocations: 146.032 MB) call forward_pass(loss) (recording)
(13) Taking out all the debug calls (turning them into macros)
0. 1.556437 seconds (1.12 M allocations: 906.322 MB)
2. 0.267631 seconds (60.78 k allocations: 58.239 MB)
4. 0.544930 seconds (587.58 k allocations: 82.464 MB)
At this point recording takes as much time as forward calculation, backward pass takes 4x more.
Forward is as efficient as can be, dominated by array ops, except for excessive allocation.
(14) What costs on forward_pass? (profiling 20 epochs)
10176 ...AutoGrad/src/core.jl; forward_pass; line: 72 end_node = fun(args...; kwargs...)
7875 .../examples/footime.jl; loss; line: 39 ypred = predict(w, x)
6595 ...examples/footime.jl; predict; line: 32 x = max(0, w[i]*x .+ w[i+1])
810 no file; *; line: 0
3307 no file; .+; line: 0
1636 no file; getindex; line: 0
832 no file; max; line: 0
1231 ...examples/footime.jl; predict; line: 35 return w[i]*x .+ w[i+1]
294 no file; *; line: 0
448 no file; .+; line: 0
484 no file; getindex; line: 0
1391 .../examples/footime.jl; loss; line: 40 ynorm = ypred .- log(sum(exp(ypred),1))
463 no file; .-; line: 0
440 no file; exp; line: 0
227 no file; log; line: 0
255 no file; sum; line: 0
907 .../examples/footime.jl; loss; line: 41 -sum(ygold .* ynorm) / size(ygold,2)
160 no file; -; line: 0
331 no file; .*; line: 0
198 no file; /; line: 0
215 no file; sum; line: 0
Compare with regular call:
2526 ...d/examples/footime.jl; train2; line: 96 z = loss(w, x, y)
1604 ...examples/footime.jl; predict; line: 32 x = max(0, w[i]*x .+ w[i+1])
262 broadcast.jl; .+; line: 315
392 linalg/matmul.jl; *; line: 132
943 operators.jl; max; line: 391
201 ...examples/footime.jl; predict; line: 35 return w[i]*x .+ w[i+1]
152 broadcast.jl; .+; line: 315
46 linalg/matmul.jl; *; line: 132
154 broadcast.jl; .-; line: 321
324 operators.jl; exp; line: 381
36 operators.jl; log; line: 381
52 reducedim.jl; sum; line: 264
132 sparse/sparsematrix.jl; .*; line: 1026
Most expensive lines in recorder:
tapes=Set() # convert to a regular array
for (arg,i) in enumerate(args) # convert to a regular for
Next iteration:
109 ...utoGrad/src/core.jl; r; line: 114 in(tape,tapes) || push!(tapes,tape)
143 ...utoGrad/src/core.jl; r; line: 127 result = f(argvals...; kwargs...)
101 ...utoGrad/src/core.jl; r; line: 147 for (tape, argnum, parent) in ops
83 tuple.jl; indexed_next; line: 21
76 ...utoGrad/src/core.jl; r; line: 149 gradfun = f(Grad{argnum}, result, args...; kwargs...)
Removing in(tape,tapes), handle duplicates in Node().
Next iteration: for loops with tuples; maybe we can optimize the single tape case.
448 no file; .+; line: 0
91 ...utoGrad/src/core.jl; r; line: 111 for (tape, parent_rnode) in arg.tapes
74 operators.jl; indexed_next; line: 437
109 ...utoGrad/src/core.jl; r; line: 127 result = f(argvals...; kwargs...)
63 ...utoGrad/src/core.jl; r; line: 147 for (tape, argnum, parent) in ops
51 tuple.jl; indexed_next; line: 21
(15) latest single epoch times: (compare to #13)
0. 1.380165 seconds (905.62 k allocations: 896.819 MB)
1. 1.344440 seconds (903.22 k allocations: 896.783 MB)
2. 0.267039 seconds (60.78 k allocations: 58.240 MB)
3. 0.240020 seconds (29.58 k allocations: 49.606 MB)
4. 0.434085 seconds (454.38 k allocations: 76.330 MB)
This is better than Knet with recording in forward_pass. Time to
optimize the backward_pass.
(16) Profiling backward_pass with 10 epochs.
10496 .../examples/footime.jl; train0; line: 73 g = gradfun(w, x, y)
10484 ...AutoGrad/src/core.jl; gradfun; line: 36 backward_pass(forward_pass(fun, args, kwargs, argnum)...)
2018 ...utoGrad/src/core.jl; backward_pass; line: 209 cur_outgrad = sum_outgrads(node.outgrads...)
290 ...utoGrad/src/core.jl; backward_pass; line: 213 for (gradfun, parent) in node.parent_grad_ops
4558 ...utoGrad/src/core.jl; backward_pass; line: 215 og = gradfun(cur_outgrad)
3472 ...utoGrad/src/core.jl; forward_pass; line: 72 end_node = fun(args...; kwargs...)
136 .../examples/footime.jl; train0; line: 76 Base.axpy!(-lr, g[i], w[i])
# sum_outgrads spends significant time adding gradients for fanout>1 ops.
# we can try adding to an accumulator instead of pushing.
# we can also use the first gradient as the accumulator to avoid allocation but we decided that was dangerous.
# the same gradient matrix is passed back multiple times by e.g. +?
# actually right now we are multiplying with 1! never passing the original dy?
# the main problem is sum_outgrads is a primitive for high-order derivatives and we don't support overwriting operations yet.
2018 ...utoGrad/src/core.jl; backward_pass; line: 209 cur_outgrad = sum_outgrads(node.outgrads...)
1996 ...utoGrad/src/core.jl; sum_outgrads; line: 572 sum_outgrads{T}(a::AbstractArray{T},b::AbstractArray{T},c::AbstractArray{T}...) =
1686 ...utoGrad/src/core.jl; sum_outgrads; line: 572 sum_outgrads{T}(a::AbstractArray{T},b::AbstractArray{T},c::AbstractArray{T}...) =
1670 broadcast.jl; broadcast; line: 253
1493 broadcast.jl; broadcast!; line: 246
4558 ...utoGrad/src/core.jl; backward_pass; line: 215 og = gradfun(cur_outgrad)
# This is the main deal where we can save a lot of time.
# we seem to be spending a lot of time filling arrays with zeros for ungetindex.
# should use `nothing` in tuples, cell arrays, and dicts and have sum_outgrads understand this.
2514 .../src/collections.jl; anonymous; line: 14 getindex(::D1,y,x,i...) = dy->ungetindex(x,dy,i...) # y=x[i], dy=df/dy
2491 no file; ungetindex; line: 0
2287 ...toGrad/src/core.jl; r; line: 127 result = f(argvals...; kwargs...)
2279 ...src/collections.jl; ungetindex; line: 21 ungetindex(x::AbstractArray, dy, i...) = (dx=zeros_like(x);setindex!(dx,dy,i...);dx)
2264 ...oGrad/src/core.jl; fill_internal; line: 548 fill_internal{T}(x::AbstractArray{T},v,d::ObjectIdDict)=
2244 ...oGrad/src/core.jl; fill_check; line: 551 fill_check(x,v,d::ObjectIdDict)=(haskey(d,x) ? d[x] : d[x]=fill_internal(x,v,d))
2232 ...oGrad/src/core.jl; fill_internal; line: 548 fill_internal{T}(x::AbstractArray{T},v,d::ObjectIdDict)=
2178 array.jl; fill!; line: 193
348 ...rc/linalg/matmul.jl; anonymous; line: 2
1019 ...utoGrad/src/util.jl; anonymous; line: 22 gexp = :(dy->dy.*$(_d[i]))
203 arraymath.jl; .*; line: 125
157 no file; .*; line: 0
113 ...utoGrad/src/core.jl; r; line: 127 result = f(argvals...; kwargs...)
109 ...se/sparsematrix.jl; .*; line: 1026
150 no file; .==; line: 0
139 ...utoGrad/src/core.jl; u; line: 408 u(x...; o...)=f(map(getval,x)...; o...)
126 broadcast.jl; .==; line: 363
473 sparse/sparsematrix.jl; .*; line: 1026
430 broadcast.jl; broadcast!; line: 246
232 ...utoGrad/src/util.jl; new_fun; line: 189 result = gradfun(dy)
222 ...utoGrad/src/util.jl; new_fun; line: 196 return sum(result, d)
2016-08-17 Deniz Yuret <[email protected]>
* DONE:
# Julia v5.0 compatibility.
# Julia 5 already supports function types making Fn{} unnecessary: julia> typeof(sin) => Base.#sin
# Julia 5 in-place syntax does not work for matmul, or .+ (unless you write x .= (+).(x,a)) yet.
2016-08-16 Deniz Yuret <[email protected]>
* DONE:
# fix namespace for runtests.jl.
# solve testing problem with zerograd args (sum, airy)
2016-08-15 Deniz Yuret <[email protected]>
* DONE:
# extend defgrads so it can handle manual definitions as well.
# implement reductions: sum, vecnorm
# implement arraymath functions (transpose etc)
# test mnist etc. more examples.
# implement zero-one loss.
# write mnist loader.
# implement zerograd for one of the arguments.
2016-08-14 Deniz Yuret <[email protected]>
* DONE:
# reorganize gradients mirroring base.
# handle broadcast.jl.
2016-08-13 Deniz Yuret <[email protected]>
* DONE:
# use Grad or some other name instead of Val.
# split functions based on what type of args they accept.
# implement broadcasting functions (finish unbroadcast).
# implement matrix multiplication.
2016-08-12 Deniz Yuret <[email protected]>
* DONE:
# write gradcheck.
# implement unbroadcast
# implement/test 2arg functions.
2016-08-11 Deniz Yuret <[email protected]>
* DONE:
# @primitive should be type specific.
# sum_outgrads should not overwrite its arguments: e.g. + may have passed the same dy back to multiple places.
# separate tests, examples, and gradients.
# (w,b)=params does not work, implement iterators?