Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added numpy benchmarks #1

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

parimarjan
Copy link

*blackscholes
*numpy ops - unary/binary, inplace/non-inplace
requires setup.py for weldnumpy to be run on the machine (PR #247 on weld)

*blackscholes
*numpy ops - unary/binary, inplace/non-inplace
@deepakn94
Copy link
Contributor

Hey Pari, thanks for doing this!
Can you paste the output of run_benchmarks.py for these two workloads here?

@parimarjan
Copy link
Author

parimarjan commented Oct 10, 2017

the emitted output files for the benchmarks are the following (numpy_ops is particularly long with a bunch of different combinations being run...):

Benchmark	Scheme	Parameters	Trial 1
blackscholes	Weld	s=100	0.000265121459961	0.000164031982422	4.2641
blackscholes	Python->Weld	s=100	0.000240087509155	0.000303030014038
blackscholes	Numpy	s=100	0.0001
blackscholes	Weld compile time	s=100	2.1442129612	2.10048604012
blackscholes	Weld->Python	s=100	0.000107049942017	5.3882598877e-05
blackscholes	Weld	s=1000	0.000416994094849	0.000325918197632	4.9892
blackscholes	Python->Weld	s=1000	0.000393867492676	0.000184059143066
blackscholes	Numpy	s=1000	0.0003
blackscholes	Weld compile time	s=1000	2.65057206154	2.32023215294
blackscholes	Weld->Python	s=1000	0.000107049942017	5.41210174561e-05

For numpy_ops:
Benchmark	Scheme	Parameters	Trial 1
numpy_ops	Weld  sqrt	s=1000, r=1, op=sqrt	0.2227
numpy_ops	Weld Inplace sqrt	s=1000, r=1, op=sqrt	0.2462
numpy_ops	Python->Weld	s=1000, r=1, op=sqrt	0.000131845474243	4.00543212891e-05
numpy_ops	Weld	s=1000, r=1, op=sqrt	0.000159025192261	0.000163793563843
numpy_ops	Numpy  sqrt	s=1000, r=1, op=sqrt	0.0
numpy_ops	Weld->Python	s=1000, r=1, op=sqrt	9.98973846436e-05	9.3936920166e-05
numpy_ops	Numpy Inplace sqrt	s=1000, r=1, op=sqrt	0.0
numpy_ops	Weld compile time	s=1000, r=1, op=sqrt	0.22155880928	0.244626998901
numpy_ops	Numpy  +	s=1000, r=1, op=+	0.0
numpy_ops	Python->Weld	s=1000, r=1, op=+	0.000144004821777	6.103515625e-05
numpy_ops	Weld	s=1000, r=1, op=+	0.000160932540894	0.000133037567139
numpy_ops	Weld  +	s=1000, r=1, op=+	0.2406
numpy_ops	Numpy Inplace +	s=1000, r=1, op=+	0.0
numpy_ops	Weld->Python	s=1000, r=1, op=+	9.3936920166e-05	8.79764556885e-05
numpy_ops	Weld compile time	s=1000, r=1, op=+	0.239284038544	0.26279592514
numpy_ops	Weld Inplace +	s=1000, r=1, op=+	0.2645
numpy_ops	Weld  sqrt	s=1000, r=10, op=sqrt	0.2416
numpy_ops	Weld Inplace sqrt	s=1000, r=10, op=sqrt	0.2354
numpy_ops	Python->Weld	s=1000, r=10, op=sqrt	0.000121831893921	6.60419464111e-05
numpy_ops	Weld	s=1000, r=10, op=sqrt	0.000233173370361	0.000114917755127
numpy_ops	Numpy  sqrt	s=1000, r=10, op=sqrt	0.0
numpy_ops	Weld->Python	s=1000, r=10, op=sqrt	0.000128984451294	4.79221343994e-05
numpy_ops	Numpy Inplace sqrt	s=1000, r=10, op=sqrt	0.0
numpy_ops	Weld compile time	s=1000, r=10, op=sqrt	0.239832878113	0.234249830246
numpy_ops	Numpy  +	s=1000, r=10, op=+	0.0
numpy_ops	Python->Weld	s=1000, r=10, op=+	0.0001540184021	7.20024108887e-05
numpy_ops	Weld	s=1000, r=10, op=+	0.000237941741943	0.000271081924438
numpy_ops	Weld  +	s=1000, r=10, op=+	0.9601
numpy_ops	Numpy Inplace +	s=1000, r=10, op=+	0.0
numpy_ops	Weld->Python	s=1000, r=10, op=+	9.79900360107e-05	7.79628753662e-05
numpy_ops	Weld compile time	s=1000, r=10, op=+	0.956810951233	0.993608951569
numpy_ops	Weld Inplace +	s=1000, r=10, op=+	0.9983
numpy_ops	Weld  sqrt	s=10000, r=1, op=sqrt	0.2748
numpy_ops	Weld Inplace sqrt	s=10000, r=1, op=sqrt	0.5959
numpy_ops	Python->Weld	s=10000, r=1, op=sqrt	0.000267028808594	3.88622283936e-05
numpy_ops	Weld	s=10000, r=1, op=sqrt	0.000191926956177	0.000656843185425
numpy_ops	Numpy  sqrt	s=10000, r=1, op=sqrt	0.0001
numpy_ops	Weld->Python	s=10000, r=1, op=sqrt	0.000205039978027	0.000185966491699
numpy_ops	Numpy Inplace sqrt	s=10000, r=1, op=sqrt	0.0
numpy_ops	Weld compile time	s=10000, r=1, op=sqrt	0.273250818253	0.592747926712
numpy_ops	Numpy  +	s=10000, r=1, op=+	0.0001
numpy_ops	Python->Weld	s=10000, r=1, op=+	0.000277996063232	9.91821289062e-05
numpy_ops	Weld	s=10000, r=1, op=+	0.000179052352905	0.000180006027222
numpy_ops	Weld  +	s=10000, r=1, op=+	0.6696
numpy_ops	Numpy Inplace +	s=10000, r=1, op=+	0.0001
numpy_ops	Weld->Python	s=10000, r=1, op=+	9.3936920166e-05	7.60555267334e-05
numpy_ops	Weld compile time	s=10000, r=1, op=+	0.668015003204	0.382047176361
numpy_ops	Weld Inplace +	s=10000, r=1, op=+	0.3841
numpy_ops	Weld  sqrt	s=10000, r=10, op=sqrt	0.3748
numpy_ops	Weld Inplace sqrt	s=10000, r=10, op=sqrt	0.2491
numpy_ops	Python->Weld	s=10000, r=10, op=sqrt	0.000208854675293	4.41074371338e-05
numpy_ops	Weld	s=10000, r=10, op=sqrt	0.000444889068604	0.000324010848999
numpy_ops	Numpy  sqrt	s=10000, r=10, op=sqrt	0.0005
numpy_ops	Weld->Python	s=10000, r=10, op=sqrt	0.000112056732178	4.6968460083e-05
numpy_ops	Numpy Inplace sqrt	s=10000, r=10, op=sqrt	0.0005
numpy_ops	Weld compile time	s=10000, r=10, op=sqrt	0.372400999069	0.247756958008
numpy_ops	Numpy  +	s=10000, r=10, op=+	0.0002
numpy_ops	Python->Weld	s=10000, r=10, op=+	0.000147104263306	0.000124931335449
numpy_ops	Weld	s=10000, r=10, op=+	0.00106287002563	0.000756978988647
numpy_ops	Weld  +	s=10000, r=10, op=+	1.355
numpy_ops	Numpy Inplace +	s=10000, r=10, op=+	0.0001
numpy_ops	Weld->Python	s=10000, r=10, op=+	0.0001060962677	0.000135898590088
numpy_ops	Weld compile time	s=10000, r=10, op=+	1.34414100647	1.28362798691
numpy_ops	Weld Inplace +	s=10000, r=10, op=+	1.2878

@deepakn94
Copy link
Contributor

Can you actually just print the output to stdout (you should use the -v flag to run_benchmarks.py

@parimarjan
Copy link
Author

parimarjan commented Oct 10, 2017

ah yeah, that makes sense. Here are the new outputs - I updated the pr to add inplace ops as a separate flag for the numpy_ops.

++++++++++++++++++++++++++++++++++++++
blackscholes
++++++++++++++++++++++++++++++++++++++
s=100
Weld: 1.8099 +/- 2.5949 seconds
Python->Weld: 0.0003 +/- 0.0002 seconds
Numpy: 0.0002 +/- 0.0001 seconds
Weld compile time: 2.7024 +/- 0.5616 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds

s=1000
Weld: 1.6030 +/- 2.3123 seconds
Python->Weld: 0.0002 +/- 0.0001 seconds
Numpy: 0.0001 +/- 0.0001 seconds
Weld compile time: 2.3896 +/- 0.4566 seconds
Weld->Python: 0.0001 +/- 0.0001 seconds



++++++++++++++++++++++++++++++++++++++
numpy_ops
++++++++++++++++++++++++++++++++++++++

i=1, s=1000, r=1, op=sqrt
Weld Inplace sqrt: 0.2593 +/- 0.0477 seconds
Python->Weld: 0.0002 +/- 0.0001 seconds
Weld: 0.0002 +/- 0.0000 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Numpy Inplace sqrt: 0.0000 +/- 0.0000 seconds
Weld compile time: 0.2578 +/- 0.0476 seconds


i=1, s=1000, r=1, op=+
Python->Weld: 0.0003 +/- 0.0001 seconds
Weld: 0.0003 +/- 0.0001 seconds
Numpy Inplace +: 0.0000 +/- 0.0000 seconds
Weld->Python: 0.0002 +/- 0.0001 seconds
Weld compile time: 0.4326 +/- 0.0806 seconds
Weld Inplace +: 0.4349 +/- 0.0809 seconds


i=1, s=1000, r=10, op=sqrt
Weld Inplace sqrt: 0.2925 +/- 0.0811 seconds
Python->Weld: 0.0002 +/- 0.0001 seconds
Weld: 0.0003 +/- 0.0000 seconds
Weld->Python: 0.0002 +/- 0.0001 seconds
Numpy Inplace sqrt: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2905 +/- 0.0808 seconds


i=1, s=1000, r=10, op=+
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 0.0003 +/- 0.0000 seconds
Numpy Inplace +: 0.0000 +/- 0.0000 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.8477 +/- 0.0387 seconds
Weld Inplace +: 0.8513 +/- 0.0383 seconds


i=1, s=10000, r=1, op=sqrt
Weld Inplace sqrt: 0.2266 +/- 0.0244 seconds
Python->Weld: 0.0001 +/- 0.0000 seconds
Weld: 0.0003 +/- 0.0000 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Numpy Inplace sqrt: 0.0000 +/- 0.0000 seconds
Weld compile time: 0.2253 +/- 0.0244 seconds


i=1, s=10000, r=1, op=+
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 0.0003 +/- 0.0000 seconds
Numpy Inplace +: 0.0000 +/- 0.0000 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2443 +/- 0.0066 seconds
Weld Inplace +: 0.2460 +/- 0.0068 seconds


i=1, s=10000, r=10, op=sqrt
Weld Inplace sqrt: 0.2711 +/- 0.0842 seconds
Python->Weld: 0.0001 +/- 0.0000 seconds
Weld: 0.0005 +/- 0.0000 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Numpy Inplace sqrt: 0.0003 +/- 0.0000 seconds
Weld compile time: 0.2692 +/- 0.0836 seconds


i=1, s=10000, r=10, op=+
Python->Weld: 0.0002 +/- 0.0001 seconds
Weld: 0.0009 +/- 0.0001 seconds
Numpy Inplace +: 0.0002 +/- 0.0002 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.7663 +/- 0.0522 seconds
Weld Inplace +: 0.7706 +/- 0.0527 seconds


i=0, s=1000, r=1, op=sqrt
Weld  sqrt: 0.2558 +/- 0.0772 seconds
Python->Weld: 0.0001 +/- 0.0000 seconds
Weld: 0.0002 +/- 0.0000 seconds
Numpy  sqrt: 0.0000 +/- 0.0000 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2543 +/- 0.0769 seconds


i=0, s=1000, r=1, op=+
Numpy  +: 0.0000 +/- 0.0000 seconds
Python->Weld: 0.0002 +/- 0.0001 seconds
Weld: 0.0003 +/- 0.0001 seconds
Weld  +: 0.3177 +/- 0.0710 seconds
Weld->Python: 0.0002 +/- 0.0001 seconds
Weld compile time: 0.3153 +/- 0.0702 seconds


i=0, s=1000, r=10, op=sqrt
Weld  sqrt: 0.3033 +/- 0.0739 seconds
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 0.0002 +/- 0.0000 seconds
Numpy  sqrt: 0.0001 +/- 0.0000 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.3016 +/- 0.0739 seconds


i=0, s=1000, r=10, op=+
Numpy  +: 0.0000 +/- 0.0000 seconds
1   added inplace ops as a flag
Python->Weld: 0.0002 +/- 0.0001 seconds
Weld: 0.0003 +/- 0.0000 seconds
Weld  +: 0.8063 +/- 0.0938 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.8026 +/- 0.0937 seconds


i=0, s=10000, r=1, op=sqrt
Weld  sqrt: 0.2131 +/- 0.0110 seconds
Python->Weld: 0.0001 +/- 0.0000 seconds
Weld: 0.0003 +/- 0.0000 seconds
Numpy  sqrt: 0.0001 +/- 0.0000 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2117 +/- 0.0110 seconds


i=0, s=10000, r=1, op=+
Numpy  +: 0.0001 +/- 0.0000 seconds
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 0.0003 +/- 0.0001 seconds
Weld  +: 0.2715 +/- 0.0358 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2684 +/- 0.0330 seconds


i=0, s=10000, r=10, op=sqrt
Weld  sqrt: 0.2906 +/- 0.1194 seconds
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 0.0005 +/- 0.0001 seconds
Numpy  sqrt: 0.0004 +/- 0.0001 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2869 +/- 0.1159 seconds


i=0, s=10000, r=10, op=+
Numpy  +: 0.0002 +/- 0.0000 seconds
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 0.0017 +/- 0.0014 seconds
Weld  +: 0.8482 +/- 0.1501 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.8414 +/- 0.1516 seconds

@deepakn94
Copy link
Contributor

Can you rename the -s flag to -n? And can you make the sizes a lot bigger, these running times are way too short to be able to tell anything (try something maybe 10,000x larger)?

@deepakn94
Copy link
Contributor

Can you also post the output after you make the sizes larger?

@parimarjan
Copy link
Author

parimarjan commented Oct 11, 2017

yup made the changes. The outputs are below (these are on my mac, so not ideal I guess, but mostly the trends seem ~similar to what we had before). Also, for blackscholes, I posted the original version - and a slightly optimized version (where the intermediate arrays are evaluated before the final two arrays are computed) - as that seems to make a huge difference.

++++++++++++++++++++++++++++++++++++++
blackscholes:
++++++++++++++++++++++++++++++++++++++
n=1000000
Weld: 2.7955 +/- 3.2432 seconds
Python->Weld: 0.0004 +/- 0.0003 seconds
Numpy: 0.1216 +/- 0.0233 seconds
Weld compile time: 3.0697 +/- 0.6137 seconds
Weld->Python: 0.0003 +/- 0.0004 seconds


n=10000000
Weld: 10.3450 +/- 6.0419 seconds
Python->Weld: 0.0003 +/- 0.0001 seconds
Numpy: 1.2682 +/- 0.0765 seconds
Weld compile time: 2.9182 +/- 0.4821 seconds
Weld->Python: 0.0024 +/- 0.0052 seconds

This was particularly bad for two reasons I think:
- two arrays were being returned, and these shared a lot of previous computations. Thus if we
could intelligently cache the intermediate arrays / or have the user evaluate them - then the
performance could be improved a lot (as below)
- common subexpression elimination --> since a lof of complicated terms are being used multiple
times in same expression. I didn't have any good way to test this, but I guess the performance
should be ~ on par numpy.

Blackscholes w/ evaluating (caching?) intermediate common parts:

++++++++++++++++++++++++++++++++++++++
blackscholes
++++++++++++++++++++++++++++++++++++++
n=1000000
Weld: 0.8787 +/- 1.7845 seconds
Python->Weld: 0.0002 +/- 0.0001 seconds
Numpy: 0.1177 +/- 0.0193 seconds
Weld compile time: 0.8794 +/- 0.3438 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds


n=10000000
Weld: 2.0647 +/- 2.8119 seconds
Python->Weld: 0.0002 +/- 0.0001 seconds
Numpy: 1.1578 +/- 0.1053 seconds
Weld compile time: 0.8056 +/- 0.3408 seconds
Weld->Python: 0.0001 +/- 0.0001 seconds


++++++++++++++++++++++++++++++++++++++
numpy_ops
++++++++++++++++++++++++++++++++++++++
i=1, r=1, op=sqrt, n=10000000
Weld Inplace sqrt: 0.3851 +/- 0.0554 seconds
Python->Weld: 0.0002 +/- 0.0001 seconds
Weld: 0.0788 +/- 0.0049 seconds
Weld->Python: 0.0002 +/- 0.0001 seconds
Numpy Inplace sqrt: 0.0250 +/- 0.0012 seconds
Weld compile time: 0.3045 +/- 0.0542 seconds


i=1, r=1, op=sqrt, n=100000000
Weld Inplace sqrt: 1.0449 +/- 0.0666 seconds
Python->Weld: 0.0001 +/- 0.0000 seconds
Weld: 0.7734 +/- 0.0383 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Numpy Inplace sqrt: 0.2529 +/- 0.0113 seconds
Weld compile time: 0.2700 +/- 0.0410 seconds


i=1, r=1, op=+, n=10000000
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 0.0678 +/- 0.0069 seconds
Numpy Inplace +: 0.0145 +/- 0.0004 seconds
Weld->Python: 0.0002 +/- 0.0001 seconds
Weld compile time: 0.2569 +/- 0.0283 seconds
Weld Inplace +: 0.3269 +/- 0.0319 seconds


i=1, r=1, op=+, n=100000000
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 0.5629 +/- 0.0138 seconds
Numpy Inplace +: 0.1719 +/- 0.0343 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2507 +/- 0.0449 seconds
Weld Inplace +: 0.8150 +/- 0.0560 seconds


i=1, r=10, op=sqrt, n=10000000
Weld Inplace sqrt: 0.4796 +/- 0.0116 seconds
Python->Weld: 0.0001 +/- 0.0000 seconds
Weld: 0.2692 +/- 0.0021 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Numpy Inplace sqrt: 0.2531 +/- 0.0356 seconds
Weld compile time: 0.2090 +/- 0.0106 seconds


i=1, r=10, op=sqrt, n=100000000
Weld Inplace sqrt: 2.9020 +/- 0.0425 seconds
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 2.6873 +/- 0.0210 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Numpy Inplace sqrt: 2.3369 +/- 0.0248 seconds
Weld compile time: 0.2135 +/- 0.0217 seconds


i=1, r=10, op=+, n=10000000
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 1.0328 +/- 0.0327 seconds
Numpy Inplace +: 0.1484 +/- 0.0133 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.6955 +/- 0.0221 seconds
Weld Inplace +: 1.7320 +/- 0.0343 seconds


i=1, r=10, op=+, n=100000000
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 33.8522 +/- 2.2897 seconds
Numpy Inplace +: 1.4450 +/- 0.0465 seconds
Weld->Python: 0.0085 +/- 0.0049 seconds
Weld compile time: 0.7406 +/- 0.0558 seconds
Weld Inplace +: 34.6313 +/- 2.2966 seconds


i=0, r=1, op=sqrt, n=10000000
Weld  sqrt: 0.2421 +/- 0.0205 seconds
Python->Weld: 0.0001 +/- 0.0000 seconds
Weld: 0.0384 +/- 0.0051 seconds
Numpy  sqrt: 0.0562 +/- 0.0008 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2026 +/- 0.0177 seconds


i=0, r=1, op=sqrt, n=100000000
Weld  sqrt: 0.7954 +/- 0.0350 seconds
Python->Weld: 0.0001 +/- 0.0000 seconds
Weld: 0.6046 +/- 0.0315 seconds
Numpy  sqrt: 0.5772 +/- 0.0075 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.1896 +/- 0.0042 seconds


i=0, r=1, op=+, n=10000000
Numpy  +: 0.0449 +/- 0.0016 seconds
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 0.0399 +/- 0.0043 seconds
Weld  +: 0.2758 +/- 0.0086 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2345 +/- 0.0066 seconds


i=0, r=1, op=+, n=100000000
Numpy  +: 0.4644 +/- 0.0189 seconds
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 0.5455 +/- 0.0209 seconds
Weld  +: 0.7897 +/- 0.0346 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2428 +/- 0.0179 seconds


i=0, r=10, op=sqrt, n=10000000
Weld  sqrt: 0.4560 +/- 0.0026 seconds
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 0.2439 +/- 0.0019 seconds
Numpy  sqrt: 0.4513 +/- 0.0134 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2105 +/- 0.0037 seconds


i=0, r=10, op=sqrt, n=100000000
Weld  sqrt: 3.1278 +/- 0.2569 seconds
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 2.8545 +/- 0.1610 seconds
Numpy  sqrt: 8.4113 +/- 3.9882 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.2716 +/- 0.1003 seconds


i=0, r=10, op=+, n=10000000
Numpy  +: 0.4006 +/- 0.0094 seconds
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 1.0201 +/- 0.0183 seconds
Weld  +: 1.7411 +/- 0.0250 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
Weld compile time: 0.7177 +/- 0.0192 seconds


i=0, r=10, op=+, n=100000000
Numpy  +: 6.3209 +/- 1.1640 seconds
Python->Weld: 0.0002 +/- 0.0000 seconds
Weld: 39.1188 +/- 4.3974 seconds
Weld  +: 39.9701 +/- 4.4380 seconds
Weld->Python: 0.0116 +/- 0.0072 seconds
Weld compile time: 0.8022 +/- 0.1165 seconds

@deepakn94
Copy link
Contributor

I see, a couple of questions:

  • Do you know why results are pretty high variance:
Weld: 1.8099 +/- 2.5949 seconds
Python->Weld: 0.0003 +/- 0.0002 seconds
Numpy: 0.0002 +/- 0.0001 seconds
Weld compile time: 2.7024 +/- 0.5616 seconds
Weld->Python: 0.0001 +/- 0.0000 seconds
  • "two arrays were being returned, and these shared a lot of previous computation" --> I don't quite understand; what are the two arrays being returned? I guess this is in the Black Scholes workload?
  • Do you know why we're slower for simple stuff like numpy array addition?

@parimarjan
Copy link
Author

high variance: I think this is probably a side effect of me running these on my mac (I have been running into performance issues with high memory/disk space utilization...) - I will set this up on dawn and get back to you on this.

blackscholes: initially it performs a bunch of computations on four arrays (a,b,c,d). The final results are two arrays, 'call', and 'put', with both these arrays being a function of all the four arrays (a,b,c,d). So if we just do call.evaluate() and put.evaluate() in the end, then the computations on (a,b,c,d) are being repeated (first case above, with results ~10 seconds). If we call a.evaluate(), b.evaluate() etc. before call.evaluate(), put.evaluate() --> then the runtime goes down dramatically (~2 seconds above). This is still slower than numpy, I think because of the second issue below:

The computations on a,b,c,d include many repeated segments, a made up e.g. is,
a = 2ac
b = aa
Now when b.evaluate() is called, then it essentially translates to (2
ac)(2ac). I guess this could be potentially solved with the common expression elimination optimization (?).

numpy binary ops: From what I remember, I think the loop fusion optimization wasn't completely handled for the binary ops (while it was complete for unary ops), which led to binary ops being slower earlier. I'm not sure why the performance for binary ops degrades so fast though in the last cases...maybe it was a combination of this + issues on my mac (?)

@deepakn94
Copy link
Contributor

Black Scholes: Ah right, I remember now. I guess we talked about having something like [call, put].evaluate() that figures out what computation is shared between call and put and only executes that once. Obviously, this isn't implemented yet, but we should figure out if we do want to do this.

NumPy operations: Aren't you doing something like b = a + a? What fusion optimization do we need here? Or is the workload more complex than that?

@parimarjan
Copy link
Author

Yeah, although we discussed the group update (or similar decorator annotations to explicitly mention which arrays the user cares about) in the context of in place operations - so then we could figure out the correct order of operations in order to first evaluate any dependent arrays etc.

It also would help with caching the intermediate arrays in the above example, but I'm not sure if this is the best syntax for it - e.g., if a few more operations need to happen on 'put' after 'call's evaluation, then it won't cover that case.

I guess if we want to do it automatically, one option could be to assume that we can see the complete python file - and infer things based on that (?). Dask also seems to have worked on this issue: http://dask.pydata.org/en/latest/caching.html - but its probably nothing too interesting. They mentioned its just for "interactive sessions", so I guess for fixed python files, they might be doing some sort of look ahead in their graph etc (will need to look further into it...).

Loop fusion in binary ops: So on of the flags there was -r for the number of reps - and the dramatic slowdown in weld appears to be in r=10. So the code is essentially: for i in range(r): a = a + b

For binary operators, loop fusion has no effect on the code:
00:32:25.748: After loop-fusion pass:
|e0:vec[f64],e1:vec[f64]|
result(
for(
zip(
result(
for(
zip(
e0:vec[f64],
e1:vec[f64]
),
appender[f64],
|b:appender[f64],i:i64,x:{f64,f64}|
merge(b:appender[f64],(x:{f64,f64}.$0+x:{f64,f64}.$1))
)
),
e1:vec[f64]
),
appender[f64],
|b#1:appender[f64],i#1:i64,x#1:{f64,f64}|
merge(b#1:appender[f64],(x#1:{f64,f64}.$0+x#1:{f64,f64}.$1))
)
)

In comparison, for unary operators, after loop fusion - the operations are merged into a single pass.
result(
for(
e0:vec[f64],
appender[f64],
|b#2:appender[f64],i#2:i64,x:f64|
merge(b#2:appender[f64],(sqrt((sqrt(x:f64)))))
)
)

@deepakn94
Copy link
Contributor

I see, thanks for the explanations! For the NumPy binary operators, I think Matei's new fusion PR might be able to handle these cases -- I will need to double check though.

Let's discuss the best way to handle the Black Scholes workload at the Weld meeting this week.

@parimarjan
Copy link
Author

parimarjan commented Oct 18, 2017

Blackscholes update: The timings posted above for blackscholes were actually wrong --- although the trend was mostly similar (just worse for weld). The problem was that since there were multiple arrays being evaluated (call, put in the case without intermediate evaluation; and a bunch of others with intermediate evaluation), and I was using weldobject.evaluate with verbosity on, there were a lot of timings with scheme "Weld" being printed since weldobject also prints those - and these were being averaged by run-benchmarks (also explains the very high variance you noticed I guess).

This didn't affect the times we had for the unary/binary ops as those only had a single call to evaluate, and the scheme for the final times was something like "Weld +" etc. I guess the best solution is just to keep the verbosity off - so we'll only get the final time outputs?

Here are the updated total times for blackscholes, with ie = intermediate evaluation:

++++++++++++++++++++++++++++++++++++++
blackscholes
++++++++++++++++++++++++++++++++++++++
ie=0, n=1000000
Weld: 4.6113 +/- 0.0831 seconds
Numpy: 0.1165 +/- 0.0126 seconds

ie=0, n=10000000
Weld: 14.8757 +/- 0.9984 seconds
Numpy: 1.2317 +/- 0.0705 seconds

ie=1, n=1000000
Weld: 3.0657 +/- 0.1211 seconds
Numpy: 0.1196 +/- 0.0046 seconds

ie=1, n=10000000
Weld: 6.7289 +/- 0.1888 seconds
Numpy: 1.2140 +/- 0.0236 seconds

I'm guessing the reasons for the worse performance are still the same:
a) repeated calculations because common subexpression elimination isn't used
b) there are many binary operations like a = bcd etc. which aren't getting fused?

@parimarjan
Copy link
Author

the numpy timings, w/o evaluate() verbosity as False -- these aren't different from the ones we had before...I guess its cleaner like this without compile time etc. though?

++++++++++++++++++++++++++++++++++++++
numpy_ops
++++++++++++++++++++++++++++++++++++++
i=1, r=1, op=sqrt, n=10000000
Numpy Inplace sqrt: 0.0248 +/- 0.0020 seconds
Weld Inplace sqrt: 0.2596 +/- 0.0540 seconds

i=1, r=1, op=sqrt, n=100000000
Numpy Inplace sqrt: 0.2441 +/- 0.0083 seconds
Weld Inplace sqrt: 1.2093 +/- 0.5774 seconds

i=1, r=1, op=+, n=10000000
Numpy Inplace +: 0.0161 +/- 0.0009 seconds
Weld Inplace +: 0.2255 +/- 0.0096 seconds

i=1, r=1, op=+, n=100000000
Numpy Inplace +: 0.1670 +/- 0.0119 seconds
Weld Inplace +: 0.9429 +/- 0.0367 seconds

i=1, r=10, op=sqrt, n=10000000
Numpy Inplace sqrt: 0.2494 +/- 0.0061 seconds
Weld Inplace sqrt: 0.4754 +/- 0.0182 seconds

i=1, r=10, op=sqrt, n=100000000
Numpy Inplace sqrt: 2.4554 +/- 0.0383 seconds
Weld Inplace sqrt: 3.0810 +/- 0.0777 seconds

i=1, r=10, op=+, n=10000000
Numpy Inplace +: 0.1581 +/- 0.0037 seconds
Weld Inplace +: 2.0232 +/- 0.0670 seconds

i=1, r=10, op=+, n=100000000
Numpy Inplace +: 1.6713 +/- 0.0779 seconds
Weld Inplace +: 54.4861 +/- 1.2056 seconds

i=0, r=1, op=sqrt, n=10000000
Weld sqrt: 0.2020 +/- 0.0110 seconds
Numpy sqrt: 0.0770 +/- 0.0159 seconds

i=0, r=1, op=sqrt, n=100000000
Weld sqrt: 0.8903 +/- 0.0312 seconds
Numpy sqrt: 0.6934 +/- 0.0959 seconds

i=0, r=1, op=+, n=10000000
Numpy +: 0.0512 +/- 0.0039 seconds
Weld +: 0.2056 +/- 0.0123 seconds

i=0, r=1, op=+, n=100000000
Numpy +: 0.5596 +/- 0.0207 seconds
Weld +: 0.8656 +/- 0.0432 seconds

i=0, r=10, op=sqrt, n=10000000
Weld sqrt: 0.4182 +/- 0.0067 seconds
Numpy sqrt: 0.4965 +/- 0.0214 seconds

i=0, r=10, op=sqrt, n=100000000
Weld sqrt: 3.2796 +/- 0.1890 seconds
Numpy sqrt: 12.6174 +/- 6.5432 seconds

i=0, r=10, op=+, n=10000000
Numpy +: 0.5226 +/- 0.0767 seconds
Weld +: 2.2294 +/- 0.0380 seconds

i=0, r=10, op=+, n=100000000
Numpy +: 8.2432 +/- 2.8448 seconds
Weld +: 54.0143 +/- 2.9903 seconds

parimarjan and others added 5 commits October 18, 2017 07:06
*bigger sizes for the arrays
*change --s flag to --n
*added --ie flag (intermediate evaluation) for blackscholes
*numpy seems to have some weird optimization when c05 was 0.5?!! Not sure why
but numpy gets +5 seconds here while weld stays the same
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants