-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added numpy benchmarks #1
base: master
Are you sure you want to change the base?
Conversation
*blackscholes *numpy ops - unary/binary, inplace/non-inplace
Hey Pari, thanks for doing this! |
the emitted output files for the benchmarks are the following (numpy_ops is particularly long with a bunch of different combinations being run...):
|
Can you actually just print the output to stdout (you should use the |
ah yeah, that makes sense. Here are the new outputs - I updated the pr to add inplace ops as a separate flag for the numpy_ops.
|
Can you rename the |
Can you also post the output after you make the sizes larger? |
yup made the changes. The outputs are below (these are on my mac, so not ideal I guess, but mostly the trends seem ~similar to what we had before). Also, for blackscholes, I posted the original version - and a slightly optimized version (where the intermediate arrays are evaluated before the final two arrays are computed) - as that seems to make a huge difference.
This was particularly bad for two reasons I think: Blackscholes w/ evaluating (caching?) intermediate common parts:
|
I see, a couple of questions:
|
high variance: I think this is probably a side effect of me running these on my mac (I have been running into performance issues with high memory/disk space utilization...) - I will set this up on dawn and get back to you on this. blackscholes: initially it performs a bunch of computations on four arrays (a,b,c,d). The final results are two arrays, 'call', and 'put', with both these arrays being a function of all the four arrays (a,b,c,d). So if we just do call.evaluate() and put.evaluate() in the end, then the computations on (a,b,c,d) are being repeated (first case above, with results ~10 seconds). If we call a.evaluate(), b.evaluate() etc. before call.evaluate(), put.evaluate() --> then the runtime goes down dramatically (~2 seconds above). This is still slower than numpy, I think because of the second issue below: The computations on a,b,c,d include many repeated segments, a made up e.g. is, numpy binary ops: From what I remember, I think the loop fusion optimization wasn't completely handled for the binary ops (while it was complete for unary ops), which led to binary ops being slower earlier. I'm not sure why the performance for binary ops degrades so fast though in the last cases...maybe it was a combination of this + issues on my mac (?) |
Black Scholes: Ah right, I remember now. I guess we talked about having something like NumPy operations: Aren't you doing something like |
Yeah, although we discussed the group update (or similar decorator annotations to explicitly mention which arrays the user cares about) in the context of in place operations - so then we could figure out the correct order of operations in order to first evaluate any dependent arrays etc. It also would help with caching the intermediate arrays in the above example, but I'm not sure if this is the best syntax for it - e.g., if a few more operations need to happen on 'put' after 'call's evaluation, then it won't cover that case. I guess if we want to do it automatically, one option could be to assume that we can see the complete python file - and infer things based on that (?). Dask also seems to have worked on this issue: http://dask.pydata.org/en/latest/caching.html - but its probably nothing too interesting. They mentioned its just for "interactive sessions", so I guess for fixed python files, they might be doing some sort of look ahead in their graph etc (will need to look further into it...). Loop fusion in binary ops: So on of the flags there was -r for the number of reps - and the dramatic slowdown in weld appears to be in r=10. So the code is essentially: for i in range(r): a = a + b For binary operators, loop fusion has no effect on the code: In comparison, for unary operators, after loop fusion - the operations are merged into a single pass. |
I see, thanks for the explanations! For the NumPy binary operators, I think Matei's new fusion PR might be able to handle these cases -- I will need to double check though. Let's discuss the best way to handle the Black Scholes workload at the Weld meeting this week. |
Blackscholes update: The timings posted above for blackscholes were actually wrong --- although the trend was mostly similar (just worse for weld). The problem was that since there were multiple arrays being evaluated (call, put in the case without intermediate evaluation; and a bunch of others with intermediate evaluation), and I was using weldobject.evaluate with verbosity on, there were a lot of timings with scheme "Weld" being printed since weldobject also prints those - and these were being averaged by run-benchmarks (also explains the very high variance you noticed I guess). This didn't affect the times we had for the unary/binary ops as those only had a single call to evaluate, and the scheme for the final times was something like "Weld +" etc. I guess the best solution is just to keep the verbosity off - so we'll only get the final time outputs? Here are the updated total times for blackscholes, with ie = intermediate evaluation: ++++++++++++++++++++++++++++++++++++++ ie=0, n=10000000 ie=1, n=1000000 ie=1, n=10000000 I'm guessing the reasons for the worse performance are still the same: |
the numpy timings, w/o evaluate() verbosity as False -- these aren't different from the ones we had before...I guess its cleaner like this without compile time etc. though? ++++++++++++++++++++++++++++++++++++++ i=1, r=1, op=sqrt, n=100000000 i=1, r=1, op=+, n=10000000 i=1, r=1, op=+, n=100000000 i=1, r=10, op=sqrt, n=10000000 i=1, r=10, op=sqrt, n=100000000 i=1, r=10, op=+, n=10000000 i=1, r=10, op=+, n=100000000 i=0, r=1, op=sqrt, n=10000000 i=0, r=1, op=sqrt, n=100000000 i=0, r=1, op=+, n=10000000 i=0, r=1, op=+, n=100000000 i=0, r=10, op=sqrt, n=10000000 i=0, r=10, op=sqrt, n=100000000 i=0, r=10, op=+, n=10000000 i=0, r=10, op=+, n=100000000 |
*bigger sizes for the arrays *change --s flag to --n *added --ie flag (intermediate evaluation) for blackscholes
*numpy seems to have some weird optimization when c05 was 0.5?!! Not sure why but numpy gets +5 seconds here while weld stays the same
*blackscholes
*numpy ops - unary/binary, inplace/non-inplace
requires setup.py for weldnumpy to be run on the machine (PR #247 on weld)