-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify benchmarks? #137
Comments
Unfortunately i'm not Scala guy, so i'd like to ask few more questions regarding tests equality, if you don't mind. Line 144: Was number of threads set to 1 in runtime? Or number of CPU threads were > 1 during both tests? |
2018-04-02 13:21 GMT+08:00 raver119 <[email protected]>:
Unfortunately i'm not Scala guy, so i'd like to ask few more questions
regarding tests equality, if you don't mind.
https://github.com/ThoughtWorksInc/Compute.scala/
blob/nvidia-gpu/benchmarks/src/jmh/scala/com/thoughtworks/compute/
benchmarks.scala#L144-L181
Line 144: Was number of threads set to 1 in runtime? Or number of CPU
threads were > 1 during both tests?
CPU threads were > 1 during both tests. I suggest you learn to use JMH,
because it is a very good tool when you are implementing a performance-critical
framework.
Lines 153 & 168: Does that means you've included input generation time to
Tanh measurements?
The input generation will be completed in warm-up iterations. I recommend
you read this: http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/ . I hope the article may help you better understand how Scala
lazy val works.
Line 170: So, you do use "cache" for your library, but not using
workspaces for ND4j? Nice.
The cache method is an equivalent to lazy val in the ND4J initialization.
It just allocate a buffer for input, not for the rest computation.
Line 174: foldLeft InlineTensor does in this case it means that operation
in this case is executed in-place? As in "input array gets modified and
returned back" once .flatArray() method is called?
Compute.scala does not support any in-place mutable operation. The
foldLeft InlineTensor means merging multiple tanh calls into one kernel
program.
… —
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#137 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAktupy5LaIGFoPJTNtpOa4Q5DBCMKcyks5tkbV3gaJpZM4TDI70>
.
--
杨博 (Yang Bo)
|
On Apr 1, 2018, at 23:18, 杨博 (Yang Bo) ***@***.***> wrote:
2018-04-02 13:21 GMT+08:00 raver119 ***@***.***>:
> Unfortunately i'm not Scala guy, so i'd like to ask few more questions
> regarding tests equality, if you don't mind.
>
> https://github.com/ThoughtWorksInc/Compute.scala/
> blob/nvidia-gpu/benchmarks/src/jmh/scala/com/thoughtworks/compute/
> benchmarks.scala#L144-L181
>
> Line 144: Was number of threads set to 1 in runtime? Or number of CPU
> threads were > 1 during both tests?
>
CPU threads were > 1 during both tests. I suggest you learn to use JMH,
because it is a very good tool when you are implementing a
I’m sorry, bu my question actually had a bit deeper sense. JMH allows runtime override for Thread annotation, and the only thing i’ve asked you - if there was override when you was running your code or no. I take your answer as no, though. So next question is: what’s the idea of testing CUDA in multiple parallel CPU threads? Was workload small enough?
p.s. Thanks for advice.
> Lines 153 & 168: Does that means you've included input generation time to
> Tanh measurements?
>
The input generation will be completed in warm-up iteration. I recommend
you read this: http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-
val-pattern/ . I hope the article may help you better understand how Scala
lazy val works.
Oh, interesting.
> Line 170: So, you do use "cache" for your library, but not using
> workspaces for ND4j? Nice.
>
The cache method is an equivalent to lazy val in the ND4J initialization.
It just allocate a buffer for input, not for the rest computation.
Perfect. See next question below please.
> Line 174: foldLeft InlineTensor does in this case it means that operation
> in this case is executed in-place? As in "input array gets modified and
> returned back" once .flatArray() method is called?
>
Compute.scala does not support any in-place mutable operation. The
foldLeft InlineTensor means merging multiple tanh call into one kernel
program.
Ok. Let me ask in other words then. Lines 170:178. There’s loop with `numberOfIterations` iterations. How many independent memory buffers allocated there? 0? 1? 2? `numberOfIterations` or `numberOfIterations` x 2?
EDIT: I mean off-heap buffers, available for the GPU.
… > —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#137 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAktupy5LaIGFoPJTNtpOa4Q5DBCMKcyks5tkbV3gaJpZM4TDI70>
> .
>
--
杨博 (Yang Bo)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#137 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_3AdMzdzvQ-F0oKeMDAEKTcKPmv5ks5tkcLQgaJpZM4TDI70>.
|
2018-04-02 14:49 GMT+08:00 raver119 <[email protected]>:
> On Apr 1, 2018, at 23:18, 杨博 (Yang Bo) ***@***.***> wrote:
>
> 2018-04-02 13:21 GMT+08:00 raver119 ***@***.***>:
>
> > Unfortunately i'm not Scala guy, so i'd like to ask few more questions
> > regarding tests equality, if you don't mind.
> >
> > https://github.com/ThoughtWorksInc/Compute.scala/
> > blob/nvidia-gpu/benchmarks/src/jmh/scala/com/thoughtworks/compute/
> > benchmarks.scala#L144-L181
> >
> > Line 144: Was number of threads set to 1 in runtime? Or number of CPU
> > threads were > 1 during both tests?
> >
> CPU threads were > 1 during both tests. I suggest you learn to use JMH,
> because it is a very good tool when you are implementing a
I’m sorry, bu my question actually had a bit deeper sense. JMH allows
runtime override for Thread annotation, and the only thing i’ve asked you -
if there was override when you was running your code or no. I take your
answer as no, though. So next question is: what’s the idea of testing CUDA
in multiple parallel CPU threads? Was workload small enough?
p.s. Thanks for advice.
>
> > Lines 153 & 168: Does that means you've included input generation time
to
> > Tanh measurements?
> >
> The input generation will be completed in warm-up iteration. I recommend
> you read this: http://fdahms.com/2015/10/14/
scala-and-the-transient-lazy-
> val-pattern/ . I hope the article may help you better understand how
Scala
> lazy val works.
Oh, interesting.
>
> > Line 170: So, you do use "cache" for your library, but not using
> > workspaces for ND4j? Nice.
> >
> The cache method is an equivalent to lazy val in the ND4J initialization.
> It just allocate a buffer for input, not for the rest computation.
Perfect. See next question below please.
> > Line 174: foldLeft InlineTensor does in this case it means that
operation
> > in this case is executed in-place? As in "input array gets modified and
> > returned back" once .flatArray() method is called?
> >
> Compute.scala does not support any in-place mutable operation. The
> foldLeft InlineTensor means merging multiple tanh call into one kernel
> program.
Ok. Let me ask in other words then. Lines 170:178. There’s loop with
`numberOfIterations` iterations. How many independent memory buffers
allocated there? 0? 1? 2? `numberOfIterations` or `numberOfIterations` x 2?
1 buffer, which stores the result. I know it's not fair to ND4J, but I
really don't know how to make ND4J merge multiple immutable operations into
one.
…
> > —
> > You are receiving this because you are subscribed to this thread.
> > Reply to this email directly, view it on GitHub
> > <https://github.com/ThoughtWorksInc/Compute.scala/
issues/137#issuecomment-377857399>,
> > or mute the thread
> > <https://github.com/notifications/unsubscribe-auth/
AAktupy5LaIGFoPJTNtpOa4Q5DBCMKcyks5tkbV3gaJpZM4TDI70>
> > .
> >
>
>
>
> --
> 杨博 (Yang Bo)
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub <https://github.com/
ThoughtWorksInc/Compute.scala#137#issuecomment-377864516>, or mute
the thread <https://github.com/notifications/unsubscribe-
auth/ALru_3AdMzdzvQ-F0oKeMDAEKTcKPmv5ks5tkcLQgaJpZM4TDI70>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#137 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAktuoIdnTyUL1bad4o3paoE0wh4YH4jks5tkcnzgaJpZM4TDI70>
.
--
杨博 (Yang Bo)
|
You are very keen to realize that the performance issue in the benchmark is
related to the memory.
For memory settings, you can check the benchmark source code. JMH forks JVM
when running, the JVM flags are provided by annotations. No annotation
means default JVM configuration.
For workspaces, I have to confess that I am not familiar with workspace.
But the purpose of this Compute.scala library is allowing people do not
care about memory even when they are using arbitrary immutable operations.
I understand a carefully optimized application written in ND4J is good. But
it seems that many users of ND4J are not smart enough to avoid
OutOfMemoryError: https://github.com/deeplearning4j/deeplearning4j/issues?utf8=%E2%9C%93&q=OutOfMemoryError
2018-04-02 12:35 GMT+08:00 Adam Gibson <[email protected]>:
… Hey folks:
First of all, I'm kind of dismayed you guys didn't talk to us about your
findings. You guys are making some heavy claims. We are about to do a
release this week. We were busy essentially writing our own TF including
import.
From an initial look, you guys didn't do your benchmarks properly.
1. You are missing workspaces in your microbenchmarks which defeats
the purpose of
benchmarking nd4j.
2. You don't show any of nd4j's native configuration or any of the
memory configurations you guys tried.
I know it's in your guys' interest to have your own framework. That's
up to you.
We'll do a blog post to correct a lot of these misconceptions you guys are
perpetuating here, but in the mean time, we'll do our best to clarify
questions you guys have.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#137>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAktusVBZIKmpA_yOJepxg5QPMjFuVssks5tkaqZgaJpZM4TDI70>
.
--
杨博 (Yang Bo)
|
2018-04-02 14:49 GMT+08:00 raver119 <[email protected]>:
I’m sorry, bu my question actually had a bit deeper sense. JMH allows
runtime override for Thread annotation, and the only thing i’ve asked you -
if there was override when you was running your code or no. I take your
answer as no, though. So next question is: what’s the idea of testing CUDA
in multiple parallel CPU threads? Was workload small enough?
p.s. Thanks for advice.
Yes, the purpose is to avoid GPU starving. I did not test but I guess
Compute.scala
will even be faster on larger arrays with less threads, because it reduces
the overhead of the driver, considering NVIDIA OpenCL driver has a higher
overhead than CUDA.
…--
杨博 (Yang Bo)
|
I see. This benchmark isn't comparing apples to apples. Thanks for your time. |
You can see there is a notice for Deeplearning4j in the README.md. We have never criticized the performance of ND4J's in a mutable style. |
2018-04-02 22:27 GMT+08:00 raver119 <[email protected]>:
I see.
This benchmark isn't comparing apples to apples. Thanks for your time.
You are wrong. The benchmark is comparing apples to apples: immutable
operations vs immutable operations
Unfortunately, ND4J does not have the feature of dynamic kernel generation,
or JIT in PyTorch's terminology.
For example, when you compare the performance of Lua runtimes between Lua
and Luajit . You can say it's not fair because Luajit support JIT. But it's
still comparing the same semantics of operations.
I understand ND4J is designed for Deeplearning4j, which only needs mutable
in-place operations, so you guys don't have to optimize the performance for
other usage.
… —
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#137 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAktuqoiJatr2nUyAnzR4iHj280WRVmuks5tkjU-gaJpZM4TDI70>
.
--
杨博 (Yang Bo)
|
Few messages above you've said that 1 array was allocated for the loop. Now you say: it's immutable vs immutable comparison. So which answer is correct? I.e. your code that uses Nd4j does If you call that "apples to apples comparison" - okay, that's up to you :) re Nd4j for Dl4j. Nd4j just mimics numpy basically. |
2018-04-03 0:41 GMT+08:00 raver119 <[email protected]>:
You are wrong. The benchmark is comparing apples to apples: immutable
operations vs immutable operations
Few messages above you've said that 1 array was allocated for the loop.
Now you say: it's immutable vs immutable comparison. So which answer is
correct?
I.e. your code that uses Nd4j does numberOfIterations x2 allocations,
because your Transform.tanh() call creates new INDArray each time, and each
INDArray has 2 buffers - 1 on gpu side, 1 on host side. With 5 total
iterations your test basically benchmarks CUDA allocation performance, and
not actual tanh.
You are talking about your implementation, not the behavior. You provide a
very good explanation why ND4J's implementation consumes more memory than
Compute.scala for the same behavior, which is exactly what the benchmark
demonstrated.
… If you call that "apples to apples comparison" - okay, that's up to you :)
re Nd4j for Dl4j. Nd4j just mimics numpy basically.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#137 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAktut0T_xNV80V8BYAugf0aYGD_FqS_ks5tklSngaJpZM4TDI70>
.
--
杨博 (Yang Bo)
|
No, i've just explained that you hadn't understood how to implement stuff with ND4j efficiently, and did claims like this:
It's not about ND4j implementation. It's about what YOU've implemented. Because obviously the same code could be written without allocating new INDArray on each iteration. Just difference of 1 argument :) |
P.d. don't get me wrong please. Personally i don't care about your claims etc. You want to claim you're faster then Nd4j? I'm ok with that. If you'll want to claim that you're faster then light - I'll be ok with that as well. The only reason I was here - is performance feedback. When i hear about Nd4j performance problems - i'm always trying to get to the bottom of the problem, and improve whatever is possible to improve. In this particular case i see - it's a waste of time for me, due to various reasons. Different approaches, bad benchmarking setup, different goals etc. Thanks for your time. |
2018-04-03 0:49 GMT+08:00 raver119 <[email protected]>:
It's not about ND4j implementation. It's about what YOU've implemented.
Suppose a data scientist Alice read a paper and want to reproduce an
algorithm in the paper, say, `a * b + c`. She found that the ND4J/ND4S
version of `a * b + c` consumes four times memory than the Compute.scala
version of `a * b + c`. And Skymind guys blame Alice because she did not
refactor her ND4J version to `a *= b; a += c`. Interesting...
…--
杨博 (Yang Bo)
|
Imagine Alice does some reading of documentation, and instead of:
does something like:
Thats when it becomes interesting... :) |
OK, today I learnt that any one who read the ND4J documentation should
never use immutable operators.
I am so curious how fast a.muli(b).addi(c) is. It must be super faster than
Compute.scala's slow immutable operators.
I guess I should add a new benchmark for that.
2018-04-03 1:50 GMT+08:00 raver119 <[email protected]>:
… Imagine Alice does some reading of documentation, and instead of:
a.mul(b).addi(c)
does something like:
a.muli(b).addi(c)
Thats when it becomes interesting... :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#137 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAktunGOj2FZ3cDwDHS73frfuo0FjJRqks5tkmT-gaJpZM4TDI70>
.
--
杨博 (Yang Bo)
|
@raver119 The inplace version of ND4J operation is indeed super fast. It is 1.44 times faster than the ND4J's immutable version when performing All the tests are running on a Titan X GPU. |
That's already something, thank you. Please tell me, what OS was used, and what CUDA Toolkit version was used? EDIT: And which Titan X generation was used? There were 2 different generations sharing the same X name. Which one you've used? M or P? |
Ubuntu 16.04 and CUDA 8.0 from this docker image: https://github.com/ThoughtWorksInc/scala-cuda/tree/sbt-openjdk8-cuda8.0-opencl-ubuntu16.04 |
What's different on your local branch? I've tried to run your A couple more things:
|
|
The reason why I was using 0.8 is the CUDA backend of ND4J 0.9.x is broken in sbt, even when compiling from a clean docker image. |
Can you quickly give me the command you are using to run the program? I
cannot access the class from SBT console
…On Mon, Apr 2, 2018 at 5:52 PM 杨博 (Yang Bo) ***@***.***> wrote:
The reason why I am using 0.8 is simply because the CUDA backend of ND4J
0.9.x is broken in sbt, even when compiling from a clear docker image.
deeplearning4j/nd4j#2767
<https://github.com/deeplearning4j/nd4j/issues/2767>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#137 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABYNr7sIE1flkii2AdBsXBioOtElgHDNks5tkse3gaJpZM4TDI70>
.
|
sbt 'benchmarks/Jmh/run Issue137' The first run of the command may be fail due to Run |
Excuse my ignorance, all I'm getting is a packaged jar and there's no main
class when I try to run it.
`/home/justin/Projects/Compute.scala/benchmarks/target/scala-2.12/benchmarks_2.12-0.3.2-SNAPSHOT-jmh.jar`
…On Mon, Apr 2, 2018 at 6:12 PM 杨博 (Yang Bo) ***@***.***> wrote:
sbt 'benchmarks/Jmh/bgRun Issue137'
The first run of the command may be fail due to sbt-jmh's bug. But retry
would be good.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#137 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABYNr-z5Im6h0ty4ijK9TvpgEmML_g9lks5tksyDgaJpZM4TDI70>
.
|
2018-04-03 5:13 GMT+08:00 Justin Long <[email protected]>:
What's different on your local branch?
There were different workarounds for different OpenCL bugs from different
vendors, but we now detect vendor at run-time, dynamically switching those
workarounds.
The only difference now in nvidia-gpu branch is the library dependency of
ND4J backend, because I don't know how to switch ND4J backend at runtime.
…--
杨博 (Yang Bo)
|
I've also tried with no success: `java -cp
/home/justin/Projects/Compute.scala/benchmarks/target/scala-2.12/benchmarks_2.12-0.3.2-SNAPSHOT-jmh.jar
com.thoughtworks.compute.benchmarks`
…On Mon, Apr 2, 2018 at 7:09 PM 杨博 (Yang Bo) ***@***.***> wrote:
2018-04-03 5:13 GMT+08:00 Justin Long ***@***.***>:
> What's different on your local branch?
>
There were different workarounds for different OpenCL bugs form different
vendors, but we now detect vendor at run-time dynamically switch those
workaround.
The only difference now in nvidia-gpu branch is the library dependency of
ND4J backend, because I don't know how to switch ND4J backend at runtime.
--
杨博 (Yang Bo)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#137 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABYNr8iDuf_xqyfPlQKpLijpjOM1YsQcks5tktnMgaJpZM4TDI70>
.
|
Imagine you do some reading of documentation, and instead of:
do something like:
That's when it becomes interesting... :) |
Burden of reproducibility falls on you. The command you gave me is
If you can't give me something that's reproducible, that's very suspect. I see that you have since edited your answer to use |
Clarify:
|
Your Tensors have an asynchronous component. Instead of calling In ND4J, we have a simple |
2018-04-03 11:47 GMT+08:00 Justin Long <[email protected]>:
how can I issue a command to execute all ops in the queue?
You are talking about the implementation of ND4J. Compute.scala does not
have similar assumption like "each Tensor has an associated command queue",
because the assumption is an inefficient design for immutable operations.
…--
杨博 (Yang Bo)
|
I'm aware that InlineTensors are lazily evaluated because you compile all of the ops until final evaluation: https://github.com/ThoughtWorksInc/Compute.scala#lazy-evaluation What I'm looking for is an op that triggers evaluation without calling My goal here is to break this into smaller, consumable pieces. |
Add a JMH parameter -p numberOfIterations=1 to benchmark smaller, consumable pieces, though I actually does not understand what the meaning of **consumable** is.
In a Scala program, referential transparency is more important than implementation details.
2018-04-03 13:35 GMT+08:00 Justin Long <[email protected]>:
… I'm aware that InlineTensors are lazily evaluated because you compile all
of the ops until final evaluation: https://github.com/
ThoughtWorksInc/Compute.scala#lazy-evaluation
What I'm looking for is an op that triggers evaluation without calling
.toString or flatArray.blockingAwait. I want to isolate the execution
itself without dumping the contents of the tensor. If I call something such
as tensor.nonInline does that trigger execution? Glancing at the code, it
appears there's an asynchronous operation triggered by Do and it's
handled on a separate thread. So if I try and call tensor.nonInline all
of the ops will be executed possibly(?) on a separate thread and I can't
evaluate execution time.
My goal here is to break this into smaller, consumable pieces.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#137 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAktuvpsD3JjIxwVj7g_4EIKoOGxjqCBks5tkwoNgaJpZM4TDI70>
.
--
杨博 (Yang Bo)
|
"Consumable pieces" was meant in a programmer's POV kind of way (informal). I'll try to explain a bit better:
So, for example, I do Tell me if I'm wrong, but because your Tensors are lazily evaluated if I simply define an op without invoking |
There is no public method to determine whether the internal computation of
a Tensor is finished. There was a doBuffer method. Unfortunately I turned the
doBuffer method into private when introducing the the public method cache
and doCache, which are similar to doBuffer but hide the internal state.
If you are benchmarking the behavior from an user's perspective, then you
may want to create complex expressions end at a flatArray or flatBuffer
call, and count the total time, which indicates whether the
Compute.scala generated
code is good.
If you are benchmarking the implementation, then you are barely
benchmarking the OpenCL runtime.
2018-04-04 2:38 GMT+08:00 Justin Long <[email protected]>:
… "Consumable pieces" was meant in a programmer's POV kind of way
(informal). I'll try to explain a bit better:
- I want to benchmark *only* the evaluation of the tensor ops, such as
addition/subtraction/etc
- I want to isolate that specific op execution time from lazy
evaluation trigger to op finish
- I do not want to dump the contents of the tensor, since this is
quite costly and from my experience we rarely need to do this in the wild
- Because a method like .toString will trigger lazy evaluation then
dump the tensor contents and print them, I'm looking for the equivalent of
this that does not require dumping tensor contents
So, for example, I do val future = Future { myClass.myUnitMethod() }.
While this is an abuse of Scala I have the ability to block the current
thread and wait for the result of Future by calling Await.result(). In
your context, I want to do the same *without* calling flatArray or
toString because there's a cost associated with dumping the contents of a
tensor.
Tell me if I'm wrong, but because your Tensors are lazily evaluated if I
simply define an op without invoking toString or flatArray if I were to
set iterations = 1 then your Tensor will appear to be blazing fast
because we're only benchmarking the time it takes to evaluate a line of
code, not the execution of the op. Nothing has invoked lazy evaluation in
that case.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#137 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAktujLL74cDBo55Kl1vrjk1GkoyO9mVks5tk8GTgaJpZM4TDI70>
.
--
杨博 (Yang Bo)
|
Both |
I'm right now writing an isolated benchmark using the methods you suggested. I added your library to SBT as per:
However, when I try to run this I get:
Did I forget a dependency somewhere? I've been able to run your tests just fine and I have CUDA 9.1 installed on my system, making me think there's an issue with my SBT configuration. |
Please read the Getting started section in the README @crockpotveggies |
Quick follow up. After spending a couple days examining your APIs I'm unsure how to allocate tensors outside of a benchmark loop. While I've attempted to also use the However, I was able to get some ops-only numbers using this code: https://gist.github.com/crockpotveggies/88a5e0f3b067b30065063790102be2fd The results are:
However, if I remove The other issue is that if I move Code:
Result:
I'd be interested in knowing a better way to do this. |
Have you tried to |
Could you please provide the source code so I could fix it? |
Thanks for the tip, I'll try the |
Added a close() method to cache and still have the same SIGSEGV error.
|
I need the complete source code and your system configuration (OS, OpenCL runtime vendor and version, |
You have complete source code in the link above. Using Ubuntu 16.04, OpenCL runtime vendor is CUDA 9.1. Output of Not sure why it's trying to use cuda-8.0...I only have cuda 9.1 installed. |
@crockpotveggies Your first version allocated caches but never release them, resulting out of memory. It seems that my |
Caching is designed for permanent data, for example, the weights of a neural network. It's not design for intermediate variables. |
Hey folks:
First of all, I'm kind of dismayed you guys didn't talk to us about your findings. You guys are making some heavy claims. We are about to do a release this week. We were busy essentially writing our own TF including import.
From an initial look, you guys didn't do your benchmarks properly.
benchmarking nd4j.
I know it's in your guys' interest to have your own framework. That's up to you.
We'll do a blog post to correct a lot of these misconceptions you guys are perpetuating here, but in the mean time, we'll do our best to clarify questions you guys have. We don't mind having competition. It's great to keep us on our toes, but we want to make sure anything representing us is at least somewhat fair (even just handling some of the lower hanging fruit)
The text was updated successfully, but these errors were encountered: