processing time streamlined #47

nsheff · 2018-03-06T14:19:46Z

If we can load all the libraries in a global area, then what stops us from also doing this with the database caches, so they don't need to be re-loaded in each container?

nsheff · 2018-03-08T13:41:21Z

@vpnagraj can you comment on this? would it be possible? It would dramatically speed up compute time if containers could pre-load the databases.

vpnagraj · 2018-03-08T14:00:10Z

@nsheff well i think this would have serious implications for the memory usage, right?

i just tried loading a couple of region dbs outside of the app, Core hg19 and LOLAroadmap hg19, and they were ~ 1GB and ~ 3GB respectively:

Core_hg19 = loadRegionDB(dbLocation="reference/Core/hg19")
object.size(Core_hg19)

LOLARoadmap_hg19 = loadRegionDB(dbLocation="reference/LOLARoadmap/hg19")
object.size(LOLARoadmap_hg19)

so for this to work as you've described we'd need to preload all of the regiondbs, and if just those two alone are hogging > 4GB of memory, i'm not sure if we'll have enough memory in the containers to do the rest of the computing

nsheff · 2018-03-08T17:11:36Z

yeah, makes sense...

So it's a tradeoff of memory and time. What if we kept a few containers pre-loaded, sacrificing a few gigs of memory, but the users would get the compute done in just a few seconds? The only advantage of doing it the other way is you save a few gigs of memory (when the process isn't being used)... but if multiple people are using it you're consuming that memory in duplicate.

But look, even better; we can scale it with user increase, using only a single representation of data in memory...

You can share memory between R processes. Sweet... this video is really worth watching, it is great. we would use this idea for the containers.

Giving us not only a speed advantage but a memory advantage. This opens up all kinds of possibilities.

nsheff · 2018-03-08T20:15:26Z

so for this to work as you've described we'd need to preload all of the regiondbs,

also why would we need to preload all of them?

nmagee · 2018-03-08T20:56:38Z

@nsheff That video is interesting (okay yes I got pulled in by your "this video is really worth watching" line).

Redis might be worth considering, as a pool in which we dump all reference datasets -- everything remains in memory, super-fast, is already containerized, and can be quite large. Just another way to share things between containers.

nsheff · 2018-03-08T22:21:45Z

@nmagee Yeah. sharing among containers is one thing, and sharing among R processes is an independent thing (which can be solved with maybe with rredis or with the svSocket above).

nsheff · 2018-06-26T19:13:04Z

@vpnagraj where are we on the sharing processes/redis idea?

vpnagraj · 2018-07-04T16:00:45Z

tl;dr ... i think we should skip the socket idea and just use redis

i've been looking into the svSocket package to try and "host" the region DBs in an R process that's separate from Shiny

i am able to successfully use the svSocket::evalServer() method (as demonstrated in the video you shared above) for sharing objects in memory between R sessions

however the performance with even a moderate sized vector (rnorm(1e5)) is much worse than the redis method

i've written up a gist with some code to benchmark the methods

but basically to retrieve rnorm(1e5) from a redis key/value store takes on average about 6 milliseconds ... to pull the object from one R session to another with svSocket takes about 50 seconds (!)

the package manual speaks to this:

Although intially designed to server GUI clients, the R socket server can also be used to exchange data between separate R processes. The evalServer() function is particularly useful for this. Note, however, that R objects are serialized into a text (i.e., using dump()) format, currently. It means that the transfer of large object is not as efficient as, say Rserver (Rserver exchanges R objects in binary format, but Rserver is not stateful, clients do not share the same global workspace and it does not allow concurrent use of the command prompt).

i haven't dug too deep on the "Rserver" option but i think that might be referring to the Rserve package? or is this implemented in svSocket as an option that i've missed?

either way at this point i'm not convinced that this memory sharing concept is worth the overhead

@nsheff ... 👍 to move forward with redis ?

nsheff · 2018-07-04T17:41:53Z

Interesting.

I don't suppose it would make sense to do the inverse? (move the smaller query files over and then perform the computation in the svSocket server that already has the big files loaded)?

Or, even better -- don't even read the uploaded file into the client session; just use evalServer() to load the file up in the 'server' session (the persistent one with the big data pre-loaded), then do runLOLA there.

It looks like you're right about the Rserve package... do you think that's not worth looking into?

nsheff · 2018-07-05T15:36:42Z

One other thought: would this method dramatically reduce memory required by making it so all the little child processes didn't even need to load the database?

In any case, I think it's fine to just move forward with the redis method. I have two thoughts in that regard:

I think this functionality should be put in simpleCache, not in LOLAweb, so it's more universally useful.
is it possible that something like mongoDB could be used for this instead?

nmagee · 2018-07-05T15:42:49Z

Mongo actually has an even smaller limit for keys – 16MB I think, and not 512MB – so that would probably make things more difficult, plus Mongo will have to do actual reads and not have keys available in memory like Redis does. Sounds smart to work this into simpleCache instead of LOLA, so I think your idea is smart. Neal From: Nathan Sheffield <[email protected]> Reply-To: databio/LOLAweb <[email protected]> Date: Thursday, July 5, 2018 at 11:36 AM To: databio/LOLAweb <[email protected]> Cc: Neal Magee <[email protected]>, Mention <[email protected]> Subject: Re: [databio/LOLAweb] processing time streamlined (#47) One other thought: would this method dramatically reduce memory required by making it so all the little child processes didn't even need to load the database? In any case, I think it's fine to just move forward with the redis method. I have two thoughts in that regard: 1. I think this functionality should be put in simpleCache, not in LOLAweb, so it's more universally useful. 2. is it possible that something like mongoDB could be used for this instead? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#47 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAqtlus_CZxlxGMPYY-i3HkZOLzROSNfks5uDjKKgaJpZM4SeyVM>.

nsheff · 2018-07-05T15:58:26Z

Mongo actually has an even smaller limit for keys – 16MB I think, and not 512MB

Is the key size a problem? or do you mean value size?

but good point about the memory thing.

nmagee · 2018-07-05T16:01:55Z

Yeah sorry, the value of a key is limited. Oddly, when the heavy users talk about limits they just refer to them as keys. But actual keys themselves have to be super small. Neal On Jul 5, 2018, at 11:58 AM, Nathan Sheffield <[email protected]<mailto:[email protected]>> wrote: Mongo actually has an even smaller limit for keys – 16MB I think, and not 512MB Is the key size a problem? or do you mean value size? but good point about the memory thing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#47 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAqtljDpbY252X3g-2S5jJDekBRf-aJbks5uDjejgaJpZM4SeyVM>.

nsheff added the question label Mar 6, 2018

nsheff mentioned this issue Apr 2, 2018

Email results #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processing time streamlined #47

processing time streamlined #47

nsheff commented Mar 6, 2018

nsheff commented Mar 8, 2018

vpnagraj commented Mar 8, 2018

nsheff commented Mar 8, 2018

nsheff commented Mar 8, 2018

nmagee commented Mar 8, 2018

nsheff commented Mar 8, 2018

nsheff commented Jun 26, 2018

vpnagraj commented Jul 4, 2018 •

edited

Loading

nsheff commented Jul 4, 2018

nsheff commented Jul 5, 2018

nmagee commented Jul 5, 2018 via email

nsheff commented Jul 5, 2018

nmagee commented Jul 5, 2018 via email

processing time streamlined #47

processing time streamlined #47

Comments

nsheff commented Mar 6, 2018

nsheff commented Mar 8, 2018

vpnagraj commented Mar 8, 2018

nsheff commented Mar 8, 2018

nsheff commented Mar 8, 2018

nmagee commented Mar 8, 2018

nsheff commented Mar 8, 2018

nsheff commented Jun 26, 2018

vpnagraj commented Jul 4, 2018 • edited Loading

nsheff commented Jul 4, 2018

nsheff commented Jul 5, 2018

nmagee commented Jul 5, 2018 via email

nsheff commented Jul 5, 2018

nmagee commented Jul 5, 2018 via email

vpnagraj commented Jul 4, 2018 •

edited

Loading