Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processing time streamlined #47

Open
nsheff opened this issue Mar 6, 2018 · 13 comments
Open

processing time streamlined #47

nsheff opened this issue Mar 6, 2018 · 13 comments
Labels

Comments

@nsheff
Copy link
Member

nsheff commented Mar 6, 2018

If we can load all the libraries in a global area, then what stops us from also doing this with the database caches, so they don't need to be re-loaded in each container?

@nsheff nsheff added the question label Mar 6, 2018
@nsheff
Copy link
Member Author

nsheff commented Mar 8, 2018

@vpnagraj can you comment on this? would it be possible? It would dramatically speed up compute time if containers could pre-load the databases.

@vpnagraj
Copy link
Collaborator

vpnagraj commented Mar 8, 2018

@nsheff well i think this would have serious implications for the memory usage, right?

i just tried loading a couple of region dbs outside of the app, Core hg19 and LOLAroadmap hg19, and they were ~ 1GB and ~ 3GB respectively:

Core_hg19 = loadRegionDB(dbLocation="reference/Core/hg19")
object.size(Core_hg19)

LOLARoadmap_hg19 = loadRegionDB(dbLocation="reference/LOLARoadmap/hg19")
object.size(LOLARoadmap_hg19)

so for this to work as you've described we'd need to preload all of the regiondbs, and if just those two alone are hogging > 4GB of memory, i'm not sure if we'll have enough memory in the containers to do the rest of the computing

@nsheff
Copy link
Member Author

nsheff commented Mar 8, 2018

yeah, makes sense...

So it's a tradeoff of memory and time. What if we kept a few containers pre-loaded, sacrificing a few gigs of memory, but the users would get the compute done in just a few seconds? The only advantage of doing it the other way is you save a few gigs of memory (when the process isn't being used)... but if multiple people are using it you're consuming that memory in duplicate.

But look, even better; we can scale it with user increase, using only a single representation of data in memory...

You can share memory between R processes. Sweet... this video is really worth watching, it is great. we would use this idea for the containers.

Giving us not only a speed advantage but a memory advantage. This opens up all kinds of possibilities.

@nsheff
Copy link
Member Author

nsheff commented Mar 8, 2018

so for this to work as you've described we'd need to preload all of the regiondbs,

also why would we need to preload all of them?

@nmagee
Copy link
Member

nmagee commented Mar 8, 2018

@nsheff That video is interesting (okay yes I got pulled in by your "this video is really worth watching" line).

Redis might be worth considering, as a pool in which we dump all reference datasets -- everything remains in memory, super-fast, is already containerized, and can be quite large. Just another way to share things between containers.

@nsheff
Copy link
Member Author

nsheff commented Mar 8, 2018

@nmagee Yeah. sharing among containers is one thing, and sharing among R processes is an independent thing (which can be solved with maybe with rredis or with the svSocket above).

@nsheff nsheff mentioned this issue Apr 2, 2018
@nsheff
Copy link
Member Author

nsheff commented Jun 26, 2018

@vpnagraj where are we on the sharing processes/redis idea?

@vpnagraj
Copy link
Collaborator

vpnagraj commented Jul 4, 2018

tl;dr ... i think we should skip the socket idea and just use redis

i've been looking into the svSocket package to try and "host" the region DBs in an R process that's separate from Shiny

i am able to successfully use the svSocket::evalServer() method (as demonstrated in the video you shared above) for sharing objects in memory between R sessions

however the performance with even a moderate sized vector (rnorm(1e5)) is much worse than the redis method

i've written up a gist with some code to benchmark the methods

but basically to retrieve rnorm(1e5) from a redis key/value store takes on average about 6 milliseconds ... to pull the object from one R session to another with svSocket takes about 50 seconds (!)

the package manual speaks to this:

Although intially designed to server GUI clients, the R socket server can also be used to exchange data between separate R processes. The evalServer() function is particularly useful for this. Note, however, that R objects are serialized into a text (i.e., using dump()) format, currently. It means that the transfer of large object is not as efficient as, say Rserver (Rserver exchanges R objects in binary format, but Rserver is not stateful, clients do not share the same global workspace and it does not allow concurrent use of the command prompt).

i haven't dug too deep on the "Rserver" option but i think that might be referring to the Rserve package? or is this implemented in svSocket as an option that i've missed?

either way at this point i'm not convinced that this memory sharing concept is worth the overhead

@nsheff ... 👍 to move forward with redis ?

@nsheff
Copy link
Member Author

nsheff commented Jul 4, 2018

Interesting.

I don't suppose it would make sense to do the inverse? (move the smaller query files over and then perform the computation in the svSocket server that already has the big files loaded)?

Or, even better -- don't even read the uploaded file into the client session; just use evalServer() to load the file up in the 'server' session (the persistent one with the big data pre-loaded), then do runLOLA there.

It looks like you're right about the Rserve package... do you think that's not worth looking into?

@nsheff
Copy link
Member Author

nsheff commented Jul 5, 2018

One other thought: would this method dramatically reduce memory required by making it so all the little child processes didn't even need to load the database?

In any case, I think it's fine to just move forward with the redis method. I have two thoughts in that regard:

  1. I think this functionality should be put in simpleCache, not in LOLAweb, so it's more universally useful.
  2. is it possible that something like mongoDB could be used for this instead?

@nmagee
Copy link
Member

nmagee commented Jul 5, 2018 via email

@nsheff
Copy link
Member Author

nsheff commented Jul 5, 2018

Mongo actually has an even smaller limit for keys – 16MB I think, and not 512MB

Is the key size a problem? or do you mean value size?

but good point about the memory thing.

@nmagee
Copy link
Member

nmagee commented Jul 5, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants