analyzing large dataset analysis without LSF #9

yxxue · 2015-12-01T14:05:17Z

HI,
Thanks for sharing the code, we have read your paper, excellcent work. We hope to use your methods to analysis our metagenomic datasets, but we meet some challenges:
We only have 5 metagenoic samples, but each of them is quite big (Illimuna Hiseq, ~40Gb). I installed all the packages and ran the test data successfully.
At first I tried to run the analysis referencing the demo scripts; it seems work and still running, but it's really slowly, only the first step 'create_hash' costs 3 days.
I hope to use the parallel methods like LSF, but our cluster didnt support that, we just run the program directly. Could you help me how to run our large dataset faster and efficiently without LSF? (I think our cluster has enough CPUs,memories and storage to do high permance computing.)

brian-cleary · 2015-12-01T14:59:34Z

Hi,

Does your cluster support any sort of distributed computing, with grid
engine, or some alternative perhaps? If so you should be able to just
change the job submission scripts to fit your environment.

On the other hand, if you're only running on single instances, then I would
stick with the same code as used in the test data (this is specifically
designed to run on one machine). You'll change a few things to account for
the difference in size from the test data: (1) make changes in params for
hash size, cluster thresh, etc. according to the docs for running large
data; (2) increase the number of cores used according to however many you
have available on your machine.

I hope this helps. Please let me know if you have any more questions!

On Tue, Dec 1, 2015 at 9:05 AM, yaxin [email protected] wrote:

HI,
Thanks for sharing the code, we have read your paper, excellcent work. We
hope to use your methods to analysis our metagenomic datasets, but we meet
some challenges:
We only have 5 metagenoic samples, but each of them is quite big (Illimuna
Hiseq, ~40Gb). I installed all the packages and ran the test data
successfully.
At first I tried to run the analysis referencing the demo scripts; it
seems work and still running, but it's really slowly, only the first step
'create_hash' costs 3 days.
I hope to use the parallel methods like LSF, but our cluster didnt support
that, we just run the program directly. Could you help me how to run our
large dataset faster and efficiently without LSF? (I think our cluster has
enough CPUs,memories and storage to do high permance computing.)

—
Reply to this email directly or view it on GitHub
#9.

yxxue · 2015-12-03T13:18:11Z

Hi,
Thanks for your suggestions. Sorry, we dont have any alternatives, as not so many people use the cluster.
I already finished HashCount.sh and KmerSVDClustering.sh, after create_hash, the rest step ran fast, now I'm running the ReadPartitioning.sh.
I found that write_partition_parts.py need a huge storage, it has run 3 days and the tmp folder is already 1T, I'm not sure how many storage it will need, how to estimate it? if it keep running, I may have to kill the process coz we only have 500G storage left, or is there some ways to reduce the size of the tmp folder?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analyzing large dataset analysis without LSF #9

analyzing large dataset analysis without LSF #9

yxxue commented Dec 1, 2015

brian-cleary commented Dec 1, 2015

yxxue commented Dec 3, 2015

analyzing large dataset analysis without LSF #9

analyzing large dataset analysis without LSF #9

Comments

yxxue commented Dec 1, 2015

brian-cleary commented Dec 1, 2015

yxxue commented Dec 3, 2015