Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analyzing large dataset analysis without LSF #9

Open
yxxue opened this issue Dec 1, 2015 · 2 comments
Open

analyzing large dataset analysis without LSF #9

yxxue opened this issue Dec 1, 2015 · 2 comments

Comments

@yxxue
Copy link

yxxue commented Dec 1, 2015

HI,
Thanks for sharing the code, we have read your paper, excellcent work. We hope to use your methods to analysis our metagenomic datasets, but we meet some challenges:
We only have 5 metagenoic samples, but each of them is quite big (Illimuna Hiseq, ~40Gb). I installed all the packages and ran the test data successfully.
At first I tried to run the analysis referencing the demo scripts; it seems work and still running, but it's really slowly, only the first step 'create_hash' costs 3 days.
I hope to use the parallel methods like LSF, but our cluster didnt support that, we just run the program directly. Could you help me how to run our large dataset faster and efficiently without LSF? (I think our cluster has enough CPUs,memories and storage to do high permance computing.)

@brian-cleary
Copy link
Owner

Hi,

Does your cluster support any sort of distributed computing, with grid
engine, or some alternative perhaps? If so you should be able to just
change the job submission scripts to fit your environment.

On the other hand, if you're only running on single instances, then I would
stick with the same code as used in the test data (this is specifically
designed to run on one machine). You'll change a few things to account for
the difference in size from the test data: (1) make changes in params for
hash size, cluster thresh, etc. according to the docs for running large
data; (2) increase the number of cores used according to however many you
have available on your machine.

I hope this helps. Please let me know if you have any more questions!

On Tue, Dec 1, 2015 at 9:05 AM, yaxin [email protected] wrote:

HI,
Thanks for sharing the code, we have read your paper, excellcent work. We
hope to use your methods to analysis our metagenomic datasets, but we meet
some challenges:
We only have 5 metagenoic samples, but each of them is quite big (Illimuna
Hiseq, ~40Gb). I installed all the packages and ran the test data
successfully.
At first I tried to run the analysis referencing the demo scripts; it
seems work and still running, but it's really slowly, only the first step
'create_hash' costs 3 days.
I hope to use the parallel methods like LSF, but our cluster didnt support
that, we just run the program directly. Could you help me how to run our
large dataset faster and efficiently without LSF? (I think our cluster has
enough CPUs,memories and storage to do high permance computing.)


Reply to this email directly or view it on GitHub
#9.

@yxxue
Copy link
Author

yxxue commented Dec 3, 2015

Hi,
Thanks for your suggestions. Sorry, we dont have any alternatives, as not so many people use the cluster.
I already finished HashCount.sh and KmerSVDClustering.sh, after create_hash, the rest step ran fast, now I'm running the ReadPartitioning.sh.
I found that write_partition_parts.py need a huge storage, it has run 3 days and the tmp folder is already 1T, I'm not sure how many storage it will need, how to estimate it? if it keep running, I may have to kill the process coz we only have 500G storage left, or is there some ways to reduce the size of the tmp folder?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants