Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One "untip" job does not finish in time. #291

Open
JohnUrban opened this issue Oct 9, 2024 · 7 comments
Open

One "untip" job does not finish in time. #291

JohnUrban opened this issue Oct 9, 2024 · 7 comments

Comments

@JohnUrban
Copy link

Hi again,

Thanks as always for the tool and help.

There is a single step 5 "untip" job that will not finish. Verkko has tried up to 6 days. I just re-launched Verkko instructing it to try 7 days (168 hours). I also added more memory to the request in a feeble hope of it helping. We will see.

That job will end next Wednesday, but I am not hopeful that it will finish. The problem is that the cluster does not allow requests longer than 7 days. So I am wondering if there is a way to further break this job down into two or more jobs. I tried telling it to request more CPUs, but that seemed to be a non-starter. Verkko complained about Rukki paths and quit.

I can give more details of course, but I guess the "big picture" here is just whether or not there will be a way around the time limit thing.

Many thanks,

John

@JohnUrban
Copy link
Author

Well, it failed as expected.

For some reason, last week when I was trying --utp-run 2 250 168 it was failing to launch the job.

I wanted to ask for more CPUs, but thought it was causing the silent failure.

Today (after the job failed), I tried --utp-run 4 360 168. At first it failed silently, next time it worked. The offending seemingly-causal factor for it working the second time was removing --snakeopts "--unlock" .

So, we will see if adding more CPUs and more memory helps it finish within 7 days.

@skoren
Copy link
Member

skoren commented Oct 16, 2024

What is the input data for the assembly?

@JohnUrban
Copy link
Author

HiFi, UL-ONT, and Hi-C.

300 GB genome (600 GB diploid). There may be an endosymbiont present with a similar genome size.

There is >300X HiFi, ~120X ONT.

@skoren
Copy link
Member

skoren commented Oct 17, 2024

I think with that much HiFi coverage you might be getting lots more noise nodes than expected, overcomplicating the graph (e.g. recurrent errors creating realistic coverage nodes). I'd suggest either downsampling to 100-ish HiFi or increasing the --unitig-abundance to something like 5 or even 10 (the default is 2) to see if that solves the issue.

@JohnUrban
Copy link
Author

Thanks. Yes, this is a slightly weird situation where two experimental labs generated data from the same sample, and both want their data to be included in the assembly. I actually quoted the amount of HiFi from one lab. There is 500-600X from both labs. I will try --unitig-abundance 10 for the runs with data from both. I can also try down-sampling. Thanks again.

@JohnUrban
Copy link
Author

Note - at the moment, I cannot say whether this issue is resolved. Another issue arose now where SLURM is not behaving well with Verkko.

sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
Job 44564454 not known (yet) by slurm; will return 'running' by default.
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:head1:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
Job 44564455 not known (yet) by slurm; will return 'running' by default.
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:head1:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

Verkko worked fine on the same cluster in early September, and I am guessing something changed with the cluster.... That is to say, I don't think it is a Verkko issue, and I have opened an issue with the cluster people to resolve this.

@JohnUrban
Copy link
Author

Update on recent SLURM issue. The SLURM back-end problem is now solved. They made some changes last week, and certain environmental variables were not adjusted, causing communication/connection issues (seen above).

I will get back to you on --unitig-abundance results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants