One "untip" job does not finish in time. #291

JohnUrban · 2024-10-09T15:21:08Z

Hi again,

Thanks as always for the tool and help.

There is a single step 5 "untip" job that will not finish. Verkko has tried up to 6 days. I just re-launched Verkko instructing it to try 7 days (168 hours). I also added more memory to the request in a feeble hope of it helping. We will see.

That job will end next Wednesday, but I am not hopeful that it will finish. The problem is that the cluster does not allow requests longer than 7 days. So I am wondering if there is a way to further break this job down into two or more jobs. I tried telling it to request more CPUs, but that seemed to be a non-starter. Verkko complained about Rukki paths and quit.

I can give more details of course, but I guess the "big picture" here is just whether or not there will be a way around the time limit thing.

Many thanks,

John

JohnUrban · 2024-10-16T16:48:18Z

Well, it failed as expected.

For some reason, last week when I was trying --utp-run 2 250 168 it was failing to launch the job.

I wanted to ask for more CPUs, but thought it was causing the silent failure.

Today (after the job failed), I tried --utp-run 4 360 168. At first it failed silently, next time it worked. The offending seemingly-causal factor for it working the second time was removing --snakeopts "--unlock" .

So, we will see if adding more CPUs and more memory helps it finish within 7 days.

skoren · 2024-10-16T19:11:09Z

What is the input data for the assembly?

JohnUrban · 2024-10-17T13:20:27Z

HiFi, UL-ONT, and Hi-C.

300 GB genome (600 GB diploid). There may be an endosymbiont present with a similar genome size.

There is >300X HiFi, ~120X ONT.

skoren · 2024-10-17T13:23:13Z

I think with that much HiFi coverage you might be getting lots more noise nodes than expected, overcomplicating the graph (e.g. recurrent errors creating realistic coverage nodes). I'd suggest either downsampling to 100-ish HiFi or increasing the --unitig-abundance to something like 5 or even 10 (the default is 2) to see if that solves the issue.

JohnUrban · 2024-10-17T14:15:13Z

Thanks. Yes, this is a slightly weird situation where two experimental labs generated data from the same sample, and both want their data to be included in the assembly. I actually quoted the amount of HiFi from one lab. There is 500-600X from both labs. I will try --unitig-abundance 10 for the runs with data from both. I can also try down-sampling. Thanks again.

JohnUrban · 2024-10-21T14:53:03Z

Note - at the moment, I cannot say whether this issue is resolved. Another issue arose now where SLURM is not behaving well with Verkko.

sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
Job 44564454 not known (yet) by slurm; will return 'running' by default.
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:head1:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
Job 44564455 not known (yet) by slurm; will return 'running' by default.
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:head1:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

Verkko worked fine on the same cluster in early September, and I am guessing something changed with the cluster.... That is to say, I don't think it is a Verkko issue, and I have opened an issue with the cluster people to resolve this.

JohnUrban · 2024-10-22T14:41:02Z

Update on recent SLURM issue. The SLURM back-end problem is now solved. They made some changes last week, and certain environmental variables were not adjusted, causing communication/connection issues (seen above).

I will get back to you on --unitig-abundance results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One "untip" job does not finish in time. #291

One "untip" job does not finish in time. #291

JohnUrban commented Oct 9, 2024

JohnUrban commented Oct 16, 2024

skoren commented Oct 16, 2024

JohnUrban commented Oct 17, 2024

skoren commented Oct 17, 2024

JohnUrban commented Oct 17, 2024

JohnUrban commented Oct 21, 2024

JohnUrban commented Oct 22, 2024

One "untip" job does not finish in time. #291

One "untip" job does not finish in time. #291

Comments

JohnUrban commented Oct 9, 2024

JohnUrban commented Oct 16, 2024

skoren commented Oct 16, 2024

JohnUrban commented Oct 17, 2024

skoren commented Oct 17, 2024

JohnUrban commented Oct 17, 2024

JohnUrban commented Oct 21, 2024

JohnUrban commented Oct 22, 2024