failure to resume from chain file #53

jeremy-baier · 2024-02-23T05:03:14Z

specifically with parallel tempering, I am getting failures to start sampling (both resuming and starting a new job) with the following error message:
File "/home/baierj/miniconda3/envs/custom_noise/lib/python3.9/site-packages/PTMCMCSampler/PTMCMCSampler.py", line 303, in initialize raise Exception( Exception: Old chain has 21 rows, which is not the initial sample plus a multiple of isave/thin = 100
I am using the most up-to-date master version of PTMCMCsampler installed from git.
Weirdly, I cannot replicate this error consistently. It just happens for some jobs but not for others.

The text was updated successfully, but these errors were encountered:

kdolum · 2024-02-23T15:29:18Z

@jeremy-baier, Do you get this error even when you set resume=False or leave it unset? It's hard to understand how this can happen, because the message is printed in a block beginning if self.resume and .... If you can reproduce the problem, could you print the value of self.resume at the beginning of this block? Thanks.

jeremy-baier · 2024-05-28T04:05:09Z

Hi Ken,
I wanted to follow up on this. I still have been having this issue and cannot figure out why. I do not experience this with resume=False.
Can you help me understand exactly what this code block is trying to do anyways? It seems to be making sure that the save file length matches the expected length give particular values of isave and thin. But why is this important?
Thanks,
—jeremy

kdolum · 2024-05-28T14:43:48Z

Hi, Jeremy. If you're starting a new run, you should of course either say resume=False or start in an empty directory. Then presumably this won't happen. If you're actually resuming, you should set isave and thin to the same values as the run you are resuming. Then this shouldn't happen either, and if it does, we will have to debug it. One thing that would be useful is to look at the number of rows in the chain files before you resume and see if it corresponds to what it says in the error message.
One reason you might legitimately get this error is that your previous run crashed in the middle of writing out a block in the chain file, and so it is only partly written out. The previous code tried to edit your chain file in this case, but that seemed dangerous to me, so I raise an exception and you can edit the file yourself. But it does not seem to me like this is your problem.
The reason this code is there is that we don't know what settings were used for the previous run that we are resuming. It would be a mess, for example, to change thin before resuming. Then your file would have different samples representing different amounts of the actual MCMC run. So the code checks that the old chain file is consistent with having been run with the same settings.
Is there any possibility that more than one run could be using the same directory by mistake? That would naturally cause unpredictable results.

jeremy-baier · 2024-05-28T17:09:26Z

Thanks for the reply, Ken.
I can confirm that I am saving different runs to different directories and there should not be any issues there. I have not been playing with the values of isave and ithin so I don’t think that is the case either. In terms of crashing mid-run, I have just been using parallel tempering on an hpc and using scancel to stop jobs. I am not sure if there is a nicer way to ask the sampler to stop.
After a little bit more digging, I think this might be related to PR!54( https://github.com/nanograv/PTMCMCSampler/pulls ).
I have been using hotchains and I have been writing them everytime (can confirm that they are being output in the directory). So I am still not sure why the resume is an issue.

kdolum · 2024-05-28T18:36:27Z

OK. I don't think it's #54, because that had to do with not writing hot chains. So let's try to find the bug. Could you start a run, then cancel it as you said, then look at the number of rows in all your chain files (e.g., with "wc -l")? If the number of rows is not one plus a multiple of 100, let me know and we'll try to understand how that occurs. If every file does indeed have the form 100n+1, try resuming and see if it works.
Just to check, you are asking for the total number of samples that is a multiple of 1000, right?

jeremy-baier · 2024-06-05T17:45:29Z

Ok Ken, I think I have tracked down what is going on.
The runs that are crashing are runs where the sampler only has the initial sample written to file. (That is, the sampler has not gotten far enough to checkpoint even once.) So when the Sampler tries to resume, it loads back in the chain file as a 1d array rather than a 2d array (since there is only the initial sample written to file). This gets caught in the block you added because the ResumeLength is no longer the length of the chain, but the resumelength incorrectly gets set to the number of parameters+4.
(So then if I comment out your block, this line breaks:

PTMCMCSampler/PTMCMCSampler/PTMCMCSampler.py

Line 475 in 9811073

    
           p0, lnlike0, lnprob0 = self.resumechain[0, :-4], self.resumechain[0, -3], self.resumechain[0, -4]

because your indexing dimensionality is wrong since you have loaded in a 1d array).
I think this could be solved by checking the dimensionality of the chain when it gets loaded in. But let me know what you think makes the most sense for a fix.
I am a bit surprised that other people have not run into this issue before. Does this mean that my models are really slow getting started?? (I checked and it was about ~40 minutes to get to the first check point for some cases.) Either way, I would be happy to help put in a PR to fix this.
Let me know if my explanation is coherent and sounds right to you!
Thanks,
—jeremy

jeremy-baier · 2024-06-05T17:50:39Z

This also explains why I was not able to consistently replicate the error. It was only happening for the jobs that were slow getting started.

kdolum · 2024-06-05T18:00:43Z

Thanks, Jeremy. Good catch! In my opinion python is too willing to muddle the difference between different shapes of arrays with the same data. I think you can fix this by passing ndmin = 2 to np.loadtxt. Please go ahead.

jeremy-baier · 2024-06-05T18:14:01Z

That sounds like a good fix!

kdolum · 2024-06-07T22:58:34Z

Fixed by #55

jeremy-baier mentioned this issue Jun 5, 2024

fixing a chain resume issue #55

Merged

kdolum closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failure to resume from chain file #53

failure to resume from chain file #53

jeremy-baier commented Feb 23, 2024

kdolum commented Feb 23, 2024

jeremy-baier commented May 28, 2024

kdolum commented May 28, 2024

jeremy-baier commented May 28, 2024

kdolum commented May 28, 2024

jeremy-baier commented Jun 5, 2024 •

edited

Loading

jeremy-baier commented Jun 5, 2024

kdolum commented Jun 5, 2024

jeremy-baier commented Jun 5, 2024

kdolum commented Jun 7, 2024

failure to resume from chain file #53

failure to resume from chain file #53

Comments

jeremy-baier commented Feb 23, 2024

kdolum commented Feb 23, 2024

jeremy-baier commented May 28, 2024

kdolum commented May 28, 2024

jeremy-baier commented May 28, 2024

kdolum commented May 28, 2024

jeremy-baier commented Jun 5, 2024 • edited Loading

jeremy-baier commented Jun 5, 2024

kdolum commented Jun 5, 2024

jeremy-baier commented Jun 5, 2024

kdolum commented Jun 7, 2024

jeremy-baier commented Jun 5, 2024 •

edited

Loading