Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure to resume from chain file #53

Closed
jeremy-baier opened this issue Feb 23, 2024 · 10 comments
Closed

failure to resume from chain file #53

jeremy-baier opened this issue Feb 23, 2024 · 10 comments

Comments

@jeremy-baier
Copy link

specifically with parallel tempering, I am getting failures to start sampling (both resuming and starting a new job) with the following error message:
File "/home/baierj/miniconda3/envs/custom_noise/lib/python3.9/site-packages/PTMCMCSampler/PTMCMCSampler.py", line 303, in initialize raise Exception( Exception: Old chain has 21 rows, which is not the initial sample plus a multiple of isave/thin = 100
I am using the most up-to-date master version of PTMCMCsampler installed from git.
Weirdly, I cannot replicate this error consistently. It just happens for some jobs but not for others.

@kdolum
Copy link
Collaborator

kdolum commented Feb 23, 2024

@jeremy-baier, Do you get this error even when you set resume=False or leave it unset? It's hard to understand how this can happen, because the message is printed in a block beginning if self.resume and .... If you can reproduce the problem, could you print the value of self.resume at the beginning of this block? Thanks.

@jeremy-baier
Copy link
Author

Hi Ken,
I wanted to follow up on this. I still have been having this issue and cannot figure out why. I do not experience this with resume=False.
Can you help me understand exactly what this code block is trying to do anyways? It seems to be making sure that the save file length matches the expected length give particular values of isave and thin. But why is this important?
Thanks,
—jeremy

@kdolum
Copy link
Collaborator

kdolum commented May 28, 2024

Hi, Jeremy. If you're starting a new run, you should of course either say resume=False or start in an empty directory. Then presumably this won't happen. If you're actually resuming, you should set isave and thin to the same values as the run you are resuming. Then this shouldn't happen either, and if it does, we will have to debug it. One thing that would be useful is to look at the number of rows in the chain files before you resume and see if it corresponds to what it says in the error message.
One reason you might legitimately get this error is that your previous run crashed in the middle of writing out a block in the chain file, and so it is only partly written out. The previous code tried to edit your chain file in this case, but that seemed dangerous to me, so I raise an exception and you can edit the file yourself. But it does not seem to me like this is your problem.
The reason this code is there is that we don't know what settings were used for the previous run that we are resuming. It would be a mess, for example, to change thin before resuming. Then your file would have different samples representing different amounts of the actual MCMC run. So the code checks that the old chain file is consistent with having been run with the same settings.
Is there any possibility that more than one run could be using the same directory by mistake? That would naturally cause unpredictable results.

@jeremy-baier
Copy link
Author

Thanks for the reply, Ken.
I can confirm that I am saving different runs to different directories and there should not be any issues there. I have not been playing with the values of isave and ithin so I don’t think that is the case either. In terms of crashing mid-run, I have just been using parallel tempering on an hpc and using scancel to stop jobs. I am not sure if there is a nicer way to ask the sampler to stop.
After a little bit more digging, I think this might be related to PR!54( https://github.com/nanograv/PTMCMCSampler/pulls ).
I have been using hotchains and I have been writing them everytime (can confirm that they are being output in the directory). So I am still not sure why the resume is an issue.

@kdolum
Copy link
Collaborator

kdolum commented May 28, 2024

OK. I don't think it's #54, because that had to do with not writing hot chains. So let's try to find the bug. Could you start a run, then cancel it as you said, then look at the number of rows in all your chain files (e.g., with "wc -l")? If the number of rows is not one plus a multiple of 100, let me know and we'll try to understand how that occurs. If every file does indeed have the form 100n+1, try resuming and see if it works.
Just to check, you are asking for the total number of samples that is a multiple of 1000, right?

@jeremy-baier
Copy link
Author

jeremy-baier commented Jun 5, 2024

Ok Ken, I think I have tracked down what is going on.
The runs that are crashing are runs where the sampler only has the initial sample written to file. (That is, the sampler has not gotten far enough to checkpoint even once.) So when the Sampler tries to resume, it loads back in the chain file as a 1d array rather than a 2d array (since there is only the initial sample written to file). This gets caught in the block you added because the ResumeLength is no longer the length of the chain, but the resumelength incorrectly gets set to the number of parameters+4.
(So then if I comment out your block, this line breaks:

p0, lnlike0, lnprob0 = self.resumechain[0, :-4], self.resumechain[0, -3], self.resumechain[0, -4]
because your indexing dimensionality is wrong since you have loaded in a 1d array).
I think this could be solved by checking the dimensionality of the chain when it gets loaded in. But let me know what you think makes the most sense for a fix.
I am a bit surprised that other people have not run into this issue before. Does this mean that my models are really slow getting started?? (I checked and it was about ~40 minutes to get to the first check point for some cases.) Either way, I would be happy to help put in a PR to fix this.
Let me know if my explanation is coherent and sounds right to you!
Thanks,
—jeremy

@jeremy-baier
Copy link
Author

This also explains why I was not able to consistently replicate the error. It was only happening for the jobs that were slow getting started.

@kdolum
Copy link
Collaborator

kdolum commented Jun 5, 2024

Thanks, Jeremy. Good catch! In my opinion python is too willing to muddle the difference between different shapes of arrays with the same data. I think you can fix this by passing ndmin = 2 to np.loadtxt. Please go ahead.

@jeremy-baier
Copy link
Author

That sounds like a good fix!

@kdolum
Copy link
Collaborator

kdolum commented Jun 7, 2024

Fixed by #55

@kdolum kdolum closed this as completed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants