-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaNs in Cl fix for PolyChord #231
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #231 +/- ##
==========================================
- Coverage 87.88% 87.86% -0.02%
==========================================
Files 92 92
Lines 8335 8347 +12
==========================================
+ Hits 7325 7334 +9
- Misses 1010 1013 +3
Continue to review full report at Codecov.
|
@JesusTorrado, I think that (a) I've caught all of the relevant portions of the code and (b) that it needs to be checked and acted upon in each of these cases. In principle this could be pushed back into self.provider.check_nan_Cl(Cl) or something similar to avoid code repetition, although having just had a go at that it doesn't actually reduce the number of lines of code. |
The Travis CI timed out -- I'm not sure this is related to this update. |
I re-ran the failed one and was OK |
Why/from where is it giving NaN at all? Would be better to handle this at a higher level somewhere rather than in each likelihood. |
@williamjameshandley and @lukashergt, thanks a lot! I'll come back to this as soon as I am finished with #222 (1-2 working days) As Antony said, ideally clik would do this. Do you have particular configurations that produce this error, so that we can send them to Karim? But I think that this makes sense as a provisional solution, as opposed to having to tell people to reduce the size of the prior manually before running PolyChord. Have you checked how large is the overhead? (not significant, I guess) |
@cmbant -- these nans are very unusual. They only occur when sampling across the full prior range, and only ~O(10^-6) of the time. They're therefore quite hard to catch in debug mode, and are almost certainly from extreme corner cases. I can move the Cl checks further back, but in my view the Cl calculation is probably returning the 'right thing', and the correct thing to do is to catch it at an 'unphysical parameter point' level i.e. at the loglikelihood. |
@JesusTorrado, I'll put in some logging statements in my local code to output the parameter values if/when it catches them and get back to you. However, I don't think we would send these to Karim. The problem is that click is undefined for nan inputs. Ideally it could return an error if provided nans (and this wouldn't require specific parameter values), but equally you could say it's the user's responsibility to not hand nonsense to the likelihood code. The 'issue' fundamentally lies with in camb for producing nan Cls, although again, I'm not certain it's wrong to do so for such unreasonable inputs. I think that the correct place to handle this is at the likelihood/modelling level. |
In that case, it sounds like CAMB/CLASS (which one?) should detect a failed computation and not even run the likelihood, I think. |
In this case it was CAMB, but it would likely apply to both. If 'detect a failed computation' is equivalent to 'have calculated nan cls', where is the best place for camb to check this, and what exception should be thrown? |
Hard to say where it should be tested. One possibility would be to add to the |
For CAMB we could add a Collector "post" for Cl results, and raise and error there if any element of the array is NaN? |
Maybe the simplest solution is a check just before the clik call, when
the array containing Cl's and nuisance is constructed. Works for CAMB
and CLASS.
|
Just after this:
|
But this is not clik specific? |
There are actually 2 problems:
Ideally both should be fixed in their respective codes, not their Cobaya interface. Since that's not happening, let's fix both in the respective interfaces. (Even if fixing the first one fixes the second one, avoiding a segfault is worth a small amount of overhead.) Agreed? @williamjameshandley @lukashergt have you noticed this happening with CLASS too? |
Yes, I initially ran into this with Cobaya+PolyChord+CLASS+Planck runs. |
@williamjameshandley, if this fix is required for runs to work, that means it must be reproducible, so you could identify the underlying problematic models? |
Hi @williamjameshandley @lukashergt . Sorry for taking so long to come back to this! Looking at the changes in this PR, they can be grouped as About a), happy to include your proposed changes, but I would instead do the test in the About b), the only possible source of segfaults I can think of would be CAMB's Regarding the test itself, apparently |
|
I've had issues with CLASS in the past, any issues with CAMB I managed to fix. So I guess that question is for @williamjameshandley...? |
In the initial stages of nested sampling, when it is drawing samples deep in the tails of the posterior distribution one can occasionally get NaN values for the Cls. When these are passed to clik code, this can cause segfaults or other undefined behaviour. The safest thing to do is to catch them and return a likelihood of logzero as an 'unphysical' point, consistent with other situations in the code
Credit to @lukashergt for writing and testing this code. I'm in the process of updating cobaya+polychord, and this will be the first in a sequence of pull requests