-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault (11) with PolyChord+Class+lowEE+Omega_k #102
Comments
Noted, thanks! Will test very soon. Just to confirm, the point that you are testing is
right? |
Yes, that is how I have interpreted the |
Thanks, I'll give it a try. If you are unable to reproduce it, it does sound ulgy: may be due to persistence of some memory allocation between clik calls. We'll see... |
output from cobaya showing
|
Wow! I ran the mini-script again and this time I got nans! For the same script I got different output. When I (obviously) ran it yet again, I didn't get the nans anymore. I ran it another few times. Mostly I get no nans, but once in a while I am getting nans. No idea why. To recapitulate:
Reduced output with nans: 2020-06-09 12:46:57,809 [model] Got input parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29, 'A_planck': 0.9982462593219659}
2020-06-09 12:46:57,809 [classy] Got parameters {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544}
2020-06-09 12:46:57,809 [classy] Computing new state
2020-06-09 12:46:57,809 [classy] Setting parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29}
2020-06-09 12:47:05,386 [planck_2018_lowl.ee] Got parameters {'A_planck': 0.9982462593219659}
2020-06-09 12:47:05,391 [planck_2018_lowl.ee] Computing new state
2020-06-09 12:47:05,391 [planck_2018_lowl.ee] Calling logp now
2020-06-09 12:47:05,391 [planck_2018_lowl.ee] Got cl = {'ee': array([ 0., 0., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan]), 'bb': array([ 0., 0., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan]), 'pp': array([0.00000000e+00, 0.00000000e+00, 1.17011902e-08, 3.56360975e-09,
1.47036184e-09, 7.22453571e-10, 3.98247801e-10, 2.38223791e-10,
1.51464579e-10, 1.00957867e-10, 6.98689585e-11, 4.91361840e-11,
3.60046913e-11, 2.69636667e-11, 2.05746303e-11, 1.59570984e-11,
1.25538757e-11, 1.00018945e-11, 8.05876718e-12, 6.55896153e-12,
5.38708729e-12, 4.46126736e-12, 3.72245994e-12, 3.12746369e-12,
2.64424851e-12, 2.24853374e-12, 1.92279547e-12, 1.65213682e-12,
1.42652861e-12, 1.23737431e-12]), 'ell': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])}
2020-06-09 12:47:05,393 [planck_2018_lowl.ee] Call clik now with vector = [0. 0. nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
0.99824626]
[lukas-amd3950x:317622] *** Process received signal ***
[lukas-amd3950x:317622] Signal: Segmentation fault (11)
[lukas-amd3950x:317622] Signal code: Address not mapped (1)
[lukas-amd3950x:317622] Failing at address: 0x555aeb4a5ad0
[lukas-amd3950x:317622] [ 0] /usr/lib/libc.so.6(+0x3c3e0)[0x7f03e545a3e0]
[lukas-amd3950x:317622] [ 1] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(simall_lkl+0x18a)[0x7f038213d51e]
[lukas-amd3950x:317622] [ 2] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(lklbs_lkl+0x20c)[0x7f03820ea8c4]
[lukas-amd3950x:317622] [ 3] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(distribution_lkl+0x146)[0x7f03820fdac9]
[lukas-amd3950x:317622] [ 4] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(clik_compute+0x5b)[0x7f03820e941b]
[lukas-amd3950x:317622] [ 5] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/python/site-packages/clik/lkl.cpython-38-x86_64-linux-gnu.so(+0x9737)[0x7f03dcfdd737]
[lukas-amd3950x:317622] [ 6] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/python/site-packages/clik/lkl.cpython-38-x86_64-linux-gnu.so(+0x82fa)[0x7f03dcfdc2fa]
[lukas-amd3950x:317622] [ 7] /usr/lib/libpython3.8.so.1.0(_PyObject_MakeTpCall+0x45c)[0x7f03e570c18c]
[lukas-amd3950x:317622] [ 8] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x5108)[0x7f03e5707a78]
[lukas-amd3950x:317622] [ 9] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0xa22)[0x7f03e5701d72]
[lukas-amd3950x:317622] [10] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [11] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [12] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [13] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [14] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x304)[0x7f03e5701654]
[lukas-amd3950x:317622] [15] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [16] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [17] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [18] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [19] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x304)[0x7f03e5701654]
[lukas-amd3950x:317622] [20] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [21] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [22] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [23] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [24] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x304)[0x7f03e5701654]
[lukas-amd3950x:317622] [25] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [26] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [27] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [28] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [29] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0xa22)[0x7f03e5701d72]
[lukas-amd3950x:317622] *** End of error message ***
Segmentation fault (core dumped) |
I've found that I can prevent (or at least heavily reduce) this non-deterministic behaviour (sometimes producing nans and sometimes not) from happening by suppressing all sort of multi-threading, i.e. compiling CLASS without OpenMP and setting all of Before that I got nans in about 10% of the cases when looping the script posted above. After compiling without OpenMP and setting those variables to one I did not manage to get nans at all anymore. |
Hi @JesusTorrado,
sorry to bring this up again, but I never quite managed to fix the remaining problem in #34. The original issue described there was related to a memory leak, which got mostly fixed. However, the segfault described later on persists. I've got a new computing setup and can now test these things locally with a gnu build and can say that this is not an intel issue, I get this problem with both gnu and intel. But with this new setup I was able to get information on which rank/process causes the issue.
This error comes up when running
Cobaya
withPolyChord
,Class
,lowEE
likelihood andOmega_k
varied. I never got it for flat LCDM and I never got it when I excluded the lowEE likelihood.Below I'm posting the
.yaml
file I used, the error message and an overview of my computing setup. I am also attaching the complete output file including debug output.I have tested the parameter set that seems to have caused the error (from rank 15). Running that parameter set directly with Class causes no errors. I've also tried to fix all those parameters in the .yaml file (while running over a dummy variable) and again did not get any errors. This has me at a loss as to what might be happening.
Full output with debug
test_EE_omegak_3d_debug_witherr.log
Example
.yaml
file (reduced to 3 varying parameters):Error message:
My setup:
The text was updated successfully, but these errors were encountered: