Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sim_objects sometimes gets stuck #296

Open
zonca opened this issue Mar 6, 2025 · 7 comments
Open

sim_objects sometimes gets stuck #296

zonca opened this issue Mar 6, 2025 · 7 comments

Comments

@zonca
Copy link
Member

zonca commented Mar 6, 2025

I am using sim_objects in the implementation of the point source catalog in PySM3 to simulate maps of point sources from a catalog with a gaussian beam directly applied to them.

I have been affected by this bug where an execution for 100k sources that should run in a few seconds instead gets stuck, and keeps increasing memory usage.

What is disconcerting is that this happens once every 3 or 4 runs.

@amaurea this is blocking for running point source simulations for Simons Observatory, could you please take a look?

The input dataset is in this Google Drive folder: https://drive.google.com/drive/folders/1BtbahZ8rkXswzBn1zJBZP3cdOvxRdTUR?usp=drive_link

Run script on Popeye

When I run on a computing node on Popeye the exact same script, sometimes it runs in 7s using less than 3 GB of RAM, sometimes I have to kill it after 5 minutes (and 14.5 GB of RAM):

(cmb) [azonca@pcn-1-06 catalog (main)]$ bash run_debug_simobjects.sh 
Elapsed time: 0m 7.12s
Memory usage (GB): 2.88
Elapsed time: 5m 0.54s
Memory usage (GB): 14.25
Elapsed time: 0m 7.04s
Memory usage (GB): 2.88
Elapsed time: 0m 7.04s
Memory usage (GB): 2.88
Elapsed time: 5m 0.53s
Memory usage (GB): 14.35
Elapsed time: 0m 7.04s
Memory usage (GB): 2.88
Elapsed time: 5m 0.53s
Memory usage (GB): 14.25
Elapsed time: 5m 0.52s
Memory usage (GB): 14.34
Elapsed time: 0m 7.01s
Memory usage (GB): 2.88
Elapsed time: 5m 0.54s
Memory usage (GB): 14.57

The 2 scripts are here: https://gist.github.com/zonca/7b33648c21235f833aa3315099b65146

Run on Colab

I tried to reproduce this on Colab here:

https://colab.research.google.com/drive/19AxJROAGYU-PrKbxHFNb9sTl9PtDIE8L?usp=sharing

It seems like it always works on the first execution of the notebook, but if executed again without restarting the kernel, most of the times it keeps increasing memory until it crashes.

@amaurea
Copy link
Collaborator

amaurea commented Mar 7, 2025

Hm.. I should look at this. In the mean while, how many threads are you using? Can you test with some different numbers? This is controlled by OMP_NUM_THREADS.

That said, the symptoms you report sound most like a memory leak. If so it should be relatively simple to fix.

@zonca
Copy link
Member Author

zonca commented Mar 7, 2025

same issues with 1 thread

export OMP_NUM_THREADS=1
[azonca@pcn-1-15 catalog (main)]$ bash run_debug_simobjects.sh 
Elapsed time: 1m 3.62s
Memory usage (GB): 2.88
Elapsed time: 1m 3.44s
Memory usage (GB): 2.88
Elapsed time: 5m 0.90s
Memory usage (GB): 14.45
Elapsed time: 5m 0.87s
Memory usage (GB): 14.42
Elapsed time: 1m 3.28s
Memory usage (GB): 2.88
Elapsed time: 1m 2.13s
Memory usage (GB): 2.88
Elapsed time: 1m 3.23s
Memory usage (GB): 2.88
Elapsed time: 1m 3.55s
Memory usage (GB): 2.88
Elapsed time: 1m 2.73s
Memory usage (GB): 2.88
Elapsed time: 1m 3.45s
Memory usage (GB): 2.88

@zonca
Copy link
Member Author

zonca commented Mar 8, 2025

export OMP_NUM_THREADS=2
[azonca@pcn-1-15 catalog (main)]$ bash run_debug_simobjects.sh 
Elapsed time: 0m 36.74s
Memory usage (GB): 2.88
Elapsed time: 5m 0.87s
Memory usage (GB): 14.23
Elapsed time: 0m 36.76s
Memory usage (GB): 2.88
Elapsed time: 0m 36.33s
Memory usage (GB): 2.88
Elapsed time: 0m 36.17s
Memory usage (GB): 2.88
Elapsed time: 0m 36.66s
Memory usage (GB): 2.88
Elapsed time: 5m 0.87s
Memory usage (GB): 14.23
Elapsed time: 0m 36.62s
Memory usage (GB): 2.88
Elapsed time: 5m 0.87s
Memory usage (GB): 14.20
Elapsed time: 0m 36.52s
Memory usage (GB): 2.88

@cpvargas
Copy link

cpvargas commented Mar 10, 2025

I tested this on two different clusters and could not reproduce your error. For example, on NERSC, using 30 cores, it takes around 12 seconds and consistently uses 2.88 GB of RAM.

I am using pixell 0.28.0 and numpy 1.26.4. I will try with another environment. It might be an issue with numpy >= 2.0, the latest pixell 0.28.3 version, or another package update.

cvargas@perlmutter:login17:/pscratch/sd/c/cvargas/sim_mss2_test> get_cores 30
salloc --nodes 1 --ntasks 30 --cpus-per-task 2 --qos shared_interactive --time 04:00:00 --constraint cpu
salloc: Pending job allocation 36711770
salloc: job 36711770 queued and waiting for resources
salloc: job 36711770 has been allocated resources
salloc: Granted job allocation 36711770
salloc: Waiting for resource configuration
salloc: Nodes nid200024 are ready for job
cvargas@nid200024:/pscratch/sd/c/cvargas/sim_mss2_test> module load python
(nersc-python) cvargas@nid200024:/pscratch/sd/c/cvargas/sim_mss2_test> conda activate pympi
(pympi) cvargas@nid200024:/pscratch/sd/c/cvargas/sim_mss2_test> ls
debug_sim_objects.py  run_debug_simobjects.sh  sim_objects_inputs.pkl
(pympi) cvargas@nid200024:/pscratch/sd/c/cvargas/sim_mss2_test> sh run_debug_simobjects.sh
Elapsed time: 0m 14.33s
Memory usage (GB): 2.88
Elapsed time: 0m 10.99s
Memory usage (GB): 2.88
Elapsed time: 0m 10.32s
Memory usage (GB): 2.88
Elapsed time: 0m 10.23s
Memory usage (GB): 2.88
Elapsed time: 0m 11.55s
Memory usage (GB): 2.88
Elapsed time: 0m 11.07s
Memory usage (GB): 2.88
Elapsed time: 0m 10.98s
Memory usage (GB): 2.88
Elapsed time: 0m 11.15s
Memory usage (GB): 2.88
Elapsed time: 0m 11.79s
Memory usage (GB): 2.88
Elapsed time: 0m 12.25s
Memory usage (GB): 2.89

@zonca
Copy link
Member Author

zonca commented Mar 10, 2025

thanks @cpvargas! did not think about trying with a different numpy.

possibly it is happening less? but still happening on Popeye:

bash run_debug_simobjects.sh 
Python version: Python 3.12.7
NumPy version: 1.26.4
Pixell version: 0.28.0
Elapsed time: 0m 7.23s
Memory usage (GB): 2.88
Elapsed time: 0m 7.10s
Memory usage (GB): 2.88
Elapsed time: 0m 7.17s
Memory usage (GB): 2.88
Elapsed time: 0m 9.03s
Memory usage (GB): 2.88
Elapsed time: 0m 7.11s
Memory usage (GB): 2.88
Elapsed time: 0m 7.12s
Memory usage (GB): 2.88
Elapsed time: 5m 0.54s
Memory usage (GB): 14.29
Elapsed time: 0m 7.24s
Memory usage (GB): 2.88
Elapsed time: 0m 7.26s
Memory usage (GB): 2.88
Elapsed time: 0m 7.48s
Memory usage (GB): 2.88

@zonca
Copy link
Member Author

zonca commented Mar 10, 2025

I also tried checking out 0.28.0 from Github and building it locally, still showing this problem:

bash run_debug_simobjects.sh                                                                      
Python version: Python 3.12.7                                                                                                             
NumPy version: 1.26.4                                                                                                                     
Pixell version: 0.28.0                                                                                                                    
Elapsed time: 0m 7.17s                                                                                                                    
Memory usage (GB): 2.88                                                                                                                   
Elapsed time: 0m 7.04s                                                                                                                    
Memory usage (GB): 2.88                                                                                                                   
Elapsed time: 0m 7.06s                                                                                                                    
Memory usage (GB): 2.88                                                                                                                   
Elapsed time: 5m 0.52s                                                                                                                    
Memory usage (GB): 14.03                                                                                                                  
Elapsed time: 0m 7.04s
Memory usage (GB): 2.88
Elapsed time: 0m 7.08s
Memory usage (GB): 2.88
Elapsed time: 0m 7.08s
Memory usage (GB): 2.88
Elapsed time: 0m 7.05s
Memory usage (GB): 2.88
Elapsed time: 0m 7.08s
Memory usage (GB): 2.88
Elapsed time: 0m 7.00s
Memory usage (GB): 2.88

@cpvargas
Copy link

I've tested in a clean environment (Python 3.10.16, NumPy 2.0.2, pixell 0.28.0) and observed intermittent high memory usage. Out of 20 runs, 2 consumed approximately 70 GB. I ran a second set of 20 runs and observed the same issue in one of those runs. This strongly points to an intermittent memory leak. I'll investigate if the issue is related to map geometry and source position, and also test if the problem doesn't occur using other Python and NumPy versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants