Reconstruction run for 256^3 grid #36

modichirag · 2020-02-13T00:22:50Z

5 step PM runs with reconstruction are failing, even on 2 nodes i.e. 16 GPUs

This is surprising given that the following configurations are working well enough -

128^3 grid with 5 steps PM is reconstructing on 1 GPU.
256^3 grid with LPT forward model is reconstructing on 1 GPU
256^3 grid forward model with 5 step PM is running on 1 GPU

This seems to be a combination of OOM and communication issue. This effectively limits our reconstruction capability on GPU since going to larger number of nodes is going to be prohibitively slow in terms of lowering the graph, and situation is likely to be worse for 512^3 and higher meshes.

Again, maybe it will not be so with SIMD placement and/or on TPUs, but it will still be good to understand this issue in the first place.

Attached is the log generated by the run.

meshnbody.o443185.txt

EiffL · 2020-02-13T19:15:36Z

hum, which script exactly is returining this log?

modichirag · 2020-02-13T19:31:06Z

Right, should have mentioned that earlier.

The examples are in recon branch, flowpm/examples folder
mesh_recon.py & pyramid_recon.py
are the models for reconstruction on a single mesh and pyramid scheme forward models respectively.

Corresponding scripts, depending on forward model of choice are:

mesh_lptrecon_job.sh
mesh_nbodyrecon_job.sh
pyramid_lptrecon_job.sh
pyramid_nbodyrecon_job.sh

You can set the mesh configuration and the grid size in the job scripts there.
Note that since the code saves a lot of diagnostics files and figures, you also need to set "fpath" to a place where you have write permissions which is currently set to my scratch on cori. One will need to change that.

What I am finding this this -
LPT jobs are running for all grid sizes upto 256^3 on a single node as well. I haven't checked for higher yet
Nbody jobs are running for 64^3 and 128^3, and not for 256^3

So, to answer your question -
either mesh_nbodyrecon_job.sh or pyramid_nbdoyrecon_job.sh with nc=256 set in the job file should replicate this error.

modichirag · 2020-02-13T19:36:25Z

Actually, one reason for this might be that I am reading in the data and/or lowering the entire grid on a single GPU to estimate power spectrum to run diagnostics. That is a huge memory overhead.
To check if that is causing the failure, I tried running a forward model on the same 256^3 grid on single GPU with lowering the entire meshes as well as generating input with non-parallel version of the code. That seemed to be running fine.
But of course its possible that when combined with gradients, we are running out of memory.

In that case, well, we will need to resolve I/O or access CPU memory to estimate power spectra and/or save the grid.

EiffL · 2020-02-13T19:37:39Z

lol, yeah ok, so that's very likely to be at least part of the problem

EiffL · 2020-02-13T19:41:41Z

One way to check for that is to output like the reduce_sum of the volume, not the volume itself. It's dumb because it means you can't see the result, but at least it can check that we dont run into protocol buffer issues

modichirag · 2020-02-14T02:02:08Z

Hmpf. Doesn't seem like it is the case.

Check this file which is entirely distributed framework for reconstruction.
https://github.com/modichirag/flowpm/blob/recon/examples/test256recon.py
and corresponding script is test_job.sh in commit - modichirag@787dd10

The data here is essentially a new "linear field", I did not bother to evolve it all the way since that is not the point.
Except that, the entire graph is the same and I am only lowering the chisq and prior which are the values after the reduce_sum operation.

Again, these were the same errors I was getting for forward evolution of 1,024^3 grid on < 2 nodes and they did not go away with reduce_sum as well.

It would be good if you could atleast confirm that you see the same errors to check against human error on my part that is very likely.

EiffL · 2020-02-14T17:27:31Z

hummm ok I see, I'll try to investigate this

EiffL · 2021-05-27T20:23:37Z

I think this will be solved with the new babckend

modichirag added help wanted Extra attention is needed Mesh TensorFlow labels Feb 13, 2020

EiffL self-assigned this Feb 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconstruction run for 256^3 grid #36

Reconstruction run for 256^3 grid #36

modichirag commented Feb 13, 2020

EiffL commented Feb 13, 2020

modichirag commented Feb 13, 2020

modichirag commented Feb 13, 2020

EiffL commented Feb 13, 2020

EiffL commented Feb 13, 2020

modichirag commented Feb 14, 2020

EiffL commented Feb 14, 2020

EiffL commented May 27, 2021

Reconstruction run for 256^3 grid #36

Reconstruction run for 256^3 grid #36

Comments

modichirag commented Feb 13, 2020

EiffL commented Feb 13, 2020

modichirag commented Feb 13, 2020

modichirag commented Feb 13, 2020

EiffL commented Feb 13, 2020

EiffL commented Feb 13, 2020

modichirag commented Feb 14, 2020

EiffL commented Feb 14, 2020

EiffL commented May 27, 2021