Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconstruction run for 256^3 grid #36

Open
modichirag opened this issue Feb 13, 2020 · 8 comments
Open

Reconstruction run for 256^3 grid #36

modichirag opened this issue Feb 13, 2020 · 8 comments
Assignees
Labels
help wanted Extra attention is needed Mesh TensorFlow

Comments

@modichirag
Copy link
Member

5 step PM runs with reconstruction are failing, even on 2 nodes i.e. 16 GPUs

This is surprising given that the following configurations are working well enough -

  • 128^3 grid with 5 steps PM is reconstructing on 1 GPU.
  • 256^3 grid with LPT forward model is reconstructing on 1 GPU
  • 256^3 grid forward model with 5 step PM is running on 1 GPU

This seems to be a combination of OOM and communication issue. This effectively limits our reconstruction capability on GPU since going to larger number of nodes is going to be prohibitively slow in terms of lowering the graph, and situation is likely to be worse for 512^3 and higher meshes.

Again, maybe it will not be so with SIMD placement and/or on TPUs, but it will still be good to understand this issue in the first place.

Attached is the log generated by the run.

meshnbody.o443185.txt

@modichirag modichirag added help wanted Extra attention is needed Mesh TensorFlow labels Feb 13, 2020
@EiffL
Copy link
Member

EiffL commented Feb 13, 2020

hum, which script exactly is returining this log?

@modichirag
Copy link
Member Author

Right, should have mentioned that earlier.

The examples are in recon branch, flowpm/examples folder
mesh_recon.py & pyramid_recon.py
are the models for reconstruction on a single mesh and pyramid scheme forward models respectively.

Corresponding scripts, depending on forward model of choice are:

  • mesh_lptrecon_job.sh
  • mesh_nbodyrecon_job.sh
  • pyramid_lptrecon_job.sh
  • pyramid_nbodyrecon_job.sh

You can set the mesh configuration and the grid size in the job scripts there.
Note that since the code saves a lot of diagnostics files and figures, you also need to set "fpath" to a place where you have write permissions which is currently set to my scratch on cori. One will need to change that.

What I am finding this this -
LPT jobs are running for all grid sizes upto 256^3 on a single node as well. I haven't checked for higher yet
Nbody jobs are running for 64^3 and 128^3, and not for 256^3

So, to answer your question -
either mesh_nbodyrecon_job.sh or pyramid_nbdoyrecon_job.sh with nc=256 set in the job file should replicate this error.

@modichirag
Copy link
Member Author

Actually, one reason for this might be that I am reading in the data and/or lowering the entire grid on a single GPU to estimate power spectrum to run diagnostics. That is a huge memory overhead.
To check if that is causing the failure, I tried running a forward model on the same 256^3 grid on single GPU with lowering the entire meshes as well as generating input with non-parallel version of the code. That seemed to be running fine.
But of course its possible that when combined with gradients, we are running out of memory.

In that case, well, we will need to resolve I/O or access CPU memory to estimate power spectra and/or save the grid.

@EiffL
Copy link
Member

EiffL commented Feb 13, 2020

lol, yeah ok, so that's very likely to be at least part of the problem

@EiffL
Copy link
Member

EiffL commented Feb 13, 2020

One way to check for that is to output like the reduce_sum of the volume, not the volume itself. It's dumb because it means you can't see the result, but at least it can check that we dont run into protocol buffer issues

@modichirag
Copy link
Member Author

Hmpf. Doesn't seem like it is the case.

Check this file which is entirely distributed framework for reconstruction.
https://github.com/modichirag/flowpm/blob/recon/examples/test256recon.py
and corresponding script is test_job.sh in commit - modichirag@787dd10

The data here is essentially a new "linear field", I did not bother to evolve it all the way since that is not the point.
Except that, the entire graph is the same and I am only lowering the chisq and prior which are the values after the reduce_sum operation.

Again, these were the same errors I was getting for forward evolution of 1,024^3 grid on < 2 nodes and they did not go away with reduce_sum as well.

It would be good if you could atleast confirm that you see the same errors to check against human error on my part that is very likely.

@EiffL
Copy link
Member

EiffL commented Feb 14, 2020

hummm ok I see, I'll try to investigate this

@EiffL EiffL self-assigned this Feb 14, 2020
@EiffL
Copy link
Member

EiffL commented May 27, 2021

I think this will be solved with the new babckend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed Mesh TensorFlow
Projects
None yet
Development

No branches or pull requests

2 participants