-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconstruction run for 256^3 grid #36
Comments
hum, which script exactly is returining this log? |
Right, should have mentioned that earlier. The examples are in recon branch, flowpm/examples folder Corresponding scripts, depending on forward model of choice are:
You can set the mesh configuration and the grid size in the job scripts there. What I am finding this this - So, to answer your question - |
Actually, one reason for this might be that I am reading in the data and/or lowering the entire grid on a single GPU to estimate power spectrum to run diagnostics. That is a huge memory overhead. In that case, well, we will need to resolve I/O or access CPU memory to estimate power spectra and/or save the grid. |
lol, yeah ok, so that's very likely to be at least part of the problem |
One way to check for that is to output like the reduce_sum of the volume, not the volume itself. It's dumb because it means you can't see the result, but at least it can check that we dont run into protocol buffer issues |
Hmpf. Doesn't seem like it is the case. Check this file which is entirely distributed framework for reconstruction. The data here is essentially a new "linear field", I did not bother to evolve it all the way since that is not the point. Again, these were the same errors I was getting for forward evolution of 1,024^3 grid on < 2 nodes and they did not go away with reduce_sum as well. It would be good if you could atleast confirm that you see the same errors to check against human error on my part that is very likely. |
hummm ok I see, I'll try to investigate this |
I think this will be solved with the new babckend |
5 step PM runs with reconstruction are failing, even on 2 nodes i.e. 16 GPUs
This is surprising given that the following configurations are working well enough -
This seems to be a combination of OOM and communication issue. This effectively limits our reconstruction capability on GPU since going to larger number of nodes is going to be prohibitively slow in terms of lowering the graph, and situation is likely to be worse for 512^3 and higher meshes.
Again, maybe it will not be so with SIMD placement and/or on TPUs, but it will still be good to understand this issue in the first place.
Attached is the log generated by the run.
meshnbody.o443185.txt
The text was updated successfully, but these errors were encountered: