Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

equal p(z) bins for all sources in mock_data #404

Closed
cristobal-sifon opened this issue Jun 13, 2021 · 9 comments · Fixed by #570
Closed

equal p(z) bins for all sources in mock_data #404

cristobal-sifon opened this issue Jun 13, 2021 · 9 comments · Fixed by #570
Assignees
Labels
enhancement Tackle this someday but not immediately

Comments

@cristobal-sifon
Copy link
Collaborator

The current implementation with photoz_sigma_unscaled generates a pdf that is sampled differently for every galaxy. From _compute_photoz_pdfs:

for row in galaxy_catalog:
    pdf_range = row['pzsigma']*10.
    zmin, zmax = row['z']-pdf_range, row['z']+pdf_range
    zbins = np.arange(zmin, zmax, 0.03)

so the result is a column like:

               pzbins [121]              
-----------------------------------------
 -0.9189157769563053 .. 2.681084223043698
-0.8755947879144116 .. 2.7244052120855917
 -0.9013367845355461 .. 2.698663215464457
 -1.0683426683023802 .. 2.531657331697623
 -1.0260346754041412 .. 2.573965324595862
-0.8371474311387707 .. 2.7628525688612324
 -0.7837649451484183 .. 2.816235054851585
 -1.0040270867734482 .. 2.595972913226555
  -1.1684000482540333 .. 2.43159995174597
-0.7198019908543756 .. 2.8801980091456274
                                      ...
-1.3521789249221339 .. 2.2478210750778693
-0.9578447603266889 .. 2.6421552396733143
 -0.9215149025501451 .. 2.678485097449858
-0.9071426552874368 .. 2.6928573447125665
-1.0662076783207386 .. 2.5337923216792646
-1.2636985872843276 .. 2.3363014127156756
 -1.136257420102231 .. 2.4637425798977723
 -0.7774128187340363 .. 2.822587181265967
-1.0730126785251919 .. 2.5269873214748113
-0.9649808022321504 .. 2.6350191977678525
-1.3844228023509195 .. 2.2155771976490835

This is both very inconvenient, because the user then has to interpolate if they want to combine the pdfs, and a big waste of space.

I propose to change this for a single binning scheme that depends on (zsrc_min,zsrc_max) and a new optional argument zsrc_bin_width or similar, so that the binning can be stored as a separate attribute of GalaxyCluster, e.g., zsrc_pdf_centers or just zsrc_bins. See related suggestion in #401.

@johannct
Copy link

johannct commented Jun 13, 2021

If I understand well, the incriminated lines are

for row in galaxy_catalog:
and subsequent. Certainly one binning per source galaxy in this context is overkill. Another possible way to proceed is to avoid the for loop and get the min and max of the galaxy_catalog['z'] itself, and then extend these zmin and zmax by pdf_range, and proceed. The resulting binning will be unique and applicable to all the galaxies in the catalog (in general I think that this for loop should disappear, but that will demand a bit of careful broadcasting).

@cristobal-sifon
Copy link
Collaborator Author

Yes, we can replace that for loop by the following (which I haven't tested):

zbins = np.arange(zsrc_min, zsrc_max+zsrc_bin_width, zsrc_bin_width)
pzpdf_grid = np.exp(-(zbins[:,None]-galaxy_catalog['z'])/(2*pzsigma**2) / ((2*np.pi)**0.5*pzsigma)

or maybe the transpose of that. (And by the way, pzsigma does not need to be a column in galaxy_catalog.) We do need to be careful that this does not start to produce memory issues if the arrays are too large. So perhaps it could be done within a loop but in chunks of say 100 or 1000

@m-aguena
Copy link
Collaborator

As I mentioned here, this part of the code is not really being used for now.
@combet do you remeber who added that? I am not sure if there is a specific reason to have different binnings for each galaxy or if it was a naive implementation.

@johannct
Copy link

git blame say Matthew Kirby 2 years ago, for the for loop, and then some refactoring 8 months ago by Céline, Alex, and a couple others. There are many ways to improve the code here, but there is not much point in doing so before it is really used. And I do not think that "really using it" should wait for qp to enter the game. Quite the contrary actually IMHO.....

@combet
Copy link
Collaborator

combet commented Jun 17, 2021

I don't recall why we implemented it this way. I guess one question is what will happen in real life? Will photoz codes provide pdf on a single binning for all galaxies or not (it's not obvious to me). If this is the standard then I completely agree that this should be changed to something like @cristobal-sifon proposes.

@johannct
Copy link

I think that the code here should not care. You should be able to mock pdfs, ingest them in some TBD form by something like qp, and make sure that the code downstream does not depend on the way the pdf is represented. This is the whole point of qp. So I do not think that it matters what real life will be : here we need to abstract away from it. The only real question to my mind is how it is provided, which is a different question : e.g. inside an object catalog, or separately with some id to join to an object catalog?

@cristobal-sifon cristobal-sifon self-assigned this Jun 18, 2021
cristobal-sifon added a commit that referenced this issue Jun 18, 2021
@m-aguena
Copy link
Collaborator

m-aguena commented Jun 25, 2021

Given this week's tag up discussion, we decided to have a more abstract approach for storing the PDF in galcat. The idea is to have the PDF column as an object type and add a pdf attribute to the galaxy catalog to describe the use of this column. So we can have something like:

  • pdf_description={'type': 'shared_bins', 'info': bins} -> pdf_column: array of size=n bins
  • pdf_description={'type': 'unique_bins'} -> pdf_column: tuple with (pdf bins, pdf vals)
  • pdf_description={'type': 'quantiles', 'info': (0.68, 0.95, 0.98)} -> pdf_column: tuple with quantiles (68%, 95%, 99%)

right now we can start implementing this first case with simple modification of what is in branch issue/404/galcat_pdf_improvement

@eacharles
Copy link

FWIW, this is really pretty heavily overlapping with functionality in qp and in particular the construction of the qp.Ensemble, which represents a set of pdfs.
I.e.,

qp.Ensemble(qp.interp, data=dict(xvals=yvals, yvals=yvals)) # where xvals is an array of nPoints, yVals is an array of (nPoint, nPdfs)
qp.Ensemble(qp.quant, data=dict(quants=quants, locs=locs)) # where quants are the quants (i.e., 0.68, 0.95, 0.98) and locs are the corresponding locations

It would be good to chat a bit about this.

@marina-ricci
Copy link
Collaborator

This is linked to issue #401 (interfacing 'clmm and qp)

@marina-ricci marina-ricci added the enhancement Tackle this someday but not immediately label May 9, 2022
@m-aguena m-aguena linked a pull request Feb 9, 2023 that will close this issue
@m-aguena m-aguena removed a link to a pull request Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Tackle this someday but not immediately
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants