Create working script for Kullback-Leibler divergence #8

sschmidt23 · 2017-08-21T23:34:10Z

Has this script been finalized?
Has it been uploaded to the PZDC1paper repository?

aimalz · 2017-08-22T22:35:56Z

This is built into qp, with the demo of how to call the function in cell 15 here.

sschmidt23 · 2017-09-26T20:46:30Z

I have tested the qp version of KL divergence with the stacked N(z). For the individual metrics, we could compare the PIT values to a uniform distribution just as we do with the KS/CvM/AD metrics. Do we have to worry about the PIT \sim 0.000 and PIT \sim 1.000 values for KLD?

sschmidt23 · 2017-09-27T01:13:35Z

Here is an example (again with 2 Gaussians) showing that the KL Divergence, computed between a qp.PDF object created from the samples of PIT values and the uniform distribution) is quite high when the low and high PIT values are not trimmed, very similar to Anderson-Darling.
https://github.com/sschmidt23/PZDC1_qptests/blob/master/pz_stats_fakedata_KLDtest.ipynb

aimalz · 2017-09-29T22:24:51Z

The KLD will not have a problem if the reference and comparison distributions never take zero probability and both integrate to unity over the integration range. However, it is most sensitive to deviation in the "tails," meaning where the probability of the comparison distribution is low (i.e. not necessarily at the edges of the integration range).

sschmidt23 · 2017-10-01T00:06:51Z

That's not what I'm seeing in the tests with both the mock and real data. There, if I use limits that go well outside the coverage of the PDF I get a fairly stable value, but if I cut within but near where the PDF goes to zero I get very different values, and I'd worry about the stability of the measure given the large change in the statistic versus small changes in the values of the limits.

Here are examples from the notebooks that I was working on yesterday (apologies for the messiness, I was working in a noisy car dealership while my airbag clock spring was replaced)
https://github.com/sschmidt23/PZDC1_qptests/blob/master/nz_fakedata_test_KSCvMADKLD.ipynb
and
https://github.com/sschmidt23/PZDC1_qptests/blob/master/nz_realdata_test_KSCvMADKLD.ipynb

So, my main question is: what limits should we use such that we get comparable values that will emphasize difference between the codes and are not as sensitive to the values we choose for the limits that we consider. From these notebooks, it seems like the answer is to use limits that span a wider range than the coverage of the PDF array.

janewman-pitt-edu · 2017-10-01T20:16:00Z

Hi Sam,

I'm wondering why you cut off one of the distributions at z >2 when the N(z) has a hard cut at 2? We know a code should fail that test. Effectively, with BPZ we should be using a prior that has 0 probability at z>2...

However, I think we'd also expect that a sum of p(z)'s should fail to match N(z) anyways, for all the reasons Alex has shown. So I'm not sure how meaningful that test would be?

sschmidt23 · 2017-10-01T21:55:10Z

I was mainly responding to what Alex said above: that the test should be fine as long as we don't explore the regions where p(z)=0.0, but when actually running the code, they seem fine in that regime, returning a stable answer in the 2 Gaussian mock data whether you integrate 0.0 to 8.0 or -5.0 to 15.0. This was counter to what I thought after reading Alex's comment, so I wanted to mention it. On the other hand, when you start cutting at values at or near the extent of the "true" data, the value of the statistic varies rapidly. I cut out more numerical examples in the notebooks that I linked, but it did look that way to me, and is easy to check if someone else wants to run the notebook and fiddle with the limits themselves. Because the value changes rapidly right where we were talking about doing cuts, I was worried about how whatever cut we decided on would affect the values for the absolute metrics, but more importantly, how it would affect the relative metric of comparing KLD values between the different codes, since these tails will be quite different code to code. If we choose to cut at 0.016 or 0.015, or 1.99 or 2.0 on the high z end, how does that change the reported performance of the photo-z codes when we write up the paper? This is very similar to what I was seeing with the Anderson-Darling test.

I could easily be missing something, I've been working on this in fits and starts, so if someone has an obvious answer of "we should use these limiting values and here's why", then I'd be happy to hear it. I've never used KLD or AD in any work that I've done, so I have no practical experience with where to set the limits near the tails, which the metrics are very sensitive to.

For the other part of the question: BPZ uses a parameterized form for the prior, so there is no hard cutoff at z=2.0. Admittedly, I did set the grid for BPZ to test up to z=2.1 when I ran the code, so it does check slightly beyond the highest redshift object. I can re-run this and set it to not check above z=2.0 if we want to do that. Note that the lowest actual redshift in the sample is at z=0.016, not the 0.010 cut that we check for, I assume because of small numbers/area in the sample. But, you may also notice, because I used qp to construct the N(z) distribution for the spec- objects and qp fits the "true" summed N(z) for the spec-z's as a Gaussian mixture model, the N(z) distribution also has non-zero probability above z=2.0 where the last Gaussian extends beyond the final data points. We could force the PDF to zero by resampling the using='samples' to using='gridded' and truncating the grid at z=2.0. If we think that this is a good idea, then we'll have to modify the implementation for how we check against the distribution for KLD. Note that this is not a problem for KS and CvM, as they are set up to evaluate the CDF only at the values of the spec-z's (it is related to the vmin and vmax cutoffs in the AD calculation, though).

In short: do we have a good idea on what values to use for the limits in the KLD test (and vmin/vmax in AD)? And are there any concerns that the metrics are sensitive to our choices of these parameters, particularly when we do relative comparisons between the codes?

janewman-pitt-edu · 2017-10-01T22:19:28Z

For K-L in particular, we know the reference (true) distribution needs to only run to z=2, so the distributions compared to should match that or we know everything will fail just for that reason (e.g. we won't be able to get fair comparisons between training-based and template-based methods if one knows about the z=2 cutoff and the other does not. It's trivial to apply that effective prior in BPZ; one would just set the probability to 0 at z>2 and renormalize to ensure you have a properly-normalized PDF. Not checking at z>2 should also work, assuming it normalizes things for you. I would try to enforce z=2 cuts throughout to keep things under control.

I'd be fine discarding A-D with an explanation in the text that it behaves unstably due to the high weight given values near the limits of the CDF, if that's what we're seeing. I do think we need to make sure (in this case again) that we are effectively comparing the two distributions within the same redshift limits, though, as if one distribution is limited and the other is not we'll surely derive a mismatch.

sschmidt23 · 2017-10-05T22:32:42Z

In addition to the limits for the KL Divergence, there is one other important factor: the bandwidth chosen by qp. For implementations of KS, CvM, and AD, we compare a set of samples (the spec-z values or the PIT values) to a qp.PDF distribution or the uniform distribution. But, the implementation of qp.utils.calculate_kl_divergence compares two qp.PDF objects, not samples to a distribution. The bandwidth for the KDE used to fit a distribution to the N spec-z values. Currently, qp is set to use "Scott's rule", which for one dimensional data sets the bandwidth to (number of data points)^(-0.2). For 100k objects this is 0.1, for 10^6 objects it is ~0.063. While this looks reasonable for BPZ, for example, it does not look like a good smoothing choice for a subset of Ibrahim's GPz data. I include figures for a set of 111k galaxies stacked for BPZ and ~100k from GPz (not the same data set, but qualitatively you'll see the point). Since KLD just evaluates the PDF of the two distributions on a grid, the choice of bandwidth on the metric seems quite obvious.

So, my question: what should we do to choose the bandwidth for each code for the stacked N(z) KLD calculation?

sschmidt23 · 2017-10-05T23:01:21Z

And for reference, here is the dataset for GPz with a bandwidth of 0.02 for comparison, showing that there is small scale structure in the N(z) stack

janewman-pitt-edu · 2017-10-06T02:03:09Z

Since KL is doing an integration, perhaps the sort of interval size you want for the integration (which there are other rules of thumb for) should guide the bandwidth? Best, Jeff On Oct 5, 2017, at 6:32 PM, Sam Schmidt <[email protected]<mailto:[email protected]>> wrote: In addition to the limits for the KL Divergence, there is one other important factor: the bandwidth chosen by qp. For implementations of KS, CvM, and AD, we compare a set of samples (the spec-z values or the PIT values) to a qp.PDF distribution or the uniform distribution. But, the implementation of qp.utils.calculate_kl_divergence compares two qp.PDF objects, not samples to a distribution. The bandwidth for the KDE used to fit a distribution to the N spec-z values. Currently, qp is set to use "Scott's rule", which for one dimensional data sets the bandwidth to (number of data points)^(-0.2). For 100k objects this is 0.1, for 10^6 objects it is ~0.063. While this looks reasonable for BPZ, for example, it does not look like a good smoothing choice for a subset of Ibrahim's GPz data. I include figures for a set of 111k galaxies stacked for BPZ and ~100k from GPz (not the same data set, but qualitatively you'll see the point). Since KLD just evaluates the PDF of the two distributions on a grid, the choice of bandwidth on the metric seems quite obvious. So, my question: what should we do to choose the bandwidth for each code for the stacked N(z) KLD calculation? [ibrahim10percent]<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F11219330%2F31255556-4dffdd5a-a9e2-11e7-8ed8-ffced165f36a.jpg&data=01%7C01%7Cjanewman%40pitt.edu%7Cd32474938432446214db08d50c40feea%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=6uO9Zy%2BX3Wb84VQR%2FGyO8nr3GtPAGiFxCRuRy7uGqmQ%3D&reserved=0> [bpznz]<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F11219330%2F31255557-4e0b0e50-a9e2-11e7-942f-fdc33602ab4f.jpg&data=01%7C01%7Cjanewman%40pitt.edu%7Cd32474938432446214db08d50c40feea%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=ptiZYsmfPqp4OXMxgmKTztpNqfPUmEzYMCMlap3OOSM%3D&reserved=0> — You are receiving this because you commented. Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FLSSTDESC%2FPZDC1paper%2Fissues%2F8%23issuecomment-334610449&data=01%7C01%7Cjanewman%40pitt.edu%7Cd32474938432446214db08d50c40feea%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=hyz79adSKtxNUKX3tJJMvjivf83iZq1gqP9u3ljVbyc%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAC-lNYK_YXGwxqWTLUZK7AgLPP3BLXkAks5spVkLgaJpZM4O98T2&data=01%7C01%7Cjanewman%40pitt.edu%7Cd32474938432446214db08d50c40feea%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=hwJldLFwZJQWnCwayKWiRMGbAd6PpcmBx9U9Wc0w5tY%3D&reserved=0>.

sschmidt23 · 2017-10-06T02:25:48Z

As you can see here, in qp the KLD is not computed as an integral, but in the sum form, simply evaluating sum(P_i * log(P_i/Q_i)), i.e. the discrete form of the KL Divergence. So, the integral is not involved.

Honestly, I think Scott's rule is fairly good for most of our data (though I have only really looked at EAZY, BPZ, and GPz), Ibrahim's data does have galaxies with very small individual sigmas, which leads to more small scale structure (some of which may accurately reflect the true distribution). So, our "rule of thumb" maybe should take some characteristic of the p(z) dataset into account, e.g. maybe we use something like the mean or median sigma of the individual p(z)'s?

janewman-pitt-edu · 2017-10-06T02:58:07Z

I don't think that really matters to my argument -- effectively the sum must basically be equivalent to an integral computed with a spacing comparable to the bandwidth... it would be interesting to understand why Ibrahim gets such small sigmas, though, presumably they are unrealistic...

sschmidt23 · 2017-10-06T04:02:27Z

True, there are two factors: the dx in the integral and the bandwidth used to construct the qp.PDF object, we need appropriate values for both parameters to get a good statistic.

aimalz · 2017-10-10T14:09:19Z

Sorry for being late to this thread. I'm making another issue here for the matter of choosing an appropriate smoothing scale for turning samples into something we can evaluate, since it's distinct from this issue's goal of just implementing the KLD in a script, and it could affect pretty much every metric of p_i(z) or sum_i[p_i(z)]~n(z) (as opposed to p(CDF_i(z) evaluated on a regular grid in probability space like KS, CvM, and AD). (I also made it an issue for qp but will not address it immediately, as a cursory search indicates it may be an open problem.)

sschmidt23 · 2017-10-10T20:39:35Z

Yes, I agree, Alex, thanks for creating the separate issue. Based on the tests that I did with two Gaussian samples I think the actual implementation of the KLD looks correct for both p(z) and N(z) stack, so I think that we can close this issue and just focus on the limits.

aimalz mentioned this issue Oct 10, 2017

Smoothing scale for stacked n(z) #13

Closed

sschmidt23 closed this as completed Oct 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create working script for Kullback-Leibler divergence #8

Create working script for Kullback-Leibler divergence #8

sschmidt23 commented Aug 21, 2017

aimalz commented Aug 22, 2017

sschmidt23 commented Sep 26, 2017 •

edited

Loading

sschmidt23 commented Sep 27, 2017

aimalz commented Sep 29, 2017

sschmidt23 commented Oct 1, 2017

janewman-pitt-edu commented Oct 1, 2017

sschmidt23 commented Oct 1, 2017

janewman-pitt-edu commented Oct 1, 2017

sschmidt23 commented Oct 5, 2017

sschmidt23 commented Oct 5, 2017

janewman-pitt-edu commented Oct 6, 2017 via email

sschmidt23 commented Oct 6, 2017 •

edited

Loading

janewman-pitt-edu commented Oct 6, 2017

sschmidt23 commented Oct 6, 2017

aimalz commented Oct 10, 2017

sschmidt23 commented Oct 10, 2017

Create working script for Kullback-Leibler divergence #8

Create working script for Kullback-Leibler divergence #8

Comments

sschmidt23 commented Aug 21, 2017

aimalz commented Aug 22, 2017

sschmidt23 commented Sep 26, 2017 • edited Loading

sschmidt23 commented Sep 27, 2017

aimalz commented Sep 29, 2017

sschmidt23 commented Oct 1, 2017

janewman-pitt-edu commented Oct 1, 2017

sschmidt23 commented Oct 1, 2017

janewman-pitt-edu commented Oct 1, 2017

sschmidt23 commented Oct 5, 2017

sschmidt23 commented Oct 5, 2017

janewman-pitt-edu commented Oct 6, 2017 via email

sschmidt23 commented Oct 6, 2017 • edited Loading

janewman-pitt-edu commented Oct 6, 2017

sschmidt23 commented Oct 6, 2017

aimalz commented Oct 10, 2017

sschmidt23 commented Oct 10, 2017

sschmidt23 commented Sep 26, 2017 •

edited

Loading

sschmidt23 commented Oct 6, 2017 •

edited

Loading