-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create working script for Kullback-Leibler divergence #8
Comments
I have tested the qp version of KL divergence with the stacked N(z). For the individual metrics, we could compare the PIT values to a uniform distribution just as we do with the KS/CvM/AD metrics. Do we have to worry about the PIT \sim 0.000 and PIT \sim 1.000 values for KLD? |
Here is an example (again with 2 Gaussians) showing that the KL Divergence, computed between a qp.PDF object created from the samples of PIT values and the uniform distribution) is quite high when the low and high PIT values are not trimmed, very similar to Anderson-Darling. |
The KLD will not have a problem if the reference and comparison distributions never take zero probability and both integrate to unity over the integration range. However, it is most sensitive to deviation in the "tails," meaning where the probability of the comparison distribution is low (i.e. not necessarily at the edges of the integration range). |
That's not what I'm seeing in the tests with both the mock and real data. There, if I use limits that go well outside the coverage of the PDF I get a fairly stable value, but if I cut within but near where the PDF goes to zero I get very different values, and I'd worry about the stability of the measure given the large change in the statistic versus small changes in the values of the limits. Here are examples from the notebooks that I was working on yesterday (apologies for the messiness, I was working in a noisy car dealership while my airbag clock spring was replaced) So, my main question is: what limits should we use such that we get comparable values that will emphasize difference between the codes and are not as sensitive to the values we choose for the limits that we consider. From these notebooks, it seems like the answer is to use limits that span a wider range than the coverage of the PDF array. |
Hi Sam, I'm wondering why you cut off one of the distributions at z >2 when the N(z) has a hard cut at 2? We know a code should fail that test. Effectively, with BPZ we should be using a prior that has 0 probability at z>2... However, I think we'd also expect that a sum of p(z)'s should fail to match N(z) anyways, for all the reasons Alex has shown. So I'm not sure how meaningful that test would be? |
I was mainly responding to what Alex said above: that the test should be fine as long as we don't explore the regions where p(z)=0.0, but when actually running the code, they seem fine in that regime, returning a stable answer in the 2 Gaussian mock data whether you integrate 0.0 to 8.0 or -5.0 to 15.0. This was counter to what I thought after reading Alex's comment, so I wanted to mention it. On the other hand, when you start cutting at values at or near the extent of the "true" data, the value of the statistic varies rapidly. I cut out more numerical examples in the notebooks that I linked, but it did look that way to me, and is easy to check if someone else wants to run the notebook and fiddle with the limits themselves. Because the value changes rapidly right where we were talking about doing cuts, I was worried about how whatever cut we decided on would affect the values for the absolute metrics, but more importantly, how it would affect the relative metric of comparing KLD values between the different codes, since these tails will be quite different code to code. If we choose to cut at 0.016 or 0.015, or 1.99 or 2.0 on the high z end, how does that change the reported performance of the photo-z codes when we write up the paper? This is very similar to what I was seeing with the Anderson-Darling test. I could easily be missing something, I've been working on this in fits and starts, so if someone has an obvious answer of "we should use these limiting values and here's why", then I'd be happy to hear it. I've never used KLD or AD in any work that I've done, so I have no practical experience with where to set the limits near the tails, which the metrics are very sensitive to. For the other part of the question: BPZ uses a parameterized form for the prior, so there is no hard cutoff at z=2.0. Admittedly, I did set the grid for BPZ to test up to z=2.1 when I ran the code, so it does check slightly beyond the highest redshift object. I can re-run this and set it to not check above z=2.0 if we want to do that. Note that the lowest actual redshift in the sample is at z=0.016, not the 0.010 cut that we check for, I assume because of small numbers/area in the sample. But, you may also notice, because I used qp to construct the N(z) distribution for the spec- objects and qp fits the "true" summed N(z) for the spec-z's as a Gaussian mixture model, the N(z) distribution also has non-zero probability above z=2.0 where the last Gaussian extends beyond the final data points. We could force the PDF to zero by resampling the using='samples' to using='gridded' and truncating the grid at z=2.0. If we think that this is a good idea, then we'll have to modify the implementation for how we check against the distribution for KLD. Note that this is not a problem for KS and CvM, as they are set up to evaluate the CDF only at the values of the spec-z's (it is related to the vmin and vmax cutoffs in the AD calculation, though). In short: do we have a good idea on what values to use for the limits in the KLD test (and vmin/vmax in AD)? And are there any concerns that the metrics are sensitive to our choices of these parameters, particularly when we do relative comparisons between the codes? |
For K-L in particular, we know the reference (true) distribution needs to only run to z=2, so the distributions compared to should match that or we know everything will fail just for that reason (e.g. we won't be able to get fair comparisons between training-based and template-based methods if one knows about the z=2 cutoff and the other does not. It's trivial to apply that effective prior in BPZ; one would just set the probability to 0 at z>2 and renormalize to ensure you have a properly-normalized PDF. Not checking at z>2 should also work, assuming it normalizes things for you. I would try to enforce z=2 cuts throughout to keep things under control. I'd be fine discarding A-D with an explanation in the text that it behaves unstably due to the high weight given values near the limits of the CDF, if that's what we're seeing. I do think we need to make sure (in this case again) that we are effectively comparing the two distributions within the same redshift limits, though, as if one distribution is limited and the other is not we'll surely derive a mismatch. |
Since KL is doing an integration, perhaps the sort of interval size you want for the integration (which there are other rules of thumb for) should guide the bandwidth?
Best,
Jeff
On Oct 5, 2017, at 6:32 PM, Sam Schmidt <[email protected]<mailto:[email protected]>> wrote:
In addition to the limits for the KL Divergence, there is one other important factor: the bandwidth chosen by qp. For implementations of KS, CvM, and AD, we compare a set of samples (the spec-z values or the PIT values) to a qp.PDF distribution or the uniform distribution. But, the implementation of qp.utils.calculate_kl_divergence compares two qp.PDF objects, not samples to a distribution. The bandwidth for the KDE used to fit a distribution to the N spec-z values. Currently, qp is set to use "Scott's rule", which for one dimensional data sets the bandwidth to (number of data points)^(-0.2). For 100k objects this is 0.1, for 10^6 objects it is ~0.063. While this looks reasonable for BPZ, for example, it does not look like a good smoothing choice for a subset of Ibrahim's GPz data. I include figures for a set of 111k galaxies stacked for BPZ and ~100k from GPz (not the same data set, but qualitatively you'll see the point). Since KLD just evaluates the PDF of the two distributions on a grid, the choice of bandwidth on the metric seems quite obvious.
So, my question: what should we do to choose the bandwidth for each code for the stacked N(z) KLD calculation?
[ibrahim10percent]<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F11219330%2F31255556-4dffdd5a-a9e2-11e7-8ed8-ffced165f36a.jpg&data=01%7C01%7Cjanewman%40pitt.edu%7Cd32474938432446214db08d50c40feea%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=6uO9Zy%2BX3Wb84VQR%2FGyO8nr3GtPAGiFxCRuRy7uGqmQ%3D&reserved=0>
[bpznz]<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F11219330%2F31255557-4e0b0e50-a9e2-11e7-942f-fdc33602ab4f.jpg&data=01%7C01%7Cjanewman%40pitt.edu%7Cd32474938432446214db08d50c40feea%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=ptiZYsmfPqp4OXMxgmKTztpNqfPUmEzYMCMlap3OOSM%3D&reserved=0>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FLSSTDESC%2FPZDC1paper%2Fissues%2F8%23issuecomment-334610449&data=01%7C01%7Cjanewman%40pitt.edu%7Cd32474938432446214db08d50c40feea%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=hyz79adSKtxNUKX3tJJMvjivf83iZq1gqP9u3ljVbyc%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAC-lNYK_YXGwxqWTLUZK7AgLPP3BLXkAks5spVkLgaJpZM4O98T2&data=01%7C01%7Cjanewman%40pitt.edu%7Cd32474938432446214db08d50c40feea%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=hwJldLFwZJQWnCwayKWiRMGbAd6PpcmBx9U9Wc0w5tY%3D&reserved=0>.
|
As you can see here, in qp the KLD is not computed as an integral, but in the sum form, simply evaluating sum(P_i * log(P_i/Q_i)), i.e. the discrete form of the KL Divergence. So, the integral is not involved. Honestly, I think Scott's rule is fairly good for most of our data (though I have only really looked at EAZY, BPZ, and GPz), Ibrahim's data does have galaxies with very small individual sigmas, which leads to more small scale structure (some of which may accurately reflect the true distribution). So, our "rule of thumb" maybe should take some characteristic of the p(z) dataset into account, e.g. maybe we use something like the mean or median sigma of the individual p(z)'s? |
I don't think that really matters to my argument -- effectively the sum must basically be equivalent to an integral computed with a spacing comparable to the bandwidth... it would be interesting to understand why Ibrahim gets such small sigmas, though, presumably they are unrealistic... |
True, there are two factors: the dx in the integral and the bandwidth used to construct the qp.PDF object, we need appropriate values for both parameters to get a good statistic. |
Sorry for being late to this thread. I'm making another issue here for the matter of choosing an appropriate smoothing scale for turning samples into something we can evaluate, since it's distinct from this issue's goal of just implementing the KLD in a script, and it could affect pretty much every metric of p_i(z) or sum_i[p_i(z)]~n(z) (as opposed to p(CDF_i(z) evaluated on a regular grid in probability space like KS, CvM, and AD). (I also made it an issue for |
Yes, I agree, Alex, thanks for creating the separate issue. Based on the tests that I did with two Gaussian samples I think the actual implementation of the KLD looks correct for both p(z) and N(z) stack, so I think that we can close this issue and just focus on the limits. |
Has this script been finalized?
Has it been uploaded to the PZDC1paper repository?
The text was updated successfully, but these errors were encountered: