-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possible bug in td_quantile_of #4
Comments
I'm less convinced that this one is a bug. I think this interpolation is pretty sane. Let's work through a small example. For shorthand a centroid is Say our merged centroids are: The result of this needs to definitely be between .7 and .9. In the current code we'll do:
If we merged all of the central centroids as you suggest we'd get:
Sort of in the area, I think that the logic you pointed to is buggy if the query is exactly the left-most or right-most point and there is only one centroid at that point then it might be reasonable to return 0 or 1. It all sort of depends on definitions I suppose.
|
Again, I do appreciate the interest. The logic to merge centroids of the same value is missing in my go implementation and I'll add it. Thanks for bringing it. I did go ahead and hook up the testing suite from the go library to this one. |
What seems strange to me is that this depends on the order of the centroids, and I don't think there's any guarantee they will all have the same count value. So maybe you'll find them like this:
or maybe
and this will affect the result of IIRC the paper does say the ordering should be deterministic even for centrois with the same mean, i.e. the |
That makes sense. I now see what I'm trying to do with the summing wouldn't make sense for the left centroid in quantile_of and the current just pick whichever one doesn't work either. Having multiple centroids of the same value is a generally meaningful occurrence in a tdigest, it implies that the data actually really does refer to the same data point. It's likely the case that we can contrive a dataset that could end up having many centroids of the same average value that didn't correspond to a frequent actual value but I suspect that doesn't happen naturally. It would be a shame to throw that information away and treat |
I think what makes the example fail is that all the centroids have count 1, which is the one case when you know where the inputs were. So in that case it does not make much sense to subtract 1/2 the count (even with the first formula) because that's based on the assumption that 1/2 the points is on the left and 1/2 on the right. But when there's just 1, it's in the mean itself. So maybe it shouldn't be subtracted, and we should only use the other centroid for the approximation? IMHO that's essentially what Fig. 4 on page 15 in the paper describes, except that we have multiple such centroids with the same mean, so the step will just be larger. |
FWIW I do agree throwing away the extra information about centroids with count=1 would be a shame (it's likely somewhat rare, though). |
That's another great point. If we know we have only one point then our interpolation function could take advantage of that. This assumption will only work if you know you're adding data with integer counts. The had such an optimization but ripped it out to deal with added non-integer count values. If you know you're dealing with just integers something like I do think just taking the minimum count of the centroids with the same value on either side is going to go a long way. |
Not sure - it does not seem particularly principled. For example, assume you have
Using the minimum count (i.e. 10) essentially means you'll ignore the other centroid, representing much larger chunk of data. That seems strange. Also, imagine you have two digests
That essentially contain the same amount of information (at least it seems to me). Why'd you use 10 for the first one, but 20 for the second one? So I'm not quite convinced using the minimum (or maximum) is better than just aggregating all the centroids as I proposed earlier, with the extra handling of centroids with count=1. Of course, if the count is not stored as an integer, that may not be possible :-( |
An argument I was trying to make above is that having two centroids with the same value is rare and carries information. Try pick some value. Then to come up with a stream of data that does not contain that value (or at least does not commonly contain a value) but has multiple centroids that have a mean equal to exactly that value. I think it's hard. Using the minimum says that we believe all other instances of a centroids which shares a mean can thought of as definitely less than the query point. |
I don't think having multiple centroids with the same mean carries information, unless the count is 1.
I'm not sure it's all that rare, TBH. Say we're dealing with response times, rounded to milliseconds, and that most requests are handled within 1 second. Chances are, most requests are handled in a fairly narrow interval, say 50 - 150 ms. So vast majority of data will contain those ~100 distinct durations. I think think it's quite likely we'll get multiple centroids with the same mean, in this case (particularly for larger digests). Sure, this example may be a bit artificial, but it does not seem all that far fetched ... |
I don't think it's a far fetched or artificial example at all but I do think it would be sound and well served by the behavior I'm suggesting. First of all, imagine you do this uniformly. Think about the likelihood that you'll ever merge two centroids to the same exact float value mean if all of the contained points didn't all correspond to the same input value. If they do correspond to the same exact point then taking the min is principled. I encourage you to go through actually generating data for the examples you're thinking of. It's really hard to create multiple centroids at some exact value where the constituent inputs which comprise that centroid aren't that value. If the data in a centroid does indeed correspond to a centroid where all the input data were the same point, then minimum of the count of the centroids which share that value is a good choice. Here's a tdigest from a uniform distribution over 96e6 and 104e6 rounded to integers.
What are the quantiles for 96.01ms (and then 103.99)? Ideally it will be at least .02691 = (1+5+107+2578)/100000. We know from the raw data that the true quantile of 96.01 must be at least that. Total count = 100000
Now imagine if we use the minimum one:
|
Ah, I see. I agree it seems hard to get multiple centroids with the same mean, unless all values in those centroids were equal to the mean. I'm still not convinced using the minimum count is the right solution, though. What I think should be done instead is using 0 as the count (because everything is in the centre, so there's nothing on the right). If both prev/next centroids are like this, it's essentially a line with slope 0. It does not make much difference in your examples, because the minimum counts are 1.0 in both cases, so not much difference. It might be more significant in the middle of the digest. |
I’m okay with the idea of using 0 though it introduces special cases to the code. If you know all of your counts are integer values then using 1 seems pretty reasonable. |
Hi, where did this end up? This seems to be the only impl linked from Dunning that has quantile_of, which seems pretty useful.... |
I haven't poked at this code really since I wrote it. https://github.com/tvondra/tdigest is almost certainly better if you're looking for a C implementation. I think that this particular issue isn't a huge deal in the grand schema of things. |
I guess maybe his repo won't be easy to use in another project because of its postgres ties. Should have looked first. |
PRs welcome if you want to get your hands dirty. |
@tvondra 's impl looks the same for quantiles_of https://github.com/tvondra/tdigest/blob/master/tdigest.c#L649C21-L649C21 |
While looking at
td_quantile_of
I think I've found another bug. My knowledge of go is pretty limited so I'm not sure if it's present in the other repository, and unlike the previous issue I don't have a reproducer demonstrating the issue yet.When computing the quantile, the code does this:
which makes perfect sense - there may be multiple centroids with the same mean, and we need to account for all of them (and then take 1/2 of the count). So far so good.
But when the value is not exactly equal to a mean of a centroid, the code ignores this possibility. It simply does this:
and entirely ignores the possibility there might be multiple centroids with the same mean, both for
nr
andnl
. So the weights used for linear approximation will be somewhat bogus, not quite representing values from all the relevant centroids.I think the code should do the same lookup as in the
val == n->mean
branch for bothnr
andnl
.The text was updated successfully, but these errors were encountered: