-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with accuracy calculation? #3
Comments
It seems to be an original intention of the authors (link): The question is: do authors report top-2 accuracy for all other methods? I found no code for them in this repository, but I may have missed something. And it seems there is nothing about it in the paper. Because if you report top-1 accuracy for one set of methods and top-2 accuracy for another set of methods, it is not fair at all. However, the paper is still very insightful and well-written, even if the numbers are incorrect. (Edited): You can also compare the results of other methods on popular datasets with other papers. Here is AGNews, for instance. Those results are similar to the paper's results and report top-1 accuracy, not top-2. |
So, is this method useful? |
I cannot speak for the benchmarks run in the original manuscript, but me reimplementation which reports numbers for a weighted kNN classifier and also regression (naive kNN regressor) for molecular string representations works well given the simplicity of the approach (see https://github.com/daenuprobst/molzip). So the method is useful in that setting, but there is more to do and things to try. |
Thank you @kts for opening the issue & Sorry for the late reply! When the flag
I want to elaborate the point 2. Also, I would love to add "decrement" strategy to the repo and let people choose the strategy freely if you'd love to. |
Thanks for the reply. Sorry if this added extra stress to your thesis day! Now, I understand that this is the desired behavior rather than a bug. Why do I call it top-2 accuracy? In k=2 case, you find the two closest training points to your test point and look at their labels. Let's call the labels C for correct and W for wrong, depending on if it equals the test point's label or not. There are 4 possibilities,
So, it's equivalent to being correct if the correct label is one or both of the top 2. (I did try to mention in the blog that it's not computing top-k for k>2, because it's only looking at those TIED for highest count, not all k). The problem is with the 1-1 ties. You need some method to break the tie, and in a "fair" classifier you cannot look at the test label. I think if you use You're right, I too have never seen "top-k accuracy" used in the context of kNN, so maybe the wording of "top-2" isn't very helpful and I shouldn't use that. It just seemed a concise way to describe the phenomenon described above. (Yes, it's a bit different here because we're talking about I'm not sure I followed everything you wrote, but maybe you agree with this and the main issue is just making that more clear in the paper. That's possible, but I'd say the main point is: this method uses the test label as part of its decision process which is not the standard classification setting and can't be fairly compared to others that don't. |
Hi all, thanks @bazingagin for the code and make me think about gzip, such an intriguing way of measuring similarity! About taking the correct label if predictions are tied.
This sound fair, but I'm not 100% sure if really fair, since likelihood of ties may be greater for some method than for other. Do you know how often the tie happens for different method? I guess if it's about equal than it may be fair. Maybe it's good to recalculate all method and baselines with the random selection by @kts and getting some mean over different seeds. |
Hi @kts, thanks for your reply. No worries, I think it's my responsibility to clarify things : ) |
Hi @flipz357, Thanks for your comments. I agree with you that different methods will have different likelihood on tie-situation. So yes, I will report the accuracy with random selection. |
Okay, you're free to use the wording that you see fit and think readers will understand. I didn't write the blog post or this issue to make you change the paper, etc, but for others trying to recreate the results. But I do maintain that using this approach (either looking at the test label or running multiple times with randomness and taking the best) makes the comparisons with other baselines unfair - i.e. it's wrong to say "this beats BERT in these tests" - unless you get those results with a classifier that doesn't look at the answer. A probably bigger problem with at least some of the datasets someone pointed out here #13 Anyways, it's still an interesting approach and paper. Cheers. |
@bazingagin I was really excited about this paper, but I carefully checked the code and here I have to agree with @kts that the test labels are indeed used in the decision making process. The "maximum accuracy" you referred to is essentially computed with two steps for the WC (wrong-correct) tie cases: (1) choose the C label as the prediction using the ground-truth label (there is no other guaranteed way to choose a C label over a W label); (2) calculate the accuracy between the predictions and the ground truth labels. So in a fair setting, there should never be a "maximum accuracy" (it is unfair itself unless other methods are also reporting the exactly same maximum accuracy); there should be one and only one accuracy between the predictions (each data sample should only have one final prediction) and the test labels. The process of obtaining the predictions should be independent of the test labels, for example, via majority voting or random selection. Hope this helps. I think it is still a good paper after the fix and result update. |
hey @kts @eric-xw I understand that you think using the max accuracy of knn for comparison is unfair. What I was trying to say is that random guess has its lowerbound and upperbound. And theoretically, the upperbound is achievable (yes, the probability is small if say we have 10 tied instances and we only have (1/2)^10 probability of achieving the highest), but still achievable in theory. But I guess we don't need to seek common ground on that as the discussion of unfair or fair comparison/evaluation can go on and on. For example, as https://twitter.com/drjwrae/status/1679988179179536385?s=21 points out, comparing GPT&knn with gzip&knn is a more fair comparison as we can directly compare the estimated compressed length (negative log-likelihood) of GPT with a compressed length of gzip which can better indicate whether pretrained models are worse in capturing regularity on OOD (it is, according to his experiment in random case). I think these are all great suggestions and should be taken into consideration for the future work (like using weighted knn might be even better). But here, I only want to make sure I disclose every detail in the experiment - people are aware the reported result uses max accuracy, assuming every guess is correct when tied. It's their judgment whether they want to check the maximum accuracy in order to see the full potential of this similarity measurement or they want to use random for a practical classifier. Thanks for the constructive discussion and I will add random results anyway. |
Hey, thanks for the nice paper. @kts @eric-xw focus on the end to end task and experimental setup, and challenge the fact that the accuracy numbers would not be reproducible in a real setup, because the label would not be available. The comparison becomes unfair to other models in the table 3 that don't have this label info to exploit.
I think I'd be more helpful if you'd provide:
|
imo it may be even unfair within the approaches who can exploit label info, since the approaches may have different likelihood of tie (and approach with more ties have advantage). But yes, if this is the case it may be seen from your suggestion to report
|
Okay, last comment on this :) Why report the max rather than the mean? A "fair" classifier can certainly use randomness and you may want to run it multiple times to get a better estimate of it's accuracy - but, for that use the mean over the trials. I can define a classifier that ignores the data and randomly picks a class. It's has a max possible accuracy of 100%. |
Because reporting mean induces more stochasticity and/or magnitude more of experiments. As you see, we have few-shot experiments across all the settings. Each shot experiment requires 5 times repeated experiments to draw different training samples. Using random and reporting mean means either inducing more stochasticity, which may make the result hard to reproduce, or repeating at least 5*5 times experiments for each shot setting.
That's not the case here as all datasets have more than 2 classes so even randomly picking top 2 nearest neighbors when tied still requires a reasonable similarity measurement. |
thanks @cyrilou242, what you said makes sense. I will also include minimum accuracy and k=1 case. |
thanks @bazingagin! |
I'm not an expert in this topic, so I have no idea about the standard way to measure accuracies. But, can we use distance information to break ties? To be specific, we calculate the sum of distances of the k-nearest points, and when tie occurs, we choose the class whose sum of distances is the smallest. |
@twjang Yes, but I guess an issue is that gzip will make more ties since the distance function is coarse grained integer (afaik), while for sentence embedding it is fine grained float. |
@twjang Yes, this is reasonable strategy. For @flipz357 The actual distances used are floats ( Line 6 in 3c21994
|
@cyrilou242 here are results with both lowerboud and upperbound. @kts I also include the random (only run once though) for reference. @flipz357 As you mentioned it may be unfair to other non-parametric methods, here are the results with random strategy using W2V and SentBERT for comparison. |
Thanks @bazingagin ! Btw, actually you don't need to run random, the expected accurcy (using random guesser to break ties) can be simply calculated: For K=2, as @cyrilou242 mention, it boils down to the average of upper and lower bound. For arbitrary K the accuarcy will depend on the amount of options in each tie, but it can also be calculated. |
@bazingagin , so in other words if top2 closest points are of the correct label and are different you count this as a correct prediction? This doesn't make sense. It's same thing like saying: Yeah I'll assume that I'm very lucky and even if the probability is 50% of picking correct label I will pick the correct one. I'm seeing your point that it's not TOP-2 accuracy, but it's very unfair heuristic to use to get the predicted labels. |
hey @maxmax1992, sure I do agree it makes the accuracy of gzip inflated. So please check the updated version of the table The table also includes the updated result for the correct (no overlapped) DengueFilipino dataset.
|
I overall can confirm the updated numbers that @bazingagin posted. For 5-shot only roughly, but that is due to the large deviation over runs (as is also evident from the large 95-confidence intervals in the table). I do think gzip is quite interesting and its performance in few shot shows that advanced methods struggle in low resource scenarios. However, I think it should be also noted even simpler methods can also provide good results, for anyone interested: https://arxiv.org/abs/2307.15002 |
npc_gzip/experiments.py
Line 116 in a469915
It appears that in the
calc_acc
method it marks a sample correct if ANY of the labels with the samemost_count
have the correct label.For
k=2
(I think the paper always usedk=2
?), any time there is a1-1
tie (top 2 have different labels), it will be correct if EITHER is correct, rather than a "fair" classifier that would have to make a decision before scoring.This fork has my implementation of the kNN accuracy results for comparisons. see file docstring for notes:
https://github.com/kts/npc_gzip/blob/main/calc_acc.py
The text was updated successfully, but these errors were encountered: