You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the threshold option of the Bayes classifier only makes sense if the enable_threshold option is also set to true (at initialization or later dynamically). This tight interdependence makes me think that the threshold enabler flag is an overhead. However, by making the default value of the threshold property as Float::INFINITY we can cleverly get rid of the enable_threshold option. Let's have a look at the current implementation of the classify method:
The modified method will behave exactly the same as the current implementation with threshold_enabled set to true if the default value of threshold is Float::INFINITY.
However, there is a case where it will not behave the same as the current implementation. Suppose that the threshold_enabled is set to false and classify_with_score returned a class with score Infinity. In this case the current implementation would return that class while the modified method will return nil instead. This is obviously a breaking change. However, I think the current implementation, without the threshold in action, is broken anyway, so we might as well get rid of that behavior completely. Here is what I mean by broken:
As illustrated above, for the text "Baz", the category Foo is as "irrelevant" as Bar, so why should return one, but not the other. Hence the nil is a better choice here in my opinion. One could argue that the score collision could happen even if the scores are not Infinity, but the keyword here is "irrelevant". In case of a real score value the distance of the classification text vector is same from the competing category vectors, so they are equally relevant and picking one or the other is equally good. However, if the score is Infinity then saying one Infinity is equal to the other Infinity is mathematically incorrect.
With this proposal in place, we can get rid of many unnecessary attributes and convenience methods (including threshold_enabled, threshold, threshold=, threshold_enabled?, threshold_disabled?, enable_threshold, and disable_threshold). I am not sure how useful these dynamic enablers and disablers are if the user has access to raw scores for making custom decisions. This will make the classifier API simpler for consumers, easier to document, and easier to maintain.
The text was updated successfully, but these errors were encountered:
Yeah, I remember #47 pretty well. I agree that it's a bit broken, but I'd like to fix it rather than removing it. Perhaps we can fix it for 3.0, and make the false flag work correctly.
Looking at this line: result = nil if score < @threshold || score == Float::INFINITY if threshold_enabled?
I feel like we do need a fix. I'm not opposed to cleaning up the API though.
I think you can add a label like 3.x so that we can resurface it when the times comes.
I agree that it's a bit broken, but I'd like to fix it rather than removing it.
The issue I described, if it is fixed, then I think enable_threshold will become completely irrelevant that can be achieved by setting the default threshold to Infinity. However, for the sake of backward compatibility, we can perhaps keep it for 2.x.
Currently, the
threshold
option of theBayes
classifier only makes sense if theenable_threshold
option is also set totrue
(at initialization or later dynamically). This tight interdependence makes me think that the threshold enabler flag is an overhead. However, by making the default value of thethreshold
property asFloat::INFINITY
we can cleverly get rid of theenable_threshold
option. Let's have a look at the current implementation of theclassify
method:If we change it to something like this:
The modified method will behave exactly the same as the current implementation with
threshold_enabled
set totrue
if the default value ofthreshold
isFloat::INFINITY
.However, there is a case where it will not behave the same as the current implementation. Suppose that the
threshold_enabled
is set tofalse
andclassify_with_score
returned a class with scoreInfinity
. In this case the current implementation would return that class while the modified method will returnnil
instead. This is obviously a breaking change. However, I think the current implementation, without the threshold in action, is broken anyway, so we might as well get rid of that behavior completely. Here is what I mean by broken:As illustrated above, for the text "Baz", the category
Foo
is as "irrelevant" asBar
, so why should return one, but not the other. Hence thenil
is a better choice here in my opinion. One could argue that the score collision could happen even if the scores are notInfinity
, but the keyword here is "irrelevant". In case of a real score value the distance of the classification text vector is same from the competing category vectors, so they are equally relevant and picking one or the other is equally good. However, if the score isInfinity
then saying oneInfinity
is equal to the otherInfinity
is mathematically incorrect.With this proposal in place, we can get rid of many unnecessary attributes and convenience methods (including
threshold_enabled
,threshold
,threshold=
,threshold_enabled?
,threshold_disabled?
,enable_threshold
, anddisable_threshold
). I am not sure how useful these dynamic enablers and disablers are if the user has access to raw scores for making custom decisions. This will make the classifier API simpler for consumers, easier to document, and easier to maintain.The text was updated successfully, but these errors were encountered: