Fixes issue #36 #37

jmcastagnetto · 2019-07-12T14:44:48Z

Adds sensitivity, specificity and other related metrics. To make it simpler, added a simple binary confusion matrix function that is not exported.

codecov-io · 2019-07-12T14:49:45Z

Codecov Report

Merging #37 into master will increase coverage by 0.3%.
The diff coverage is 100%.

@@           Coverage Diff            @@
##           master     #37     +/-   ##
========================================
+ Coverage      98%   98.3%   +0.3%     
========================================
  Files           5       5             
  Lines         100     118     +18     
========================================
+ Hits           98     116     +18     
  Misses          2       2

Impacted Files	Coverage Δ
R/binary_classification.R	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca12765...a189e12. Read the comment docs.

jmcastagnetto · 2019-07-12T15:26:03Z

OOPs, sorry, wrong button, reopenning

mfrasco

Hi! I'm sorry that it has taken me so long to respond. I was away on vacation, and I just got back to my computer.

Thanks for your contributions. I really appreciate all of the effort that you put into this PR. I left several comments on the code in your PR. One was about making cmat more robust to abnormal inputs. Most of them were about making improvements to the documentation. I care a lot about the quality of documentation in this package, which is why I put a lot of time into reviewing it.

Also, in this PR, you didn't commit any of the documentation files. Can you run devtools::document() and commit the .Rd files that get generated?

Let me know if you are interested in making the changes that I requested on this PR and if you have any questions about the requests that I made.

Lastly, please add yourself as a contributor to the Authors section in the DESCRIPTION file!

R/binary_classification.R

mfrasco · 2019-07-21T23:36:49Z

Addresses issue #36

jmcastagnetto · 2019-07-22T13:34:47Z

Hi @mfrasco, will take a look at all your suggestions for improvement, and will fix them, hopefully sooner rather than later. Fortunately here (in Peru) will be a shorter week than usual, for our National holidays so I'll have time to fix this. In particular the bit about making the confusion matrix robust (it is very naive now).

…e docs and references

mllg · 2019-08-02T19:46:25Z

R/binary_classification.R

+    if (!(
+        setequal(binvals, b_actual) |
+        setequal(c(0), b_actual) |
+        setequal(c(1), b_actual)


You don't need b_actual nor the three calls to setequal(), but instead !all(actual %in% 0:1) should do.

Yes. I agree that !all(actual %in% 0:1) is a good implementation.

This is a breaking change to the existing functionality of the package, as the current implementation does not break if the user provides a vector that doesn't contain 0 and 1.

When I took over maintenance of this package, I thought a lot about whether I should validate the inputs. At the time, it wasn't worth the effort, but I understand why we would want to implement it now.

If we do add input validation, I'd want this logic to be pulled into a separate function and used consistently across all binary classification metrics.

Wanted to write so you know I am still interested in helping here. Sorry for not been responsive for quite some time, got a bit of health issues that are resolving, and have not had time to look at the code and suggestions. Hopefully by the end of this week.

A quick glance seems like @mllg suggestions are on point. As for input validation, it will be a "good thing" (TM), but needs some thought I think, in particular if it involves breaking bc.

Perhaps the idea of putting this into a side branch will be better, so things can be merged to the main branch once they are in their final form.

mllg · 2019-08-02T19:48:04Z

R/binary_classification.R

+    if (!(
+        setequal(binvals, b_predicted) |
+        setequal(c(0), b_predicted) |
+        setequal(c(1), b_predicted)


See above, same for b_predicted.

mllg · 2019-08-02T19:49:05Z

R/binary_classification.R

+        "fn" = sum(actual == 1 & predicted == 0),
+        "fp" = sum(actual == 0 & predicted == 1),
+        "tn" = sum(actual == 0 & predicted == 0)
+        )


A data.frame() seems like an odd choice to return a integer(4).

I agree. A named vector would be simpler.

mllg · 2019-08-02T19:55:07Z

R/binary_classification.R

 #' actual <- c(1, 1, 1, 0, 0, 0)
 #' predicted <- c(1, 1, 1, 1, 1, 1)
 #' precision(actual, predicted)
 precision <- function(actual, predicted) {
-    return(mean(actual[predicted == 1]))
+    cm <- confusion_matrix(actual, predicted)
+    cm$tp / (cm$tp + cm$fp)


This returns NaN if tp and fp both are 0. This is absolutely fine, but should be clearly documented though. Also note that other toolkits decided to return either 0 or 1 in such a case. There is a discussion about it on SO: https://stats.stackexchange.com/questions/1773/what-are-correct-values-for-precision-and-recall-in-edge-cases.

The existing implementation of the function returns NaN, so we should be intentional about making this change. It'd need to be consistent across the entire package.

I'd support returning 0 when cm$tp + cm$fp == 0 and raising a warning that says something like Precision is called with no positive predictions. The proper definition of precision is undefined. Returning 0 instead.

mllg · 2019-08-02T20:04:58Z

Sorry for hijacking this PR, just stumbled over it as I thought about creating a feature request for more binary classification measures.

FWIW, I implemented something similar for mlr3. Maybe you find it useful:

creating the confusion matrix: https://github.com/mlr-org/mlr3/blob/master/R/PredictionClassif.R#L174 (self$response and self$truth are factors with the positive class as first level)
calculating measures: https://github.com/mlr-org/mlr3/blob/master/R/MeasureClassifConfusion.R#L82

mfrasco

I really appreciate your effort into this PR. And I'm sorry that it's taken me a while to respond. The changes proposed in this PR are significant because they change the behavior of existing functions, which means that we need to be thoughtful with what users expect from this package. For that reason, I have more feedback on the types of changes that I'd like to make on this PR.

I also recognize that this PR is becoming quite large in scope. For that reason, we might take a first step of implementing the basic logic in all of the functions but not exporting them in the package. Or we might take the strategy of merging this PR into a non-master branch within this repo. So that further development can be made. Does that make sense?

Let me know what you want to do.

mfrasco · 2019-08-15T04:45:58Z

R/binary_classification.R

 #' actual <- c(1, 1, 1, 0, 0, 0)
 #' predicted <- c(1, 1, 1, 1, 1, 1)
 #' precision(actual, predicted)
 precision <- function(actual, predicted) {
-    return(mean(actual[predicted == 1]))
+    cm <- confusion_matrix(actual, predicted)
+    cm$tp / (cm$tp + cm$fp)


The existing implementation of the function returns NaN, so we should be intentional about making this change. It'd need to be consistent across the entire package.

I'd support returning 0 when cm$tp + cm$fp == 0 and raising a warning that says something like Precision is called with no positive predictions. The proper definition of precision is undefined. Returning 0 instead.

mfrasco · 2019-08-15T04:57:08Z

R/binary_classification.R

+    if (!(
+        setequal(binvals, b_actual) |
+        setequal(c(0), b_actual) |
+        setequal(c(1), b_actual)


Yes. I agree that !all(actual %in% 0:1) is a good implementation.

This is a breaking change to the existing functionality of the package, as the current implementation does not break if the user provides a vector that doesn't contain 0 and 1.

When I took over maintenance of this package, I thought a lot about whether I should validate the inputs. At the time, it wasn't worth the effort, but I understand why we would want to implement it now.

If we do add input validation, I'd want this logic to be pulled into a separate function and used consistently across all binary classification metrics.

mfrasco · 2019-08-15T04:58:37Z

R/binary_classification.R

+        setequal(c(1), b_actual)
+    )) {
+        stop(paste("Expecting a vector of 0s and 1s for 'actual'. Got:",
+                   paste(actual, collapse = ", ")))


If actual has many values, this print statement could be very large. You'd probably want to only show the first 5 or so unique values.

mfrasco · 2019-08-15T05:00:06Z

R/binary_classification.R

+        "fn" = sum(actual == 1 & predicted == 0),
+        "fp" = sum(actual == 0 & predicted == 1),
+        "tn" = sum(actual == 0 & predicted == 0)
+        )


I agree. A named vector would be simpler.

mfrasco · 2019-08-15T05:06:52Z

R/binary_classification.R

+#' predicted <- c(1, 0, 1, 1, 1, 1)
+#' sensitivity(actual, predicted)
+sensitivity <- function(actual, predicted) {
+    recall(actual, predicted)


Thanks for implementing sensitivity as a function of recall.

I'm thinking of a better way to implement this so that sensitivity and recall actually are defined as the same function, not just one function that calls another. I want the functions to share documentation too. I'll look into the best ways for doing this.

I'm thinking about

foo <- bar <- baz <- function(actual, predicted { ... }

But I need to do research into how that impacts R packages and documentation.

mfrasco · 2019-08-15T05:10:31Z

R/binary_classification.R

+#' predicted <- c(1, 0, 1, 1, 1, 1)
+#' fnr(actual, predicted)
+fnr <- function(actual, predicted) {
+    1 - sensitivity(actual, predicted)


This is good. I need to decide if this should be 0 or 1 in the case where sum(actual) == 0

mfrasco · 2019-08-15T05:27:59Z

@mllg Thanks for your feedback on the PR. I really appreciate it and I thought all of your ideas were terrific. I took a look at mlr3. I like how it defined the measures that have multiple names in one location. That's something that I want to accomplish with this PR too.

How do you think that the Metrics package should think about backwards compatibility issues (e.g. returning 0 instead of NaN in precision when sum(predicted) == 0? The more I think about it, I'm starting to question whether changing the functionality of the package is appropriate.

jmcastagnetto · 2019-10-13T02:12:15Z

Moved the proposed code changes to https://github.com/jmcastagnetto/Metrics/tree/fix36, haven't reviewed them since last august. Closing this until I get time to make the suggested improvements or something else is decided.

jmcastagnetto added 2 commits July 12, 2019 09:38

added sensitivity, specifictity and related metrics

2cfa496

removed copied code

ee4f64d

jmcastagnetto added 4 commits July 12, 2019 10:10

updating README (part 1)

c9451ad

finished with README, fixed doc for lrn

398fc49

minor edit on lnr equation

38d5a85

fixed extra LF

1674a3f

jmcastagnetto closed this Jul 12, 2019

jmcastagnetto reopened this Jul 12, 2019

mfrasco requested changes Jul 21, 2019

View reviewed changes

jmcastagnetto added 6 commits July 29, 2019 14:19

adding Rstudio project pattern

4514d39

making the confusion_matrix function a bit more robust and fixing som…

1f6ee38

…e docs and references

generated docs

9805b64

updating package description and namespace

52f376d

added .RData

2307368

simplifying code

a189e12

mllg reviewed Aug 2, 2019

View reviewed changes

mfrasco requested changes Aug 15, 2019

View reviewed changes

jmcastagnetto closed this Oct 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes issue #36 #37

Fixes issue #36 #37

jmcastagnetto commented Jul 12, 2019

codecov-io commented Jul 12, 2019 •

edited

Loading

jmcastagnetto commented Jul 12, 2019

mfrasco left a comment

mfrasco commented Jul 21, 2019

jmcastagnetto commented Jul 22, 2019

mllg Aug 2, 2019

mfrasco Aug 15, 2019

jmcastagnetto Aug 19, 2019

mllg Aug 2, 2019

mllg Aug 2, 2019

mfrasco Aug 15, 2019

mllg Aug 2, 2019

mfrasco Aug 15, 2019

mllg commented Aug 2, 2019

mfrasco left a comment

mfrasco Aug 15, 2019

mfrasco Aug 15, 2019

mfrasco Aug 15, 2019

mfrasco Aug 15, 2019

mfrasco Aug 15, 2019

mfrasco Aug 15, 2019

mfrasco commented Aug 15, 2019

jmcastagnetto commented Oct 13, 2019

Fixes issue #36 #37

Fixes issue #36 #37

Conversation

jmcastagnetto commented Jul 12, 2019

codecov-io commented Jul 12, 2019 • edited Loading

Codecov Report

jmcastagnetto commented Jul 12, 2019

mfrasco left a comment

Choose a reason for hiding this comment

mfrasco commented Jul 21, 2019

jmcastagnetto commented Jul 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mllg commented Aug 2, 2019

mfrasco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfrasco commented Aug 15, 2019

jmcastagnetto commented Oct 13, 2019

codecov-io commented Jul 12, 2019 •

edited

Loading