Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Log warnings for numIterations * miniBatchFraction < 1.0
## What changes were proposed in this pull request? Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used. This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: zhengruifeng/spark-libFM#11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`. ## How was this patch tested? `build/mvn -DskipTests clean package` build succeeds Author: Gio Borje <[email protected]> Closes #13265 from Hydrotoast/master.
- Loading branch information
589cce9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thank you for your change. I think your PR comment may be a bit off; for your example to work
miniBatchFraction
should be0.02
, not0.2
.Furthermore, the message is a bit confusing. It seems to imply having the product
P = numIterations * miniBatchFraction >= 1
guarantees use of all distinct examples. In fact, the expected number of distinct points sampled is about1 - exp(-P)
- you need be to be a lot more than 1 to ensure that, say, 90% of distinct examples are used in expectation.In any case, I don't think we care too much about using distinct examples here; that's the whole point of using replacement. I think it would be more appropriate to warn the user that the true number of (possibly repeated) examples used is less than the size of the dataset.
589cce9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it implies that (because it doesn't logically follow), and it doesn't say that in any event. Distinct examples aren't relevant here.
589cce9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't imply it logically, but it certainly implicates it (conversationally). It's kind of silly to say "X won't happen if you don't Y" but then expect people to understand that X still won't happen if you Y.
Also, regarding distinct examples, I completely agree, which is why I think the warning should be something like "The number of examples used for training ..."
589cce9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here X may not happen if you Y. X may or may not be a problem, even, so I don't know if more aggressive warning is needed. The message doesn't talk about distinct examples, what are you referring to?
589cce9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also agree that X may not be a problem; to be honest I don't know if a warning rather than info log is appropriate. Regarding distinct: "all examples" means just that, though what's really meant is "a number of examples equal to the number of examples inputted." Of course, that's a bit verbose, all I'm saying is that a warning like
"The number of examples used for training is less than the input data size since numIterations * miniBatchFraction < 1.0"
is neither verbose nor aggressive. Also, its (conversational) implication is technically correct, which of course is the best kind of correct one can have.