Skip to content

Commit

Permalink
Log warnings for numIterations * miniBatchFraction < 1.0
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used.

This may be counter-intuitive to most users and led to the issue during the development of another Spark  ML model: zhengruifeng/spark-libFM#11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`.

## How was this patch tested?

`build/mvn -DskipTests clean package` build succeeds

Author: Gio Borje <[email protected]>

Closes #13265 from Hydrotoast/master.
  • Loading branch information
Gio Borje authored and srowen committed May 25, 2016
1 parent 9c297df commit 589cce9
Showing 1 changed file with 5 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,11 @@ object GradientDescent extends Logging {
"< 1.0 can be unstable because of the stochasticity in sampling.")
}

if (numIterations * miniBatchFraction < 1.0) {
logWarning("Not all examples will be used if numIterations * miniBatchFraction < 1.0: " +
s"numIterations=$numIterations and miniBatchFraction=$miniBatchFraction")
}

val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
// Record previous weight and current one to calculate solution vector difference

Expand Down

5 comments on commit 589cce9

@vlad17
Copy link

@vlad17 vlad17 commented on 589cce9 Aug 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thank you for your change. I think your PR comment may be a bit off; for your example to work miniBatchFraction should be 0.02, not 0.2.

Furthermore, the message is a bit confusing. It seems to imply having the product P = numIterations * miniBatchFraction >= 1 guarantees use of all distinct examples. In fact, the expected number of distinct points sampled is about 1 - exp(-P) - you need be to be a lot more than 1 to ensure that, say, 90% of distinct examples are used in expectation.

In any case, I don't think we care too much about using distinct examples here; that's the whole point of using replacement. I think it would be more appropriate to warn the user that the true number of (possibly repeated) examples used is less than the size of the dataset.

@srowen
Copy link
Member

@srowen srowen commented on 589cce9 Aug 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it implies that (because it doesn't logically follow), and it doesn't say that in any event. Distinct examples aren't relevant here.

@vlad17
Copy link

@vlad17 vlad17 commented on 589cce9 Aug 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't imply it logically, but it certainly implicates it (conversationally). It's kind of silly to say "X won't happen if you don't Y" but then expect people to understand that X still won't happen if you Y.

Also, regarding distinct examples, I completely agree, which is why I think the warning should be something like "The number of examples used for training ..."

@srowen
Copy link
Member

@srowen srowen commented on 589cce9 Aug 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here X may not happen if you Y. X may or may not be a problem, even, so I don't know if more aggressive warning is needed. The message doesn't talk about distinct examples, what are you referring to?

@vlad17
Copy link

@vlad17 vlad17 commented on 589cce9 Aug 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also agree that X may not be a problem; to be honest I don't know if a warning rather than info log is appropriate. Regarding distinct: "all examples" means just that, though what's really meant is "a number of examples equal to the number of examples inputted." Of course, that's a bit verbose, all I'm saying is that a warning like

"The number of examples used for training is less than the input data size since numIterations * miniBatchFraction < 1.0"

is neither verbose nor aggressive. Also, its (conversational) implication is technically correct, which of course is the best kind of correct one can have.

Please sign in to comment.