`FMWithSGD` default constructor parameters are inconsistent/too small #11

Hydrotoast · 2016-05-23T19:57:34Z

From the FMWithSGD file:

  /**
    * Construct an object with default parameters: {task: 0, stepSize: 1.0, numIterations: 100,
    * dim: (true, true, 8), regParam: (0, 0.01, 0.01), miniBatchFraction: 1.0}.
    */
  def this() = this(0, 1.0, 100, (true, true, 8), (0, 1e-3, 1e-4), 1e-5)

The comment is inconsistent with the actual values passed.

It is also worth noting that 1e-5 may be too small a fraction size to train over all parameters. Since the GradientDescent implementation in Scala performs numIterations iterations of mini batch SGD with batch size miniBatchFraction, it follows that approximately numIterations * miniBatchFraction labeled points are updated. For numIterations = 100 and miniBatchFraction = 1e-5, this means only a maximum of 1e-3 labeled points are actually used during training!

Further implications: since the model has a set of parameters per feature, this means that if a feature is unseen during training, then they will simply be initialized with their default values: latent vectors initialized from a Normal distribution and weights initialized to 0.0.

The text was updated successfully, but these errors were encountered:

## What changes were proposed in this pull request? Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used. This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: zhengruifeng/spark-libFM#11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`. ## How was this patch tested? `build/mvn -DskipTests clean package` build succeeds Author: Gio Borje <[email protected]> Closes #13265 from Hydrotoast/master. (cherry picked from commit 589cce9) Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used. This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: zhengruifeng/spark-libFM#11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`. ## How was this patch tested? `build/mvn -DskipTests clean package` build succeeds Author: Gio Borje <[email protected]> Closes #13265 from Hydrotoast/master.

This was referenced May 23, 2016

Fix default parameters for SGD training so that all examples are used… #12

Open

Log warnings for numIterations * miniBatchFraction < 1.0 apache/spark#13265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`FMWithSGD` default constructor parameters are inconsistent/too small #11

`FMWithSGD` default constructor parameters are inconsistent/too small #11

Hydrotoast commented May 23, 2016 •

edited

Loading

FMWithSGD default constructor parameters are inconsistent/too small #11

FMWithSGD default constructor parameters are inconsistent/too small #11

Comments

Hydrotoast commented May 23, 2016 • edited Loading

`FMWithSGD` default constructor parameters are inconsistent/too small #11

`FMWithSGD` default constructor parameters are inconsistent/too small #11

Hydrotoast commented May 23, 2016 •

edited

Loading