-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FMWithSGD
default constructor parameters are inconsistent/too small
#11
Comments
This was referenced May 23, 2016
asfgit
pushed a commit
to apache/spark
that referenced
this issue
May 25, 2016
## What changes were proposed in this pull request? Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used. This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: zhengruifeng/spark-libFM#11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`. ## How was this patch tested? `build/mvn -DskipTests clean package` build succeeds Author: Gio Borje <[email protected]> Closes #13265 from Hydrotoast/master. (cherry picked from commit 589cce9) Signed-off-by: Sean Owen <[email protected]>
asfgit
pushed a commit
to apache/spark
that referenced
this issue
May 25, 2016
## What changes were proposed in this pull request? Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used. This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: zhengruifeng/spark-libFM#11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`. ## How was this patch tested? `build/mvn -DskipTests clean package` build succeeds Author: Gio Borje <[email protected]> Closes #13265 from Hydrotoast/master.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
From the
FMWithSGD
file:The comment is inconsistent with the actual values passed.
It is also worth noting that
1e-5
may be too small a fraction size to train over all parameters. Since theGradientDescent
implementation in Scala performsnumIterations
iterations of mini batch SGD with batch sizeminiBatchFraction
, it follows that approximatelynumIterations * miniBatchFraction
labeled points are updated. FornumIterations = 100
andminiBatchFraction = 1e-5
, this means only a maximum of1e-3
labeled points are actually used during training!Further implications: since the model has a set of parameters per feature, this means that if a feature is unseen during training, then they will simply be initialized with their default values: latent vectors initialized from a Normal distribution and weights initialized to
0.0
.The text was updated successfully, but these errors were encountered: