Implementation for cyclic and mini-batch option #24

wzzlcss · 2019-06-24T17:01:38Z

This pull request adds options to enable cyclic and mini batch saga. We can use these new features by setting batchsize (default = 1) and cyclic (default = FALSE). I calculate epoch = floor(n_samples/batchsize) as the number of updates on coefficients in one iteration. Before a new iteration, an index matrix is renewed that each column contains indices of samples for an update. This implementation pulls fresh batch every iteration.

Correctness check using abalone dataset (4177 samples) and gaussian model

Full batch without penalty is expected to get a same fit as (X^T X)^{-1}X^T y

X <- matrix(as.numeric(unlist(abalone$x)), ncol = 9)
y <- as.numeric(unlist(abalone$y))
X2 <- matrix(cbind(rep(1, 4177), X), ncol = 10) 
ols <- solve(t(X2)%*%X2)%*%t(X2)%*%y
fit <- sgdnet(X, y, alpha = 0, lambda = 0, maxit = 10000000, 
              thresh = 0.00000001, batchsize = 4177)

With ridge penalty

glmfit_ridge <- glmnet(X, y, alpha = 0, lambda = 0.4)
sgdfit_ridge <- sgdnet(X, y, alpha = 0, lambda = 0.4)
batch_ridge  <- sgdnet(X, y, alpha = 0, lambda = 0.4, thresh = 0.00001, batchsize = 10)

With lasso penalty

glmfit_lasso <- glmnet(X, y, alpha = 1, lambda = 0.4)
sgdfit_lasso <- sgdnet(X, y, alpha = 1, lambda = 0.4)
batch_lasso  <- sgdnet(X, y, alpha = 1, lambda = 0.4, thresh = 0.00001, batchsize = 10)

Performance

Mini-batch version often requires more times on benchmark dataset to reach the same loss and outperforms saga on randomly generated GLM dataset.

michaelweylandt · 2019-06-25T17:42:31Z

DESCRIPTION

@@ -32,5 +32,6 @@ Suggests:
    latticeExtra
 LinkingTo: 
    Rcpp (>= 0.12.16),
-    RcppEigen (>= 0.3.3.4.0)
+    RcppEigen (>= 0.3.3.4.0),
+    testthat


Do we need to link to testthat? In R parlance, "LinkingTo" refers to sharing compiled (C / C++) code but I don't see that being used here

When I want to add test file for functions from package’s compiled library, it seems that without this link, testtaht::use_catch cannot find <testthat.h>. I will remove this since most functions are in head file.

No need to remove it - it seems like a good thing to use. (I just couldn't find the actual use of it since you deleted the tests in subsequent commits on this branch.)

So where is this being used? I don't see it in the test files.

Hi mentor, it is not being used now, I previously thought I was going to formally test c++ helper function.

michaelweylandt · 2019-06-25T17:44:12Z

tests/testthat/test-families.R

@@ -1,40 +0,0 @@
-context("general family tests")


Why delete this file without a replacement?

I tired to isolate problems on Travis. I will add it back.

michaelweylandt · 2019-06-25T17:49:10Z

Hi Daisy,

This PR and its commit history are a bit messy. Could you rebase it (create a new more direct history) before we review it? If you need help with this process, let me know. (It's pretty confusing your first time through, but such a useful thing to learn in the long run.)

It also looks like you've been removing most of the test files - is this intentional or just trying to isolate problems on Travis?

wzzlcss · 2019-06-25T21:18:56Z

Hi Michael, thank you for the comments! I will rebase the commit history. Earlier Travis CI said test-families.R fails (but it pass with my environment) so I will add it back. Testing cpp code via testthat::use_catch works for cpp files, and since our helper functions are in head files and exported main functions do not require that so I delete this infrastructure. I might only add test cases for main functions with cyclic and mini batch.

jolars

Good job, Daisy. A few pointers:

Please add tests along with any enhancements that you do. It is necessary that any changes are validated before they are merged. Also, please avoid deleting test files unless there is a good reason to do so (reiterating Michael's point). And if you like to do, it is usually always better to just add a skip() call instead or commenting out the tests.
Try to adhere to the coding style that I've tried to use previously for the code. For the C++ side, this is mostly based on google's C++ style guide with a few exceptions (which I apologize for). The only really big exception is that I like to put return type declarations and inline modifiers on separate lines.
I get a warning and a note when I run my r cmd check (and these are on travis too):

* checking Rd \usage sections ... WARNING
Undocumented arguments in documentation object 'sgdnet'
  ‘batchsize’

and

* checking compiled code ... NOTE
File ‘sgdnet/libs/sgdnet.so’:
  Found ‘rand’, possibly from ‘rand’ (C)
    Object: ‘sgdnet.o’

Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor use Fortran I/O
nor system RNGs.

Could you try fixing these? I think the NOTE is related to your use of std::random_shuffle in utils.h, which you shouldn't use anyway since, if I'm not mistaken, c++ does not respect or use the random number generators from R, which means that results cannot be controlled by setting the seed on the R side, which is not so good. I found this post in the Rcpp gallery, which is probably exactly what you need.

jolars · 2019-06-30T14:26:52Z

src/families.h

  {
-    gradient = linear_predictor - y.array().col(i);
+    for (unsigned i = 0; i < ind.rows(); ++i){


Suggested change

for (unsigned i = 0; i < ind.rows(); ++i){

for (unsigned i = 0; i < ind.rows(); ++i) {

jolars · 2019-06-30T14:28:09Z

src/saga-dense.h

-        // Unlag and rescale coefficients
-        w *= wscale;
-        wscale = 1.0;
+  Saga(Penalty&               penalty,


The indentation seems off here. I am guessing this is a result of copy-pasting code in R studio with auto-indentation on. Could you please undo the indentation?

jolars · 2019-06-30T14:29:10Z

src/saga-sparse.h

-    if (lagged_amount != 0) {
-      penalty(w, j, wscale, lag_scaling[lagged_amount], g_sum);
-      lag[j] = k;
+  for (unsigned m = 0; m < subx.cols(); ++m){


Suggested change

for (unsigned m = 0; m < subx.cols(); ++m){

for (unsigned m = 0; m < subx.cols(); ++m) {

Please add a space between ) and {

jolars · 2019-06-30T14:30:32Z

src/utils.h

+                const unsigned length)
+{
+  Eigen::ArrayXXi index(1, length);
+  for (unsigned i = 0; i < length; ++i){


Suggested change

for (unsigned i = 0; i < length; ++i){

for (unsigned i = 0; i < length; ++i) {

jolars · 2019-06-30T14:32:22Z

src/utils.h

+    unsigned s_ind = floor(R::runif(0.0, n_samples));
+    index.col(i) = s_ind;
+  }
+  return(index);


Suggested change

return(index);

return index;

Try to use a consistent coding style. Most of the code currently uses return foo; rather than return(foo);, so please try to stick to this scheme.

jolars · 2019-06-30T14:33:10Z

src/utils.h

+    g_change_col = g_change.col(i);
+    step += g_change_col.rowwise()*subx.col(i).transpose().array();
+  }
+  return(step);


Suggested change

return(step);

return step;

jolars · 2019-06-30T14:39:15Z

src/utils.h

+      const unsigned  B,
+      const bool cyclic)
+{
+  if (B > 1)  return(IndexBatch(n_samples, B));


Suggested change

if (B > 1) return(IndexBatch(n_samples, B));

if (B > 1)

return IndexBatch(n_samples, B);

jolars · 2019-06-30T14:39:31Z

src/utils.h

+      const bool cyclic)
+{
+  if (B > 1)  return(IndexBatch(n_samples, B));
+  if (cyclic) return(IndexCyclic(n_samples, n_samples));


Suggested change

if (cyclic) return(IndexCyclic(n_samples, n_samples));

if (cyclic)

return IndexCyclic(n_samples, n_samples);

jolars · 2019-06-30T14:39:47Z

src/utils.h

+{
+  if (B > 1)  return(IndexBatch(n_samples, B));
+  if (cyclic) return(IndexCyclic(n_samples, n_samples));
+  else        return(IndexStochastic(n_samples, n_samples));


Suggested change

else return(IndexStochastic(n_samples, n_samples));

else

return IndexStochastic(n_samples, n_samples);

jolars · 2019-06-30T14:40:10Z

src/utils.h

+  Eigen::ArrayXXi index(B, n_iter);
+  Eigen::ArrayXi  pool = Eigen::ArrayXi::LinSpaced(n_samples, 0, n_samples);
+
+  for (unsigned i = 0; i < n_iter; ++i){


Suggested change

for (unsigned i = 0; i < n_iter; ++i){

for (unsigned i = 0; i < n_iter; ++i) {

…chastic sampler

modify test case (roll back) test case

modify test file

Add helper function for cyclic and mini-batch [skip ci] Some changes for Lag update [skip ci] modify cyclic and mini batch [skip ci]

Delete test-runner.cpp Delete catch-routine-registration.R Delete test-cpp.R Update test-lambda-path.R Update test-cross-validation.R Update RcppExports.cpp

delete space check for integer batch size

jolars

I think this looks good now, Daisy. Good job!

I noticed that you are force-pushing your commits; please try to avoid this if you can. I know it's necessary sometimes, but make it a habit to push normally if possible.

jolars · 2019-07-04T09:01:12Z

tests/testthat/test-gaussian.R

@@ -30,9 +36,15 @@ test_that("all weights are zero when lambda > lambda_max", {
  lambda_max <- max(abs(crossprod(yy, xx)) * sy)/NROW(x)

  fit <- sgdnet(x, y, maxit = 1000, thresh = 0.0001)
+  fit_batch <- sgdnet(x, y, maxit = 1000, thresh = 0.0001, B = 10)


Shouldn't this be batchsize instead of B?

wzzlcss · 2019-07-04T09:19:34Z

Hi Johan, thank you! Your comments help me a lot! I tried to delete some unnecessary commits. I will be more careful to keep a clean commit history.

michaelweylandt · 2019-07-09T02:38:20Z

R/sgdnet-package.R

@@ -16,6 +16,7 @@

 #' @useDynLib sgdnet, .registration = TRUE
 #' @importFrom Rcpp sourceCpp
+#' @import methods


Why are we importing the whole methods' package? AFAIR, we have no S4 anywhere in this code (except possibly sparse matrix support, but we shouldn't need methods` imported for that.)

I have removed the methods' package. Some earlier travis ci wanted it, but now it is fine.

michaelweylandt · 2019-07-09T02:40:00Z

R/sgdnet.R

+  if (batchsize > n_samples)
+    stop("batch size cannot be larger than sample size.")
+
+  if (batchsize%%1 > 0)


Maybe use the is.wholenumber function given in ?is.integer here - this check is a bit opaque to me.

Hi, I am using the method from is.wholenumber instead.

michaelweylandt · 2019-07-09T02:41:22Z

tests/testthat/test-mgaussian.R

  gfit <- glmnet(x, y, family = "mgaussian", standardize.response = TRUE)

  expect_equal(sfit$lambda, gfit$lambda)
  expect_equivalent(coef(sfit), coef(gfit))
+  expect_equivalent(coef(bfit), coef(gfit), tolerance = 1e-2)


This is still really loose and seems prone to false negatives: can we tighten this test?

I have tightened the test to have smaller tolerance.

michaelweylandt · 2019-07-09T02:41:46Z

tests/testthat/test-families.R

  }
-})
+})


Missing EOL here

tests/testthat/test-gaussian.R

michaelweylandt · 2019-07-09T02:46:12Z

src/saga-dense.h

  // Outer loop
  unsigned it_outer = 0;
  bool converged = false;
  do {
+
+    // Pull samples
+    index = Index(n_samples, B, cyclic);


sb Eigen::Index

Sorry about this, I have changed its name to "Ind" now.

src/saga-dense.h

michaelweylandt · 2019-07-09T02:54:02Z

src/utils.h

+            const unsigned length)
+{
+  Eigen::ArrayXXi index(1, length);
+  index.row(0) = Eigen::ArrayXi::LinSpaced(n_samples, 0, n_samples);


Can't we just return the LinSpaced object here?

I'm also a bit confused on the design - this looks like it will always give the same samples at each iteration.

Hi Michael, I changed it to give cyclic samples with random start at each iteration.

michaelweylandt · 2019-07-09T02:55:44Z

src/utils.h

+
+//' wrapper aroud R's RNG such that we get a unifrom distribution over 
+//' [0,n) as required by the STL algorithm
+inline int randWrapper(const int n) { return floor(unif_rand()*n); }


Is there an Rcpp version we could use here instead? It's pretty hard to tell without going hunting that unif_rand is from R's C API.

Hi Michael, it seems that c++ does not respect random number generators from R, so I am using this to let the results be controlled by setting the seed on the R side following Johan’s suggestion.

I didn't mean use a C++ RNG: I still meant use the R RNGs via C++.

Maybe Rcpp::as<int>(Rcpp::sample(n, 1)) - 1

My point was that it's not obvious (to me) that unif_rand does use the R RNGs, as opposed to being a C++ standard function or something from Eigen.

michaelweylandt · 2019-08-19T21:31:52Z

Hi @wzzlcss: Is this ready for final reviewing / merging?

wzzlcss · 2019-08-20T06:32:38Z

Hi mentors, I think this is ready for final reviewing.

michaelweylandt · 2019-08-21T22:41:02Z

@jolars - Can you take a look at this again? There's been quite a lot of work on it since July.

michaelweylandt

I think this looks good, though I haven't checked all the C++. Will let @jolars give the final thumbs up.

tests/testthat/test-mgaussian.R

michaelweylandt · 2019-08-21T22:42:36Z

tests/testthat/test-mgaussian.R


  expect_equal(sfit$lambda, gfit$lambda)
  expect_equivalent(coef(sfit), coef(gfit))
+  expect_equivalent(coef(bfit), coef(gfit), tolerance = 1e-3)


Why do we need a looser tolerance here than with the regular sfit?

michaelweylandt · 2019-08-21T22:43:47Z

tests/testthat/test-gaussian.R

+                      intercept = FALSE,
+                      thresh = 0.000001,
+                      maxit = 1000,
+                      batchsize = 500)


Can we set this 500 to n or to NROW(x) to make the intent (full gradient) a bit clearer?

jolars

Apart from the issues @michaelweylandt pointed out, I think this looks good.

michaelweylandt · 2019-08-22T17:19:04Z

Great - thanks @jolars. @wzzlcss, I think one more commit addressing the simple stuff I noted and we'll be good to merge this.

michaelweylandt · 2019-08-28T23:27:28Z

Hi @wzzlcss, Is this ready to merge? (I just noted one test that looks a bit loose, but I think everything else has been addressed.)

define a iterator class

1fb4ae0

michaelweylandt reviewed Jun 25, 2019

View reviewed changes

jolars self-requested a review June 30, 2019 14:21

jolars added the enhancement New feature or request label Jun 30, 2019

jolars requested changes Jun 30, 2019

View reviewed changes

wzzlcss force-pushed the dev branch from 1f0a8d8 to 9d640fa Compare June 30, 2019 23:43

wzzlcss added 3 commits July 4, 2019 00:52

use Eigen::ArrayXi for index sampler instead and replace previous sto…

38fcd0c

…chastic sampler

add docs directory to .Rbuildignore

f70009a

add cyclic flag and documentation

cc11e47

wzzlcss force-pushed the dev branch 2 times, most recently from 487e627 to ca03008 Compare July 4, 2019 07:57

modify test cases

f18c650

modify test case (roll back) test case

wzzlcss force-pushed the dev branch from ca03008 to 56c2142 Compare July 4, 2019 07:59

wzzlcss added 5 commits July 4, 2019 01:03

add infrastructure for cpp unit test

e9149f1

modify test file

add method for submatrix

9669739

add method for update multiple columns for gradient table

336a505

add batch size option [skip ci]

af68985

change to feed Gradient with an array of indices

1a0f116

wzzlcss force-pushed the dev branch from 56c2142 to d3ca34b Compare July 4, 2019 08:04

wzzlcss added 7 commits July 4, 2019 01:06

add cyclic and mini-batch method [skip ci]

8a47f7c

Add helper function for cyclic and mini-batch [skip ci] Some changes for Lag update [skip ci] modify cyclic and mini batch [skip ci]

Delete test-example.cpp

ecb30ca

Delete test-runner.cpp Delete catch-routine-registration.R Delete test-cpp.R Update test-lambda-path.R Update test-cross-validation.R Update RcppExports.cpp

add documentation for batchsize [skip ci]

334d4cc

modify coding style [skip ci]

c604af6

fix std::random_shuffle

fe811ca

add test cases for mini-bach saga

dfd70d8

add check for legal batch size

01ef623

delete space check for integer batch size

wzzlcss force-pushed the dev branch from d3ca34b to 01ef623 Compare July 4, 2019 08:07

jolars approved these changes Jul 4, 2019

View reviewed changes

jolars requested a review from michaelweylandt July 4, 2019 09:08

change error in test case parameter

d230e1b

michaelweylandt requested changes Jul 9, 2019

View reviewed changes

wzzlcss added 6 commits August 13, 2019 12:37

change legal test for batchsize [skip ci]

8178519

add full batch test case for gaussian and set tight thresh for mgaussian

6db818f

use is.wholenumber instead for legal batchsize check [skip ci]

f88ace7

delete white space in file [skip ci]

feb3dfb

do not import methods package

1627d8b

change name of index generator and method of cyclic index

4c68139

michaelweylandt mentioned this pull request Aug 21, 2019

Faster SubSampling for Minibatches #30

Open

michaelweylandt requested a review from jolars August 21, 2019 22:41

michaelweylandt reviewed Aug 21, 2019

View reviewed changes

jolars approved these changes Aug 22, 2019

View reviewed changes

fix R RNG via C++ and test functions

16e9850

wzzlcss force-pushed the dev branch from c90c765 to 16e9850 Compare August 26, 2019 08:43

fix test cases

7f2dacb

	for (unsigned i = 0; i < ind.rows(); ++i){
	for (unsigned i = 0; i < ind.rows(); ++i) {

	for (unsigned m = 0; m < subx.cols(); ++m){
	for (unsigned m = 0; m < subx.cols(); ++m) {

	for (unsigned i = 0; i < length; ++i){
	for (unsigned i = 0; i < length; ++i) {

	if (B > 1) return(IndexBatch(n_samples, B));
	if (B > 1)
	return IndexBatch(n_samples, B);

	if (cyclic) return(IndexCyclic(n_samples, n_samples));
	if (cyclic)
	return IndexCyclic(n_samples, n_samples);

	else return(IndexStochastic(n_samples, n_samples));
	else
	return IndexStochastic(n_samples, n_samples);

	for (unsigned i = 0; i < n_iter; ++i){
	for (unsigned i = 0; i < n_iter; ++i) {

Implementation for cyclic and mini-batch option #24

Are you sure you want to change the base?

Implementation for cyclic and mini-batch option #24

Conversation

wzzlcss commented Jun 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelweylandt commented Jun 25, 2019

wzzlcss commented Jun 25, 2019 • edited Loading

jolars left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jolars left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzzlcss commented Jul 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzzlcss Aug 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelweylandt commented Aug 19, 2019

wzzlcss commented Aug 20, 2019 • edited Loading

michaelweylandt commented Aug 21, 2019

michaelweylandt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jolars left a comment

Choose a reason for hiding this comment

michaelweylandt commented Aug 22, 2019

michaelweylandt commented Aug 28, 2019

wzzlcss commented Jun 24, 2019 •

edited

Loading

wzzlcss commented Jun 25, 2019 •

edited

Loading

wzzlcss Aug 20, 2019 •

edited

Loading

wzzlcss commented Aug 20, 2019 •

edited

Loading