MoMA: Modern Multivariate Analysis in R

`MoMA`: Modern Multivariate Analysis in `R`

Background

Multivariate analysis -- the study of finding meaningful patterns in datasets -- is a key technique in any data scientist's toolbox. Beyond its use for Exploratory Data Analysis ("EDA"), multivariate analysis also allows for principled Data-Driven Discovery: finding meaningful, actionable, and reproducible structure in large data sets. Classical techniques for multivariate analysis have proven immensely successful through history, but modern Data-Driven Discovery requires new techniques to account for the specific complexities of modern data. We propose a new unified framework for Modern Multivariate Analysis ("MoMA"), which will provide a unified and flexible baseline for future research in multivariate analysis. Even more importantly, we anticipate that an easy-to-use R package will increase adoption of these powerful new models by end users and, in conjunction with R's rich graphics libraries, position R as the leading platform for modern exploratory data analysis and data-driven discovery.

Multivariate analysis techniques date back to the earliest days of statistics, pre-dating other foundational concepts like hypothesis testing by several decades. Classical techniques such as Principal Components Analysis ("PCA") [1, 2], Partial Least Squares ("PLS"), Canonical Correlation Analysis ("CCA") [3], and Linear Discriminant Analysis ("LDA"), have a long and distinguished history of use in statistics and are still among the most widely used methods for EDA. Their importance is reflected in the CRAN Task View dedicated to Multivariate Analysis [4], as well as the specialized implementations available for a range of application areas. Somewhat surprisingly, each of these techniques can be interpreted as a variant of the well-studied eigendecomposition problem, allowing statisticians to build upon a rich mathematical and computational literature.

In the early 2000s, researchers noted that naive extensions of classical multivariate techniques to the high-dimensional setting produced unsatisfactory results, a finding later confirmed by advances in random matrix theory [5]. In response to these findings, multivariate analysis experienced a renaissance as researchers developed a wide array of new techniques to incorporate sparsity, smoothness, and other structure into classical techniques [6,7,8,9,10,11,12,13,14 among many others], resulting in a rich literature on "modern multivariate analysis." Around the same time, theoretical advances showed that these techniques avoided many of the pitfalls associated with naive extensions [15,16,17,18,19,20].

While this literature is vast, it relies on a single basic principle: it is essential to adapt classical techniques to account for known characteristics and complexities of the dataset at hand for multivariate analysis to succeed. For example, a neuroscientist investigating the brain's response to an external stimulus may expect a response which is simultaneously spatially smooth and sparse: spatially smooth because the brain processes related stimuli in well-localized areas (e.g., the visual cortex) and sparse because not all regions of the brain are used to respond to a given stimulus. Alternatively, a statistical factor model used to understand financial returns may be significantly improved by incorporating known industry sector data, motivating a form of group sparsity. A sociologist studying how pollution leads to higher levels of respiratory illnesses may combine spatial smoothness and sparsity (indicating "pockets" of activity) with a non-negativity constraint, knowing that pollution and illness have a positive effect.

To incorporate these different forms of prior knowledge into multivariate analysis, a wide variety of algorithms and approaches have been proposed. In 2013, Allen proposed a general framework that unified existing techniques for "modern" PCA, as well as proposing a number of novel extensions [21]. The mentors' recently developed MoMA algorithm [42] builds on this work, allowing more forms of regularization and structure, as well as supporting more forms of multivariate analysis.

The project MoMA has started since last summer and we are about halfway through the project. So far we have implemented the core algorithm of MoMA. Specifically, in addition to the options discussed in [21], the package includes a generalized algorithm with options for:

L1 ("Lasso") [26], SCAD [27], and MCP [28] sparsity-inducing penalization
Group sparsity [29]
Ordered fusion using the "fused lasso" [30]
Unordered fusion using a convex clustering penalty [31,32]
Non-Negativity constraints
SLOPE (sorted L1 norm) [44]

In this summer, we will focus on the last element of MoMA: parameter selection. First, we provide greedy BIC selection scheme for PCA, PLS, CCA and LDA. We also expose a low level regularized rank-1 SVD function that takes a grid of parameters and returns the estimated eigenvectors for each of those values, which enables users to build their own problem-specific CV scheme. For UX consideration, we provide higher level wrappers for all these functionality. Second, we build visualization platforms using Shiny. Although the automation of parameter selection is crucial, especially in our multi-parameter model, we also emphasize human intervention in such process. Visualizing how the level of penalization affects the estimators helps scientists gain more insights about the data. Last but not the least, we supplement MoMA with detailed and thorough documentations.

Related work

Plain PCA is implemented in base R's stats::prcomp function. A number of packages which perform either sparse [34] or smooth [35] PCA can be found on CRAN. The spls [36] and RGCCA packages [37] implement sparse, but not smooth PLS and CCA respectively.

The PMA package [38], based on [12], is the most similar in spirit to our package, providing a unified approach to PCA and CCA, as well as multi-way CCA, but it does not implement PLS or LDA. More importantly, it does not implement the smoothing options, orthogonalization, or unordered fusion (clustering) penalties that the MoMA algorithm provides.

Coding Project Details

User Interface and Tuning Parameter Selection (Estimated time: 5 Weeks)

Wrapper functions for PCA, PLS, CCA, and LDA will be implemented. Helper functions to select regularization parameters (either via cross-validation, BIC, or extended BIC [33]) will be provided.
Visualization (Estimated time: 5 weeks)

We will implement some basic all-purpose visualizations.
Documentation and Case Study (Estimated time: 3 Weeks)

Finally, the package requires extensive documentation and write two vignettes: one with a detailed example application to neuroscience, loosely based on Section 6 of [21] and one describing the implementation details of the package. At the end of the summer, time will be dedicated to preparing MoMA for its initial CRAN release.

Future development (not in scope for GSoC-2019) may incorporate the Generalized PCA algorithm, with natural extensions to PLS and CCA [41], Multi-Way CCA, missing value support (including imputation), or more general loss functions such as those described in [43]. Tensor (higher-order array) support is of interest to the mentors, but would require more comprehensive support for higher-order arrays in Armadillo or Eigen.

Expected Impact

Multivariate Analysis techniques are indispensable for true Data Driven Discovery, but a unified and easy-to-use framework has been lacking to date. Individual packages allow for certain special cases handled by MoMA, but for advanced cases, no standard packaged solution is available. The MoMA package will provide the first unified toolbox for all forms of high-dimensional multivariate analysis in R or any other language. MoMA will empower statisticians and data scientists to flexibly find patterns that respect the specific structure of their data and allow for truly Modern Multivariate Analysis.

Mentors

Genevera Allen ([email protected]): Theory and Algorithms

Departments of Statistics, CS, and ECE, Rice University

Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children's Hospital
Michael Weylandt ([email protected]): Implementation

Department of Statistics, Rice University

Required Skills

Convex Optimization
Coding with R, C++, Rcpp and Shiny.

Tests

Potential applicants should:

Implement greedy BIC scheme for PCA;
Wrap their implementation using Rcpp;
Test their implementation using testthat;
Package their implementation and pass R CMD check on at least two of the three major platforms: Windows, MacOS, and Linux (Debian/Ubuntu).

Solutions of Tests

Students, please add a link to your solutions below. Mentors will check that the package passes R CMD check without any WARNING(s) or ERROR(s).

Coding Milestones

Community Bonding Period

This part is not needed since the project is deployed on TravisCI already. However, deploying MoMA on codecov is of interest. We should also use a clang-format file for C++ code and the styler package for R code formatting.

Coding Period

Phase I

Weeks 1-2:

Parameter selection via BIC

Weeks 3-5

PCA and LDA wrappers

Phase II

Week 6

CCA and PLS wrappers

Weeks 7-8:

PCA and CCA visualization

Weeks 9-10:

LDA and PLS visulization

Phase III

Weeks 11:

Function Level Documentation
Algorithm + Timing Vignette

Week 12:

Neuroscience Vignette

Week 13:

GitHub pages + pkgdown documentation
"Polishing" as needed

References

[1] K. Pearson. "On Lines and Planes of Closest Fit to Systems of Points in Space." The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, p.559-572, 1901. https://doi.org/10.1080/14786440109462720

[2] H. Hotelling. Analysis of a Complex of Statistical Variables into Principal Components. Journal of Educational Psychology 24(6), p.417-441, 1933. http://dx.doi.org/10.1037/h0071325

[3] H. Hotelling. "Relations Between Two Sets of Variates" Biometrika 28(3-4), p.321-377, 1936. https://doi.org/10.1093/biomet/28.3-4.321

[4] See CRAN Task View: Multivariate Statistics

[5] I. Johnstone, A. Lu. "On Consistency and Sparsity for Principal Components Analysis in High Dimensions." Journal of the American Statistical Association: Theory and Methods 104(486), p.682-693, 2009. https://doi.org/10.1198/jasa.2009.0121

[6] B. Silverman. "Smoothed Functional Frincipal Fomponents Analysis by Choice of Norm." Annals of Statistics 24(1), p.1-24, 1996. https://projecteuclid.org/euclid.aos/1033066196

[7] J. Huang, H. Shen, A. Buja. "Functional Principal Components Analysis via Penalized Rank One Approximation." Electronic Journal of Statistics 2, p.678-695, 2008. https://projecteuclid.org/euclid.ejs/1217450800

[8] I.T. Jolliffe, N.T. Trendafilov, M. Uddin. "A Modified Principal Component Technique Based on the Lasso." Journal of Computational and Graphical Statistics 12(3), p.531-547, 2003. https://doi.org/10.1198/1061860032148

[9] H. Zou, and T. Hastie, and R. Tibshirani. "Sparse Principal Component Analysis." Journal of Computational and Graphical Statistics 15(2), p.265-286, 2006. https://doi.org/10.1198/106186006X113430

[10] A. d'Aspremont, L. El Gahoui, M.I. Jordan, G.R.G. Lanckriet. "A Direct Formulation for Sparse PCA Using Semidefinite Programming." SIAM Review 49(3), p.434-448, 2007. https://doi.org/10.1137/050645506

[11] A. d'Aspremont, F. Bach, L. El Gahoui. "Optimal Solutions for Sparse Principal Component Analysis." Journal of Machine Learning Research 9, p.1269-1294, 2008. http://www.jmlr.org/papers/v9/aspremont08a.htm

[12] D. Witten, R. Tibshirani, T. Hastie. "A Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis." Biostatistics 10(3), p.515-534, 2009. https://doi.org/10.1093/biostatistics/kxp008

[13] R. Jenatton, G. Obozinski. F. Bach. "Structured Sparse Principal Component Analysis." Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010. http://proceedings.mlr.press/v9/jenatton10a.html

[14] G.I. Allen, M. Maletic-Savatic. "Sparse Non-Negative Generalized PCA with Applications to Metabolomics." Bioinformatics 27(21), p.3029-3035, 2011. https://doi.org/10.1093/bioinformatics/btr522

[15] A.A. Amini, M.J. Wainwright. "High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components." Annals of Statistics 37(5B), p.2877-2921, 2009. https://projecteuclid.org/euclid.aos/1247836672

[16] S. Jung, J.S. Marron. "PCA Consistency in High-Dimension, Low Sample Size Context." Annals of Statistics 37(6B), p.4104-4130, 2009. https://projecteuclid.org/euclid.aos/1256303538

[17] Z. Ma. "Sparse Principal Component Analysis and Iterative Thresholding." Annals of Statistics 41(2), p.772-801, 2013. https://projecteuclid.org/euclid.aos/1368018173

[18] T.T. Cai, Z. Ma, Y. Wu. "Sparse PCA: Optimal Rates and Adaptive Estimation." Annals of Statistics 41(6), p.3074-3110, 2013. https://projecteuclid.org/euclid.aos/1388545679

[19] V.Q. Vu, J. Lei. "Minimax Sparse Principal Subspace Estimation in High Dimensions." Annals of Statistics 41(6), p.2905-2947, 2013. https://projecteuclid.org/euclid.aos/1388545673

[20] D. Shen, H. Shen, J.S. Marron. "Consistency of Sparse PCA in High Dimension, Low Sample Size Contexts." Journal of Multivariate Analysis 115, p.317-333, 2013. https://doi.org/10.1016/j.jmva.2012.10.007

[21] G.I. Allen. "Sparse and Functional Principal Components Analysis." ArXiv Pre-Print 1309.2895 (2013). https://arxiv.org/abs/1309.2895

[22] D. Eddelbuettel, R. Francois. "Rcpp: Seamless R and C++ Integration." Journal of Statistical Software 40(8), p.1-18, 2011. https://doi.org/10.18637/jss.v040.i08

[23] D. Eddelbuettel, C. Sanderson. "RcppArmadillo: Accelerating R with High-Performance C++ Linear Algebra." Computational Statistics and Data Analysis 71, p.1054-1063, 2014. https://doi.org/10.1016/j.csda.2013.02.005

[24] C. Sanderson, R. Curtin. "Armadillo: A Template-Based C++ Library for Linear Algebra." Journal of Open Source Software 1(2), p.26, 2016. https://doi.org/10.21105/joss.00026

[25] A. Beck, M. Teboulle. "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems." SIAM Journal on Imaging Sciences 2(1), p.183-202, 2009. https://doi.org/10.1137/080716542

[26] R. Tibshirani. "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B. 58(1), p.267-288, 1996. http://www.jstor.org/stable/2346178

[27] J. Fan, R. Li. "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties." Journal of the American Statistical Association 96(456), p.1348-1360, 2011. https://doi.org/10.1198/016214501753382273

[28] C.-H. Zhang. "Nearly Unbiased Variable Selection Uner Minimax Convex Penalty." Annals of Statistics 38(2), p.894-942, 2010. https://projecteuclid.org/euclid.aos/1266586618

[29] M. Yuan, "Model Selection and Estimation in Regression with Grouped Variables." Journal of the Royal Statistical Society, Series B 68(1), p.49-67, 2006. https://doi.org/10.1111/j.1467-9868.2005.00532.x

[30] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, K. Knight. "Sparsity and Smoothness via the Fused Lasso." Journal of the Royal Statistical Society, Series B 67(1), p. 91-108, 2005. https://doi.org/10.1111/j.1467-9868.2005.00490.x

[31] T.D. Hocking, A. Joulin, F. Bach, J.-P. Vert. "ClusterPath: An Algorithm for Clustering using Convex Fusion Penalties." Proceedings of the 28th International Conference on Machine Learning (ICML) 2011.

[32] E.C. Chi, K. Lange. "Splitting Methods for Convex Clustering." Journal of Computational and Graphical Statistics 24(4), p.994-1013, 2015. https://doi.org/10.1080/10618600.2014.948181

[33] J. Chen, Z. Chen. "Extended Bayesian Information Criteria for Model Selection with Large Model Spaces." Biometrika 95(3), p.759-771, 2008. https://doi.org/10.1093/biomet/asn034

[34] For example, the elasticnet, nsprcomp, and rospca packages.

[35] For example, the fda, fdapace, and fdasrvf packages.

[36] Available on CRAN: spls

[37] Available on CRAN: RGCCA

[38] Available on CRAN: PMA

[39] http://www.stat.rice.edu/~gallen/software/sfpca/

[40] Slide 25 of http://www.stat.rice.edu/~gallen/gallen_sfpca_talk.pdf

[41] G.I. Allen, L. Grosenick, J. Taylor. "A Generalized Least-Square Matrix Decomposition." Journal of the American Statistical Association 109(505), p.145-159, 2014. https://doi.org/10.1080/01621459.2013.852978

[42] G.I. Allen, M. Weylandt. "MoMA: A General Algorithm for Modern Multivariate Analysis." (In Preparation)

[43] M. Udell, C. Horn, R. Zadeh S. Boyd. "Generalized Low Rank Models." Foundations and Trends in Machine Learning 9(1), p.1-118, 2016. https://doi.org/10.1561/2200000055

[44] M. Bogdan, Ewout van den Berg, C. Sabatti, W. Su, E.J. Candès. "SLOPE—adaptive variable selection via convex optimization." The annals of applied statistics 9.3, pp.1103-1140, 2015. https://doi.org/10.1214/15-aoas842

Provide feedback

Saved searches

Use saved searches to filter your results more quickly