-
Notifications
You must be signed in to change notification settings - Fork 8
MoMA: Modern Multivariate Analysis in R
Multivariate analysis -- the study of finding meaningful patterns in
datasets -- is a key technique in any data scientist's toolbox.
Beyond its use for Exploratory Data Analysis ("EDA"),
multivariate analysis also allows for principled Data-Driven Discovery:
finding meaningful, actionable, and reproducible structure in large data sets.
Classical techniques for multivariate analysis have proven immensely
successful through history, but modern Data-Driven Discovery requires
new techniques to account for the specific complexities of modern data.
We propose a new unified framework for Modern Multivariate Analysis ("MoMA"),
which will provide a unified and flexible baseline for future research in multivariate
analysis. Even more importantly, we anticipate that an easy-to-use R
package will increase adoption of these powerful new models by end
users and, in conjunction with R
's rich graphics libraries, position
R
as the leading platform for modern exploratory data analysis and
data-driven discovery.
Multivariate analysis techniques date back to the earliest days of statistics, pre-dating other foundational concepts like hypothesis testing by several decades. Classical techniques such as Principal Components Analysis ("PCA") [1, 2], Partial Least Squares ("PLS"), Canonical Correlation Analysis ("CCA") [3], and Linear Discriminant Analysis ("LDA"), have a long and distinguished history of use in statistics and are still among the most widely used methods for EDA. Their importance is reflected in the CRAN Task View dedicated to Multivariate Analysis [4], as well as the specialized implementations available for a range of application areas. Somewhat surprisingly, each of these techniques can be interpreted as a variant of the well-studied eigendecomposition problem, allowing statisticians to build upon a rich mathematical and computational literature.
In the early 2000s, researchers noted that naive extensions of classical multivariate techniques to the high-dimensional setting produced unsatisfactory results, a finding later confirmed by advances in random matrix theory [5]. In response to these findings, multivariate analysis experienced a renaissance as researchers developed a wide array of new techniques to incorporate sparsity, smoothness, and other structure into classical techniques [6,7,8,9,10,11,12,13,14 among many others], resulting in a rich literature on "modern multivariate analysis." Around the same time, theoretical advances showed that these techniques avoided many of the pitfalls associated with naive extensions [15,16,17,18,19,20].
While this literature is vast, it relies on a single basic principle: it is essential to adapt classical techniques to account for known characteristics and complexities of the dataset at hand for multivariate analysis to succeed. For example, a neuroscientist investigating the brain's response to an external stimulus may expect a response which is simultaneously spatially smooth and sparse: spatially smooth because the brain processes related stimuli in well-localized areas (e.g., the visual cortex) and sparse because not all regions of the brain are used to respond to a given stimulus. Alternatively, a statistical factor model used to understand financial returns may be significantly improved by incorporating known industry sector data, motivating a form of group sparsity. A sociologist studying how pollution leads to higher levels of respiratory illnesses may combine spatial smoothness and sparsity (indicating "pockets" of activity) with a non-negativity constraint, knowing that pollution and illness have a positive effect.
To incorporate these different forms of prior knowledge into multivariate analysis,
a wide variety of algorithms and approaches have been proposed. In 2013, Allen
proposed a general framework that unified existing techniques for "modern" PCA,
as well as proposing a number of novel extensions [21]. The mentors' recently developed
MoMA
algorithm [42] builds on this work, allowing more forms of regularization
and structure, as well as supporting more forms of multivariate analysis.
The project MoMA
has started since last summer and we are about halfway through the project. So far we have implemented the core algorithm of MoMA
. Specifically, in addition to the options discussed in [21], the package includes a generalized algorithm with options for:
- L1 ("Lasso") [26], SCAD [27], and MCP [28] sparsity-inducing penalization
- Group sparsity [29]
- Ordered fusion using the "fused lasso" [30]
- Unordered fusion using a convex clustering penalty [31,32]
- Non-Negativity constraints
- SLOPE (sorted L1 norm) [44]
In this summer, we will focus on the last element of MoMA
: parameter selection. First, we provide greedy BIC selection scheme for PCA, PLS, CCA and LDA. We also expose a low level regularized rank-1 SVD function that takes a grid of parameters and returns the estimated eigenvectors for each of those values, which enables users to build their own problem-specific CV scheme. For UX consideration, we provide higher level wrappers for all these functionality. Second, we build visualization platforms using Shiny
. Although the automation of parameter selection is crucial, especially in our multi-parameter model, we also emphasize human intervention in such process. Visualizing how the level of penalization affects the estimators helps scientists gain more insights about the data. Last but not the least, we supplement MoMA
with detailed and thorough documentations.
Plain PCA is implemented in base R
's stats::prcomp
function. A
number of packages which perform either sparse [34] or smooth [35] PCA
can be found on CRAN
. The spls
[36] and RGCCA
packages [37]
implement sparse, but not smooth PLS and CCA respectively.
The PMA
package [38], based on [12], is the most similar in spirit
to our package, providing a unified approach to PCA and CCA, as well
as multi-way CCA, but it does not implement PLS or LDA. More
importantly, it does not implement the smoothing options,
orthogonalization, or unordered fusion (clustering) penalties that the
MoMA
algorithm provides.
-
User Interface and Tuning Parameter Selection (Estimated time: 5 Weeks)
Wrapper functions for PCA, PLS, CCA, and LDA will be implemented. Helper functions to select regularization parameters (either via cross-validation, BIC, or extended BIC [33]) will be provided.
-
Visualization (Estimated time: 5 weeks)
We will implement some basic all-purpose visualizations.
-
Documentation and Case Study (Estimated time: 3 Weeks)
Finally, the package requires extensive documentation and write two vignettes: one with a detailed example application to neuroscience, loosely based on Section 6 of [21] and one describing the implementation details of the package. At the end of the summer, time will be dedicated to preparing
MoMA
for its initial CRAN release.
Future development (not in scope for GSoC-2019) may incorporate the Generalized
PCA algorithm, with natural extensions to PLS and CCA [41], Multi-Way CCA,
missing value support (including imputation), or more general loss functions
such as those described in [43]. Tensor (higher-order array) support is of
interest to the mentors, but would require more comprehensive support for
higher-order arrays in Armadillo
or Eigen
.
Multivariate Analysis techniques are indispensable for true Data Driven Discovery,
but a unified and easy-to-use framework has been lacking to date. Individual
packages allow for certain special cases handled by MoMA
, but for advanced
cases, no standard packaged solution is available. The MoMA
package will
provide the first unified toolbox for all forms of high-dimensional
multivariate analysis in R
or any other language. MoMA
will empower
statisticians and data scientists to flexibly find patterns that respect the
specific structure of their data and allow for truly Modern Multivariate Analysis.
-
Genevera Allen ([email protected]): Theory and Algorithms
Departments of Statistics, CS, and ECE, Rice University
Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children's Hospital
-
Michael Weylandt ([email protected]): Implementation
Department of Statistics, Rice University
- Convex Optimization
- Coding with
R
,C++
,Rcpp
andShiny
.
Potential applicants should:
- Implement greedy BIC scheme for PCA;
- Wrap their implementation using
Rcpp
; - Test their implementation using
testthat
; - Package their implementation and pass
R CMD check
on at least two of the three major platforms: Windows, MacOS, and Linux (Debian/Ubuntu).
Students, please add a link to your solutions below. Mentors will check that the package passes R CMD check
without any WARNING(s)
or ERROR(s)
.
This part is not needed since the project is deployed on TravisCI already. However, deploying MoMA
on codecov
is of interest. We should also use a clang-format
file for C++
code and the styler
package for R
code formatting.
Weeks 1-2:
- Parameter selection via BIC
Weeks 3-5
- PCA and LDA wrappers
Week 6
- CCA and PLS wrappers
Weeks 7-8:
- PCA and CCA visualization
Weeks 9-10:
- LDA and PLS visulization
Weeks 11:
-
Function Level Documentation
-
Algorithm + Timing Vignette
Week 12:
- Neuroscience Vignette
Week 13:
-
GitHub pages +
pkgdown
documentation -
"Polishing" as needed
[1] K. Pearson. "On Lines and Planes of Closest Fit to Systems of Points in Space." The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, p.559-572, 1901. https://doi.org/10.1080/14786440109462720
[2] H. Hotelling. Analysis of a Complex of Statistical Variables into Principal Components. Journal of Educational Psychology 24(6), p.417-441, 1933. http://dx.doi.org/10.1037/h0071325
[3] H. Hotelling. "Relations Between Two Sets of Variates" Biometrika 28(3-4), p.321-377, 1936. https://doi.org/10.1093/biomet/28.3-4.321
[4] See CRAN Task View: Multivariate Statistics
[5] I. Johnstone, A. Lu. "On Consistency and Sparsity for Principal Components Analysis in High Dimensions." Journal of the American Statistical Association: Theory and Methods 104(486), p.682-693, 2009. https://doi.org/10.1198/jasa.2009.0121
[6] B. Silverman. "Smoothed Functional Frincipal Fomponents Analysis by Choice of Norm." Annals of Statistics 24(1), p.1-24, 1996. https://projecteuclid.org/euclid.aos/1033066196
[7] J. Huang, H. Shen, A. Buja. "Functional Principal Components Analysis via Penalized Rank One Approximation." Electronic Journal of Statistics 2, p.678-695, 2008. https://projecteuclid.org/euclid.ejs/1217450800
[8] I.T. Jolliffe, N.T. Trendafilov, M. Uddin. "A Modified Principal Component Technique Based on the Lasso." Journal of Computational and Graphical Statistics 12(3), p.531-547, 2003. https://doi.org/10.1198/1061860032148
[9] H. Zou, and T. Hastie, and R. Tibshirani. "Sparse Principal Component Analysis." Journal of Computational and Graphical Statistics 15(2), p.265-286, 2006. https://doi.org/10.1198/106186006X113430
[10] A. d'Aspremont, L. El Gahoui, M.I. Jordan, G.R.G. Lanckriet. "A Direct Formulation for Sparse PCA Using Semidefinite Programming." SIAM Review 49(3), p.434-448, 2007. https://doi.org/10.1137/050645506
[11] A. d'Aspremont, F. Bach, L. El Gahoui. "Optimal Solutions for Sparse Principal Component Analysis." Journal of Machine Learning Research 9, p.1269-1294, 2008. http://www.jmlr.org/papers/v9/aspremont08a.htm
[12] D. Witten, R. Tibshirani, T. Hastie. "A Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis." Biostatistics 10(3), p.515-534, 2009. https://doi.org/10.1093/biostatistics/kxp008
[13] R. Jenatton, G. Obozinski. F. Bach. "Structured Sparse Principal Component Analysis." Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010. http://proceedings.mlr.press/v9/jenatton10a.html
[14] G.I. Allen, M. Maletic-Savatic. "Sparse Non-Negative Generalized PCA with Applications to Metabolomics." Bioinformatics 27(21), p.3029-3035, 2011. https://doi.org/10.1093/bioinformatics/btr522
[15] A.A. Amini, M.J. Wainwright. "High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components." Annals of Statistics 37(5B), p.2877-2921, 2009. https://projecteuclid.org/euclid.aos/1247836672
[16] S. Jung, J.S. Marron. "PCA Consistency in High-Dimension, Low Sample Size Context." Annals of Statistics 37(6B), p.4104-4130, 2009. https://projecteuclid.org/euclid.aos/1256303538
[17] Z. Ma. "Sparse Principal Component Analysis and Iterative Thresholding." Annals of Statistics 41(2), p.772-801, 2013. https://projecteuclid.org/euclid.aos/1368018173
[18] T.T. Cai, Z. Ma, Y. Wu. "Sparse PCA: Optimal Rates and Adaptive Estimation." Annals of Statistics 41(6), p.3074-3110, 2013. https://projecteuclid.org/euclid.aos/1388545679
[19] V.Q. Vu, J. Lei. "Minimax Sparse Principal Subspace Estimation in High Dimensions." Annals of Statistics 41(6), p.2905-2947, 2013. https://projecteuclid.org/euclid.aos/1388545673
[20] D. Shen, H. Shen, J.S. Marron. "Consistency of Sparse PCA in High Dimension, Low Sample Size Contexts." Journal of Multivariate Analysis 115, p.317-333, 2013. https://doi.org/10.1016/j.jmva.2012.10.007
[21] G.I. Allen. "Sparse and Functional Principal Components Analysis." ArXiv Pre-Print 1309.2895 (2013). https://arxiv.org/abs/1309.2895
[22] D. Eddelbuettel, R. Francois. "Rcpp
: Seamless R
and C++
Integration."
Journal of Statistical Software 40(8), p.1-18, 2011. https://doi.org/10.18637/jss.v040.i08
[23] D. Eddelbuettel, C. Sanderson. "RcppArmadillo
: Accelerating R
with High-Performance
C++
Linear Algebra." Computational Statistics and Data Analysis 71, p.1054-1063,
2014. https://doi.org/10.1016/j.csda.2013.02.005
[24] C. Sanderson, R. Curtin. "Armadillo
: A Template-Based C++
Library for
Linear Algebra." Journal of Open Source Software 1(2), p.26, 2016.
https://doi.org/10.21105/joss.00026
[25] A. Beck, M. Teboulle. "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems." SIAM Journal on Imaging Sciences 2(1), p.183-202, 2009. https://doi.org/10.1137/080716542
[26] R. Tibshirani. "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B. 58(1), p.267-288, 1996. http://www.jstor.org/stable/2346178
[27] J. Fan, R. Li. "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties." Journal of the American Statistical Association 96(456), p.1348-1360, 2011. https://doi.org/10.1198/016214501753382273
[28] C.-H. Zhang. "Nearly Unbiased Variable Selection Uner Minimax Convex Penalty." Annals of Statistics 38(2), p.894-942, 2010. https://projecteuclid.org/euclid.aos/1266586618
[29] M. Yuan, "Model Selection and Estimation in Regression with Grouped Variables." Journal of the Royal Statistical Society, Series B 68(1), p.49-67, 2006. https://doi.org/10.1111/j.1467-9868.2005.00532.x
[30] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, K. Knight. "Sparsity and Smoothness via the Fused Lasso." Journal of the Royal Statistical Society, Series B 67(1), p. 91-108, 2005. https://doi.org/10.1111/j.1467-9868.2005.00490.x
[31] T.D. Hocking, A. Joulin, F. Bach, J.-P. Vert. "ClusterPath: An Algorithm for Clustering using Convex Fusion Penalties." Proceedings of the 28th International Conference on Machine Learning (ICML) 2011.
[32] E.C. Chi, K. Lange. "Splitting Methods for Convex Clustering." Journal of Computational and Graphical Statistics 24(4), p.994-1013, 2015. https://doi.org/10.1080/10618600.2014.948181
[33] J. Chen, Z. Chen. "Extended Bayesian Information Criteria for Model Selection with Large Model Spaces." Biometrika 95(3), p.759-771, 2008. https://doi.org/10.1093/biomet/asn034
[34] For example, the elasticnet
,
nsprcomp
, and
rospca
packages.
[35] For example, the fda
,
fdapace
, and
fdasrvf
packages.
[36] Available on CRAN: spls
[37] Available on CRAN: RGCCA
[38] Available on CRAN: PMA
[39] http://www.stat.rice.edu/~gallen/software/sfpca/
[40] Slide 25 of http://www.stat.rice.edu/~gallen/gallen_sfpca_talk.pdf
[41] G.I. Allen, L. Grosenick, J. Taylor. "A Generalized Least-Square Matrix Decomposition." Journal of the American Statistical Association 109(505), p.145-159, 2014. https://doi.org/10.1080/01621459.2013.852978
[42] G.I. Allen, M. Weylandt. "MoMA
: A General Algorithm for Modern Multivariate
Analysis." (In Preparation)
[43] M. Udell, C. Horn, R. Zadeh S. Boyd. "Generalized Low Rank Models." Foundations and Trends in Machine Learning 9(1), p.1-118, 2016. https://doi.org/10.1561/2200000055
[44] M. Bogdan, Ewout van den Berg, C. Sabatti, W. Su, E.J. Candès. "SLOPE—adaptive variable selection via convex optimization." The annals of applied statistics 9.3, pp.1103-1140, 2015. https://doi.org/10.1214/15-aoas842