Sampling correlation matrices

Overview

Sampling from correlation matrices is useful for numerous applications in statistics and bioinformatics. It is a fundamental problem in numerous Bayesian models. This coding project considers the case of sampling from a log-concave distribution restricted to the set of correlation matrices. The aim is to develop efficient open-source software, expanding package volesti. The student will implement these methods in C++ and she/he will perform an extended empirical comparison with existing software and report on the results.

Sampling correlation matrices is a relatively difficult problem due to three constraints imposed on a rectangular matrix: positive definiteness, that is, a symmetric matrix with non-negative eigenvalues, fixed unit diagonal elements, and non-diagonal elements bounded in [-1,1]. The project relies on the geometric representation of correlation matrices in [1] and the Markov Chain Monte Carlo methods implemented in volesti for sampling from a multivariate truncated distribution.

Related work

The simplest method for constructing a correlation matrix is to use the rejection-sampling method, which generates correlation coefficients using uniform random variables in the closed interval [-1, 1]. Subsequently, each time the matrix positive definiteness is checked, and if it is not satisfied, another correlation matrix is generated. Instead, for large-dimensional problems, there are several techniques for generating a correlation matrix. In [2, 3, 4] they generate correlation matrices with predetermined eigenvalues and spectrum. In [5] they generate correlation matrices with a given mean value, structure, or eigenvalues. In [6, 7] they generate a correlation matrix with MCMC algorithms (sampling from the posterior distribution in certain Bayesian models).

Details of your coding project

The student will implement in C++ a class of the convex body that represents the set of correlation matrices. In that class, she/he has to implement the membership, boundary, and reflection oracles that the random walks in volesti require to operate (see [1] for more details). Moreover, the student will have to perform experiments to justify the efficiency of the implementation.

Matlab prototypes for the implementations will be given to the student.

Difficulty: Medium

Skills

Required: C++, Probability theory, Basic applied math background
Preferred: Experience with statistical or other mathematical software is a plus

Expected impact

The project will be a very useful addition to package volesti. They will crucially contribute to the implementation of efficient Bayesian models to learn the covariance matrix and to fit a copula on given data.

[1] Efficient Bayesian inference of systemic risk interlinkages, V Arakelian, A Chalkis (2021).
[2] Generation of correlation matrices with a given eigenstructure, C. Chalmers (1975).
[3] Population correlation matrices for sampling experiments, Bendel, R. B. and M. R. Mickey (1978).
[4] Generating correlation matrices with specified eigenvalues using the method of alternating projections, N. G. Waller (2020).
[5] Generating correlation matrices, G. Marsaglia, and I. Olkin (1984).
[6] Efficient estimation of covariance selection models, F. Wong, C. K. Carter, and R. Kohn (2003).
[7] Efficient Bayesian inference for Gaussian copula regression models, M. Pitt, D. Chan, and R. Kohn (2006).

Mentors

Apostolos Chalkis <tolis.chal at gmail.com> is a PhD student in Computer Science. His research focuses on mathematical computing, optimization, and computational finance. He has previous experience in GSoC 2018 and 2019 as a student under Org. R-project, implementing state-of-the-art algorithms for sampling from high dimensional multivariate distributions. He was GSOC mentor in three projects with Geomscale (2020). He is one of the authors of volesti.
Veni Arakelian is an academic expert in macro-finance, and financial econometrics. She holds a Ph.D. in econometrics from Athens University of Economics and Business. She is skilled in econometrics, macroeconomics, and machine learning techniques. She has more than fifteen years of experience in research, teaching, and working for the banking sector. Author and co-author of several publications in leading scientific journals in finance, statistics, and econometrics.

Students, please contact the first and the third mentor after completing at least one of the tests below.

Tests

Students, please do one or more of the following tests before contacting the mentors above.

Easy: Implement in C++ Acceptance-Rejection method to sample uniformly distributed correlation matrices and create a PR to add the code in volesti.
Medium: Compile and run volesti. Create a C++ method that generates the matrices in the Linear Matrix Inequality of the spectrahedron that represents the set of nxn correlation matrices.
Hard: Use the generator from the medium test and the existing C++ spectrahedron class in volesti to sample uniformly distributed correlation matrices with Billiard Walk. Report the n that the sampler becomes inefficient (e.g. takes more than 3 minutes to sample 1000 matrices).

For tips and references contact the Mentors!