Skip to content
wrathematics edited this page Oct 16, 2012 · 2 revisions

The R package pbdDMAT is a collection of methods (in the oop sense) which offer high-level syntax for performing numeric computations with a distributed matrix object, but with low-level programming speed. As such, much of the focus is on linear algebra.

The methods are designed to very closely resemble native R syntax. Assuming you can get your data read into R in this special class, called ddmatrix, then really most of the hard work is over. The table below illustrates some examples of code use. On the left is native R syntax for an ordinary matrix, and on the right is pbdDMAT syntax for a distributed matrix.

Matrix Distributed Matrix
Logarithm log(x) log(dx)
Add x + y dx + dy
Multiply x %*% y dx %*% dy
Cholesky chol(x) chol(dx)
Var-Cov Matrix cov(x) cov(dx)
PCA prcomp(x) prcomp(dx)

Note that the only difference here is in our naming convention (there is no syntactic or semantic meaning in prefixing the object name with a "d"; it is done purely for demonstration purposes).

So clearly the real trick here is getting the data into our special distributed matrix class. For people working on smaller datasets but who still wish to have the advantages of parallelism with a handful of cores, it is probably sufficient to read the data into an R matrix on one core and call a special distributor function to convert to a distributed matrix. For many, this could be as simple as:

if (comm.rank() == 0){ # first processor only
    x <- data.matrix(read.table("myfile"))
} else {
    x <- NULL
}

dx <- as.ddmatrix(x)

When you are done, you can even convert back to an ordinary R matrix via the command as.matrix(dx).

For those working at a large scale with a parallel file system, other techniques are necessary. In the future, after getting a better sense for the kinds of parallel readers desired by the community, we hope to develop some packages which can utilize these parallel readers.

Clone this wiki locally