Skip to content

Commit

Permalink
fixing URLs in vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
pbreheny committed Oct 6, 2024
1 parent db94e95 commit 9ba4511
Showing 1 changed file with 5 additions and 8 deletions.
13 changes: 5 additions & 8 deletions vignettes/getting-started.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ library(plmmr)

`plmmr` is a package for fitting **P**enalized **L**inear **M**ixed **M**odels in **R**. This package was created for the purpose of fitting penalized regression models to high dimensional data in which the observations are correlated in some complex way. For instance, this kind of data arises often in the context of genetics (*e.g.*, GWAS dealing with population structure/family structure).

For more information on the theoretical context of penalized regression, check out [these online course notes](https://myweb.uiowa.edu/pbreheny/7240/s23/notes.html) from Prof. [Patrick Breheny](https://myweb.uiowa.edu/pbreheny/) (the PI of our lab). The section on 'ridge regression' may be a good place to start.
For more information on the theoretical context of penalized regression, check out [these online course notes](https://myweb.uiowa.edu/pbreheny/7240/s23/notes.html); the section on ridge regression may be a good place to start.

The novelties of `plmmr` are:

Expand All @@ -38,13 +38,13 @@ The novelties of `plmmr` are:
## Computational capability

### File-backing
In many applications of high dimensional data analysis, the dataset is too large to read into R -- the session will crash for lack of memory. This is almost always an issue with analyzing genetic data from [PLINK](https://www.cog-genomics.org/plink/) files. To analyze such large datasets, `plmmr` is equipped to analyze data using *filebacking* - a strategy that lets R 'point' to a file on disk, rather than reading the file into the R session. Many other packages use this technique - [bigstatsr](https://privefl.github.io/bigstatsr/) and [biglasso](https://github.com/pbreheny/biglasso) are two examples of packages that use the filebacking technique. The package that `plmmr` uses to create and store filebacked objects is [bigmemory](https://cran.r-project.org/web/packages/bigmemory/bigmemory.pdf). The filebacked computation relies on the `biglasso` [package](https://github.com/pbreheny/biglasso) by [Yaohui Zeng](https://scholar.google.com/citations?user=jpEmf04AAAAJ&hl=en) et al. and on [bigalgebra](https://cran.r-project.org/web/packages/bigalgebra/bigalgebra.pdf) by Michael Kane et al. For processing PLINK files, we use methods from the `bigsnpr` [package](https://privefl.github.io/bigsnpr/) by [Florian Privé](https://privefl.github.io/).
In many applications of high dimensional data analysis, the dataset is too large to read into R -- the session will crash for lack of memory. This is particularly common when analyzing data from genome-wide association studies (GWAS). To analyze such large datasets, `plmmr` is equipped to analyze data using *filebacking* - a strategy that lets R 'point' to a file on disk, rather than reading the file into the R session. Many other packages use this technique - [bigstatsr](https://privefl.github.io/bigstatsr/) and [biglasso](https://pbreheny.github.io/biglasso/) are two examples of packages that use the filebacking technique. The package that `plmmr` uses to create and store filebacked objects is [bigmemory](https://CRAN.R-project.org/package=bigmemory). The filebacked computation relies on the [biglasso](https://pbreheny.github.io/biglasso/) package by [Yaohui Zeng](https://scholar.google.com/citations?user=jpEmf04AAAAJ&hl=en) et al. and on [bigalgebra](https://CRAN.R-project.org/package=bigalgebra) by Michael Kane et al. For processing [PLINK](https://www.cog-genomics.org/plink/) files, we use methods from the `bigsnpr` [package](https://privefl.github.io/bigsnpr/) by [Florian Privé](https://privefl.github.io/).

### Numeric outcomes only
At this time, the package is designed for linear regression only -- that is, we are considering only continuous (numeric) outcomes. We maintain that treating binary outcomes as numeric values is appropriate in some contexts, as described by Hastie et al. in the [Elements of Statistical Learning](https://hastie.su.domains/Papers/ESLII.pdf), chapter 4. In the future, we would like to extend this package to handle dichotomous outcomes via logistic regression; the theoretical work underlying this is an open problem.
At this time, the package is designed for linear regression only -- that is, we are considering only continuous (numeric) outcomes. We maintain that treating binary outcomes as numeric values is appropriate in some contexts, as described by Hastie et al. in the [Elements of Statistical Learning](https://hastie.su.domains/ElemStatLearn/), chapter 4. In the future, we would like to extend this package to handle dichotomous outcomes via logistic regression; the theoretical work underlying this is an open problem.

### 3 types of penalization
Since we are focused on penalized regression in this package, `plmmr` offers 3 choices of penalty: the minimax concave (MCP), the smoothly clipped absolute deviation (SCAD), and the least absolute shrinkage and selection operator (LASSO). The implementation of these penalties is built on the concepts/techniques provided in the `ncvreg` [package](https://github.com/pbreheny/ncvreg) by Patrick Breheny.
Since we are focused on penalized regression in this package, `plmmr` offers 3 choices of penalty: the minimax concave (MCP), the smoothly clipped absolute deviation (SCAD), and the least absolute shrinkage and selection operator (LASSO). The implementation of these penalties is built on the concepts/techniques provided in the [ncvreg](https://pbreheny.github.io/ncvreg/) package.

### Data size and dimensionality

Expand All @@ -71,7 +71,4 @@ We distinguish between the data attributes 'big' and 'high dimensional.' 'Big' d

* The `penncath_lite` data is our example of PLINK input data. `penncath_lite` (data on coronary artery disease from the [PennCath study](https://pubmed.ncbi.nlm.nih.gov/21239051/)) is a high dimensional data set (1401 observations, 4217 SNPs) with several health outcomes as well as age and sex information. The features in this data set represent a small subset of a much larger GWAS data set (the original data has over 800K SNPs). For for information on this data set, refer to the [original publication](https://pubmed.ncbi.nlm.nih.gov/21239051/). An example analysis with the `penncath_lite` data is available in `vignette('plink_files', package = "plmmr")`.

* The `colon2` data is our example of delimited-file input data. `colon2` is a variation of the `colon` data included in the [`biglasso` package](https://github.com/pbreheny/biglasso). `colon2` has 62 observations and 2,001 features representing a study of colon disease. 2000 features are original to the data, and the 'sex' feature is simulated. An example analysis with the `colon2` data is available in `vignette('delim_files', package = "plmmr")`.



* The `colon2` data is our example of delimited-file input data. `colon2` is a variation of the `colon` data included in the [biglasso](https://pbreheny.github.io/biglasso/) package. `colon2` has 62 observations and 2,001 features representing a study of colon disease. 2000 features are original to the data, and the 'sex' feature is simulated. An example analysis with the `colon2` data is available in `vignette('delim_files', package = "plmmr")`.

0 comments on commit 9ba4511

Please sign in to comment.