Skip to content

ProjectPCA

Adrian Quintana edited this page Dec 11, 2017 · 1 revision

classify_project_pca

Purpose

PCA (Principal Component Analysis) is a linear mapping technique intended to map a set of high-dimensional input data into a lower dimensional space. Using this program you can project the data set into a predefined number of proncipal components calculated using the PCA program.

Usage


$ classify_project_pca ...


Parameters

  • ``The input data file (raw file). It should be a text file with each row representing the data items and each column representing the variables. It should have the following format:
 3 1000 12 34 54 -12 45 76 ... 32 45 76 

The first line indicates the dimension of the vectors (in this case 3) and the number of vectors (in this case 1000). Please note that vector components (variables) are separated by empty spaces. Additionally, the last column can also be used as a label for the vector. Example:

 3 1000 12 34 54     labelA -12 45 76   labelB ... 32 45 76     labelN 
  • `` The input eigen vectors file generated by the PCA program.

  • `` The input eigen values file generated by the PCA program.

  • `` This is the file name for the generated output files. projectPCA produces one output file: basename.dat where the mapped input vectors are stored . The generated data also follows the same format as the input data. Example:

2 1000 0.4 0.2 labelA -0.1 0.3 labelB .......... 0.2 0.5 labelN The first line first indicates the dimension of the vectors (in this case 2 but it depends on the -k parameter) and the number of vectors (in this case 1000). The rest of the lines represent the mapped input vectors. The number of output vectors is the same as the number of input vectors.

  • `` The number of components of the projected subspace. If you use this option, the projectPCA program will produce a k-dimensional dataset as output.
  • `` If you select this option, the number of components of the projected subspace in this case is calculated automatically in such a way that the resulting components explain at least thep% of variance. NOTE: You cannot use p and k at the same time. Only one of them should be used at a time.
  • `` Reconstruct the original data using the first k principal components. The output file will have.recon extension.

Examples and notes

Example 1: Map a set of data stored in "test.dat" file into 10 components generated by the PCA program


$ classify_project_pca -i test.dat -o out -ein pcaTest.evec -evin pcaTest.eval -k 10


In this case the following parameters are set by default:


Input data file : test.dat
Input eigen vector file : pcaTest.evec
Input eigen values  file : pcaTest.eval
Output file : out.dat
Algorithm information output file : out.inf
Output space dimension = 10


A 10-dimensional output file is generated (out.dat) storing the new mapped data. The algorithm information file is stored inout.inf. This file contains the percent of the variance explained by these 10 components.

Example 2: Map a set of data stored in "test.dat" file into a number of components generated by the PCA program in such a way that those components explain the 99.9 percent of the variance of the imput data


$ classify_project_pca -i test.dat -o out -ein pcaTest.evec -evin pcaTest.eval -p 99.9


In this case the following parameters are set by default:


Input data file : test.dat
Input eigen vector file : pcaTest.evec
Input eigen values  file : pcaTest.eval
Output file : out.dat
Algorithm information output file : out.inf
Percent of explained variance = 99.9


A k-dimensional output file is generated (out.dat) storing the new mapped data. The value of k will be calculated automatically so that it explains at least 99.9 percent of the variance. The algorithm information file is stored inout.inf.

--Main.AlfredoSolano - 26 Jan 2007

Clone this wiki locally