This package is deprecated. Please use TableTransforms.jl instead.
Package Status | Package Evaluator | Build Status |
---|---|---|
Utility package that provides end user friendly methods for feature scalings and polynomial
basis expansion. Feature scalings work on Matrix
, Vector
and DataFrames
. It is possible to
have observations stored as columns or rows of a matrix. In order to distinguish between these cases
one can provide the parameter obsdim
, where obsdim=1
corresponds to "observations as rows" and
obsdim=2
to "observations as columns". Transformations can be computed on a subset
of columns/rows by defining a vector operate_on
.
Standardization of data sets result in variables with a mean of 0 and variance of 1.
A common use case would be to fit a StandardScaler
to the training data and later
apply the same transformation to the test data. StandardScaler
is used with the
functions fit()
, transform()
and fit_transform()
as shown below.
fit(StandardScaler, X[, μ, σ; obsdim, operate_on])
fit_transform(StandardScaler, X[, μ, σ; obsdim, operate_on])
X
: Data of type Matrix or DataFrame
.
μ
: Vector or scalar describing the translation.
Defaults to mean(X; dims=obsdim)
σ
: Vector or scalar describing the scale.
Defaults to std(X; dims=obsdim)
obsdim
: Specify which axis corresponds to observations.
Defaults to obsdim=2 (observations are columns of matrix)
For DataFrames obsdim
is obsolete and rescaling occurs
column wise.
operate_on
: Specify the indices of columns or rows to be centered.
Defaults to all columns/rows.
For DataFrames this must be a vector of symbols, not indices.
E.g. operate_on
=[1,3] will perform centering on columns
with index 1 and 3 only (if obsdim=1, else rows 1 and 3)
Note on DataFrames:
Columns containing missing
values are skipped.
Columns containing non numeric elements are skipped.
Examples:
Xtrain = rand(100, 4)
Xtest = rand(10, 4)
x = rand(4)
Dtrain = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])
Dtest = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])
scaler = fit(StandardScaler, Xtrain)
scaler = fit(StandardScaler, Xtrain, obsdim=1)
scaler = fit(StandardScaler, Xtrain, obsdim=1, operate_on=[1,3])
transform(Xtest, scaler)
transform!(Xtest, scaler)
transform(x, scaler)
transform!(x, scaler)
scaler = fit(StandardScaler, Dtrain)
scaler = fit(StandardScaler, Dtrain, operate_on=[:A,:B])
transform(Dtest, scaler)
transform!(Dtest, scaler)
Xscaled, scaler = fit_transform(StandardScaler, X, obsdim=1, operate_on=[1,2,4])
scaler = fit_transform!(StandardScaler, X, obsdim=1, operate_on=[1,2,4])
Note that for transform!
the data matrix X
has to be of type <: AbstractFloat
as the scaling occurs inplace. (E.g. cannot be of type Matrix{Int64}). This is not
the case for transform
however.
For DataFrames
transform!
can be used on columns of type <: Integer.
FixedRangeScaler
is used with the functions fit()
, transform()
and fit_transform()
to scale data in a Matrix X
or DataFrame to a fixed range [lower:upper].
After fitting a FixedRangeScaler
to one data set, it can be used to perform the same
transformation to a new set of data. E.g. fit the FixedRangeScaler
to your training
data and then apply the scaling to the test data at a later stage. (See examples below).
fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])
fit_transform(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])
X
: Data of type Matrix or DataFrame
.
lower
: (Scalar) Lower limit of new range.
Defaults to 0.
upper
: (Scalar) Upper limit of new range.
Defaults to 1.
obsdim
: Specify which axis corresponds to observations.
Defaults to obsdim=2 (observations are columns of matrix)
For DataFrames obsdim
is obsolete and rescaling occurs
column wise.
operate_on
: Specify the indices of columns or rows to be centered.
Defaults to all columns/rows.
For DataFrames this must be a vector of symbols, not indices.
E.g. operate_on
=[1,3] will perform centering on columns
with index 1 and 3 only (if obsdim=1, else rows 1 and 3)
Note on DataFrames:
Columns containing NA
values are skipped.
Columns containing non numeric elements are skipped.
Examples:
Xtrain = rand(100, 4)
Xtest = rand(10, 4)
x = rand(10)
D = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])
scaler = fit(FixedRangeScaler, Xtrain)
scaler = fit(FixedRangeScaler, Xtrain, -1, 1)
scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1)
scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1, operate_on=[1,3])
scaler = fit(FixedRangeScaler, D, -1, 1, operate_on=[:A,:B])
Xscaled = transform(Xtest, scaler)
transform!(Xtest, scaler)
Xscaled, scaler = fit_transform(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4])
scaler = fit_transform!(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4])
The lower level functions on which StandardScaler
and FixedRangeScaler
are built on can also
be used seperately.
μ = center!(X[, μ; obsdim, operate_on])
Shift X
along obsdim
by μ
according to X = X - μ
where X
is of type Matrix or Vector and D
of type DataFrame.
lower, upper, xmin, xmax = fixedrange!(X[, lower, upper, xmin, xmax; obsdim, operate_on])
Normalize X
or D
along obsdim
to the interval [lower:upper]
where X
is of type Matrix or Vector and D
of type DataFrame.
If lower
and upper
are omitted the default range is [0:1].
μ, σ = standardize!(X[, μ, σ; obsdim, operate_on])
Standardize X
along obsdim
according to X = (X - μ) / σ.
If μ and σ are omitted they are computed such that variables have a mean of zero.
M = expand_poly(x[, degree=5, obsdim])
Perform a polynomial basis expansion of the given degree
for the vector x
.
julia> expand_poly(1:5, degree=3)
3×5 Array{Float64,2}:
1.0 2.0 3.0 4.0 5.0
1.0 4.0 9.0 16.0 25.0
1.0 8.0 27.0 64.0 125.0
julia> expand_poly(1:5, degree=3, obsdim=1)
5×3 Array{Float64,2}:
1.0 1.0 1.0
2.0 4.0 8.0
3.0 9.0 27.0
4.0 16.0 64.0
5.0 25.0 125.0
julia> expand_poly(1:5, 3, ObsDim.First()); # same but type-stable