DEPRECATED

This package is deprecated. Please use TableTransforms.jl instead.

MLPreprocessing

Package Status	Package Evaluator	Build Status

Overview

Utility package that provides end user friendly methods for feature scalings and polynomial basis expansion. Feature scalings work on Matrix, Vector and DataFrames. It is possible to have observations stored as columns or rows of a matrix. In order to distinguish between these cases one can provide the parameter obsdim, where obsdim=1 corresponds to "observations as rows" and obsdim=2 to "observations as columns". Transformations can be computed on a subset of columns/rows by defining a vector operate_on.

StandardScaler

Standardization of data sets result in variables with a mean of 0 and variance of 1. A common use case would be to fit a StandardScaler to the training data and later apply the same transformation to the test data. StandardScaler is used with the functions fit(), transform() and fit_transform() as shown below.

    fit(StandardScaler, X[, μ, σ; obsdim, operate_on])

    fit_transform(StandardScaler, X[, μ, σ; obsdim, operate_on])

X : Data of type Matrix or DataFrame.

μ : Vector or scalar describing the translation. Defaults to mean(X; dims=obsdim)

σ : Vector or scalar describing the scale. Defaults to std(X; dims=obsdim)

obsdim : Specify which axis corresponds to observations. Defaults to obsdim=2 (observations are columns of matrix) For DataFrames obsdim is obsolete and rescaling occurs column wise.

operate_on: Specify the indices of columns or rows to be centered. Defaults to all columns/rows. For DataFrames this must be a vector of symbols, not indices. E.g. operate_on=[1,3] will perform centering on columns with index 1 and 3 only (if obsdim=1, else rows 1 and 3)

Note on DataFrames: Columns containing missing values are skipped. Columns containing non numeric elements are skipped.

Examples:

    Xtrain = rand(100, 4)
    Xtest  = rand(10, 4)
    x = rand(4)
    Dtrain = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])
    Dtest = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])

    scaler = fit(StandardScaler, Xtrain)
    scaler = fit(StandardScaler, Xtrain, obsdim=1)
    scaler = fit(StandardScaler, Xtrain, obsdim=1, operate_on=[1,3])
    transform(Xtest, scaler)
    transform!(Xtest, scaler)
    transform(x, scaler)
    transform!(x, scaler)

    scaler = fit(StandardScaler, Dtrain)
    scaler = fit(StandardScaler, Dtrain, operate_on=[:A,:B])
    transform(Dtest, scaler)
    transform!(Dtest, scaler)

    Xscaled, scaler = fit_transform(StandardScaler, X, obsdim=1, operate_on=[1,2,4])
    scaler = fit_transform!(StandardScaler, X, obsdim=1, operate_on=[1,2,4])

Note that for transform! the data matrix X has to be of type <: AbstractFloat as the scaling occurs inplace. (E.g. cannot be of type Matrix{Int64}). This is not the case for transform however. For DataFrames transform! can be used on columns of type <: Integer.

FixedRangeScaler

FixedRangeScaler is used with the functions fit(), transform() and fit_transform() to scale data in a Matrix X or DataFrame to a fixed range [lower:upper]. After fitting a FixedRangeScaler to one data set, it can be used to perform the same transformation to a new set of data. E.g. fit the FixedRangeScaler to your training data and then apply the scaling to the test data at a later stage. (See examples below).

    fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])

    fit_transform(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])

X : Data of type Matrix or DataFrame.

lower : (Scalar) Lower limit of new range. Defaults to 0.

upper : (Scalar) Upper limit of new range. Defaults to 1.

obsdim : Specify which axis corresponds to observations. Defaults to obsdim=2 (observations are columns of matrix) For DataFrames obsdim is obsolete and rescaling occurs column wise.

operate_on: Specify the indices of columns or rows to be centered. Defaults to all columns/rows. For DataFrames this must be a vector of symbols, not indices. E.g. operate_on=[1,3] will perform centering on columns with index 1 and 3 only (if obsdim=1, else rows 1 and 3)

Note on DataFrames: Columns containing NA values are skipped. Columns containing non numeric elements are skipped.

Examples:

    Xtrain = rand(100, 4)
    Xtest  = rand(10, 4)
    x = rand(10)
    D = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])

    scaler = fit(FixedRangeScaler, Xtrain)
    scaler = fit(FixedRangeScaler, Xtrain, -1, 1)
    scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1)
    scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1, operate_on=[1,3])
    scaler = fit(FixedRangeScaler, D, -1, 1, operate_on=[:A,:B])

    Xscaled = transform(Xtest, scaler)
    transform!(Xtest, scaler)

    Xscaled, scaler = fit_transform(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4])
    scaler = fit_transform!(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4])

Lower Level Functions

The lower level functions on which StandardScaler and FixedRangeScaler are built on can also be used seperately.

center!()

    μ = center!(X[, μ; obsdim, operate_on])

Shift X along obsdim by μ according to X = X - μ where X is of type Matrix or Vector and D of type DataFrame.

fixedrange!()

    lower, upper, xmin, xmax = fixedrange!(X[, lower, upper, xmin, xmax; obsdim, operate_on])

Normalize X or D along obsdim to the interval [lower:upper] where X is of type Matrix or Vector and D of type DataFrame. If lower and upper are omitted the default range is [0:1].

standardize!()

    μ, σ = standardize!(X[, μ, σ; obsdim, operate_on])

Standardize X along obsdim according to X = (X - μ) / σ. If μ and σ are omitted they are computed such that variables have a mean of zero.

Polynomial Basis Expansion

    M = expand_poly(x[, degree=5, obsdim])

Perform a polynomial basis expansion of the given degree for the vector x.

julia> expand_poly(1:5, degree=3)
3×5 Array{Float64,2}:
 1.0  2.0   3.0   4.0    5.0
 1.0  4.0   9.0  16.0   25.0
 1.0  8.0  27.0  64.0  125.0

julia> expand_poly(1:5, degree=3, obsdim=1)
5×3 Array{Float64,2}:
 1.0   1.0    1.0
 2.0   4.0    8.0
 3.0   9.0   27.0
 4.0  16.0   64.0
 5.0  25.0  125.0

julia> expand_poly(1:5, 3, ObsDim.First()); # same but type-stable

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
docs		docs
src		src
test		test
.codecov.yml		.codecov.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.md		LICENSE.md
Project.toml		Project.toml
README.md		README.md
REQUIRE		REQUIRE
appveyor.yml		appveyor.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DEPRECATED

MLPreprocessing

Overview

StandardScaler

FixedRangeScaler

Lower Level Functions

center!()

fixedrange!()

standardize!()

Polynomial Basis Expansion

About

Releases

Packages

Contributors 6

Languages

License

JuliaML/MLPreprocessing.jl

Folders and files

Latest commit

History

Repository files navigation

DEPRECATED

MLPreprocessing

Overview

StandardScaler

FixedRangeScaler

Lower Level Functions

center!()

fixedrange!()

standardize!()

Polynomial Basis Expansion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages