Skip to content

Latest commit

 

History

History
185 lines (133 loc) · 7.2 KB

Notes.md

File metadata and controls

185 lines (133 loc) · 7.2 KB

Scikit-Learn Framework

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.model_family import ModelAlgo
mymodel = ModelAlgo(param1, param2)
mymodel.fit(X_train, y_train)
predictions = mymodel.predict(X_test)

frrom sklearn.metrics import error_metric
performance = error_metric(y_test, predictions)

Regularization

  • Regularization seeks to solve a few common model issues by:

    • Minimizing model complexity
    • Penalizing the loss function
    • Reducing model overfitting (add more bias to reduce model variance)
  • In general, we can think of regularization as a way to reduce model overfitting and variance.

    • Requires some additional bias
    • Requires a search for optimal penalty hyperparameters.
  • Three main types of Regularization:

    • L1 Regularization (LASSO Regression - Least Absolute Shrinkage and Selection Operator) $$ \sum_{i=1}^n{(y_i - \beta_0 - \sum_{j=1}^p(\beta_j x_{ij}))^2} + \lambda \sum_{j=1}^p{|\beta_j|} = RSS + \lambda \sum_{j=1}^p{|\beta_j|} $$

      L1 Regularization adds a penalty equal to the absolute value of the magnitude of coefficients.

      • Limits the size of the coefficients.
      • Can yield sparse models where some coefficients can become zero.
      # LASSO with Cross Validation
      from sklearn.linear_model import LassoCV
      lasso_cv_model = LassoCV(eps=0.1, n_alpha=100, cv=5, max_iter=1000000)
      lasso_cv_model.fit(X_train, y_train)
      lasso_cv_model.alpha_
      test_predictions = lasso_cv_model.predict(X_test)
      lasso_cv_model.coef_
    • L2 Regularization (Ridge Regression) $$ \sum_{i=1}^n{(y_i - \beta_0 - \sum_{j=1}^p(\beta_j x_{ij}))^2} + \lambda \sum_{j=1}^p{\beta_j^2} = RSS + \lambda \sum_{j=1}^p{\beta_j^2} $$ L2 Regularization adds a penalty equal to the square of the magnitude of coefficients.

      • All coefficients are shrunk by the same factor.
      • Does not necessary eliminate coefficients.
      from sklearn.linear_model import Ridge
      ridge_model = Ridge(alpha=10)
      ridge_model.fit(X_train, y_train)
      test_predictions = ridge_model.predict(X_test)
      
      # Ridge with Cross Validation
      from sklearn.linear_model import RidgeCV
      ridge_cv_model = RidgeCV(alpha=(0.1, 1.0, 10.0), scoring='neg_mean_absolute_error')
      ridge_cv_model.fit(X_train, y_train)
      ridge_cv_model.alpha_
      test_predictions = ridge_model.predict(X_test)
      ridge_cv_model.best_score_
    • Combining L1 and L2 (Elastic Net) $$ \frac{\sum_{i=1}^n{(y_i - x_i^J \hat\beta)^2}}{2n} + \lambda (\frac{1-\alpha}{2} \sum_{j=1}^m{\hat\beta_j^2} + \alpha \sum_{j=1}^m{|\hat\beta_j|}) $$ Elastics Net combines L1 and L2 with the addition of an alpha parameter deciding the ratio between them.

      from sklearn.linear_model import ElasticNetCV
      elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1],
                                  eps=0.001, n_alphas=100, max_iter=1000000)
      elastic_model.fit(X_train, y_train)
      elastic_model.l1_ratio
      elastic_model.alpha_
      test_predictions = elastic_model.predict(X_test)
      ridge_cv_model.best_score_
  • These regularization methods do have a cost:

    • Introduce an additional hyperparameter that need to be tuned.
    • A multiplier to the penalty to decide the "strength" of the penalty.

Feature Scalling

  • Feature scaling provides many benefits to our machine learning process!

  • Some machine learning models that rely on distance metrics (e.g. KNN) require scaling to perform well.

  • Feature scaling improves the convergence of steepest descent algorithms, which do not possess the property of scale invariance.

  • If features are on different scales, certain weights may update faster than others since the feature values $x_j$ play a role in the weight updates.

  • Critical benefit of feature scaling related to gradient descent.

  • There are some ML Algos where scaling won't have an effect (e.g. CART based methods).

  • Scaling the features so that their respective ranges are uniform is important in comparing measurements that have different unit.

  • Allows us directly compare model coefficients to each other.

  • Feature scaling caveats:

    • Must always scale new unseen data before feeding to model.
    • Effects direct interpretability of feature coefficients
      • Easier to compare coefficients to one another, harder to related back to original unscaled feature.
  • Feature scaling benefits:

    • Can lead to great increases in performance.
    • Absolutely necessary for some models.
    • Virtually no "real" downside to scaling features.
  • Two main ways to scale features:

    • Standardization:

      • Rescales data to have a mean ($\mu$) of 0 and standard deviation ($\sigma$) of 1 (unit variance). $$ X_{changed} = \frac{X - \mu}{\sigma} $$

      • Namesake can be confusing since this is also referred to as "Z-score normalization".

    • Normalization:

      • Rescales all data values to be between 0-1. $$ X_{changed} = \frac{X - X_{min}}{X_{max} - X{min}} $$

      • Simple and easy to understand.

  • There are many more methods of scaling features and Scikit-Learn provides easy to use classes that "fit" and "transform" feature data for scaling

    • A .fit() method call simply calculates the necessary statistics (min, max, mean, standard deviation).
    • A .transform() call actually scales data and returns the new scaled version of data.
    • Very important consideration for fit and transform:
      • We only fit to training data.
      • Calculating statistical information should only come from training data.
      • Don't want to assume prior knowledge of the test set!
      • Using the full data set would cause data leakage:
        • Calculating statistics from full data leads to some information of the test set leaking into the training process upon transform() conversion.
  • Feature scaling process:

    • Perform train test split
    • Fit to training feature data
    • Transform training feature data
    • Transform test feature data
  • Do we need to scale the label?

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
scaled_X_train = scaler.transform(X_train)
scaled_X_test = scaler.transform(X_test)