Meta Algorithms

Introduction

Meta-algorithms are approaches to combine several machine learning techniques into one predictive model in order to decrease the variance (bagging), bias (boosting), or improving the predictive force (stacking alias ensemble). In this WIKI page, we are only talking about bagging and boosting.

There is a quick comparison of these three approaches:

	Bagging	Boosting	Stacking
Partition of the data into subsets	Random	Gives misclassified samples higher preference	Various
Goal to achieve	Minimize variance	Increase predictive force	Both
Methods where this is used	Random subspace	gradient descent	Blending
Function to combine single models	(Weighted) average	Weighted majority vote	Logistic Regression

Bagging

A way to decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multisets of the same size as your original data. By increasing the size of your training set you can't improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to expected outcome.

For example, Random Forest is a bagging algorithm, which reduces variance.

If you have very unreliable models, e.g. decision trees, you can build a robust model through bagging by creating different models with resampling the data to make the result more robust. Random forest is a bagging algorithm applied to decision trees to make the model more stable.

Boosting

A two-step approach, where one first uses subsets of the original data to produce a series of averagely performing models and then boosts their performance by combining them together using a particular cost function (i.e. majority vote). Unlike bagging, in the classical boosting the subset creation is not random and depends upon the performance of the previous models: every new subsets contains the elements that were misclassified by previous models.

Boosting reduces variance by using multiple models (bagging), and it also reduces bias by training the subsequent model with telling it what errors the previous models made (boosting).

Adaboost: Tell subsequent models to punish more heavily observations mistaken by the previous models
Gradient boosting: Train each subsequent model using the residuals (the difference between the predicted and true values), e.g. XGBoost

Discussion of these Ensembles

To use these ensembles, your base learner must be weak.

If the model overfits the data, there won't be any residuals or errors for the subsequent models to build upon (the residuals will be 0).

Using gradient boosting because it is very easy to use different loss functions even when the derivative is not convex.
Don't mess up with random forest and gradient boosting trees (XGBoost).

People sometimes confuse random forest and gradient boosting trees, just because both use decision trees, but they are two very different families of ensembles!

XGBoost

XGBoost is a very fast, scalable implementation of gradient boosting.

Installation

Read the tutorial from XGBoost github or read the following quick tutorial for Mac users on Python 3.6:

Obtain gcc-7.x.x with Anaconda:

$ conda install -c anaconda gcc

Clone the XGBoost repository

$ git clone --recursive https://github.com/dmlc/xgboost

Build XGBoost

$ cd xgboost; cp make/config.mk ./config.mk; make -j4

Python Package Installation

Install system-widely

$ cd python-package; sudo python setup.py install

Only set the environment variable PYTHONPATH to tell python where to find the library.

$ export PYTHONPATH=~/xgboost/python-package

Quick test

Try the following codes from XGBoost github and test your XGBoost: (The required data are in XGBoost repo)

import xgboost as xgb

# read in data
dtrain = xgb.DMatrix('demo/data/agaricus.txt.train')
dtest = xgb.DMatrix('demo/data/agaricus.txt.test')
# specify parameters via map
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)
# make prediction
preds = bst.predict(dtest)
print(preds)

Reference:

Data Science Stack Exchange: Why do we need XGBoost and Random Forest?
Cross Validated Stack Exchange: Bagging, boosting, and stacking in machine learning
XGBoost Installation Guide
XGBoost Quick start tutorial

Meta Algorithm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly