Overview | Features | Prerequisites | Installation | Usage | To-Do | License
Supervised learning in Stata with scikit-learn
version 0.70 21jul2020
pylearn is a set of Stata modules that allows Stata users to implement many popular supervised learning algorithms - decision trees, random forests, adaptive boosting, gradient boosting, and multi-layer perceptrons (neural networks) - directly from Stata. In particular, pylearn makes use of Stata 16.0's Python integration and the popular Python library scikit-learn to interface between Stata and Python behind the scenes.
Pylearn consists of five Stata functions implementing popular supervised ML algorithms:
Stata Function Name | Description | Related scikit-learn classes |
pytree | Decision trees | DecisionTreeClassifier DecisionTreeRegressor |
pyforest | Random forests | RandomForestClassifier RandomForestRegressor |
pymlp | Neural networks (multi-layer perceptrons) | MLPClassifier MLPRegressor |
pyadaboost | Adaptive Boosting (AdaBoost) | AdaBoostClassifier AdaBoostRegressor |
pygradboost | Gradient Boosting | GradientBoostingClassifier GradientBoostingRegressor |
Each of these programs contains detailed internal documentation. For instance, to view the internal documentation for pyforest, type the following Stata command:
help pyforest
pylearn requires Stata 16+, since it relies on the Python integration introduced in Stata 16.0.
pylearn also requires Python 3.6+, scikit-learn, and pandas. If you do not have these Python libraries installed, pylearn will try to install them automatically - see the installation section below.
Installing pylearn is very easy.
- First, install the Stata code and documentation. You can run the following Stata command to install everything directly from this GitHub repository:
net install pylearn, from(https://raw.githubusercontent.com/mdroste/stata-pylearn/master/src/) replace
- Install Python, if you haven't already. Make sure that you have the required Python prerequisites installed by running the included Stata program:
pylearn, setup
If Stata cannot find your Python installation, refer to the installation guide.
To upgrade to the latest version of pylearn, simply run the following:
pylearn, upgrade
Using pylearn is simple, since the syntax for each component looks very similar to other Stata modeling commands.
Here is a quick example of a random forest regression with pylearn's 'pyforest' command:
* Load auto dataset
sysuse auto, clear
* Estimate random forest regression model, predicting price using mpg, trunk, and weight
* Train only on cars with foreign==1 and test on foreign==0
pyforest price mpg trunk weight, type(regress) training(foreign)
predict price_predicted
* We can also use Stata-like 'if' conditions to specify the training sample, but we won't get out-of-sample RMSE in that case
pyforest price mpg trunk weight if foreign==1, type(regress)
predict price_predicted_2
* Pyforest chooses defaults automatically, but you can use any hyperparameter scikit-learn can
pyforest price mpg trunk weight, type(regress) training(foreign) max_depth(5) min_samples_split(4)
Detailed documentation and usage examples are provided with each Stata file. For instance, see:
help pyforest
The following items will be addressed soon:
- Weights: Add support for weights
- Post-estimation: return feature importance (where applicable)
- Model selection: cross-validation
- Exception handling: more elegant exception handling in Python
pylearn is MIT-licensed.