All notable changes to this project will be documented in this file.
- Fixed
predict_model
throwing an exception with loaded pipelines (pycaret#2349) - Fixed potential parameter leaking for
ParallelBackend
- thanks to @goodwanghan (pycaret#2339) - Refactored a piece of logic in arules - thanks to @daikikatsuragawa (pycaret#2316)
- Added Two Tutorials in Chinese - thanks to @ryanxjhan (pycaret#2352)
- Added CLF101 in Chinese - thanks to @ryanxjhan (pycaret#2353)
- Added new tutorials in Chinese - thanks to @ryanxjhan (pycaret#2375)
- Made
log_experiment
more configurable (pycaret#2334, pycaret#2335) - Made
return_train_score=False
use the old output format (pycaret#2333)
- Fixed
dashboard_logger
key error duringsetup
(pycaret#2311)
- Fugue integration - thanks to @goodwanghan (pycaret#2035)
- Added W&B experiment logger - thanks to @AyushExel (pycaret#2231)
- Fixed
check_fairness
exception when index is not and ordinal number - thanks to @reza1615 (pycaret#2055) - Unsupported characters in dataframes are now replaced - thanks to @reza1615 (pycaret#2058)
- Fixed drift report with categorical columns - thanks to @reza1615 (pycaret#2063)
- Added multivariable time series dataset from UCI - thanks to @reza1615 (pycaret#2094)
- Fixed a UTF error during installation - thanks to @reza1615 (pycaret#2113)
- MLFlow tracking API can now take in custom tags - thanks to @netoferraz (pycaret#1526)
- Updated
create_api
function (pycaret#2146) drift_report
can now work with unseen data - thanks to @reza1615 (pycaret#2183)- Added Japanese tutorial - thanks to @hanaseleb (pycaret#2215)
- Added Traffic and Drugs Related Violations dataset and example - thanks to @HaithemH (pycaret#2191)
- Train score can now be returned from various supervised learning functions (
return_train_score=True
). Passing an unseen dataset with the label column topredict_model
will now calculate the metrics for that dataset - thanks to @levelalphaone (pycaret#2237) - Fixed spelling mistakes in function docstrings - thanks to @aadarshsingh191198 (pycaret#2269)
- Pinned
numba<0.55
(pycaret#2056)
- Added new function
create_app
(pycaret#2044) - Refactored
optimize_threshold
function (pycaret#2041) - Added new function
create_docker
(pycaret#2005) - Added new function
create_api
(pycaret#2000) - Added new function
check_fairness
(pycaret#1997) - Added new function
eda
(pycaret#1983) - Added new function
convert_model
(pycaret#1959) - Added an ability to pass kwargs to plots in
plot_model
(https://github.com/pycaret/pycaret/pull/19400) - Added
drift_report
functionality topredict_model
(pycaret#1935) - Added new function
create_dashboard
(pycaret#1925) - Added
grid_interval
parameter tooptimize_threshold
- thanks to @wolfryu (pycaret#1938) - Made logging level configurable by environment variable (pycaret#2026)
- Made the optional path in AWS configurable (pycaret#2045)
- Fixed TSNE plot with PCA (pycaret#2032)
- Fixed rendering of streamlit plots (pycaret#2008)
- Fixed class names in
tree
plot - thanks to @yamasakih (pycaret#1982) - Fixed NearZeroVariance preprocessor not being configurable - thanks to @Flyfoxs (pycaret#1952)
- Removed duplicated code - thanks to @Flyfoxs (pycaret#1882)
- Documentation improvements - thanks to @harsh204016, @khrapovs (https://github.com/pycaret/pycaret/pull/1931/files, pycaret#1956, pycaret#1946, pycaret#1949)
- Pinned
pyyaml<6.0.0
to fix issues with Google Colab
- Fixed an issue where
Fix_multicollinearity
would fail if the target was a float (pycaret#1640) - MLFlow runs are now nested - thanks to @jfagn (pycaret#1660)
- Fixed a typo in REG102 tutorial - thanks to @bobo-jamson (pycaret#1684)
- Fixed
interpret_model
not always respectingsave_path
(pycaret#1707) - Fixed certain plots not being logged by MLFlow (pycaret#1769)
- Added dummy models to set a baseline in
compare_models
- thanks to @reza1615 (pycaret#1739) - Improved error message if a column specified in
ignore_features
doesn't exist in the dataset - thanks to @reza1615 (pycaret#1793) - Added an ability to set a custom probability threshold for binary classification through the
probability_threshold
argument in various methods (pycaret#1858) - Separated internal CV from validation CV for
stack_models
andcalibrate_models
(pycaret#1849, pycaret#1858) - A
RuntimeError
will now be raised if an incorrect version ofscikit-learn
is installed (pycaret#1870) - Improved readme, documentation and repository structure
- Unpinned
numba
(pycaret#1735)
- Added
get_leaderboard
function for classification and regression modules - It is now possible to specify the plot save path with the save argument of
plot_model
andinterpret_model
- thanks to @bhanuteja2001 (pycaret#1537) - Fixed
interpret_model
affectingplot_model
behavior - thanks to @naujgf (pycaret#1600) - Fixed issues with conda builds - thanks to @melonhead901 (pycaret#1479)
- Documentation improvements - thanks to @caron14 and @harsh204016 (pycaret#1499, pycaret#1502)
- Fixed
blend_models
andstack_models
throwing an exception when using custom estimators (pycaret#1500) - Fixed a "Target Missing" issue with "Remove Multicolinearity" option (pycaret#1508)
errors="ignore"
parameter forcompare_models
now correctly ignores errors during full fit (pycaret#1510)- Fixed certain data types being incorrectly encoded as int64 during setup (pycaret#1515)
- Pinned
numba<0.54
(pycaret#1530)
- Fixed issues with
[full]
install by pinninginterpret<=0.2.4
- Added support for S3 folder path in
deploy_model()
with AWS - Enabled experimental Optuna
TPESampler
options to improve convergence (intune_model()
)
- Implemented PDP, MSA and PFI plots in
interpret_model
- thanks to @IncubatorShokuhou (pycaret#1415) - Implemented Kolmogorov-Smirnov (KS) plot in
plot_model
underpycaret.classification
module - Fixed a typo "RVF" to "RBF" - thanks to @baturayo (pycaret#1220)
- Readme & license updates and improvements
- Fixed
remove_multicollinearity
considering categorical features - Fixed keyword issues with PyCaret's cuML wrappers
- Improved performance of iterative imputation
- Fixed
gain
andlift
plots taking wrong arguments, creating misleading plots interpret_model
on LightGBM will now show a beeswarm plot- Multiple improvements to exception handling and documentation in
pycaret.persistence
(pycaret#1324) remove_perfect_collinearity
option will now be show in thesetup()
summary - thanks to @mjkanji (pycaret#1342)- Fixed
IterativeImputer
setting wrong float precision - Fixed custom grids in
tune_model
raising an exception when composed of lists - Improved documentation in
pycaret.clustering
- thanks to @susmitpy (pycaret#1372) - Added support for LightGBM CUDA version - thanks to @IncubatorShokuhou (pycaret#1396)
- Exposed
address
inget_data
for alternative data sources - thanks to @IncubatorShokuhou (pycaret#1416)
- Fixed an exception with missing variables (display_container etc.) during load_config()
- Fixed exceptions when using Ridge and RF estimators with cuML (GPU mode)
- Fixed PyCaret's cuML wrappers not being pickleable
- Added an extra check to get_all_object_vars_and_properties internal method, fixing exceptions with certain estimators
- save_model() now supports kwargs, which will be passed to joblib.dump()
- Fixed an issue with load_model() from AWS (duplicate .pkl extension) - thanks to markgrujic (pycaret#1128)
- Fixed a typo in documentation - thanks to koorukuroo (pycaret#1149)
- Optimized Fix_multicollinearity transformer, drastically reducing the size of saved pipeline
- interpret_model() now supports data passed as an argument - thanks to jbechtel (pycaret#1184)
- Removed
infer_signature
from MLflow logging whenlog_experiment=True
. - Fixed a rare issue where binary_multiclass_score_func was not pickleable
- Fixed edge case exceptions in feature selection
- Fixed an exception with
finalize_model
when using GroupKFold CV - Pinned
mlxtend>=0.17.0
,imbalanced-learn==0.7.0
, andgensim<4.0.0
- Modules Impacted:
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.arules
- Added new interactive residual plots in
pycaret.regression
module. You can now generate interactive residual plots by usingresiduals_interactive
in theplot_model
function. - Added plot rendering support for streamlit applications. A new parameter
display_format
is added in theplot_model
function. To render plot in streamlit app, set this tostreamlit
. - Revamped Boruta feature selection algorithm. (give it a try!).
tune_model
inpycaret.classification
andpycaret.regression
is now compatible with custom models.- Added low_memory and max_len support to association rules module (pycaret#1008).
- Increased robustness of DataFrame checks (pycaret#1005).
- Improved loading of models from AWS (pycaret#1005).
- Catboost and XGBoost are now optional dependencies. They are not automatically installed with default slim installation. To install optional dependencies use
pip install pycaret[full]
. - Added
raw_score
argument in thepredict_model
function forpycaret.classification
module. When set to True, scores for each class will be returned separately. - PyCaret now returns base scikit-learn objects, whenever possible.
- When
handle_unknown_categorical
is set to False in thesetup
function, an exception will be raised during prediction if the data contains unknown levels in categorical features. predict_model
for multiclass classification now returns labels as an integer.- Fixed an edge case where an IndexError would be raised in
pycaret.clustering
andpycaret.anomaly
. - Fixed text formatting for certain plots in
pycaret.classification
andpycaret.regression
. - If a
logs.log
file cannot be created whensetup
is initialized, no exception will be raised now (support for more configurable logging to come in future). - User added metrics will not raise exceptions now and instead return 0.0.
- Compatibility with tune-sklearn>=0.2.0.
- Fixed an edge case for dropping NaNs in target column.
- Fixed stacked models not being tuned correctly.
- Fixed an exception with KFold when fold_shuffle=False.
Release: PyCaret 2.2.3 | Release Date: December 22, 2020 (SEVERAL BUGS FIX | CRITICAL COMPATIBILITY FIX)
- Fixed exceptions with the
predict_model
function when data columns had non-string characters. - Fixed a rare exception with the
remove_multicollinearity
parameter in thesetup
function`. - Improved performance and robustness of conversion of date features to categoricals.
- Fixed an exception with the
models
function when thetype
parameter was passed. - The data frame displayed after setup can now be accessed with the
pull
function. - Fixed an exception with save_config
- Fixed a rare case where the target column would be treated as an ID column and thus dropped.
- SHAP plots can now be saved (pass save parameter as True)
- | CRITICAL | Compatibility broke for catboost, pyod (other impacts unknown as of now) with sklearn=0.24 (released on Dec 22, 2020). A temporary fix is requiring 0.23.2 specifically in the
requirements.txt
.
- Fixed an issue with the
optimize_threshold
function thepycaret.classification
module. It now returns a float instead of an array. - Fixed issue with the
predict_model
function. It now uses original data frame to append the predictions. As such any extra columns given at the time of inference are not removed when returning the predictions. Instead they are internally ignored at the time of predictions. - Fixed edge case exceptions for the
create_model
function inpycaret.clustering
. - Fixed exceptions when column names are not string.
- Fixed exceptions in
pycaret.regression
whentransform_target
is True in thesetup
function. - Fixed an exception in the
models
function if thetype
parameter is specified.
Post-release 2.2
, the following issues have been fixed:
- Fixed
plot_model = 'tree'
exceptions. - Fixed issue with
predict_model
causing errors with non-contiguous indices. - Fixed issue with
remove_outliers
parameter in thesetup
function. It was introducing extra columns in training data. The issue has been fixed now. - Fixed issue with
plot_model
inpycaret.clustering
causing errors with non-contiguous indices. - Fixed an exception when the model was saved or logged when
imputation_type
is set to 'iterative' in thesetup
function. compare_models
now prints intermediate output whenhtml=False
.- Metrics in
pycaret.classification
for binary classification are now calculated withaverage='binary'
. Before they were a weighted average of positive and negative class, now they are just calculated for positive class. For multiclass classificationaverage='weighted'
. optimize_threshold
now returns optimized probability threshold value as numpy object.- Fixed issue with certain exceptions in
compare_models
. - Added
profile_kwargs
argument in thesetup
function to pass keyword arguments to Pandas Profiler. plot_model
,interpret_model
, andevaluate_model
now accepts a new parameteruse_train_data
which when set to True, generates plot on train data instead of test data.
-
Modules Impacted:
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
-
Separate Train and Test Set: New parameter
test_data
has been added in thesetup
function ofpycaret.classification
andpycaret.regression
. When a DataFrame is passed into thetest_data
, it is used as a holdout set and thetrain_size
parameter is ignored.test_data
must be labeled and the shape oftest_data
must match with the shape ofdata
. -
Disable Default Preprocessing: A new parameter
preprocess
has been added into thesetup
function. Whenpreprocess
is set toFalse
, no transformations are applied except fortrain_test_split
and custom transformations passed in thecustom_pipeline
param. Data must be ready for modeling (no missing values, no dates, categorical data encoding) when preprocess is set to False. -
Custom Metrics: New functions
get_metric
,add_metric
andremove_metric
is now added inpycaret.classification
,pycaret.regression
, andpycaret.clustering
, that can be used to add / remove metrics used in model evaluation. -
Custom Transformations: A new parameter
custom_pipeline
has been added into thesetup
function. It takes a tuple of(str, transformer)
or a list of tuples. When passed, it will append the custom transformers in the preprocessing pipeline and are applied on each CV fold separately and on the final fit. All the custom transformations are applied aftertrain_test_split
and before pycaret's internal transformations. -
GPU enabled Training: To use GPU for training
use_gpu
parameter in thesetup
function can be set toTrue
orforce
. When set to True, it will use GPU with algorithms that support it and fall back on CPU for remaining. When set toforce
it will only use GPU-enabled algorithms and raise exceptions if they are unavailable for use. The following algorithms are supported on GPU:- Extreme Gradient Boosting
pycaret.classification
pycaret.regression
- LightGBM
pycaret.classification
pycaret.regression
- CatBoost
pycaret.classification
pycaret.regression
- Random Forest
pycaret.classification
pycaret.regression
- K-Nearest Neighbors
pycaret.classification
pycaret.regression
- Support Vector Machine
pycaret.classification
pycaret.regression
- Logistic Regression
pycaret.classification
- Ridge Classifier
pycaret.classification
- Linear Regression
pycaret.regression
- Lasso Regression
pycaret.regression
- Ridge Regression
pycaret.regression
- Elastic Net (Regression)
pycaret.regression
- K-Means
pycaret.clustering
- Density-Based Spatial Clustering
pycaret.clustering
- Extreme Gradient Boosting
-
Hyperparameter Tuning: New methods for hyperparameter tuning has been added in the
tune_model
function forpycaret.classification
andpycaret.regression
. New parametersearch_library
andsearch_algorithm
in thetune_model
function is added.search_library
can bescikit-learn
,scikit-optimize
,tune-sklearn
, andoptuna
. Thesearch_algorithm
param can take the following values based on itssearch_library
:- scikit-learn:
random
grid
- scikit-optimize:
bayesian
- tune-sklearn:
random
grid
bayesian
hyperopt
bohb
- optuna:
random
tpe
Except for
scikit-learn
, all the other search libraries are not hard dependencies of pycaret and must be installed separately. - scikit-learn:
-
Early Stopping: Early stopping now supported for hyperparameter tuning. A new parameter
early_stopping
is added in thetune_model
function forpycaret.classification
andpycaret.regression
. It is ignored whensearch_library
isscikit-learn
, or if the estimator doesn't have a 'partial_fit' attribute. It can be either an object accepted by the search library or one of the following:asha
for Asynchronous Successive Halving Algorithmhyperband
for Hyperbandmedian
for median stopping rule- When
False
orNone
, early stopping will not be used.
-
Iterative Imputation: Iterative imputation type for numeric and categorical missing values is now implemented. New parameters
imputation_type
,iterative_imptutation_iters
,categorical_iterative_imputer
, andnumeric_iterative_imputer
added in thesetup
function. Read the blog post for more details: https://www.linkedin.com/pulse/iterative-imputation-pycaret-22-antoni-baum/?trackingId=Shg1zF%2F%2FR5BE7XFpzfTHkA%3D%3D -
New Plots: Following new plots have been added:
- lift
pycaret.classification
- gain
pycaret.classification
- tree
pycaret.classification
pycaret.regression
- feature_all
pycaret.classification
pycaret.regression
- lift
-
CatBoost Compatibility:
CatBoostClassifier
andCatBoostRegressor
is now compatible withplot_model
. It requirescatboost>=0.23.2
. -
Log Plots in MLFlow Server: You can now log any plot in the
MLFlow
tracking server that is available in theplot_model
function. To log specific plots, pass a list containing plot IDs in thelog_plots
parameter. Check the documentation of theplot_model
to see all available plots. -
Data Split Stratification: A new parameter
data_split_stratify
is added in thesetup
function ofpycaret.classification
andpycaret.regression
. It controls stratification duringtrain_test_split
. When set to True, will stratify by target column. To stratify on any other columns, pass a list of column names. -
Fold Strategy: A new parameter
fold_strategy
is added in thesetup
function forpycaret.classification
andpycaret.regression
. By default, it is 'stratifiedkfold' forpycaret.classification
and 'kfold' forpycaret.regression
. Possible values are:kfold
for KFold CV;stratifiedkfold
for Stratified KFold CV;groupkfold
for Group KFold CV;timeseries
for TimeSeriesSplit CV; or- a custom CV generator object compatible with scikit-learn.
-
Global Fold Parameter: A new parameter
fold
has been added in thesetup
function forpycaret.classification
andpycaret.regression
. It controls the number of folds to be used in cross validation. This is a global setting that can be over-written at function level by usingfold
parameter within each function. Ignored whenfold_strategy
is a custom object. -
Fold Groups: Optional Group labels when
fold_strategy
isgroupkfold
. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing the group label. -
Transformation Pipeline: All transformations are now applied after
train_test_split
. -
Data Type Handling: All data types handling internally has been changed from
int64
andfloat64
toint32
andfloat32
respectively in order to improve memory usage and performance, as well as for better compatibility with GPU-based algorithms. -
AutoML Behavior Change:
automl
function inpycaret.classification
andpycaret.regression
is no more re-fitting the model on the entire dataset. As such, if the model needs to be fitted on the entire dataset including the holdout set,finalize_model
must be explicitly used. -
Default Tuning Grid: Default hyperparameter tuning grid for
RandomForest
,XGBoost
,CatBoost
, andLightGBM
has been amended to remove extreme values formax_depth
and other training intense parameters to speed up the tuning process. -
Random Forest Default Values: Default value of
n_estimators
forRandomForestClassifier
andRandomForestRegressor
has been changed from10
to100
to make it consistent with the default behavior ofscikit-learn
. -
AUC for Multiclass Classification: AUC for Multiclass target is now available in the metric evaluation.
-
Google Colab Display: All output printed on screen (information grid, score grids) is now format compatible with Google Colab resulting in semantic improvements.
-
Sampling Parameter Removed:
sampling
parameter is now removed from thesetup
function ofpycaret.classification
andpycaret.regression
. -
Type Hinting: In order to make both the usage and development easier, type hints have been added to all updated pycaret functions, in accordance with best practices. Users can leverage those by using an IDE with support for type hints.
-
Documentation: All Modules documentation on the website is now retired. Updated documentation is available here: https://pycaret.readthedocs.io/en/latest/
-
get_metrics: Returns table of available metrics used for CV.
pycaret.classification
pycaret.regression
pycaret.clustering
-
add_metric: Adds a custom metric for model evaluation.
pycaret.classification
pycaret.regression
pycaret.clustering
-
remove_metric: Remove custom metrics.
pycaret.classification
pycaret.regression
pycaret.clustering
-
save_config: save all global variables to a pickle file, allowing to later resume without rerunning the
setup
function.pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
-
load_config: Load global variables from pickle file into Python environment.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
Following new parameters have been added:
-
test_data: pandas.DataFrame, default = None If not None, test_data is used as a hold-out set, and the
train_size
parameter is ignored. test_data must be labeled and the shape of data and test_data must match. -
preprocess: bool, default = True When set to False, no transformations are applied except for
train_test_split
and custom transformations passed incustom_pipeline
param. Data must be ready for modeling (no missing values, no dates, categorical data encoding) whenpreprocess
is set to False. -
imputation_type: str, default = 'simple' The type of imputation to use. Can be either 'simple' or 'iterative'.
-
iterative_imputation_iters: int, default = 5 The number of iterations. Ignored when
imputation_type
is not 'iterative'. -
categorical_iterative_imputer: str, default = 'lightgbm' Estimator for iterative imputation of missing values in categorical features. Ignored when
imputation_type
is not 'iterative'. -
numeric_iterative_imputer: str, default = 'lightgbm' Estimator for iterative imputation of missing values in numeric features. Ignored when
imputation_type
is set to 'simple'. -
data_split_stratify: bool or list, default = False Controls stratification during 'train_test_split'. When set to True, will stratify by target column. To stratify on any other columns, pass a list of column names. Ignored when
data_split_shuffle
is False. -
fold_strategy: str or sklearn CV generator object, default = 'stratifiedkfold' / 'kfold' Choice of cross validation strategy. Possible values are:
- 'kfold'
- 'stratifiedkfold'
- 'groupkfold'
- 'timeseries'
- a custom CV generator object compatible with scikit-learn.
-
fold: int, default = 10 The number of folds to be used in cross-validation. Must be at least 2. This is a global setting that can be over-written at the function level by using the
fold
parameter. Ignored whenfold_strategy
is a custom object. -
fold_shuffle: bool, default = False Controls the shuffle parameter of CV. Only applicable when
fold_strategy
is 'kfold' or 'stratifiedkfold'. Ignored whenfold_strategy
is a custom object. -
fold_groups: str or array-like, with shape (n_samples,), default = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
-
use_gpu: str or bool, default = False When set to 'force', will try to use GPU with all algorithms that support it, and raise exceptions if they are unavailable. When set to True, will use GPU with algorithms that support it, and fall back to CPU if they are unavailable. When False, all algorithms are trained using CPU only.
-
custom_pipeline: transformer or list of transformers or tuple, default = None* When passed, will append the custom transformers in the preprocessing pipeline and are applied on each CV fold separately and on the final fit. All the custom transformations are applied after 'train_test_split' and before pycaret's internal transformations.
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
cross_validation: bool = True When set to False, metrics are evaluated on holdout set.
fold
param is ignored when cross_validation is set to False. -
errors: str = "ignore" When set to 'ignore', will skip the model with exceptions and continue. If 'raise', will stop the function when exceptions are raised.
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
cross_validation: bool = True When set to False, metrics are evaluated on holdout set.
fold
param is ignored when cross_validation is set to False. -
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
Following parameters have been removed:
- ensemble - Deprecated - use
ensemble_model
function directly. - method - Deprecated - use
ensemble_model
function directly. - system - Moved to private API.
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
search_library: str, default = 'scikit-learn' The search library used for tuning hyperparameters. Possible values:
'scikit-learn' - default, requires no further installation https://github.com/scikit-learn/scikit-learn
'scikit-optimize' -
pip install scikit-optimize
https://scikit-optimize.github.io/stable/'tune-sklearn' -
pip install tune-sklearn ray[tune]
https://github.com/ray-project/tune-sklearn'optuna' -
pip install optuna
https://optuna.org/ -
search_algorithm: str, default = None The search algorithm depends on the
search_library
parameter. Some search algorithms require additional libraries to be installed. When None, will use the search library-specific default algorithm.scikit-learn
possible values: - random (default) - gridscikit-optimize
possible values: - bayesian (default)tune-sklearn
possible values: - random (default) - grid - bayesianpip install scikit-optimize
- hyperoptpip install hyperopt
- bohbpip install hpbandster ConfigSpace
optuna
possible values: - tpe (default) - random -
early_stopping: bool or str or object, default = False Use early stopping to stop fitting to a hyperparameter configuration if it performs poorly. Ignored when
search_library
is scikit-learn, or if the estimator does not have 'partial_fit' attribute. If False or None, early stopping will not be used. Can be either an object accepted by the search library or one of the following:- 'asha' for Asynchronous Successive Halving Algorithm
- 'hyperband' for Hyperband
- 'median' for Median Stopping Rule
- If False or None, early stopping will not be used.
-
early_stopping_max_iters: int, default = 10 The maximum number of epochs to run for each sampled configuration. Ignored if
early_stopping
is False or None. -
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
-
return_tuner: bool, default = False When set to True, will return a tuple of (model, tuner_object).
-
tuner_verbose: bool or in, default = True If True or above 0, will print messages from the tuner. Higher values print more messages. Ignored when
verbose
param is False.
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
-
weights: list, default = None Sequence of weights (float or int) to weight the occurrences of predicted class labels (hard voting) or class probabilities before averaging (soft voting). Uses uniform weights when None.
-
The default value for the
method
parameter has been changed fromhard
toauto
.
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fold: int or scikit-learn compatible CV generator, default = None Controls cross-validation. If None, the CV generator in the
fold_strategy
parameter of thesetup
function is used. When an integer is passed, it is interpreted as the 'n_splits' parameter of the CV generator in thesetup
function. -
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fold: int or scikit-learn compatible CV generator, default = None Controls cross-validation. If None, the CV generator in the
fold_strategy
parameter of thesetup
function is used. When an integer is passed, it is interpreted as the 'n_splits' parameter of the CV generator in thesetup
function. -
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None Dictionary of arguments passed to the fit method of the model.
-
groups: Optional[Union[str, Any]] = None Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
-
model_only: bool, default = True When set to False, only the model object is re-trained and all the transformations in Pipeline are ignored.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
Following new parameters have been added:
-
internal: bool, default = False When True, will return extra columns and rows used internally.
-
raise_errors: bool, default = True When False, will suppress all exceptions, ignoring models that couldn't be created.
- Post-release
2.1
a bug has been reported preventingpredict_model
function to work inregression
module in a new notebook session, whentransform_target
was set toFalse
during model training. This issue has been fixed in PyCaret release2.1.2
. To learn more about the issue: pycaret#525
- Post-release
2.1
a bug has been identified in MLFlow back-end. The error is only caused whenlog_experiment
in thesetup
function is set to True and is applicable to all the modules. The cause of the error has been identified and an issue is opened withMLFlow
. The error is caused byinfer_signature
function inmlflow.sklearn.log_model
and is only raised when there are missing values in the dataset. This issue has been fixed in PyCaret release2.1.1
by skipping the signature in cases whereMLFlow
raises exception.
- Model Deployment Model deployment support for
gcp
andazure
has been added indeploy_model
function for all modules. Seedocumentation
for details. - Compare Models Budget Time new parameter
budget_time
added incompare_models
function. To set the upper limit oncompare_models
training time,budget_time
parameter can be used. - Feature Selection New feature selection method
boruta
has been added for feature selection. By default,feature_selection_method
parameter in thesetup
function is set toclassic
but can be set toboruta
for feature selection using boruta algorithm. This change is applicable forpycaret.classification
andpycaret.regression
. - Numeric Imputation New method
zero
has been added in thenumeric_imputation
in thesetup
function. When method is set tozero
, missing values are replaced with constant 0. Default behavior ofnumeric_imputation
is unchanged. - Plot Model New parameter
scale
has been added inplot_model
for all modules to enable high quality images for research publications. - User Defined Loss Function You can now pass
custom_scorer
for optimizing user defined loss function intune_model
forpycaret.classification
andpycaret.regression
. You must usemake_scorer
fromsklearn
to create custom loss function that can be passed intocustom_scorer
for thetune_model
function. - Change in Pipeline Behavior When using
save_model
themodel
object is appended intoPipeline
, as such the behavior ofPipeline
andpredict_model
is now changed. Instead of saving alist
,save_model
now savesPipeline
object where trained model is on last position. The user functionality on front-end forpredict_model
remains same. - Compare Models parameter
blacklist
andwhitelist
is now renamed toexclude
andinclude
with no change in functionality. - Predict Model Labels The
Label
column returned bypredict_model
function inpycaret.classification
now returns the original label instead of encoded value. This change is made to make output frompredict_model
more human-readable. A new parameterencoded_labels
is added, which isFalse
by default. When set toTrue
, it will return encoded labels. - Model Logging Model persistence in the backend when
log_experiment
is set toTrue
is now changed. Instead of using internalsave_model
functionality, it now adopts tomlflow.sklearn.save_model
to allow the use of Model Registry andMLFlow
native deployment functionalities. - CatBoost Compatibility
CatBoostClassifier
is now compatible withblend_models
inpycaret.classification
. As suchblend_models
without anyestimator_list
will now result in blending total of15
estimators includingCatBoostClassifier
. - Stack Models
stack_models
inpycaret.classification
andpycaret.regression
now adopts toStackingClassifier()
andStackingRegressor
fromsklearn
. As such thestack_models
function now returnssklearn
object instead of customlist
in previous versions. - Create Stacknet
create_stacknet
inpycaret.classification
andpycaret.regression
is now removed. - Tune Model
tune_model
inpycaret.classification
andpycaret.regression
now inherits params from the inputestimator
. As such if you have trainedxgboost
,lightgbm
orcatboost
on gpu will not inherits training method fromestimator
. - Interpret Model
**kwargs
argument now added ininterpret_model
. - Pandas Categorical Type All modules are now compatible with
pandas.Categorical
object. Internally they are converted into object and are treated as the same way asobject
orbool
is treated. - use_gpu A new parameter added in the
setup
function forpycaret.classification
andpycaret.regression
. In2.1
it was added to prepare for the backend work required to make this change in future releases. As such usinguse_gpu
param in2.1
has no impact. - Unit Tests Unit testing enhanced. Continious improvement in progress https://github.com/pycaret/pycaret/tree/master/pycaret/tests
- Automated Documentation Added Automated documentation now added. Documentation on Website will only update for
major
releases 0.X. For all minor monthly releases, documentation will be available on: https://pycaret.readthedocs.io/en/latest/ - Introduction of GitHub Actions CI/CD build testing is now moved from
travis-ci
togithub-actions
.pycaret-nightly
is now being published every 24 hours automatically. - Tutorials All tutorials are now updated using
pycaret==2.0
. https://github.com/pycaret/pycaret/tree/master/tutorials - Resources New resources added under
/pycaret/resources/
https://github.com/pycaret/pycaret/tree/master/resources - Example Notebook Many example notebooks added under
/pycaret/examples/
https://github.com/pycaret/pycaret/tree/master/examples
- Experiment Logging MLFlow logging backend added. New parameters
log_experiment
experiment_name
log_profile
log_data
added insetup
. Available inpycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
- Save / Load Experiment
save_experiment
andload_experiment
function frompycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
is removed in PyCaret 2.0 - System Logging System log files now generated when
setup
is executed.logs.log
file is saved in current working directory. Functionget_system_logs
can be used to access log file in notebook. - Command Line Support When using PyCaret 2.0 outside of Notebook,
html
parameter insetup
must be set to False. - Imbalance Dataset
fix_imbalance
andfix_imbalance_method
parameter added insetup
forpycaret.classification
. When set to True, SMOTE is applied by default to create synthetic datapoints for minority class. To change the method pass any class fromimblearn
that supportsfit_resample
method infix_imbalance_method
parameter. - Save Plot
save
parameter added inplot_model
. When set to True, it saves the plot aspng
orhtml
in current working directory. - kwargs
kwargs**
added increate_model
forpycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
- choose_better
choose_better
andoptimize
parameter added intune_model
ensemble_model
blend_models
stack_models
create_stacknet
inpycaret.classification
andpycaret.regression
. Read the details below to learn more about thi added increate_model
forpycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
- Training Time
TT (Sec)
added incompare_models
function forpycaret.classification
andpycaret.regression
- New Metric: MCC
MCC
metric added in score grid forpycaret.classification
- NEW FUNCTION: automl() New function
automl
added inpycaret.classification
pycaret.regression
- NEW FUNCTION: pull() New function
pull
added inpycaret.classification
pycaret.regression
- NEW FUNCTION: models() New function
models
added inpycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
- NEW FUNCTION: get_logs() New function
get_logs
added inpycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
- NEW FUNCTION: get_config() New function
get_config
added inpycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
- NEW FUNCTION: set_config() New function
set_config
added inpycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
- NEW FUNCTION: get_system_logs New function
get_logs
added inpycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
- CHANGE IN BEHAVIOR: compare_models
compare_models
now returns top_n models defined byn_select
parameter, by default set to 1. - CHANGE IN BEHAVIOR: tune_model
tune_model
function inpycaret.classification
andpycaret.regression
now requires trained model object to be passed asestimator
instead of string abbreviation / ID. - REMOVED DEPENDENCIES
awscli
andshap
removed from requirements.txt. To useinterpret_model
function inpycaret.classification
pycaret.regression
anddeploy_model
function inpycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
, these libraries will have to be installed separately.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
remove_perfect_collinearity
parameter added insetup()
. Default set to False.
When set to True, perfect collinearity (features with correlation = 1) is removed from the dataset, When two features are 100% correlated, one of it is randomly dropped from the dataset.fix_imbalance
parameter added insetup()
. Default set to False.
When dataset has unequal distribution of target class it can be fixed using fix_imbalance parameter. When set to True, SMOTE (Synthetic Minority Over-sampling Technique) is applied by default to create synthetic datapoints for minority class.fix_imbalance_method
parameter added insetup()
. Default set to None.
When fix_imbalance is set to True and fix_imbalance_method is None, 'smote' is applied by default to oversample minority class during cross validation. This parameter accepts any module from 'imblearn' that supports 'fit_resample' method.data_split_shuffle
parameter added insetup()
. Default set to True.
If set to False, prevents shuffling of rows when splitting data.folds_shuffle
parameter added insetup()
. Default set to False.
If set to False, prevents shuffling of rows when using cross validation.n_jobs
parameter added insetup()
. Default set to -1.
The number of jobs to run in parallel (for functions that supports parallel processing) -1 means using all processors. To run all functions on single processor set n_jobs to None.html
parameter added insetup()
. Default set to True.
If set to False, prevents runtime display of monitor. This must be set to False when using environment that doesnt support HTML.log_experiment
parameter added insetup()
. Default set to False.
When set to True, all metrics and parameters are logged on MLFlow server.experiment_name
parameter added insetup()
. Default set to None.
Name of experiment for logging. When set to None, 'clf' is by default used as alias for the experiment name.log_plots
parameter added insetup()
. Default set to False.
When set to True, specific plots are logged in MLflow as a png file.log_profile
parameter added insetup()
. Default set to False.
When set to True, data profile is also logged on MLflow as a html file.log_data
parameter added insetup()
. Default set to False.
When set to True, train and test dataset are logged as csv.verbose
parameter added insetup()
. Default set to True.
Information grid is not printed when verbose is set to False.
pycaret.classification
pycaret.regression
whitelist
parameter added incompare_models
. Default set to None.
In order to run only certain models for the comparison, the model ID's can be passed as a list of strings in whitelist param.n_select
parameter added incompare_models
. Default set to 1.
Number of top_n models to return. use negative argument for bottom selection. For example, n_select = -3 means bottom 3 models.verbose
parameter added incompare_models
. Default set to True.
Score grid is not printed when verbose is set to False.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
cross_validation
parameter added increate_model
. Default set to True.
When cross_validation set to False fold parameter is ignored and model is trained on entire training dataset. No metric evaluation is returned. Only applicable inpycaret.classification
andpycaret.regression
system
parameter added increate_model
. Default set to True.
Must remain True all times. Only to be changed by internal functions.ground_truth
parameter added increate_model
. Default set to None.
When ground_truth is provided, Homogeneity Score, Rand Index, and Completeness Score is evaluated and printer along with other metrics. This is only available inpycaret.clustering
kwargs
parameter added increate_model
.
Additional keyword arguments to pass to the estimator.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
custom_grid
parameter added intune_model
. Default set to None.
To use custom hyperparameters for tuning pass a dictionary with parameter name and values to be iterated. When set to None it uses pre-defined tuning grid. Forpycaret.clustering
pycaret.anomaly
pycaret.nlp
, custom_grid param must be a list of values to iterate over.choose_better
parameter added intune_model
. Default set to False.
When set to set to True, base estimator is returned when the performance doesn't improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.
pycaret.classification
pycaret.regression
choose_better
parameter added inensemble_model
. Default set to False.
When set to set to True, base estimator is returned when the performance doesn't improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.optimize
parameter added inensemble_model
. Default set toAccuracy
forpycaret.classification
andR2
forpycaret.regression
.
Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter forpycaret.classification
are 'Accuracy', 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', 'MCC' and forpycaret.regression
are 'MAE', 'MSE', 'RMSE' 'R2', 'RMSLE' and 'MAPE'.
pycaret.classification
pycaret.regression
choose_better
parameter added inblend_models
. Default set to False.
When set to set to True, base estimator is returned when the performance doesn't improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.optimize
parameter added inblend_models
. Default set toAccuracy
forpycaret.classification
andR2
forpycaret.regression
.
Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter forpycaret.classification
are 'Accuracy', 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', 'MCC' and forpycaret.regression
are 'MAE', 'MSE', 'RMSE' 'R2', 'RMSLE' and 'MAPE'.
pycaret.classification
pycaret.regression
choose_better
parameter added instack_models
. Default set to False.
When set to set to True, base estimator is returned when the performance doesn't improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.optimize
parameter added instack_models
. Default set toAccuracy
forpycaret.classification
andR2
forpycaret.regression
.
Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter forpycaret.classification
are 'Accuracy', 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', 'MCC' and forpycaret.regression
are 'MAE', 'MSE', 'RMSE' 'R2', 'RMSLE' and 'MAPE'.
pycaret.classification
pycaret.regression
choose_better
parameter added increate_stacknet
. Default set to False.
When set to set to True, base estimator is returned when the performance doesn't improve by tune_model. This gurantees the returned object would perform atleast equivalent to base estimator created using create_model or model returned by compare_models.optimize
parameter added increate_stacknet
. Default set toAccuracy
forpycaret.classification
andR2
forpycaret.regression
.
Only used when choose_better is set to True. optimize parameter is used to compare emsembled model with base estimator. Values accepted in optimize parameter forpycaret.classification
are 'Accuracy', 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', 'MCC' and forpycaret.regression
are 'MAE', 'MSE', 'RMSE' 'R2', 'RMSLE' and 'MAPE'.
pycaret.classification
pycaret.regression
verbose
parameter added inpredict_model
. Default set to True.
Holdout score grid is not printed when verbose is set to False.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
save
parameter added inplot_model
. Default set to False.
When set to True, Plot is saved as a 'png' file in current working directory.
verbose
parameter added inplot_model
. Default set to True.
Progress bar not shown when verbose set to False.
system
parameter added inplot_model
. Default set to True.
Must remain True all times. Only to be changed by internal functions.
pycaret.classification
pycaret.regression
- This function returns the best model out of all models created in current active environment based on metric defined in optimize parameter.
optimize
string, default = 'Accuracy' forpycaret.classification
and 'R2' forpycaret.regression
Other values you can pass in optimize param are 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', and 'MCC' forpycaret.classification
and 'MAE', 'MSE', 'RMSE', 'R2', 'RMSLE', and 'MAPE' forpycaret.regression
use_holdout
bool, default = False
When set to True, metrics are evaluated on holdout set instead of CV.
pycaret.classification
pycaret.regression
- This function returns the last printed score grid as pandas dataframe.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
- This function Returns the table of models available in model library.
type
string, default = None
linear : filters and only return linear models
tree : filters and only return tree based models
ensemble : filters and only return ensemble models
type
parameter only available in pycaret.classification
and pycaret.regression
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
- This function returns a table with experiment logs consisting run details, parameter, metrics and tags.
-
experiment_name
string, default = None
When set to None current active run is used. -
save
bool, default = False
When set to True, csv file is saved in current directory.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
- This function is used to access global environment variables. Check docstring for the list of global var accessible.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
- This function is used to reset global environment variables. Check docstring for the list of global var accessible.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
pycaret.nlp
- This function is reads and print 'logs.log' file from current active directory. logs.log is generated from
setup
is initialized in any module.