Pandas ML Utils

Pandas ML Utils is intended to help you through your journey of statistical or machine learning models, while you never need to leave the world of pandas.

install pip install pandas-ml-utils
1. optional finance: pip install pandas-ml-utils[finance] allows you to pd.fetch_yahoo(...)
2. optional crypto: pip install pandas-ml-utils[crypto] allows you to pd.fetch_crypto(...)
3. optional notebook: pip install pandas-ml-utils[notebook] renders results nicely in notebooks
4. optional development: pip install pandas-ml-utils[development] if you want to develop
analyze your features
find a model
save and reuse your model

Or read the docs.

Install

pip install pandas-ml-utils

Analyze your Features

The feature_selection functionality helps you to analyze your features, filter out highly correlated once and focus on the most important features. This function also applies an auto regression and embeds and ACF plot.

import pandas_ml_utils as pmu
import pandas as pd

df = pd.read_csv('burritos.csv')[["Tortilla", "Temp", "Meat", "Fillings", "Meat:filling", "Uniformity", "Salsa", "Synergy", "Wrap", "overall"]]
df.feature_selection(label_column="overall")

          Tortilla   overall   Synergy  Fillings      Temp     Salsa  \
Tortilla       1.0  0.403981  0.367575  0.345613  0.290702  0.267212   

              Meat  Uniformity  Meat:filling      Wrap  
Tortilla  0.260194    0.208666      0.207518  0.160831  
label is continuous: True

Feature ranking:
['Synergy', 'Meat', 'Fillings', 'Meat:filling', 'Wrap', 'Tortilla', 'Uniformity', 'Salsa', 'Temp']

TOP 5 features
         Synergy      Meat  Fillings  Meat:filling     Wrap
Synergy      1.0  0.601545  0.663328      0.428505  0.08685

filtered features with correlation < 0.5
           Synergy  Meat:filling      Wrap
Tortilla  0.367575      0.207518  0.160831

Synergy       1.000000
Synergy_0     1.000000
Synergy_1     0.147495
Synergy_56    0.128449
Synergy_78    0.119272
Synergy_55    0.111832
Synergy_79    0.086466
Synergy_47    0.085117
Synergy_53    0.084786
Synergy_37    0.084312
Name: Synergy, dtype: float64

Meat:filling       1.000000
Meat:filling_0     1.000000
Meat:filling_15    0.185946
Meat:filling_35    0.175837
Meat:filling_1     0.122546
Meat:filling_87    0.118597
Meat:filling_33    0.112875
Meat:filling_73    0.103090
Meat:filling_72    0.103054
Meat:filling_71    0.089437
Name: Meat:filling, dtype: float64

Wrap       1.000000
Wrap_0     1.000000
Wrap_63    0.210823
Wrap_88    0.189735
Wrap_1     0.169132
Wrap_87    0.166502
Wrap_66    0.146689
Wrap_89    0.141822
Wrap_74    0.120047
Wrap_11    0.115095
Name: Wrap, dtype: float64
best lags are
[(1, '-1.00'), (2, '-0.15'), (88, '-0.10'), (64, '-0.07'), (19, '-0.07'), (89, '-0.06'), (36, '-0.05'), (43, '-0.05'), (16, '-0.05'), (68, '-0.04'), (90, '-0.04'), (87, '-0.04'), (3, '-0.03'), (20, '-0.03'), (59, '-0.03'), (75, '-0.03'), (91, '-0.03'), (57, '-0.03'), (46, '-0.02'), (48, '-0.02'), (54, '-0.02'), (73, '-0.02'), (25, '-0.02'), (79, '-0.02'), (76, '-0.02'), (37, '-0.02'), (71, '-0.02'), (15, '-0.02'), (49, '-0.02'), (12, '-0.02'), (65, '-0.02'), (40, '-0.02'), (24, '-0.02'), (78, '-0.02'), (53, '-0.02'), (8, '-0.02'), (44, '-0.01'), (45, '0.01'), (56, '0.01'), (26, '0.01'), (82, '0.01'), (77, '0.02'), (22, '0.02'), (83, '0.02'), (11, '0.02'), (66, '0.02'), (31, '0.02'), (80, '0.02'), (92, '0.02'), (39, '0.03'), (27, '0.03'), (70, '0.04'), (41, '0.04'), (51, '0.04'), (4, '0.04'), (7, '0.05'), (13, '0.05'), (97, '0.06'), (60, '0.06'), (42, '0.06'), (96, '0.06'), (95, '0.06'), (30, '0.07'), (81, '0.07'), (52, '0.07'), (9, '0.07'), (61, '0.07'), (84, '0.07'), (29, '0.08'), (94, '0.08'), (28, '0.11')]

Fit a Model

Once you know your features you can start to try out different models i.e. a very basic Logistic Regression. It is also possible to search through a set of hyper parameters.

import pandas as pd
import pandas_ml_utils as pmu
from sklearn.linear_model import LogisticRegression
from pandas_ml_utils.summary.binary_classification_summary import BinaryClassificationSummary

df = pd.read_csv('burritos.csv')
columns = ["Tortilla", "Temp", "Meat", "Fillings", "Meat:filling", "Uniformity", "Salsa", "Synergy", "Wrap", "overall", "with_fires", "price"]
fit = df.fitpmu.SkitModel(LogisticRegression(solver='lbfgs'),
                          pmu.FeaturesAndLabels(["Tortilla", "Temp", "Meat", "Fillings", "Meat:filling",
                                                  "Uniformity", "Salsa", "Synergy", "Wrap", "overall"],
                                                 ["with_fires"], 
                                                pre_processor=lambda _df: pmu.LazyDataFrame(_df,
                                                                                            with_fires = lambda f: f["Fries"].apply(lambda x: str(x).lower() == "x"),
                                                                                            price      = lambda f: f["Cost"] * -1).to_dataframe()[columns].dropna()),
                          BinaryClassificationSummary)

fit

Save and use your model

Once you are happy with your model you can save it and apply it on any DataFrame which serves the needed columns by your features.

fit.save_model("/tmp/burrito.model")

An then just apply the model on the data frame as you would expect it from your data source:

df = pd.read_csv('burritos.csv')
df.predict(pmu.Model.load("/tmp/burrito.model")).tail()

	price
	prediction		target
	value	value_proba	value
380	False	0.251311	-6.85
381	False	0.328659	-6.85
382	False	0.064751	-11.50
383	False	0.428745	-7.89
384	False	0.265546	-7.89

TODO

allow multiple class for classification
replace hard coded summary objects by a summary provider function
add more tests
add Proximity https://stats.stackexchange.com/questions/270201/pooling-levels-of-categorical-variables-for-regression-trees/275867#275867

Wanna help?

for tensorflow 2.x implement a new TfKeras Model
for non classification problems you might want to augment the Summary
write some tests
add different more charts for a better understanding/interpretation of the models
add whatever you need for yourself and share it with us

Change Log

0.0.25 / 26

refactored how traing and test data sets are split
allow to control the amount of young test data being used (useful for time series)
add sample weights i.e. to penalize loss per sample in a keras model

0.0.23 / 24

changed SkitModel to SkModel
some minor bug fixes

0.0.22

introduce proper keras session and graph handling in case of tensorflow backend
rename features_and_labels.loss to gross_loss to avoid confusion with traning loss

0.0.21

added engineered source frame to backtest
introduced pre-processing of data frame in features and labels
changed the lambda parameters of target and loss providers (can be 1, 2 or 3 parameter lambda)
bugfixes in laza dataframe

0.0.18

refactored the data frame logic in the feature and label extractor for using multi level index

0.0.16, 0.0.17

there is now only one fit and only one backtest and predict method
Summary class has to be provided as part of the model i.e. BinaryClassificationSummary

0.0.12

added sphinx documentation
added multi model as regular model which has quite a big impact
- features and labels signature changed
- multiple targets has now the consequence that a lot of things a returning a dict now
- everything is using now DataFrames instead of arrays after plain model invoke
added some tests
fixed some bugs a long the way

0.0.11

Added Hyper parameter tuning

from hyperopt import hp

fit = df.fit_classifier(
            pdu.SkitModel(MLPClassifier(activation='tanh', hidden_layer_sizes=(60, 50), random_state=42),
                          pdu.FeaturesAndLabels(features=['vix_Close'], labels=['label'],
                                                targets=("vix_Open", "spy_Volume"))),
            test_size=0.4,
            test_validate_split_seed=42,
            hyper_parameter_space={'alpha': hp.choice('alpha', [0.001, 0.1]), 'early_stopping': True, 'max_iter': 50,
                                   '__max_evals': 4, '__rstate': np.random.RandomState(42)})

NOTE there is currently a bug in hyperot module bson has no attribute BSON ! However there is a workaround:

sudo pip uninstall bson
pip install pymongo

0.0.10

Added support for rescaling features within the auto regressive lags. The following example re-scales the domain of min/max(featureA and featureB) to the range of -1 and 1.

FeaturesAndLabels(["featureA", "featureB", "featureC"],
                  ["labelA"],
                  feature_rescaling={("featureA", "featureC"): (-1, 1)})

added a feature selection functionality. When starting from scratch this just helps to analyze the data to find feature importance and feature (auto) correlation. I.e. df.filtration(label_column='delta') takes all columns as features exept for the delta column (which is the label) and reduces the feature space by some heuristics.

Name		Name	Last commit message	Last commit date
Latest commit History 299 Commits
.github/workflows		.github/workflows
Readme_files		Readme_files
docs		docs
pandas_ml_utils		pandas_ml_utils
test		test
.codacy.yml		.codacy.yml
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
README.md		README.md
deploy.sh		deploy.sh
docs-requirements.txt		docs-requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pandas ML Utils

Install

Analyze your Features

Fit a Model

Save and use your model

TODO

Wanna help?

Change Log

0.0.25 / 26

0.0.23 / 24

0.0.22

0.0.21

0.0.18

0.0.16, 0.0.17

0.0.12

0.0.11

0.0.10

About

Releases

Packages

Contributors 2

Languages

License

KIC/pandas_ml_utils

Folders and files

Latest commit

History

Repository files navigation

Pandas ML Utils

Install

Analyze your Features

Fit a Model

Save and use your model

TODO

Wanna help?

Change Log

0.0.25 / 26

0.0.23 / 24

0.0.22

0.0.21

0.0.18

0.0.16, 0.0.17

0.0.12

0.0.11

0.0.10

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages