diff --git a/README.md b/README.md index 39328f2..b4b4082 100644 --- a/README.md +++ b/README.md @@ -8,8 +8,8 @@ [![Build Status](https://travis-ci.org/joemccann/dillinger.svg?branch=master)](https://github.com/AutoViML) ## Table of Contents +## Update (Jan 2022): Now with mlflow! +You can now add `mlflow` experiment tracking to all your deep_autoviml runs. [mlflow](https://mlflow.org/) is a popular python library for experiment tracking and MLOps in general. See more details below under `mlflow`. + ## Motivation ✨ deep_autoviml is a powerful new deep learning library with a very simple design goal: ✨ ```Make it easy for novices and experts to experiment and build tensorflow.keras preprocessing pipelines and models in fewest steps.``` @@ -38,13 +41,13 @@ deep autoviml is primarily meant for sophisticated data engineers, data scientis 1. Instead, you can "bring your own model" ("BYOM" option) model to attach keras data pipelines to your model. 1. Additionally, you can choose any Tensorflow Hub model (TFHub) to custom train on your data. Just look for instructions below in "Tips for using deep_autoviml" section. 1. There are 4 ways to build your model quickly or slowly depending on your needs: -
  • fast: a quick model that uses only dense layers (deep layers)
  • -
  • fast1: a deep and wide model that uses both deep and wide layers
  • -
  • fast2: a deep and cross model that crosses some variables (hence deep and cross)
  • -
  • auto: This will try out multiple combinations of dense layers and optimize them using either Optuna or Storm-Tuner. This will take the longest time
  • +- fast: a quick model that uses only dense layers (deep layers) +- fast1: a deep and wide model that uses both deep and wide layers. This is slightly slower than `fast` model. +- fast2: a deep and cross model that crosses some variables (hence deep and cross). This is about the same speed as 'fast1` model. +- auto: This uses `Optuna` or `Storm-Tuner` to perform combinations of dense layers and select best architecture. This will take the longest time. ![why_deep](deep_2.jpg) -## InnerWorking +## Features These are the main features that distinguish deep_autoviml from other libraries: - It uses keras preprocessing layers which are more intuitive, and are included inside your model to simplify deployment - The pipeline is available to you to use as inputs in your own functional model (if you so wish - you must specify that option in the input - see below for "pipeline") @@ -57,7 +60,6 @@ These are the main features that distinguish deep_autoviml from other libraries: ![how_it_works](deep_1.jpg) ## Technology - deep_autoviml uses the latest in tensorflow (2.4.1+) td.data.Datasets and tf.keras preprocessing technologies: the Keras preprocessing layers enable you to encapsulate feature engineering and preprocessing into the model itself. This makes the process for training and predictions the same: just feed input data (in the form of files or dataframes) and the model will take care of all preprocessing before predictions. To perform its preprocessing on the model itself, deep_autoviml uses [tensorflow](https://www.tensorflow.org/) (TF 2.4.1+ and later versions) and [tf.keras](https://www.tensorflow.org/api_docs/python/tf/keras) experimental preprocessing layers: these layers are part of your saved model. They become part of the model's computational graph that can be optimized and executed on any device including GPU's and TPU's. By packaging everything as a single unit, we save the effort in reimplementing the preprocessing logic on the production server. The new model can take raw tabular data with numeric and categorical variables or strings text directly without any preprocessing. This avoids missing or incorrect configuration for the preprocesing_layer during production. @@ -67,7 +69,6 @@ In addition, to select the best hyper parameters for the model, it uses a new op ![how_deep](deep_4.jpg) ## Install - deep_autoviml requires [tensorflow](https://www.tensorflow.org/api_docs/python/tf) v2.4.1+ and [storm-tuner](https://github.com/ben-arnao/StoRM) to run. Don't worry! We will install these libraries when you install deep_autoviml. ``` @@ -85,7 +86,6 @@ pip install git+https://github.com/AutoViML/deep_autoviml.git ``` ## Usage - ![deep_usage](deep_5.jpg) deep_autoviml can be invoked with a simple import and run statement: @@ -98,7 +98,8 @@ Load a data set (any .csv or .gzip or .gz or .txt file) into deep_autoviml and i ``` model, cat_vocab_dict = deepauto.fit(train, target, keras_model_type="auto", project_name="deep_autoviml", keras_options={}, model_options={}, - save_model_flag=True, use_my_model='', model_use_case='', verbose=0) + save_model_flag=True, use_my_model='', model_use_case='', verbose=0, + use_mlflow=False, mlflow_exp_name='autoviml', mlflow_run_name='first_run') ``` Once deep_autoviml writes your saved model and cat_vocab_dict files to disk in the project_name directory, you can load it from anywhere (including cloud) for predictions like this using the model and cat_vocab_dict generated above: @@ -132,6 +133,11 @@ deep_autoviml requires only a single line of code to get started. You can howeve - `save_model_flag`: must be True or False. The model will be saved in keras model format. - `use_my_model`: This is where "bring your own model" (BYOM) option comes into play. This BYOM model must be a keras Sequential model with NO input layers and output layers! You can define it and send it as input here. We will add input and preprocessing layers to it automatically. Your custom defined model must contain only hidden layers (Dense, Conv1D, Conv2D, etc.), and dropouts, activations, etc. The default for this argument is "" (empty string) which means we will build your model. If you provide your custom model object here, we will use it instead. - `verbose`: must be 0, 1 or 2. Can also be True or False. You can see more and more outputs as you increase the verbose level. If you want to see a chart of your model, use verbose = 2. But you must have graphviz and pydot installed in your machine to see the model plot. +-`use_mlflow`: default = False. Use for MLflow lifecycle tracking. You can set it to True. MLflow is an open source python library useed to manage ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. +Once the model training (via `fit` method) is done, you need to run MLflow locally from your working directory. Run below command on command line. This will start MLflow UI on port 5000 (http://localhost:5000/) and user can manage and visualize the end-to-end machine learning lifecycle.
    +`$ mlflow ui` +-`mlflow_exp_name`: Default value is 'autoviml'. MLflow experiment name. You can change this to any string you want. +-`mlflow_run_name`: Default value is'first_run'. Each run under an experiment can have a unique run name. You can change this. ## Image ![image_deep](deep_7.jpg) @@ -203,4 +209,4 @@ PRs accepted. Apache License 2.0 © 2020 Ram Seshadri ## DISCLAIMER -This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose. \ No newline at end of file +This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose. diff --git a/build/lib/deep_autoviml/__init__.py b/build/lib/deep_autoviml/__init__.py new file mode 100644 index 0000000..ef9c5ba --- /dev/null +++ b/build/lib/deep_autoviml/__init__.py @@ -0,0 +1,52 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +# -*- coding: utf-8 -*- +################################################################################ +# deep_auto_viml - build and test multiple Tensorflow 2.0 models and pipelines +# Python v3.6+ tensorflow v2.4.1+ +# Created by Ram Seshadri +# Licensed under Apache License v2 +################################################################################ +# Version +from .__version__ import __version__ +__all__ = ['data_load', 'models', 'modeling', 'preprocessing', 'utilities'] +import pdb + +from .deep_autoviml import fit +from deep_autoviml.modeling.predict_model import load_test_data, predict, predict_images, predict_text +from deep_autoviml.utilities.utilities import print_one_row_from_tf_dataset, print_one_row_from_tf_label +from deep_autoviml.utilities.utilities import print_classification_metrics, print_regression_model_stats +from deep_autoviml.utilities.utilities import print_classification_model_stats, plot_history, plot_classification_results +################################################################################ +if __name__ == "__main__": + module_type = 'Running' +else: + module_type = 'Imported' +version_number = __version__ +print(""" +%s deep_auto_viml. version=%s +from deep_autoviml import deep_autoviml as deepauto +------------------- +model, cat_vocab_dict = deepauto.fit(train, target, keras_model_type="fast", + project_name="deep_autoviml", keras_options=keras_options, + model_options=model_options, save_model_flag=True, use_my_model='', + model_use_case='', verbose=0) + +predictions = deepauto.predict(model, project_name, test_dataset=test, + keras_model_type=keras_model_type, + cat_vocab_dict=cat_vocab_dict) + """ %(module_type, version_number)) +################################################################################ diff --git a/build/lib/deep_autoviml/__version__.py b/build/lib/deep_autoviml/__version__.py new file mode 100644 index 0000000..8df4dba --- /dev/null +++ b/build/lib/deep_autoviml/__version__.py @@ -0,0 +1,25 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +# -*- coding: utf-8 -*- +"""Specifies the version of the deep_autoviml package.""" + +__title__ = "deep_autoviml" +__author__ = "Ram Seshadri" +__description__ = "deep_autoviml - build and test multiple Tensorflow 2.0 models and pipelines" +__url__ = "https://github.com/Auto_ViML/deep_autoviml.git" +__version__ = "0.0.82" +__license__ = "Apache License 2.0" +__copyright__ = "2020-21 Google" diff --git a/build/lib/deep_autoviml/data_load/classify_features.py b/build/lib/deep_autoviml/data_load/classify_features.py new file mode 100644 index 0000000..e451e40 --- /dev/null +++ b/build/lib/deep_autoviml/data_load/classify_features.py @@ -0,0 +1,1219 @@ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +# -*- coding: utf-8 -*- +################################################################################ +# deep_autoviml - build and test multiple Tensorflow 2.0 models and pipelines +# Python v3.6+ tensorflow v2.4.1+ +# Created by Ram Seshadri +# Licensed under Apache License v2 +################################################################################ +import pandas as pd +import numpy as np +np.random.seed(99) +#### The warnings from Sklearn are so annoying that I have to shut it off ####### +import warnings +warnings.filterwarnings("ignore") +from sklearn.exceptions import DataConversionWarning +warnings.filterwarnings(action='ignore', category=DataConversionWarning) +def warn(*args, **kwargs): + pass +warnings.warn = warn +#################################################################################### +import re +import pdb +import pprint +from itertools import cycle, combinations +from collections import defaultdict, OrderedDict +import time +import sys +import random +import xlrd +import statsmodels +from io import BytesIO +import base64 +from functools import reduce +import copy +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import tempfile +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +############################################################################################ +# data pipelines and feature engg here + +# pre-defined TF2 Keras models and your own models here + +# Utils + +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers + +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle + +##### Suppress all TF2 and TF1.x warnings ################### +tf2logger = tf.get_logger() +tf2logger.warning('Silencing TF2.x warnings') +tf2logger.root.removeHandler(tf2logger.root.handlers) +tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization + +####################################################################################################### +def classify_features(dfte, depVar, model_options={}, verbose=0): + max_cols_analyzed = 30 + dfte = copy.deepcopy(dfte) + if isinstance(depVar, list): + orig_preds = [x for x in list(dfte) if x not in depVar] + else: + orig_preds = [x for x in list(dfte) if x not in [depVar]] + ################# CLASSIFY COLUMNS HERE ###################### + preds_copy_copy = copy.deepcopy(orig_preds) + for key in preds_copy_copy: + if len(dfte[key].map(type).value_counts()) > 1: + print('Alert! %s has %d mixed data types: %s ' %(key,len(dfte[key].map(type).value_counts()), + dfte[key].map(type).value_counts().index)) + var_df = classify_columns(dfte[orig_preds], model_options, verbose) + ##### Classify Columns ################ + IDcols = var_df['id_vars'] + nlp_vars = var_df['nlp_vars'] + discrete_string_vars = var_df['discrete_string_vars'] + cols_delete = var_df['cols_delete'] + int_vars = var_df['int_vars'] + var_df['num_bool_vars'] + categorical_vars = var_df['cat_vars'] + var_df['factor_vars'] + var_df['string_bool_vars'] + date_vars = var_df['date_vars'] + continuous_vars = var_df['continuous_vars'] + ####### Now search for latitude and longitude variables ###### + lats, lons, matched_pairs = find_latitude_longitude_columns_in_df(dfte[orig_preds], verbose) + if len(lats+lons) > 0: + continuous_vars = left_subtract(continuous_vars, lats+lons) + categorical_vars = left_subtract(categorical_vars, lats+lons) + discrete_string_vars = left_subtract(discrete_string_vars, lats+lons) + ###################################################################### + #cols_delete = list(set(IDcols+cols_delete)) ## leave IDcols in dataset. We need ID's to track rows. + preds = [x for x in orig_preds if x not in cols_delete] + + var_df['cols_delete'] = cols_delete + if len(cols_delete) == 0: + print(' No variables removed since no ID or low-information variables found in data set') + else: + print(' %d variable(s) to be removed since they were ID or low-information variables' + %len(cols_delete)) + if verbose >= 1: + print(' List of variables to be removed: %s' %cols_delete) + ############# Check if there are too many columns to visualize ################ + ppt = pprint.PrettyPrinter(indent=4) + if verbose > 1 and len(preds) <= max_cols_analyzed: + marthas_columns(dfte,verbose) + print(" Columns to delete:") + ppt.pprint(' %s' % cols_delete) + print(" Categorical variables: ") + ppt.pprint(' %s' % categorical_vars) + print(" Continuous variables:" ) + ppt.pprint(' %s' % continuous_vars) + print(" Discrete string variables: " ) + ppt.pprint(' %s' % discrete_string_vars) + print(" NLP string variables: " ) + ppt.pprint(' %s' % nlp_vars) + print(" Date and time variables: " ) + ppt.pprint(' %s' % date_vars) + if len(lats) > 0: + print(" Latitude variables:" ) + ppt.pprint(' %s' % lats) + if len(lons) > 0: + print(" Longitude variables:" ) + ppt.pprint(' %s' % lons) + if len(matched_pairs) > 0: + print(" Matched Latitude and Longitude variables:" ) + ppt.pprint(' %s' % matched_pairs) + print(" ID variables %s ") + ppt.pprint(' %s' % IDcols) + print(" Target variable %s ") + ppt.pprint(' %s' % depVar) + elif verbose==1 and len(preds) > max_cols_analyzed: + print(' Total columns > %d, too numerous to list.' %max_cols_analyzed) + features_dict = dict([('IDcols',IDcols),('cols_delete',cols_delete),('categorical_vars',categorical_vars), ( + 'lat_vars',lats),('lon_vars',lons),('matched_pairs',matched_pairs), ('int_vars',int_vars), + ('continuous_vars',continuous_vars),('discrete_string_vars',discrete_string_vars), + ('nlp_vars',nlp_vars), ('date_vars',date_vars)]) + return features_dict +####################################################################################################### +def marthas_columns(data,verbose=0): + """ + This program is named in honor of my one of students who came up with the idea for it. + It's a neat way of printing data types and information compared to the boring describe() function in Pandas. + """ + data = data[:] + print('Data Set Shape: %d rows, %d cols' % data.shape) + if data.shape[1] > 30: + print('Too many columns to print') + else: + if verbose==1: + print('Data Set columns info:') + for col in data.columns: + print('* %s: %d nulls, %d unique vals, most common: %s' % ( + col, + data[col].isnull().sum(), + data[col].nunique(), + data[col].value_counts().head(2).to_dict() + )) + print('--------------------------------------------------------------------') +################################################################################ +######### NEW And FAST WAY to CLASSIFY COLUMNS IN A DATA SET ####### +################################################################################ +from collections import defaultdict +def classify_columns(df_preds, model_options={}, verbose=0): + """ + This actually does Exploratory data analysis - it means this function performs EDA + ###################################################################################### + Takes a dataframe containing only predictors to be classified into various types. + DO NOT SEND IN A TARGET COLUMN since it will try to include that into various columns. + Returns a data frame containing columns and the class it belongs to such as numeric, + categorical, date or id column, boolean, nlp, discrete_string and cols to delete... + ####### Returns a dictionary with 10 kinds of vars like the following: # continuous_vars,int_vars + # cat_vars,factor_vars, bool_vars,discrete_string_vars,nlp_vars,date_vars,id_vars,cols_delete + """ + train = copy.deepcopy(df_preds) + #### If there are 30 chars are more in a discrete_string_var, it is then considered an NLP variable + ### if a variable has more than this many chars, it will be treated like a NLP variable + + max_nlp_char_size = check_model_options(model_options, "nlp_char_limit", 50) + ### if a variable has more than this limit, it will not be treated like a cat variable # + #### Cat_Limit defines the max number of categories a column can have to be called a categorical colum + cat_limit = check_model_options(model_options, "variable_cat_limit", 30) + max_cols_to_print = 30 + #### Make this limit low so that float variables below this limit become cat vars ### + float_limit = 15 + print('############## C L A S S I F Y I N G V A R I A B L E S ####################') + print('Classifying variables in data set...') + def add(a,b): + return a+b + sum_all_cols = defaultdict(list) + orig_cols_total = train.shape[1] + #Types of columns + cols_delete = [col for col in list(train) if (len(train[col].value_counts()) == 1 + ) | (train[col].isnull().sum()/len(train) >= 0.90)] + train = train[left_subtract(list(train),cols_delete)] + var_df = pd.Series(dict(train.dtypes)).reset_index(drop=False).rename( + columns={0:'type_of_column'}) + sum_all_cols['cols_delete'] = cols_delete + var_df['bool'] = var_df.apply(lambda x: 1 if x['type_of_column'] in ['bool','object'] + and len(train[x['index']].value_counts()) == 2 else 0, axis=1) + string_bool_vars = list(var_df[(var_df['bool'] ==1)]['index']) + sum_all_cols['string_bool_vars'] = string_bool_vars + var_df['num_bool'] = var_df.apply(lambda x: 1 if x['type_of_column'] in [np.uint8, + np.uint16, np.uint32, np.uint64, + 'int8','int16','int32','int64', + 'float16','float32','float64'] and len( + train[x['index']].value_counts()) == 2 else 0, axis=1) + num_bool_vars = list(var_df[(var_df['num_bool'] ==1)]['index']) + sum_all_cols['num_bool_vars'] = num_bool_vars + ###### This is where we take all Object vars and split them into diff kinds ### + discrete_or_nlp = var_df.apply(lambda x: 1 if x['type_of_column'] in ['object'] and x[ + 'index'] not in string_bool_vars+cols_delete else 0,axis=1) + ######### This is where we figure out whether a string var is nlp or discrete_string var ### + var_df['nlp_strings'] = 0 + var_df['discrete_strings'] = 0 + var_df['cat'] = 0 + var_df['id_col'] = 0 + var_df['date_time'] = 0 + discrete_or_nlp_vars = var_df.loc[discrete_or_nlp==1]['index'].values.tolist() + ###### This is where we detect categorical variables based on category limit ####### + if len(var_df.loc[discrete_or_nlp==1]) != 0: + for col in discrete_or_nlp_vars: + #### first fill empty or missing vals since it will blowup ### + train[col] = train[col].fillna(' ') + if train[col].map(lambda x: len(x) if type(x)==str else 0).mean( + ) >= max_nlp_char_size and len(train[col].value_counts() + ) >= int(0.9*len(train)) and col not in string_bool_vars: + try: + pd.to_datetime(train[col],infer_datetime_format=True) + var_df.loc[var_df['index']==col,'date_time'] = 1 + except: + var_df.loc[var_df['index']==col,'nlp_strings'] = 1 + elif len(train[col].value_counts()) > cat_limit and len(train[col].value_counts() + ) <= int(0.9*len(train)) and col not in string_bool_vars: + try: + pd.to_datetime(train[col],infer_datetime_format=True) + var_df.loc[var_df['index']==col,'date_time'] = 1 + except: + var_df.loc[var_df['index']==col,'discrete_strings'] = 1 + elif len(train[col].value_counts()) > cat_limit and len(train[col].value_counts() + ) == len(train) and col not in string_bool_vars: + try: + pd.to_datetime(train[col],infer_datetime_format=True) + var_df.loc[var_df['index']==col,'date_time'] = 1 + except: + var_df.loc[var_df['index']==col,'id_col'] = 1 + else: + var_df.loc[var_df['index']==col,'cat'] = 1 + nlp_vars = list(var_df[(var_df['nlp_strings'] ==1)]['index']) + sum_all_cols['nlp_vars'] = nlp_vars + discrete_string_vars = list(var_df[(var_df['discrete_strings'] ==1) ]['index']) + sum_all_cols['discrete_string_vars'] = discrete_string_vars + date_vars = list(var_df[(var_df['date_time'] == 1)]['index']) + ###### This happens only if a string column happens to be an ID column ####### + #### DO NOT Add this to ID_VARS yet. It will be done later.. Dont change it easily... + #### Category DTYPE vars are very special = they can be left as is and not disturbed in Python. ### + var_df['dcat'] = var_df.apply(lambda x: 1 if str(x['type_of_column'])=='category' else 0, + axis=1) + factor_vars = list(var_df[(var_df['dcat'] ==1)]['index']) + sum_all_cols['factor_vars'] = factor_vars + ######################################################################## + date_or_id = var_df.apply(lambda x: 1 if x['type_of_column'] in [np.uint8, + np.uint16, np.uint32, np.uint64, + 'int8','int16', + 'int32','int64'] and x[ + 'index'] not in string_bool_vars+num_bool_vars+discrete_string_vars+nlp_vars+date_vars else 0, + axis=1) + ######### This is where we figure out whether a numeric col is date or id variable ### + var_df['int'] = 0 + ### if a particular column is date-time type, now set it as a date time variable ## + var_df['date_time'] = var_df.apply(lambda x: 1 if x['type_of_column'] in [' 2050: + var_df.loc[var_df['index']==col,'id_col'] = 1 + else: + try: + pd.to_datetime(train[col],infer_datetime_format=True) + var_df.loc[var_df['index']==col,'date_time'] = 1 + except: + var_df.loc[var_df['index']==col,'id_col'] = 1 + else: + if train[col].min() < 1900 or train[col].max() > 2050: + if col not in num_bool_vars: + var_df.loc[var_df['index']==col,'int'] = 1 + else: + try: + pd.to_datetime(train[col],infer_datetime_format=True) + var_df.loc[var_df['index']==col,'date_time'] = 1 + except: + if col not in num_bool_vars: + var_df.loc[var_df['index']==col,'int'] = 1 + else: + pass + int_vars = list(var_df[(var_df['int'] ==1)]['index']) + date_vars = list(var_df[(var_df['date_time'] == 1)]['index']) + id_vars = list(var_df[(var_df['id_col'] == 1)]['index']) + sum_all_cols['int_vars'] = int_vars + copy_date_vars = copy.deepcopy(date_vars) + ###### In Tensorflow there is no need to create age variables from year-dates. Hence removing them! + for date_var in copy_date_vars: + if train[date_var].dtype in ['int16','int32','int64']: + if train[date_var].min() >= 1900 or train[date_var].max() <= 2050: + ### if it is between these numbers, its probably a year so avoid adding it + date_items = train[date_var].dropna(axis=0).apply(str).apply(len).values + if all(date_items[0] == item for item in date_items): + if date_items[0] == 4: + print(' Changing %s from date-var to int-var' %date_var) + int_vars.append(date_var) + date_vars.remove(date_var) + continue + else: + date_items = train[date_var].dropna(axis=0).apply(str).apply(len).values + #### In some extreme cases, 4 digit date variables are not useful + if all(date_items[0] == item for item in date_items): + if date_items[0] == 4: + print(' Changing %s from date-var to discrete-string-var' %date_var) + discrete_string_vars.append(date_var) + date_vars.remove(date_var) + continue + #### This test is to make sure sure date vars are actually date vars + try: + pd.to_datetime(train[date_var],infer_datetime_format=True) + except: + ##### if not a date var, then just add it to delete it from processing + cols_delete.append(date_var) + date_vars.remove(date_var) + sum_all_cols['date_vars'] = date_vars + sum_all_cols['id_vars'] = id_vars + sum_all_cols['cols_delete'] = cols_delete + ## This is an EXTREMELY complicated logic for cat vars. Don't change it unless you test it many times! + var_df['numeric'] = 0 + float_or_cat = var_df.apply(lambda x: 1 if x['type_of_column'] in ['float16', + 'float32','float64'] else 0, + axis=1) + if len(var_df.loc[float_or_cat == 1]) > 0: + for col in var_df.loc[float_or_cat == 1]['index'].values.tolist(): + if len(train[col].value_counts()) > 2 and len(train[col].value_counts() + ) <= float_limit and len(train[col].value_counts()) <= len(train): + var_df.loc[var_df['index']==col,'cat'] = 1 + else: + if col not in num_bool_vars: + var_df.loc[var_df['index']==col,'numeric'] = 1 + cat_vars = list(var_df[(var_df['cat'] ==1)]['index']) + continuous_vars = list(var_df[(var_df['numeric'] ==1)]['index']) + ######## V E R Y I M P O R T A N T ################################################### + ##### There are a couple of extra tests you need to do to remove abberations in cat_vars ### + cat_vars_copy = copy.deepcopy(cat_vars) + for cat in cat_vars_copy: + if df_preds[cat].dtype==float: + continuous_vars.append(cat) + cat_vars.remove(cat) + var_df.loc[var_df['index']==cat,'cat'] = 0 + var_df.loc[var_df['index']==cat,'numeric'] = 1 + elif len(df_preds[cat].value_counts()) == df_preds.shape[0]: + id_vars.append(cat) + cat_vars.remove(cat) + var_df.loc[var_df['index']==cat,'cat'] = 0 + var_df.loc[var_df['index']==cat,'id_col'] = 1 + sum_all_cols['cat_vars'] = cat_vars + sum_all_cols['continuous_vars'] = continuous_vars + sum_all_cols['id_vars'] = id_vars + cols_delete = find_remove_duplicates(cols_delete+id_vars) + sum_all_cols['cols_delete'] = cols_delete + ###### This is where you consoldate the numbers ########### + var_dict_sum = dict(zip(var_df.values[:,0], var_df.values[:,2:].sum(1))) + for col, sumval in var_dict_sum.items(): + if sumval == 0: + print('%s of type=%s is not classified' %(col,train[col].dtype)) + elif sumval > 1: + print('%s of type=%s is classified into more then one type' %(col,train[col].dtype)) + else: + pass + ############### This is where you print all the types of variables ############## + ####### Returns 8 vars in the following order: continuous_vars,int_vars,cat_vars, + ### string_bool_vars,discrete_string_vars,nlp_vars,date_or_id_vars,cols_delete + cat_vars_copy = copy.deepcopy(cat_vars) + for each_cat in cat_vars_copy: + if len(train[each_cat].value_counts()) > cat_limit: + discrete_string_vars.append(each_cat) + cat_vars.remove(each_cat) + sum_all_cols['cat_vars'] = cat_vars + sum_all_cols['discrete_string_vars'] = discrete_string_vars + ######### The variables can now be printed ############## + + if verbose == 1: + print(" Number of Numeric Columns = ", len(continuous_vars)) + print(" Number of Integer-Categorical Columns = ", len(int_vars)) + print(" Number of String-Categorical Columns = ", len(cat_vars)) + print(" Number of Factor-Categorical Columns = ", len(factor_vars)) + print(" Number of String-Boolean Columns = ", len(string_bool_vars)) + print(" Number of Numeric-Boolean Columns = ", len(num_bool_vars)) + print(" Number of Discrete String Columns = ", len(discrete_string_vars)) + print(" Number of NLP String Columns = ", len(nlp_vars)) + print(" Number of Date Time Columns = ", len(date_vars)) + print(" Number of ID Columns = ", len(id_vars)) + print(" Number of Columns to Delete = ", len(cols_delete)) + if verbose == 2: + marthas_columns(df_preds,verbose=1) + if verbose >=1 and orig_cols_total > max_cols_to_print: + print(" Numeric Columns: %s" %continuous_vars[:max_cols_to_print]) + print(" Integer-Categorical Columns: %s" %int_vars[:max_cols_to_print]) + print(" String-Categorical Columns: %s" %cat_vars[:max_cols_to_print]) + print(" Factor-Categorical Columns: %s" %factor_vars[:max_cols_to_print]) + print(" String-Boolean Columns: %s" %string_bool_vars[:max_cols_to_print]) + print(" Numeric-Boolean Columns: %s" %num_bool_vars[:max_cols_to_print]) + print(" Discrete String Columns: %s" %discrete_string_vars[:max_cols_to_print]) + print(" NLP text Columns: %s" %nlp_vars[:max_cols_to_print]) + print(" Date Time Columns: %s" %date_vars[:max_cols_to_print]) + print(" ID Columns: %s" %id_vars[:max_cols_to_print]) + print(" Columns that will not be considered in modeling: %s" %cols_delete[:max_cols_to_print]) + ##### now collect all the column types and column names into a single dictionary to return! + #### Since cols_delete and id_vars have the same columns, you need to subtract id_vars from this! + len_sum_all_cols = reduce(add,[len(v) for v in sum_all_cols.values()]) - len(id_vars) + if len_sum_all_cols == orig_cols_total: + print(' %d Predictors classified...' %orig_cols_total) + #print(' This does not include the Target column(s)') + else: + print('Number columns classified %d does not match %d total cols. Continuing...' %( + len_sum_all_cols, orig_cols_total)) + ls = sum_all_cols.values() + flat_list = [item for sublist in ls for item in sublist] + if len(left_subtract(list(train),flat_list)) == 0: + print(' Missing columns = None') + else: + print(' Missing columns = %s' %left_subtract(list(train),flat_list)) + return sum_all_cols +################################################################################# +import re +WORD = re.compile(r'\w+') +def tokenize_fast(text): + """ + This is a fast function that tokenizes a string of text into its words + """ + words = WORD.findall(text) + return words +############################################################################################ +def check_model_options(model_options, name, default): + try: + if model_options[name]: + value = model_options[name] + else: + value = default + except: + value = default + return value +##################################################################################### +def classify_features_using_pandas(data_sample, target, model_options={}, verbose=0): + """ + If you send in a small pandas dataframe with the name of target variable(s), you will get back + all the features classified by type such as dates, cats, ints, floats and nlps. This is all done using pandas. + """ + data_sample = copy.deepcopy(data_sample) + ###### This is where you get the cat_vocab_dict is created in the form of feats_max_min ##### + feats_max_min = nested_dictionary() + print_features = False + nlps = [] + bools = [] + ### if a variable has more than this many chars, it will be treated like a NLP variable + nlp_char_limit = check_model_options(model_options, "nlp_char_limit", 50) + ### if a variable has more than this limit, it will not be treated like a cat variable # + cat_limit = check_model_options(model_options, "variable_cat_limit", 30) + ### Classify features using the previously define function ############# + var_df1 = classify_features(data_sample, target, model_options, verbose=verbose) + ##### This might be useful for users to know whether to use feature-crosses or not ### + stri, numi, cat_feature_cross_flag = fast_classify_features(data_sample) + convert_cols = [] + if len(numi['veryhighcats']) > 0: + convert_cols = numi['veryhighcats'] + if convert_cols: + var_df1['int_vars'] = left_subtract(var_df1['int_vars'], convert_cols) + var_df1['continuous_vars'] = var_df1['continuous_vars'] + convert_cols + ########################## Set the default variable types here ############# + dates = var_df1['date_vars'] + cats = var_df1['categorical_vars'] + discrete_strings = var_df1['discrete_string_vars'] + lats = var_df1['lat_vars'] + lons = var_df1['lon_vars'] + ignore_variables = copy.deepcopy(var_df1['cols_delete']) + all_ints = var_df1['int_vars'] + if isinstance(target, list): + preds = [x for x in list(data_sample) if x not in target+ignore_variables] + else: + preds = [x for x in list(data_sample) if x not in [target]+ignore_variables] + feats_max_min['predictors_in_train'] = copy.deepcopy(preds) + #### Take(1) always displays only one batch only if num_epochs is set to 1 or a number. Otherwise No print! ######## + #### If you execute the below code without take, then it will go into an infinite loop if num_epochs was set to None. + if verbose >= 1 and target: + print(f"printing first five values of {target}: {data_sample[target].values[:5]}") + if len(preds) <= 30: + print_features = True + if print_features and verbose > 1: + print("printing features and their max, min, datatypes in one batch ") + ###### Now we do the creation of cat_vocab_dict though it is called feats_max_min here ##### + floats = [] + preds_copy = copy.deepcopy(preds) + for key in preds_copy: + if str(data_sample[key].dtype) in ['object', 'category']: + if type('str') in data_sample[key].map(type).value_counts().index: + feats_max_min[key]["dtype"] = "string" + elif data_sample[key].map(type).value_counts().index[0] == int: + data_sample[key] = data_sample[key].astype(np.int32).values + feats_max_min[key]["dtype"] = np.int32 + all_ints.append(key) + if key in cats: + cats.remove(key) + var_df1['categorical_vars'] = copy.deepcopy(cats) + elif key in discrete_strings: + discrete_strings.remove(key) + var_df1['discrete_string_vars'] = copy.deepcopy(discrete_strings) + elif data_sample[key].map(type).value_counts().index[0] == float: + data_sample[key] = data_sample[key].astype(np.float32).values + feats_max_min[key]["dtype"] = np.float32 + floats.append(key) + if key in cats: + cats.remove(key) + var_df1['categorical_vars'] = copy.deepcopy(cats) + elif key in discrete_strings: + discrete_strings.remove(key) + var_df1['discrete_string_vars'] = copy.deepcopy(discrete_strings) + elif data_sample[key].map(type).value_counts().index[0] == bool: + data_sample[key] = data_sample[key].astype(bool).values + feats_max_min[key]["dtype"] = "bool" + bools.append(key) + if key in cats: + cats.remove(key) + var_df1['categorical_vars'] = copy.deepcopy(cats) + elif key in discrete_strings: + discrete_strings.remove(key) + var_df1['discrete_string_vars'] = copy.deepcopy(discrete_strings) + #### This is not a mistake - you have to test it again. That way we make sure type is safe + if str(data_sample[key].dtype) in ['object', 'category']: + if data_sample[key].map(type).value_counts().index[0] == object or data_sample[key].map(type).value_counts().index[0] == str: + feats_max_min[key]["dtype"] = "string" + elif data_sample[key].dtype in ['bool']: + feats_max_min[key]["dtype"] = "bool" + bools.append(key) + if key in cats: + cats.remove(key) + elif str(data_sample[key].dtype).split("[")[0] in ['datetime64','datetime32','datetime16']: + feats_max_min[key]["dtype"] = "string" + elif data_sample[key].dtype in [np.int16, np.int32, np.int64]: + if key in convert_cols: + feats_max_min[key]["dtype"] = np.float32 + floats.append(key) + else: + feats_max_min[key]["dtype"] = np.int32 + else: + floats.append(key) + feats_max_min[key]["dtype"] = np.float32 + if feats_max_min[key]['dtype'] in [np.int16, np.int32, np.int64, + np.float16, np.float32, np.float64]: + ##### This is for integer and float variables ####### + if key in lats+lons: + ### For lats and lons you need the vocab to create bins using pd.qcut #### + vocab = data_sample[key].unique() + feats_max_min[key]["vocab"] = vocab + feats_max_min[key]['size_of_vocab'] = len(vocab) + feats_max_min[key]["max"] = max(data_sample[key].values) + feats_max_min[key]["min"] = min(data_sample[key].values) + else: + if feats_max_min[key]['dtype'] in [np.int16, np.int32, np.int64]: + vocab = data_sample[key].unique() + feats_max_min[key]["vocab"] = vocab + feats_max_min[key]['size_of_vocab'] = len(vocab) + else: + ### For the rest of the numeric variables, you just need mean and variance ### + vocab = data_sample[key].unique() + feats_max_min[key]["vocab_min_var"] = [data_sample[key].mean(), data_sample[key].var()] + feats_max_min[key]["vocab"] = vocab + feats_max_min[key]['size_of_vocab'] = len(vocab) + feats_max_min[key]["max"] = max(data_sample[key].values) + feats_max_min[key]["min"] = min(data_sample[key].values) + elif feats_max_min[key]['dtype'] in ['bool']: + ### we are going to convert boolean to float type ##### + vocab = data_sample[key].unique() + full_array = data_sample[key].values + full_array = np.array([0.0 if type(x) == float else float(x) for x in full_array]) + ### Don't change the next line even though it appears wrong. I have tested and it works! + vocab = [0.0 if type(x) == float else float(x) for x in vocab] + feats_max_min[key]["vocab_min_var"] = [full_array.mean(), full_array.var()] + feats_max_min[key]["vocab"] = vocab + feats_max_min[key]['size_of_vocab'] = len(vocab) + elif feats_max_min[key]['dtype'] in ['string']: + data_sample[[key]] = data_sample[[key]].fillna("missing") + data_types = len(data_sample[key].map(type).value_counts()) + if data_types > 1: + print('\nDATA CLEANING ALERT: Dropping %s since it has %s mixed data types.' %(key, data_types)) + print(' Convert this variable to a single data type and re-run deep_autoviml.') + ignore_variables.append(key) + preds.remove(key) + feats_max_min['predictors_in_train'] = preds + var_df1['cols_delete'] = copy.deepcopy(ignore_variables) + if key in cats: + cats.remove(key) + var_df1['categorical_vars'] = copy.deepcopy(cats) + elif key in discrete_strings: + discrete_strings.remove(key) + var_df1['discrete_string_vars'] = copy.deepcopy(discrete_strings) + if not key in ignore_variables: + if np.max(data_sample[key].map(len)) >= nlp_char_limit: + ### This is for NLP variables. You want to remove duplicates ##### + if key in dates: + continue + elif key in cats: + cats.remove(key) + var_df1['categorical_vars'] = cats + elif key in discrete_strings: + discrete_strings.remove(key) + var_df1['discrete_string_vars'] = discrete_strings + print('%s is detected as an NLP variable' %key) + if key not in var_df1['nlp_vars']: + var_df1['nlp_vars'].append(key) + #### Now let's calculate some statistics on this NLP variable ### + num_rows_in_data = model_options['DS_LEN'] + if data_sample.shape[0]*data_sample[key].map(len).mean()/1e6 > 1 or data_sample[key].memory_usage(deep=True)/1e6 > 100: + ## number of unique words in a document may be only 10% of the total num of words + ### If this number exceeds 1, it means there are 1 million words in that document + ### Immediately cap the vocab size to 300,000 - don't measure its vocab!! + data_sample = data_sample.sample(frac=0.1, random_state=0) + try: + vocab = np.concatenate(data_sample[key].map(tokenize_fast)) + except: + vocab = np.concatenate(data_sample[key].map(tokenize_fast).values) + vocab = np.unique(vocab).tolist() + feats_max_min[key]["vocab"] = vocab + try: + feats_max_min[key]['seq_length'] = int(data_sample[key].map(len).max()) + num_words_in_each_row = data_sample[key].map(lambda x: len(x.split(" "))).mean() + feats_max_min[key]['size_of_vocab'] = int(num_rows_in_data*num_words_in_each_row) + except: + feats_max_min[key]['seq_length'] = len(vocab) // num_rows_in_data + feats_max_min[key]['size_of_vocab'] = len(vocab) + else: + ### This is for string variables ######## + #### Now we select features if they are present in the data set ### + num_rows_in_data = model_options['DS_LEN'] + data_sample[[key]] = data_sample[[key]].fillna("missing") + vocab = data_sample[key].unique().tolist() + vocab = np.unique(vocab).tolist() + #vocab = ['missing' if type(x) != str else x for x in vocab] + feats_max_min[key]["vocab"] = vocab + feats_max_min[key]['size_of_vocab'] = len(vocab) + feats_max_min[key]['seq_length'] = len(vocab) // num_rows_in_data + else: + #### Now we treat bool and other variable types ### + #feats_max_min[key]["vocab"] = data_sample[key].unique() + vocab = data_sample[key].unique() + #### just leave this as is - it works for other data types + vocab = ['missing' if type(x) == str else x for x in vocab] + feats_max_min[key]["vocab"] = vocab + feats_max_min[key]['size_of_vocab'] = len(vocab) + #feats_max_min[key]['size_of_vocab'] = len(feats_max_min[key]["vocab"]) + if print_features and verbose > 1: + print(" {!r:20s}: {}".format(' sample words from vocab', key, data_sample[key].values[:4])) + print(" {!r:25s}: {}".format(' size of vocab', feats_max_min[key]["size_of_vocab"])) + print(" {!r:25s}: {}".format(' max', feats_max_min[key]["max"])) + print(" {!r:25s}: {}".format(' min', feats_max_min[key]["min"])) + print(" {!r:25s}: {}".format(' dtype', feats_max_min[key]["dtype"])) + if not print_features: + print('Number of variables in dataset is too numerous to print...skipping print') + ##### Save the variable changes back to the variable type dictionary ## + var_df1['discrete_string_vars'] = copy.deepcopy(discrete_strings) + var_df1['categorical_vars'] = copy.deepcopy(cats) + var_df1['cols_delete'] = ignore_variables + + ##### Make some changes to integer variables to select those with less than certain category limits ## + ints = [ x for x in all_ints if feats_max_min[x]['size_of_vocab'] > cat_limit and x not in floats] + + int_bools = [ x for x in all_ints if feats_max_min[x]['size_of_vocab'] == 2 and x not in floats] + + int_cats = [ x for x in all_ints if feats_max_min[x]['size_of_vocab'] <= cat_limit and x not in floats+int_bools] + + var_df1['int_vars'] = ints + var_df1['int_cats'] = int_cats + var_df1['int_bools'] = int_bools + var_df1["continuous_vars"] = floats + var_df1['bools'] = bools + #### It is better to have a baseline number for the size of the dataset here ######## + feats_max_min['DS_LEN'] = len(data_sample) + feats_max_min["predictors_in_train"] = preds + print(' after data cleaning, number of predictors used in modeling = %s' %len(preds)) + ### check if cat_vocab_dict has cat_feature_cross_flag in it ### + if "cat_feature_cross_flag" in model_options.keys(): + ### If they mistakenly type it as cat_feature_cross_flag then use it #### + if model_options["cat_feature_cross_flag"]: + cat_feature_cross_flag = model_options["cat_feature_cross_flag"] + print('performing feature crossing for %s variables' %cat_feature_cross_flag) + else: + model_options["cat_feat_cross_flag"] = cat_feature_cross_flag + print('Not performing feature crossing for categorical nor integer variables' ) + elif "cat_feat_cross_flag" in model_options.keys(): + if model_options["cat_feat_cross_flag"]: + cat_feature_cross_flag = model_options["cat_feat_cross_flag"] + print('performing feature crossing for %s variables' %cat_feature_cross_flag) + else: + model_options["cat_feat_cross_flag"] = cat_feature_cross_flag + print('Not performing feature crossing for categorical nor integer variables' ) + else: + ### If there is no input for cat_feature_cross_flag, then don't do it ### + cat_feature_cross_flag = "" + print('Not performing feature crossing for categorical nor integer variables' ) + return data_sample, var_df1, feats_max_min +############################################################################################ +def EDA_classify_and_return_cols_by_type(df1, nlp_char_limit=50): + """ + EDA stands for Exploratory data analysis. This function performs EDA - hence the name + ######################################################################################## + This handy function classifies your columns into different types : make sure you send only predictors. + Beware sending target column into the dataframe. You don't want to start modifying it. + ##################################################################################### + It returns a list of categorical columns, integer cols and float columns in that order. + """ + ### Let's find all the categorical excluding integer columns in dataset: unfortunately not all integers are categorical! + catcols = df1.select_dtypes(include='object').columns.tolist() + df1.select_dtypes(include='category').columns.tolist() + cats = copy.deepcopy(catcols) + nlpcols = [] + for each_cat in cats: + try: + df1[[each_cat]] = df1[[each_cat]].fillna('missing') + if df1[each_cat].map(len).max() >=nlp_char_limit: + nlpcols.append(each_cat) + catcols.remove(each_cat) + except: + continue + intcols = df1.select_dtypes(include='integer').columns.tolist() + int_cats = [ x for x in intcols if df1[x].nunique() <= 30 and x not in idcols] + intcols = left_subtract(intcols, int_cats) + # let's find all the float numeric columns in data + floatcols = df1.select_dtypes(include='float').columns.tolist() + return catcols, int_cats, intcols, floatcols, nlpcols +############################################################################################ +def EDA_classify_features(train, target, idcols, nlp_char_limit=50): + ### Test Labeler is a very important dictionary that will help transform test data same as train #### + test_labeler = defaultdict(list) + + #### all columns are features except the target column and the folds column ### + if isinstance(target, str): + features = [x for x in list(train) if x not in [target]+idcols] + else: + ### in this case target is a list and hence can be added to idcols + features = [x for x in list(train) if x not in target+idcols] + + ### first find all the types of columns in your data set #### + cats, int_cats, ints, floats, nlps = EDA_classify_and_return_cols_by_type(train[features], + nlp_char_limit) + + numeric_features = ints + floats + categoricals_features = copy.deepcopy(cats) + nlp_features = copy.deepcopy(nlps) + + test_labeler['categoricals_features'] = categoricals_features + test_labeler['numeric_features'] = numeric_features + test_labeler['nlp_features'] = nlp_features + + return cats, int_cats, ints, floats, nlps +############################################################################################# +def left_subtract(l1,l2): + lst = [] + for i in l1: + if i not in l2: + lst.append(i) + return lst +############################################################################################# +def find_number_bins(series): + """ + Input must be a pandas series. Otherwise it will blow up. Be careful! + Returns the recommended number of bins for any Series in pandas + Input must be a float or integer column. Don't send in alphabetical series! + """ + return int(np.log2(series.nunique())+1) +######################################################################################### +import re +def find_words_in_list(words, in_list): + result = [] + for each_word in words: + for in_src in in_list: + if re.findall(each_word, in_src): + result.append(in_src) + return list(set(result)) + +############################################################################################# +from collections import defaultdict +def find_latitude_longitude_columns_in_df(df, verbose=0): + matched_pairs = [] + lats, lat_keywords = find_latitude_columns(df, verbose) + lons, lon_keywords = find_longitude_columns(df, verbose) + if len(lats) > 0 and len(lons) > 0: + if len(lats) > 1: + for each_lat in lats: + for each_lon in lons: + if lat_keywords[each_lat] == lon_keywords[each_lon]: + matched_pairs.append((each_lat, each_lon)) + elif len(lats) == 1 and len(lons) == 1: + each_lat = lats[0] + each_lon = lons[0] + matched_pairs.append((each_lat, each_lon)) + if len(matched_pairs) > 0 and verbose: + print('Matched pairs of latitudes and longitudes: %s' %matched_pairs) + return lats, lons, matched_pairs + +def find_latitude_columns(df, verbose=0): + columns = df.select_dtypes(include='float').columns.tolist() + df.select_dtypes(include='object').columns.tolist() + lat_words = find_words_in_list(['Lat','lat','LAT','Latitude','latitude','LATITUDE'], columns) + sel_columns = lat_words[:] + lat_keywords = defaultdict(str) + if len(columns) > 0: + for lat_word in columns: + lati_keyword = find_latitude_keyword(lat_word, columns, sel_columns) + if not lati_keyword == '': + if lati_keyword == lat_word: + lat_keywords[lat_word] = lati_keyword + else: + lat_keywords[lat_word] = lat_word.replace(lati_keyword,'') + ###### This is where we find whether they are truly latitudes ############ + print(' possible latitude columns in dataset: %s' %sel_columns) + sel_columns_copy = copy.deepcopy(sel_columns) + for sel_col in sel_columns_copy: + if not lat_keywords[sel_col]: + sel_columns.remove(sel_col) + #### If there are any more columns left, then do further analysis ####### + if len(sel_columns) > 0: + sel_cols_float = df[sel_columns].select_dtypes(include='float').columns.tolist() + if len(sel_cols_float) > 0: + for sel_column in sel_cols_float: + if df[sel_column].isnull().sum() > 0: + print('Null values in possible latitude column %s. Removing it' %sel_column) + sel_columns.remove(sel_column) + continue + if df[sel_column].min() >= -90 and df[sel_column].max() <= 90: + if verbose: + print(' %s found as latitude column' %sel_column) + if sel_column not in sel_columns: + sel_columns.append(sel_column) + lati_keyword = find_latitude_keyword(sel_column, columns, sel_columns) + if not lati_keyword == '': + lat_keywords[lat_word] = sel_column.replace(lati_keyword,'') + else: + sel_columns.remove(sel_column) + sel_cols_string = df[sel_columns].select_dtypes(include='object').columns.tolist() + if len(sel_cols_string) > 0: + for sel_column in sel_cols_string: + if len(df[df[sel_column].str.endswith(('N','S'))]) > 0: + if verbose: + print(' %s found as latitude column' %sel_column) + if sel_column not in sel_columns: + sel_columns.append(sel_column) + lati_keyword = find_latitude_keyword(sel_column, columns, sel_columns) + if not lati_keyword == '': + lat_keywords[lat_word] = sel_column.replace(lati_keyword,'') + else: + sel_columns.remove(sel_column) + ### after everything what is left is shown here #### + if len(sel_columns) == 0: + print(' after further analysis, no latitude columns found') + else: + print(' after further analysis, selected latitude columns = %s' %sel_columns) + return sel_columns, lat_keywords + +def find_latitude_keyword(lat_word, columns, sel_columns=[]): + lat_keywords = defaultdict(str) + #### This is where we find the text that is present in column related to latitude ## + if len(columns) > 0: + if lat_word.lower() == 'lat': + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'lat' + elif lat_word.lower() == 'latitude': + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'latitude' + elif 'latitude' in lat_word.lower().split(" "): + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'latitude' + elif 'latitude' in lat_word.lower().split("_"): + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'latitude' + elif 'latitude' in lat_word.lower().split("-"): + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'latitude' + elif 'latitude' in lat_word.lower().split("/"): + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'latitude' + elif 'latitude' in lat_word.lower(): + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'latitude' + elif 'lat' in lat_word.lower().split(" "): + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'lat' + elif 'lat' in lat_word.lower().split("_"): + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'lat' + elif 'lat' in lat_word.lower().split("-"): + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'lat' + elif 'lat' in lat_word.lower().split("/"): + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'lat' + elif 'lat' in lat_word.lower(): + if lat_word not in sel_columns: + sel_columns.append(lat_word) + lat_keywords[lat_word] = 'lat' + return lat_keywords[lat_word] + +def find_longitude_keyword(lon_word, columns, sel_columns=[]): + lon_keywords = defaultdict(str) + #### This is where we find the text that is present in column related to longitude ## + if len(columns) > 0: + if lon_word.lower() == 'lon': + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'lon' + elif lon_word.lower() == 'longitude': + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'longitude' + elif 'longitude' in lon_word.lower().split(" "): + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'longitude' + elif 'longitude' in lon_word.lower().split("_"): + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'longitude' + elif 'longitude' in lon_word.lower().split("-"): + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'longitude' + elif 'longitude' in lon_word.lower().split("/"): + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'longitude' + elif 'longitude' in lon_word.lower(): + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'longitude' + elif 'lon' in lon_word.lower().split(" "): + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'lon' + elif 'lon' in lon_word.lower().split("_"): + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'lon' + elif 'lon' in lon_word.lower().split("-"): + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'lon' + elif 'lon' in lon_word.lower().split("/"): + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'lon' + elif 'lon' in lon_word.lower(): + if lon_word not in sel_columns: + sel_columns.append(lon_word) + lon_keywords[lon_word] = 'lon' + return lon_keywords[lon_word] + +def find_longitude_columns(df, verbose=0): + columns = df.select_dtypes(include='float').columns.tolist() + df.select_dtypes(include='object').columns.tolist() + lon_words = find_words_in_list(['Lon','lon','LON','Longitude','Longitude', "LONGITUDE"], columns) + sel_columns = lon_words[:] + lon_keywords = defaultdict(str) + if len(columns) > 0: + for lon_word in columns: + long_keyword = find_longitude_keyword(lon_word, columns, sel_columns) + if not long_keyword == '': + if lon_word == long_keyword: + lon_keywords[lon_word] = long_keyword + else: + lon_keywords[lon_word] = lon_word.replace(long_keyword,'') + ##### This is where we test whether they are indeed longitude columns #### + print(' possible longitude columns in dataset: %s' %sel_columns) + sel_columns_copy = copy.deepcopy(sel_columns) + for sel_col in sel_columns_copy: + if not lon_keywords[sel_col]: + sel_columns.remove(sel_col) + ###### This is where we find whether they are truly longitudes ############ + if len(sel_columns) > 0: + sel_cols_float = df[sel_columns].select_dtypes(include='float').columns.tolist() + if len(sel_cols_float) > 0: + for sel_column in sel_cols_float: + if df[sel_column].isnull().sum() > 0: + print('Null values in possible longitude column %s. Removing it' %sel_column) + sel_columns.remove(sel_column) + continue + if df[sel_column].min() >= -180 and df[sel_column].max() <= 180: + if verbose: + print(' %s found as longitude column' %sel_column) + if sel_column not in sel_columns: + sel_columns.append(sel_column) + long_keyword = find_longitude_keyword(sel_column, columns, sel_columns) + if not long_keyword == '': + lon_keywords[lon_word] = sel_column.replace(long_keyword,'') + else: + sel_columns.remove(sel_column) + sel_cols_string = df[sel_columns].select_dtypes(include='object').columns.tolist() + if len(sel_cols_string) > 0: + for sel_column in sel_cols_string: + if len(df[df[sel_column].str.endswith(('N','S'))]) > 0: + if verbose: + print(' %s found as longitude column' %sel_column) + if sel_column not in sel_columns: + sel_columns.append(sel_column) + long_keyword = find_longitude_keyword(sel_column, columns, sel_columns) + if not long_keyword == '': + lon_keywords[lon_word] = sel_column.replace(long_keyword,'') + else: + sel_columns.remove(sel_column) + ### this is where we finally can select columns ## + if len(sel_columns) == 0: + print(' after further analysis, no longitude columns found') + else: + print(' after further analysis, selected longitude columns = %s' %sel_columns) + return sel_columns, lon_keywords +########################################################################################### +from collections import defaultdict +def nested_dictionary(): + return defaultdict(nested_dictionary) +############################################################################################ +def classify_dtypes_using_TF2(data_sample, preds, idcols, verbose=0): + """ + #### This works only on train data sets since they have both features and labels. ################ + #### It also works in only certain types of tf.data.datasets since every TF dataset is different format. + If you send in a batch of Ttf.data.dataset with the name of target variable(s), you will get back + all the features classified by type such as cats, ints, floats and nlps. This is all done using TF2. + """ + print_features = False + nlps = [] + nlp_char_limit = 50 + all_ints = [] + floats = [] + cats = [] + int_vocab = 0 + feats_max_min = nested_dictionary() + #### Take(1) always displays only one batch only if num_epochs is set to 1 or a number. Otherwise No print! ######## + #### If you execute the below code without take, then it will go into an infinite loop if num_epochs was set to None. + if data_sample.element_spec[0][preds[0]].shape[0] is None: + for feature_batch, label_batch in data_sample.take(1): + if verbose >= 1: + print(f"{target}: {label_batch[:4]}") + if len(feature_batch.keys()) <= 30: + print_features = True + if verbose >= 1: + print("features and their max, min, datatypes in one batch of size: ",batch_size) + for key, value in feature_batch.items(): + feats_max_min[key]["dtype"] = data_sample.element_spec[0][key].dtype + if feats_max_min[key]['dtype'] in [tf.float16, tf.float32, tf.float64]: + ## no need to find vocab of floating point variables! + floats.append(key) + elif feats_max_min[key]['dtype'] in [tf.int16, tf.int32, tf.int64]: + ### if it is an integer var, it is worth finding their vocab! + all_ints.append(key) + int_vocab = tf.unique(value)[0].numpy().tolist() + feats_max_min[key]['size_of_vocab'] = len(int_vocab) + elif feats_max_min[key]['dtype'] in [tf.string]: + feature_batch[[key]] = feature_batch[[key]].fillna("missing") + if tf.reduce_max(tf.strings.length(feature_batch[key])).numpy() >= nlp_char_limit: + print('%s is detected and will be treated as an NLP variable') + nlps.append(key) + else: + cats.append(key) + if not print_features: + print('Number of variables in dataset is too numerous to print...skipping print') + + ints = [ x for x in all_ints if feats_max_min[x]['size_of_vocab'] > 30 and x not in idcols] + + int_cats = [ x for x in all_ints if feats_max_min[x]['size_of_vocab'] <= 30 and x not in idcols] + + return cats, int_cats, ints, floats, nlps +############################################################################################ +# Define feature columns(Including feature engineered ones ) +# These are the features which come from the TF Data pipeline +def create_feature_cols(data_batches, preds): + #Keras format features + keras_dict_input = {} + if data_batches.element_spec[0][preds[0]].shape[0] is None: + print("Creating keras features dictionary...") + for feature_batch, label_batch in data_batches.take(1): + for key, _ in feature_batch.items(): + k_month = tf.keras.Input(name=key, shape=(1,), dtype=tf.string) + keras_dict_input[key] = k_month + print(' completed.') + return({'K' : keras_dict_input}) +############################################################################################## +# Removes duplicates from a list to return unique values - USED ONLYONCE +def find_remove_duplicates(values): + output = [] + seen = set() + for value in values: + if value not in seen: + output.append(value) + seen.add(value) + return output +################################################################################# +from collections import defaultdict +import copy +def fast_classify_features(df): + """ + This is a very fast way to get a handle on what a dataset looks like. Just send in df and get a print. + Nothing is returned. You just get a printed number of how many types of features you have in dataframe. + """ + num_list = df.select_dtypes(include='integer').columns.tolist() + float_list = df.select_dtypes(include='float').columns.tolist() + str_list = left_subtract(df.columns.tolist(), num_list+float_list) + all_list = [str_list, num_list] + str_dict = defaultdict(dict) + int_dict = defaultdict(dict) + for inum, dicti in enumerate([str_dict, int_dict]): + bincols = [] + catcols = [] + highcols = [] + numcols = [] + for col in all_list[inum]: + leng = len(df[col].value_counts()) + if leng <= 2: + bincols.append(col) + elif leng > 2 and leng <= 15: + catcols.append(col) + elif leng >15 and leng <100: + highcols.append(col) + else: + numcols.append(col) + dicti['bincols'] = bincols + dicti['catcols'] = catcols + dicti['highcats'] = highcols + dicti['veryhighcats'] = numcols + if inum == 0: + str_dict = copy.deepcopy(dicti) + print('Distribution of string columns in datatset:') + print(' number of binary = %d, cats = %d, high cats = %d, very high cats = %d' %( + len(bincols), len(catcols), len(highcols), len(numcols))) + else: + print('Distribution of integer columns in datatset:') + int_dict = copy.deepcopy(dicti) + print(' number of binary = %d, cats = %d, high cats = %d, very high cats = %d' %( + len(bincols), len(catcols), len(highcols), len(numcols))) + ### Check if worth doing cat_feature_cross_flag on this dataset ### + int_dict['floats'] = float_list + print('Distribution of floats:') + print(' number float variables = %d' %len(float_list)) + cat_feature_cross_flag = [] + if len(str_dict['bincols']+str_dict['catcols']) > 2 and len(str_dict['bincols']+str_dict['catcols']) <= 10: + cat_feature_cross_flag.append("cat") + if len(int_dict['bincols']+int_dict['catcols']) > 2 and len(int_dict['bincols']+int_dict['catcols']) <= 10: + cat_feature_cross_flag.append("num") + ######## This is where we advise and act on whether to do feature crosses or not ######### + if cat_feature_cross_flag: + print('Data Transformation Advisory:') + if "cat" in cat_feature_cross_flag: + cat_feature_cross_flag = "cat" + print(' performing categorical feature crosses: changed cat_feat_cross_flag to "cat"') + elif "num" in cat_feature_cross_flag: + cat_feature_cross_flag = "num" + print(' performing integer feature crosses: changed cat_feat_cross_flag to "num" ') + elif "cat" in cat_feature_cross_flag and "num" in cat_feature_cross_flag: + cat_feature_cross_flag = "both" + print(' performing both integer and cat feature crosses: changed cat_feat_cross_flag to "both" ') + else: + print('No data transformations needed in this dataset') + cat_feature_cross_flag = "" + if len(int_dict['veryhighcats']) > 0: + print(' we transformed %s from integer to float' %int_dict['veryhighcats']) + return str_dict, int_dict, cat_feature_cross_flag +################################################################################################### diff --git a/build/lib/deep_autoviml/data_load/extract.py b/build/lib/deep_autoviml/data_load/extract.py new file mode 100644 index 0000000..4fe5adc --- /dev/null +++ b/build/lib/deep_autoviml/data_load/extract.py @@ -0,0 +1,1326 @@ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +# -*- coding: utf-8 -*- +################################################################################ +# deep_auto_viml - build and test multiple Tensorflow 2.0 models and pipelines +# Python v3.6+ tensorflow v2.4.1+ +# Created by Ram Seshadri +# Licensed under Apache License v2 +################################################################################ +# data pipelines and feature engg here + +# pre-defined TF2 Keras models and your own models here + +# Utils +from .classify_features import classify_features_using_pandas +from .classify_features import check_model_options, fast_classify_features +# Utils +from deep_autoviml.utilities.utilities import print_one_row_from_tf_dataset, print_one_row_from_tf_label +from deep_autoviml.utilities.utilities import My_LabelEncoder, print_one_image_from_dataset +from deep_autoviml.utilities.utilities import print_one_text_from_dataset +from deep_autoviml.utilities.utilities import find_columns_with_infinity, drop_rows_with_infinity +############################################################################################ +import pandas as pd +import numpy as np +pd.set_option('display.max_columns',500) +from sklearn.model_selection import train_test_split +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +# TensorFlow ≥2.4 is required +import tensorflow as tf +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow import keras +############################################################################################ +#### probably the most handy function of all! +def left_subtract(l1,l2): + lst = [] + for i in l1: + if i not in l2: + lst.append(i) + return lst +import re +def find_words_in_list(words, in_list): + result = [] + for each_word in words: + for in_src in in_list: + if re.findall(each_word, in_src): + result.append(in_src) + return list(set(result)) + +############################################################################################## +def find_problem_type(train, target, model_options={}, verbose=0) : + """ + ############################################################################ + ##### Now find the problem type of this train dataset using its target variable + ############################################################################ + """ + target = copy.deepcopy(target) + ### this determines the number of categories to name integers as classification ## + ### if a variable has more than this limit, it will not be treated like a cat variable # + cat_limit = check_model_options(model_options, "variable_cat_limit", 30) + float_limit = 15 ### this limits the number of float variable categories for it to become cat var + model_label = 'Single_Label' + model_class = 'Classification' + if isinstance(target, str): + if target == '': + model_class ='Clustering' + model_label = 'Single_Label' + return model_class, model_label, target + targ = copy.deepcopy(target) + target = [target] + elif isinstance(target, list): + if len(target) == 1: + targ = target[0] + else: + targ = target[0] + model_label = 'Multi_Label' + else: + print('target is Not detected. Default chosen is %s, %s' %(model_class, model_label)) + #### This is where you detect what kind of problem it is ################# + + if train[targ].dtype in ['int64', 'int32','int16']: + if len(train[targ].unique()) <= 2: + model_class = 'Classification' + elif len(train[targ].unique()) > 2 and len(train[targ].unique()) <= cat_limit: + model_class = 'Multi_Classification' + else: + model_class = 'Regression' + elif train[targ].dtype in ['float']: + if len(train[targ].unique()) <= 2: + model_class = 'Classification' + elif len(train[targ].unique()) > 2 and len(train[targ].unique()) <= float_limit: + model_class = 'Multi_Classification' + else: + model_class = 'Regression' + else: + if len(train[targ].unique()) <= 2: + model_class = 'Classification' + else: + model_class = 'Multi_Classification' + ########### print this for the start of next step ########### + print(' Model type is %s and %s problem' %(model_class,model_label)) + return model_class, model_label, target +###################################################################################### + +def transform_train_target(train_target, target, modeltype, model_label, cat_vocab_dict): + train_target = copy.deepcopy(train_target) + cat_vocab_dict = copy.deepcopy(cat_vocab_dict) + ### Just have to change the target from string to Numeric in entire dataframe! ### + + if modeltype != 'Regression': + if model_label == 'Multi_Label': + target_copy = copy.deepcopy(target) + print('Train target shape = %s' %(train_target.shape,)) + #### This is for multi-label problems ##### + cat_vocab_dict['target_le'] = [] + for each_target in target_copy: + cat_vocab_dict[each_target+'_original_classes'] = np.unique(train_target[target]) + target_le = My_LabelEncoder() + print('Transforming %s target labels...' %each_target) + print(' Original target labels data type is %s ' %train_target[each_target].dtype) + train_values = target_le.fit_transform(train_target[each_target]) + cat_vocab_dict[each_target+'_transformed_classes'] = np.unique(train_values) + train_target[each_target] = train_values + cat_vocab_dict['target_le'].append(target_le) + print('%s transformed as follows: %s' %(each_target, target_le.transformer)) + print(' transformed target labels data type to numeric or ordered from 0') + else: + #### This is for Single Label problems #### + cat_vocab_dict['original_classes'] = np.unique(train_target[target]) + target_le = My_LabelEncoder() + print('Transforming %s target labels...' %target) + print(' Original labels dtype is %s ' %train_target[target].dtype) + train_values = target_le.fit_transform(train_target[target]) + cat_vocab_dict['transformed_classes'] = np.unique(train_values) + train_target[target] = train_values + cat_vocab_dict['target_le'] = target_le + print('%s transformed as follows: %s' %(target, target_le.transformer)) + print(' transformed target labels data type to numeric or ordered from 0') + else: + target_le = "" + cat_vocab_dict['target_le'] = target_le + print('No Target transformation needed since target dtype is numeric') + train_target = train_target[target] + return train_target, cat_vocab_dict + +def split_combined_ds_into_two(x, usecols, preds): + """ + This is useful for splitting a single dataset which has both features and labels into two. + usecols is basically target column in the form of a list: [target] + preds is basically predictor columns in the form of a list: a list of predictors + """ + labels = {k: x[k] for k in x if k in usecols} + features = {k: x[k] for k in x if k in preds} + return (features, labels) +###################################################################################################### +import pathlib +import os +import random +def load_train_data_file(train_datafile, target, keras_options, model_options, verbose=0): + """ + This handy function loads a file from a local or remote machine provided the filename and path are given. + It loads the file(s) into a Tensorflow Dataset using the make_csv_dataset function from Tensorflow 2.0 + """ + train_datafile = copy.deepcopy(train_datafile) + http_url = False + if find_words_in_list(['http'], [train_datafile]): + print('http urls file: will be loaded into pandas and then into tensorflow datasets') + http_url = True + try: + DS_LEN = model_options['DS_LEN'] + except: + ### Choose a default option in case it is not given + DS_LEN = 100000 + shuffle_flag = False + ################################################################################# + try: + compression = None + ### see if there is a . in the file name. If it is, then do this process. + split_str = train_datafile.split(".")[-1] + if split_str=='csv': + print("CSV file being loaded into tf.data.Dataset") + compression_type = None + elif split_str=='zip' : + print("Zip file being loaded into tf.data.Dataset") + compression_type="GZIP" ### don't change this. It is correct. + compression = "zip" ### don't change this. It is correct. + print(' Using %s compression_type in make_csv_dataset argument' %compression_type) + elif split_str=='gz': + print("Zip file being loaded into tf.data.Dataset") + compression_type="GZIP" + compression = "gzip" + print(' Using %s compression_type in make_csv_dataset argument' %compression_type) + else: + compression_type = None + except: + #### if . is not there, it means it is a folder and we need to collect all files in that folder + font_csvs = sorted(str(p) for p in pathlib.Path(train_datafile).glob("*.csv")) + print('Printing the first 5 files in the %s folder:\n%s' %(train_datafile,font_csv[:5])) + train_datafile_list = pathlib.Path(train_datafile).glob("*.csv") + print(' collecting files matching this file pattern in directory: %s' %train_datafile_list) + try: + list_files = [] + filetype = train_datafile.split(".")[-1] + list_files = [x for x in os.listdir(inpath) if x.endswith(filetype)] + if list_files == []: + print('No csv, tsv or Excel files found in the given directory') + return + else: + print('%d files found in directory matching pattern: %s' %(len(list_files), train_datafile)) + ### now you must use this file_pattern in make_csv_dataset argument + train_datafile = list_files[0] + except: + print('not able to collect files matching given pattern = %s' %train_datafile) + return + ################################################################################# + model_options['compression'] = compression + #### About 25% of the data or 10,000 rows which ever is higher is loaded ####### + if http_url: + maxrows = 100000 ### set it very high so that all rows are read into dataframe ### + else: + maxrows = min(100000, int(0.25*DS_LEN)) + print('Max rows loaded to classify features = %s' %maxrows) + ### first load a small sample of the dataframe and the entire target if it needs transform + try: + modeltype = model_options["modeltype"] + except: + modeltype, model_label, usecols = find_problem_type(train_small, target, model_options, verbose) + model_options['modeltype'] = modeltype + if isinstance(target, str): + targets = [target] + else: + targets = copy.deepcopy(target) + ###### This is where you select a small sample of a file to do classification of variables #### + if compression_type: + ### this reads the entire file and loads it into a dataset if it is a zip file ###### + train_small = pd.read_csv(train_datafile, sep=sep, nrows=maxrows, compression=compression, + header=header, encoding=csv_encoding) + train_small, data_batches, var_df, cat_vocab_dict, keras_options, model_options = load_train_data_frame( + train_small, target, keras_options, model_options, verbose) + ##### This might be useful for users to know whether to use feature-crosses or not ### + return train_small, data_batches, var_df, cat_vocab_dict, keras_options, model_options + else: + ### It reads only a small dataframe if it is a regular CSV file ####### + train_small = select_rows_from_file_or_frame(train_datafile, model_options, targets, maxrows) + ##### Now detect modeltype if it is not given ############### + print(' small sample dataset from train loaded. Shape = %s' %(train_small.shape,)) + #### All column names in Tensorflow should have no spaces ! So you must convert them here! + sel_preds = ["_".join(x.split(" ")) for x in list(train_small) ] + header = model_options['header'] + if header is None: + sel_preds = ["col_"+str(x) for x in range(train_small.shape[1])] + else: + sel_preds = ["_".join(x.split("(")) for x in sel_preds ] + sel_preds = ["_".join(x.split(")")) for x in sel_preds ] + sel_preds = ["_".join(x.split("/")) for x in sel_preds ] + sel_preds = ["_".join(x.split("\\")) for x in sel_preds ] + sel_preds = ["_".join(x.split("?")) for x in sel_preds ] + sel_preds = [x.lower() for x in sel_preds ] + + if isinstance(target, str): + target = "_".join(target.split(" ")) + target = "_".join(target.split("(")) + target = "_".join(target.split(")")) + target = "_".join(target.split("/")) + target = "_".join(target.split("\\")) + target = "_".join(target.split("?")) + target = target.lower() + model_label = 'Single_Label' + else: + target = ["_".join(x.split(" ")) for x in target ] + target = ["_".join(x.split("(")) for x in target ] + target = ["_".join(x.split(")")) for x in target ] + target = ["_".join(x.split("/")) for x in target ] + target = ["_".join(x.split("\\")) for x in target ] + target = ["_".join(x.split("?")) for x in target ] + target = [x.lower() for x in target ] + model_label = 'Multi_Label' + + train_small.columns = sel_preds + + print('Alert! Modified column names to satisfy rules for column names in Tensorflow...') + + ### modeltype and usecols are very important to know before doing any processing ##### + #### usecols is a very handy tool to handle a target which can be single label or multi-label! + if modeltype == '': + ### usecols is basically target in a list format. Very handy to know when target is a list. + modeltype, _, usecols = find_problem_type(train_small, target, model_options, verbose) + else: + ### if modeltype is given, then do not find the model type using this function + _, _, usecols = find_problem_type(train_small, target, model_options, verbose) + + + ########## Find small details about the data to help create the right model ### + label_encode_flag = model_options["label_encode_flag"] + if isinstance(label_encode_flag, str): + if modeltype == 'Classification' or modeltype == 'Multi_Classification': + if isinstance(target, str): + #### This is for Single-Label problems ######## + if train_small[target].dtype == 'object' or str(train_small[target].dtype).lower() == 'category': + label_encode_flag = True + elif 0 not in np.unique(train_small[target]): + label_encode_flag = False + if verbose: + print(' label encoding must be done since there is no zero class!') + target_vocab = train_small[target].unique() + num_classes = len(target_vocab) + elif isinstance(target, list): + #### This is for Multi-Label problems ######## + num_classes = [] + for each_target in target: + if train_small[each_target].dtype == 'object' or str(train_small[target[0]].dtype).lower() == 'category': + label_encode_flag = True + elif 0 not in np.unique(train_small[each_target]): + label_encode_flag = False + if verbose: + print(' label encoding must be done since there is no zero class!') + target_vocab = train_small[each_target].unique().tolist() + num_classes.append(len(target_vocab)) + else: + num_classes = 1 + target_vocab = [] + label_encode_flag = False + else: + if isinstance(target, str): + target_vocab = train_small[target].unique() + num_classes = len(target_vocab) + else: + for each_target in copy_target: + target_vocab = train_small[target].unique().tolist() + num_classes_each = len(target_vocab) + num_classes.append(int(num_classes_each)) + + #### This is where we set the model_options for num_classes and num_labels ######### + model_options['num_classes'] = num_classes + + ############# Sample Data classifying features into variaous types ################## + print('Loaded a small data sample of size = %s into pandas dataframe to analyze...' %(train_small.shape,)) + ### classify variables using the small dataframe ## + print(' Classifying variables using data sample in pandas...') + train_small, var_df1, cat_vocab_dict = classify_features_using_pandas(train_small, target, model_options, verbose=verbose) + + ########## Just transfer all the values from var_df to cat_vocab_dict ################################## + for each_key in var_df1: + cat_vocab_dict[each_key] = var_df1[each_key] + ############################################################################################################ + + model_options['modeltype'] = modeltype + model_options['model_label'] = model_label + cat_vocab_dict['modeltype'] = modeltype + cat_vocab_dict['target_variables'] = usecols + cat_vocab_dict['num_classes'] = num_classes + cat_vocab_dict["target_transformed"] = label_encode_flag + + # Construct a lookup table to map string chars to indexes, + + # using the vocab loaded above: + if label_encode_flag: + #### Sometimes, using tf.int64 works. Hence this is needed. + table = tf.lookup.StaticHashTable( + tf.lookup.KeyValueTensorInitializer( + keys=target_vocab, values=tf.constant(list(range(len(target_vocab))), + dtype=tf.int64)), + default_value=int(len(target_vocab)+1)) + + #### Set column defaults while reading dataset from CSV files - that way, missing values avoided! + ### The following are valid CSV dtypes for missing values: float32, float64, int32, int64, or string + ### fill all missing values in categorical variables with "None" + ### Similarly. fill all missing values in float variables with -99 + if train_small.isnull().sum().sum() > 0: + print(' %d missing values in dataset: filling them with default values...' %( + train_small.isnull().sum().sum())) + string_cols = train_small.select_dtypes(include='object').columns.tolist() + train_small.select_dtypes( + include='category').columns.tolist() + integer_cols = train_small.select_dtypes(include='integer').columns.tolist() + float_cols = train_small.select_dtypes(include='float').columns.tolist() + bool_cols = train_small.select_dtypes(include='bool').columns.tolist() + ### Bool_columns become string after you set their defaults since missing is default ## + column_defaults = [-99.0 if x in float_cols else -99 if x in integer_cols else "missing" + for x in list(train_small)] + ### So we need to put back bool columns as boolean right after we load them into data_batches + ####### Make sure you don't move this next stage. It should be after column defaults! ### + if label_encode_flag: + trans_output, cat_vocab_dict = transform_train_target(train_small, target, modeltype, + model_label, cat_vocab_dict) + train_small[target] = trans_output.values + + #### CAUTION: (num_epochs=None) will automatically repeat the data forever! Be Careful with it! + ### setting num_epochs to 1 is always good practice since it ensures that your dataset is readable later + ### If you set num_epochs to None it will throw your dataset batches into infinite loop. Be careful! + #### Also the dataset will display the batch size as 4 (or whatever) if you set num_epochs as None. + #### However, if you set num_epochs=1, then you will see dataset shape as None! + #### Also num_epochs=1 need to do repeat() on the dataset to loop it forever. + num_epochs = 1 + + ########### find the number of labels in data #### + if isinstance(target, str): + num_labels = 1 + elif isinstance(target, list): + if len(target) == 1: + num_labels = 1 + else: + num_labels = len(target) + cat_vocab_dict['num_labels'] = num_labels + model_options['num_labels'] = num_labels + + ### Initially set this batch_size low so that you can do better model training with small batches ### + #### It is a good idea to shuffle and batch it with small batch size like 4 immediately ### + if http_url: + ### Once a file is in gzip format, you have to load it into pandas and then find file size and batch + cat_vocab_dict["DS_LEN"] = train_small.shape[0] + model_options['DS_LEN'] = train_small.shape[0] + DS_LEN = train_small.shape[0] + try: + keras_options["batchsize"] = batch_size + if isinstance(keras_options["batchsize"], str): + batch_size = find_batch_size(DS_LEN) + cat_vocab_dict['batch_size'] = batch_size + except: + batch_size = find_batch_size(DS_LEN) + keras_options["batchsize"] = batch_size + cat_vocab_dict['batch_size'] = batch_size + ###### Do this for selecting what columns to load into TF.Data ####### + #### This means it is not a test dataset - hence it has target columns - load it too! + if isinstance(target, str): + if target == '': + target_name = None + else: + target_name = copy.deepcopy(target) + preds = [x for x in list(train_small) if x not in [target]] + elif isinstance(target, list): + #### then it is a multi-label problem + target_name = None + preds = left_subtract(list(train_small), target) + else: + print('Error: Target %s type not understood' %type(target)) + return + + ############################################################################################ + ########### C H E C K F O R BOOL and I N F I N I T E V A L U E S H E R E ########### + ############################################################################################ + cols_with_infinity = find_columns_with_infinity(train_small) + @tf.function + def convert_boolean_to_string(features, target): + """ + This is how you convert all your boolean features into float variables. + The reason you have to do this is because tf.keras does not know how to handle boolean types. + It takes as input an ordered dict named features and returns the same in features format. + """ + for feature_name in features: + if feature_name in bool_cols: + # Cast boolean feature values to string. + #features[feature_name] = tf.cast(features[feature_name], tf.dtypes.float32) + features[feature_name] = tf.dtypes.cast(features[feature_name], tf.string) + return (features, target) + + ################ T F D A T A D A T A S E T L O A D I N G H E R E ################ + ############ Create a Tensorflow Dataset using the make_csv function ##################### + if http_url: + print('Since input is http URL file we load it into pandas and then tf.data.Dataset...') + ### Now load the URL file loaded into pandas into a tf.data.dataset ############# + if isinstance(target, str): + if target != '': + labels = train_small.pop(target) + data_batches = tf.data.Dataset.from_tensor_slices((dict(train_small), labels)) + else: + print('target variable is blank - please fix input and try again') + return + elif isinstance(target, list): + ##### For multi-label problems, you need to use dict of labels as well ### + labels = train_small.pop(target) + data_batches = tf.data.Dataset.from_tensor_slices((dict(train_small), dict(labels))) + else: + data_batches = tf.data.Dataset.from_tensor_slices(dict(train_small)) + ### batch it if you are creating it from a dataframe + data_batches = data_batches.batch(batch_size, drop_remainder=True) + else: + print('Loading your input file(s) data directly into tf.data.Dataset...') + data_batches = tf.data.experimental.make_csv_dataset(train_datafile, + batch_size=batch_size, + column_names=sel_preds, + label_name=target_name, + num_epochs = num_epochs, + column_defaults=column_defaults, + compression_type=compression_type, + shuffle=shuffle_flag, + num_parallel_reads=tf.data.experimental.AUTOTUNE) + ############### Additional post-processing checkes needed - do it here ####### + #### here is where we need to put back boolean columns that were strings back to boolean + if bool_cols: + data_batches = data_batches.map(convert_boolean_to_string) + ### Remove this not after testing the function below ### + if cols_with_infinity: + data_batches = data_batches.map(drop_non_finite_rows) + print(' ALERT! Dropping non-finite values in %d columns: %s ' %( + len(cols_with_infinity), cols_with_infinity)) + + ######## P E R F O R M L A B E L E N C O D I N G H E R E ############ + if label_encode_flag: + print(' target label encoding now...') + data_batches = data_batches.map(lambda x, y: to_ids(x, y, table)) + print(' target label encoding completed.') + print(' train data loaded successfully.') + drop_cols = var_df1['cols_delete'] + preds = [x for x in list(train_small) if x not in usecols+drop_cols] + print('\nNumber of predictors to be used = %s in predict step: keras preprocessing...' %len(preds)) + cat_vocab_dict['columns_deleted'] = drop_cols + if len(drop_cols) > 0: ### drop cols that have been identified for deletion ### + print('Dropping %s columns marked for deletion...' %drop_cols) + train_small.drop(drop_cols,axis=1,inplace=True) + model_options['train_data_is_file'] = True + return train_small, data_batches, var_df1, cat_vocab_dict, keras_options, model_options +############################################################################################ +def drop_non_finite_rows(features, targets): + cols = [] + for key, col in features.items(): + cols.append(col) + # stack the columns to build a matrix + cols = tf.stack(cols, axis=-1) + # The good rows are the ones where all the elements are finite + good = tf.reduce_all(tf.math.is_finite(cols), axis=-1) + + # Apply the boolean mask to each column and return it as a dict. + result = {} + for name, value in features.items(): + result[name] = tf.boolean_mask(value,good) + return result, targets +############################################################################################ +def to_ids(features, labels, table): + if labels.dtype==np.int32: labels = tf.cast(labels, tf.int64) + #labels = tf.cast(labels, tf.int64) ## this should not have been used ## + labels = table.lookup(labels) + return (features, labels) +############################################################################################# +def lenopenreadlines(filename): + with open(filename) as f: + return len(f.readlines()) +######################################################################################### +def closest(lst, K): + """ + Find a number in list lst that is closest to the value K. + """ + return lst[min(range(len(lst)), key = lambda i: abs(lst[i]-K))] +########################################################################################## +def find_batch_size(DS_LEN): + ### Since you cannot deal with a very large dataset in pandas, let's look into how big the file is + maxrows = 10000 + if DS_LEN < 100: + batch_ratio = 0.16 + elif DS_LEN >= 100 and DS_LEN < 1000: + batch_ratio = 0.05 + elif DS_LEN >= 1000 and DS_LEN < 10000: + batch_ratio = 0.01 + elif DS_LEN >= maxrows and DS_LEN <= 100000: + batch_ratio = 0.001 + else: + batch_ratio = 0.0001 + batch_len = int(batch_ratio*DS_LEN) + #print(' Batch size selected as %d' %batch_len) + lst = [32, 48, 64, 96, 128, 256] + batch_len = closest(lst, batch_len) + return batch_len +######################################################################################### +def fill_missing_values_for_TF2(train_small, var_df): + """ + ######################################################################################## + ### As of now (TF 2.4.1) we still cannot load pd.dataframe with nulls in string columns! + ### You must first remove nulls from the objects in dataframe and use that sample + ### to build a normalizer layer. You can use Mean and SD from that sample. + ### Using that sample, you can build the layer for complete dataset + #### in that case the input is a dataframe, you must first remove nulls from it + ######################################################################################## + ### Filling Strategy (this is not Imputation - mind you) + ### 1. Fill all missing values in categorical variables with "None" + ### 2. Similarly, fill all missing values in float variables with -99 + ######################################################################################## + """ + train_small = copy.deepcopy(train_small) + bools = var_df['bools'] + cols_delete = var_df['cols_delete'] + cat_cols = var_df['categorical_vars'] + var_df['discrete_string_vars'] + bools + int_bools = var_df['int_bools'] + int_cats = var_df['int_cats'] + ints = var_df['int_vars'] + int_cols = int_cats + ints + int_bools + float_cols = var_df['continuous_vars'] + nlp_cols = var_df['nlp_vars'] + date_vars = var_df['date_vars'] + lats = var_df['lat_vars'] + lons = var_df['lon_vars'] + ffill_cols = lats + lons + date_vars + + if len(cat_cols) > 0: + if train_small[cat_cols].isnull().sum().sum() > 0: + for col in cat_cols: + colcount = "Missing" + train_small[col].fillna(colcount, inplace=True) + + if len(nlp_cols) > 0: + if train_small[nlp_cols].isnull().sum().sum() > 0: + for col in nlp_cols: + colcount = "Missing" + train_small[col].fillna(colcount, inplace=True) + + ints_copy = int_cols + int_cats + if len(ints_copy) > 0: + if train_small[ints_copy].isnull().sum().sum() > 0: + for col in ints_copy: + colcount = 0 + train_small[col].fillna(colcount,inplace=True) + + if len(float_cols) > 0: + if train_small[float_cols].isnull().sum().sum() > 0: + for col in float_cols: + colcount = 0.0 + train_small[col].fillna(colcount,inplace=True) + + ffill_cols = train_small.columns[train_small.isnull().sum()>0] + + if len(ffill_cols) > 0: + ffill_cols_copy = copy.deepcopy(ffill_cols) + if train_small[ffill_cols].isnull().sum().sum() > 0: + for col in ffill_cols: + train_small[col].fillna(method='ffill', inplace=True) + #### Sometimes forward-fill doesn't do it. You need to try back-fill too! + if train_small[ffill_cols].isnull().sum().sum() > 0: + for col in ffill_cols_copy: + train_small[col].fillna(method='bfill', inplace=True) + return train_small +######################################################################################## +def load_train_data_frame(train_dataframe, target, keras_options, model_options, verbose=0): + """ + ### CAUTION: TF2.4 Still cannot load a DataFrame with Nulls in string or categoricals! + ############################################################################ + #### TF 2.4 still cannot load tensor_slices into ds if an object or string column + #### that has nulls in it! So we need to find other ways to load tensor_slices by + #### first filling dataframe with pandas fillna() function! + ############################################################################# + """ + train_dataframe = copy.deepcopy(train_dataframe) + DS_LEN = model_options['DS_LEN'] + print('Max rows loaded to classify features = %s' %DS_LEN) + print(' small sample dataset from train loaded. Shape = %s' %(train_dataframe.shape,)) + #### do this for dataframes ################## + maxrows = 100000 + try: + batch_size = keras_options["batchsize"] + if isinstance(keras_options["batchsize"], str): + batch_size = find_batch_size(DS_LEN) + except: + #### If it is not given find it here #### + batch_size = find_batch_size(DS_LEN) + ######### Modify or Convert column names to fit tensorflow rules of no space in names! + sel_preds = ["_".join(x.split(" ")) for x in list(train_dataframe) ] + #### This can also be a problem with other special characters ### + sel_preds = ["_".join(x.split("(")) for x in sel_preds ] + sel_preds = ["_".join(x.split(")")) for x in sel_preds ] + sel_preds = ["_".join(x.split("/")) for x in sel_preds ] + sel_preds = ["_".join(x.split("\\")) for x in sel_preds ] + sel_preds = ["_".join(x.split("?")) for x in sel_preds ] + sel_preds = [x.lower() for x in sel_preds ] + + if isinstance(target, str): + target = "_".join(target.split(" ")) + target = "_".join(target.split("(")) + target = "_".join(target.split(")")) + target = "_".join(target.split("/")) + target = "_".join(target.split("\\")) + target = "_".join(target.split("?")) + target = target.lower() + model_label = 'Single_Label' + else: + target = ["_".join(x.split(" ")) for x in target ] + target = ["_".join(x.split("(")) for x in target ] + target = ["_".join(x.split(")")) for x in target ] + target = ["_".join(x.split("/")) for x in target ] + target = ["_".join(x.split("\\")) for x in target ] + target = ["_".join(x.split("?")) for x in target ] + target = [x.lower() for x in target ] + model_label = 'Multi_Label' + + train_dataframe.columns = sel_preds + + print('Alert! Modified column names to satisfy rules for column names in Tensorflow...') + + + #### if target is changed you must send that modified target back to other processes ###### + ### usecols is basically target in a list format. Very handy to know when target is a list. + + try: + modeltype = model_options["modeltype"] + if model_options["modeltype"] == '': + ### usecols is basically target in a list format. Very handy to know when target is a list. + modeltype, model_label, usecols = find_problem_type(train_dataframe, target, model_options, verbose) + model_options["modeltype"] = modeltype + else: + if isinstance(target, str): + usecols = [target] + else: + usecols = copy.deepcopy(target) + except: + ### if modeltype is given, then do not find the model type using this function + modeltype, model_label, usecols = find_problem_type(train_dataframe, target, model_options, verbose) + + if isinstance(target, str): + targets = [target] + else: + targets = copy.deepcopy(target) + + ##### This is a simple function to load a small sample of data to do analysis ############ + train_small = select_rows_from_file_or_frame(train_dataframe, model_options, targets, maxrows) + + ### Cat_Vocab_Dict contains all info about vocabulary in each variable and their size + print(' Classifying variables using data sample in pandas...') + train_small, var_df, cat_vocab_dict = classify_features_using_pandas(train_small, target, model_options, verbose=verbose) + + ########## Just transfer all the values from var_df to cat_vocab_dict ################################## + for each_key in var_df: + cat_vocab_dict[each_key] = var_df[each_key] + ############################################################################################################ + model_options['modeltype'] = modeltype + model_options['model_label'] = model_label + cat_vocab_dict['target_variables'] = usecols + cat_vocab_dict['modeltype'] = modeltype + model_options['batch_size'] = batch_size + ########## Find small details about the data to help create the right model ### + target_transformed = model_options["label_encode_flag"] + if isinstance(target_transformed, str): + if modeltype != 'Regression': + if isinstance(target, str): + #### This is for Single Label Problems ###### + if train_small[target].dtype == 'object' or str(train_small[target].dtype).lower() == 'category': + target_transformed = True + target_vocab = train_small[target].unique() + num_classes = len(target_vocab) + else: + if 0 not in np.unique(train_small[target]): + target_transformed = True ### label encoding must be done since no zero class! + target_vocab = train_small[target].unique() + num_classes = len(train_small[target].value_counts()) + elif isinstance(target, list): + #### This is for Multi-Label Problems ####### + copy_target = copy.deepcopy(target) + num_classes = [] + for each_target in copy_target: + if train_small[target[0]].dtype == 'object' or str(train_small[target[0]].dtype).lower() == 'category': + target_transformed = True + target_vocab = train_small[target].unique().tolist() + num_classes_each = len(target_vocab) + else: + if 0 not in np.unique(train_small[target[0]]): + target_transformed = True ### label encoding must be done since no zero class! + target_vocab = train_small[target[0]].unique() + num_classes_each = train_small[target].apply(np.unique).apply(len).max() + num_classes.append(int(num_classes_each)) + else: + num_classes = 1 + target_vocab = [] + target_transformed = False + else: + if isinstance(target, str): + target_vocab = train_small[target].unique() + num_classes = len(target_vocab) + else: + for each_target in copy_target: + target_vocab = train_small[target].unique().tolist() + num_classes_each = len(target_vocab) + num_classes.append(int(num_classes_each)) + + ########### find the number of labels in data #### + if isinstance(target, str): + num_labels = 1 + elif isinstance(target, list): + if len(target) == 1: + num_labels = 1 + else: + num_labels = len(target) + #### This is where we set the model_options for num_classes and num_labels ######### + model_options['num_labels'] = num_labels + model_options['num_classes'] = num_classes + cat_vocab_dict['num_labels'] = num_labels + cat_vocab_dict['num_classes'] = num_classes + cat_vocab_dict["target_transformed"] = target_transformed + + #### once the dataframe has been classified, you can again change train_small to original dataframe ## + train_small = copy.deepcopy(train_dataframe) + + #### fill missing values using this function ############## + train_small = fill_missing_values_for_TF2(train_small, cat_vocab_dict) + + ##### Do the deletion of cols after filling with missing values since otherwise fill errors! + drop_cols = var_df['cols_delete'] + cat_vocab_dict['columns_deleted'] = drop_cols + if len(drop_cols) > 0: ### drop cols that have been identified for deletion ### + print(' Dropping %s columns marked for deletion...' %drop_cols) + train_small.drop(drop_cols,axis=1,inplace=True) + + ######### Now load the train Dataframe into a tf.data.dataset ############# + if target_transformed: + ####################### T R A N S F O R M I N G T A R G E T ######################## + train_small[target], cat_vocab_dict = transform_train_target(train_small, target, modeltype, + model_label, cat_vocab_dict) + + if isinstance(target, str): + #### For single label do this: labels can be without names since there is only one label + if target != '': + labels = train_small[target] + features = train_small.drop(target, axis=1) + ds = tf.data.Dataset.from_tensor_slices((dict(features), labels)) + else: + print('target variable is blank - please fix input and try again') + return + elif isinstance(target, list): + #### For multi label do this: labels must be dict and hence with names since there are many targets + labels = train_small[target] + features = train_small.drop(target, axis=1) + ds = tf.data.Dataset.from_tensor_slices((dict(features), dict(labels))) + else: + ds = tf.data.Dataset.from_tensor_slices(dict(train_small)) + ###### Now save some defaults in cat_vocab_dict ########################## + try: + keras_options["batchsize"] = batch_size + cat_vocab_dict['batch_size'] = batch_size + except: + batch_size = find_batch_size(DS_LEN) + keras_options["batchsize"] = batch_size + cat_vocab_dict['batch_size'] = batch_size + + ########################################################################## + #### C H E C K F O R I N F I N I T E V A L U E S H E R E ########## + ########################################################################## + cols_with_infinity = find_columns_with_infinity(train_small) + if cols_with_infinity: + train_small = drop_rows_with_infinity(train_small, cols_with_infinity, fill_value=True) + model_options['train_data_is_file'] = False + return train_small, ds, var_df, cat_vocab_dict, keras_options, model_options +############################################################################################### +def load_image_data(image_directory, project_name, keras_options, model_options, + verbose=0): + """ + Handy function that collects a sequence of image files into a tf.data generator. + + Your images input folder or directory must be like the following. If not, you will get error. + main_directory/ + ...class_a/ + ......image_1.png + ......image_2.png + ...class_b/ + ......image_3.png + ......image_4.png + + Inputs: + ----------- + image_directory: This is the folder that contains image files organized by class_label. + project_name: This is where the model will be stored once it is trained. + keras_options: a data dictionary that contains keras model options you can send. + model_options: a data dictionary that saves all the characteristics of your model + + Outputs: + ----------- + train_ds: a train dataset in tf.data.Dataset + valid_ds: a validation dataset in tf.data.Dataset format + cat_vocab_dict: a data dictionary that saves all the characteristics of your data + model_options: a data dictionary that saves all the characteristics of your model + """ + cat_vocab_dict = dict() + cat_vocab_dict['target_variables'] = "target" + cat_vocab_dict['project_name'] = project_name + if 'image_height' in model_options.keys(): + print(' Image height given as %d' %model_options['image_height']) + else: + print(" No image height given. Returning. Provide image height and width...") + return + if 'image_width' in model_options.keys(): + print(' Image width given as %d' %model_options['image_width']) + else: + print(" No image width given. Returning. Provide image height and width...") + return + if 'image_channels' in model_options.keys(): + print(' Image channels given as %d' %model_options['image_channels']) + else: + print(" No image_channels given. Returning. Provide image height and width...") + return + try: + image_train_folder = os.path.join(image_directory,"train") + if not os.path.exists(image_train_folder): + print("Image use case. No train folder exists under given directory. Returning...") + return + except: + print('Error: Not able to find any train or test image folder in the given folder %s' %image_train_folder) + print(""" You must put images under folders named train, + validation (optional) and test folders under given %s folder. + Otherwise deep_autoviml won't work. """ %image_directory) + image_train_split = False + image_train_folder = os.path.join(image_directory,"train") + if not os.path.exists(image_train_folder): + print("No train folder found under given image directory %s. Returning..." %image_directory) + image_train_folder = os.path.join(image_directory,"validation") + image_valid_folder = os.path.join(image_directory,"validation") + if not os.path.exists(image_valid_folder): + print("No validation folder found under given image directory %s. Returning..." %image_directory) + image_train_split = True + img_height = model_options['image_height'] + img_width = model_options['image_width'] + img_channels = model_options['image_channels'] + #### make this a small number - default batch_size ### + batch_size = check_model_options(model_options, "batch_size", 64) + model_options["batch_size"] = batch_size + full_ds = tf.keras.preprocessing.image_dataset_from_directory(image_train_folder, + seed=111, + image_size=(img_height, img_width), + batch_size=batch_size) + if image_train_split: + ############## Split train into train and validation datasets here ############### + classes = full_ds.class_names + recover = lambda x,y: y + print('\nSplitting train into two: train and validation data') + valid_ds = full_ds.enumerate().filter(is_valid).map(recover) + train_ds = full_ds.enumerate().filter(is_train).map(recover) + else: + train_ds = full_ds + valid_ds = tf.keras.preprocessing.image_dataset_from_directory(image_valid_folder, + seed=111, + image_size=(img_height, img_width), + batch_size=batch_size) + classes = train_ds.class_names + #### Successfully loaded train and validation data sets ################ + cat_vocab_dict["image_classes"] = classes + cat_vocab_dict["target_transformed"] = True + cat_vocab_dict['modeltype'] = 'Classification' + MLB = My_LabelEncoder() + ins = copy.deepcopy(classes) + outs = np.arange(len(classes)) + MLB.transformer = dict(zip(ins,outs)) + MLB.inverse_transformer = dict(zip(outs,ins)) + cat_vocab_dict['target_le'] = MLB + print('Number of image classes = %d and they are: %s' %(len(classes), classes)) + if len(classes) <= 2: + model_options["num_predicts"] = 1 + else: + model_options["num_predicts"] = len(classes) + print_one_image_from_dataset(train_ds, classes) + AUTOTUNE = tf.data.AUTOTUNE + train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE) + valid_ds = valid_ds.cache().prefetch(buffer_size=AUTOTUNE) + cat_vocab_dict["image_height"] = img_height + cat_vocab_dict["image_width"] = img_width + cat_vocab_dict["batch_size"] = batch_size + cat_vocab_dict["image_channels"] = img_channels + return train_ds, valid_ds, cat_vocab_dict, model_options +######################################################################################## +from collections import defaultdict +from collections import Counter + +def load_train_data(train_data_or_file, target, project_name, keras_options, model_options, + keras_model_type, verbose=0): + """ + Handy function that loads a file or a sequence of files (*.csv) into a tf.data.Dataset + You can also load a pandas dataframe instead of a file if you wanted to. It accepts both! + It will automatically figure out whether input is a file or file(s) or a pandas dataframe. + + Inputs: train_data_or_file, target + ------------------------------------------------------------------------------- + train_data_or_file: this can be a name of file to load or can be a pandas dataframe to load into tf.data + either option will work. This function will detect that automatically and load them. + target: target name as a string or a list + + Outputs: train_small, model_options, ds, var_df, cat_vocab_dict, keras_options + ------------------------------------------------------------------------------- + train_small: a sample of data into a pandas dataframe + model_options: a dictionary describing the data + ds: a tf.data.Dataset containing a symbolic link to the data at rest in your train_data_or_file + var_df: a dictionary classifying features in data to multiple types such as numeric, category, etc. + cat_vocab_dict: a dictionary containing artifacts from the data that will be used during inference + keras_options: a dictionary containing keras defaults for the model that will be built using this data + """ + shuffle_flag = False + cat_vocab_dict = defaultdict(list) + train_data_or_file = copy.deepcopy(train_data_or_file) + maxrows = 10000 ### the number of maximum rows read by pandas to sample data ## + ### Since you cannot deal with a very large dataset in pandas, let's look into how big the file is + try: + if isinstance(train_data_or_file, str): + DS_LEN = lenopenreadlines(train_data_or_file) + else: + DS_LEN = train_data_or_file.shape[0] + except: + if find_words_in_list(['http'], [train_data_or_file.lower()]): + print('http url file: cannot find size of dataset. Setting default...') + DS_LEN = maxrows #### set to an artificial low number ### + keras_options["data_size"] = DS_LEN + model_options["DS_LEN"] = DS_LEN + ########## LOADING EITHER FILE OR DATAFRAME INTO TF DATASET HERE ################## + if isinstance(train_data_or_file, str): + #### do this for files only ################## + train_small, train_ds, var_df, cat_vocab_dict, keras_options, model_options = load_train_data_file(train_data_or_file, target, + keras_options, model_options, verbose) + else: + train_small, train_ds, var_df, cat_vocab_dict, keras_options, model_options = load_train_data_frame(train_data_or_file, target, + keras_options, model_options, verbose) + + ### This is where we do all kinds of feature engineering - this needs to be in predict #### + cat_vocab_dict['bools_converted'] = False + if isinstance(train_data_or_file, str): + ### if train_data is a file, boolean vars have to be converted to strings ### + BOOLS = [] + cat_vocab_dict['bools_converted'] = True + cat_vocab_dict['categorical_vars'] += cat_vocab_dict['bools'] + else: + ### if train is a dataframe, you can leave bools as it is ### + BOOLS = cat_vocab_dict['bools'] + ################################################################################# + ##### F E A T U R E E N G I N E E R I N G H E R E ############# + ################################################################################# + def process_boolean(features, target): + """ + This is how you convert all your boolean features into float variables. + The reason you have to do this is because tf.keras does not know how to handle boolean types. + It takes as input an ordered dict named features and returns the same in features format. + """ + for feature_name in features: + if feature_name in BOOLS: + # Cast boolean feature values to int32 only if the train_data is a dataframe ## + #features[feature_name] = tf.cast(features[feature_name], tf.dtypes.float32) + features[feature_name] = tf.dtypes.cast(features[feature_name], tf.int32) + return (features, target) + ################################################################################# + train_ds = train_ds.map(process_boolean) + ################################################################################# + ################## process boolean target if needed ############################# + ################################################################################# + #@tf.autograph.experimental.do_not_convert + @tf.function + def process_target(features, target): + target = tf.cast(target, tf.dtypes.float32) + return (features, target) + if bool in train_small[cat_vocab_dict['target_variables']].dtypes.values: + train_ds = train_ds.map(process_target) + print('Boolean columns successfully processed') + ################################################################################# + if keras_model_type.lower() in ['nlp', 'text']: + NLP_VARS = cat_vocab_dict['predictors_in_train'] + else: + NLP_VARS = cat_vocab_dict['nlp_vars'] + ################################################################ + @tf.function + def process_NLP_features(features): + """ + This is how you combine all your string NLP features into a single new feature. + Then you can perform embedding on this combined feature. + It takes as input an ordered dict named features and returns the same features format. + """ + return tf.strings.reduce_join([features[i] for i in NLP_VARS],axis=0, + keepdims=False, separator=' ', name="combined") + ################################################################ + NLP_COLUMN = "combined_nlp_text" + ################################################################ + @tf.function + def combine_nlp_text(features): + ##use x to derive additional columns u want. Set the shape as well + y = {} + y.update(features) + y[NLP_COLUMN] = tf.strings.reduce_join([features[i] for i in NLP_VARS],axis=0, + keepdims=False, separator=' ') + return y + ###################################################################################### + ### You have to load only the NLP or text variables into dataset. + ### otherwise, it will fail during predict. Yo still need to create input for them. + ### In mixed_NLP models, you drop original NLP vars and combine them into one NLP var. + ###################################################################################### + + if NLP_VARS and keras_model_type.lower() in ['nlp','text', 'mixed_nlp', 'combined_nlp']: + if keras_model_type.lower() in ['nlp', 'text']: + train_ds = train_ds.map(lambda x, y: (process_NLP_features(x), y)) + #train_ds = train_ds.unbatch().batch(batch_size) + print(' processed NLP or text vars: %s successfully' %NLP_VARS) + elif keras_model_type.lower() in ['combined_nlp']: + train_ds = train_ds.map(lambda x, y: (combine_nlp_text(x), y)) + print(' combined NLP or text vars: %s into a single feature successfully' %NLP_VARS) + else: + ### Mixed NLP is to keep NLP vars separate so they can be processed individually ## + print(' keeping NLP vars separate') + else: + print(' No special text preprocessing done for NLP vars.') + ############################################################################ + ### You must batch it if you are creating it from a dataframe + batch_size = cat_vocab_dict['batch_size'] + if not isinstance(train_data_or_file, str): + train_ds = train_ds.batch(batch_size, drop_remainder=True) + #### if Target is modified in the above processes such as removing spaces, etc. you must re-init here + usecols = cat_vocab_dict['target_variables'] + cat_vocab_dict['DS_LEN'] = DS_LEN + if verbose >= 1 and train_small.shape[1] <= 30: + print_one_row_from_tf_dataset(train_ds) + #### Set Class Weights for Imbalanced Data Sets here ########## + modeltype = model_options["modeltype"] + #### You need to do this transform only for files. Otherwise, it is done already for dataframes. + if len(usecols) == 1: + target = usecols[0] + ### This is a single label problem ######## + y_train = train_small[target] + if modeltype != 'Regression' and not cat_vocab_dict['target_transformed']: + cat_vocab_dict["original_classes"] = np.unique(train_small[target]) + else: + ### This is a Multi-label label problem ######## + target = usecols[0] + y_train = train_small[usecols[0]] + target_copy = copy.deepcopy(usecols) + if modeltype != 'Regression' and not cat_vocab_dict['target_transformed']: + for each_t in target_copy: + cat_vocab_dict[each_t+"_original_classes"] = np.unique(train_small[each_t]) + #### CREATE CLASS_WEIGHTS HERE ################# + if modeltype != 'Regression': + find_rare_class(y_train, verbose=1) + if 'class_weight' in keras_options.keys() and not model_options['model_label']=='Multi_Label': + # Class weights are only applicable to single labels in Keras right now + class_weights = get_class_distribution(y_train) + keras_options['class_weight'] = class_weights + print(' Class weights calculated: %s' %class_weights) + else: + keras_options['class_weight'] = {} + else: + keras_options['class_weight'] = {} + print(' No class weights specified. Continuing...') + return train_small, model_options, train_ds, var_df, cat_vocab_dict, keras_options +########################################################################################################## +from collections import OrderedDict +def find_rare_class(classes, verbose=0): + ######### Print the % count of each class in a Target variable ##### + """ + Works on Multi Class too. Prints class percentages count of target variable. + It returns the name of the Rare class (the one with the minimum class member count). + This can also be helpful in using it as pos_label in Binary and Multi Class problems. + """ + counts = OrderedDict(Counter(classes)) + total = sum(counts.values()) + if verbose >= 1: + print(' Class -> Counts -> Percent') + sorted_keys = sorted(counts.keys()) + for cls in sorted_keys: + print("%12s: % 7d -> % 5.1f%%" % (cls, counts[cls], counts[cls]/total*100)) + if type(pd.Series(counts).idxmin())==str: + return pd.Series(counts).idxmin() + else: + return int(pd.Series(counts).idxmin()) +############################################################################### +from sklearn.utils.class_weight import compute_class_weight +import copy +from collections import Counter +def get_class_distribution(y_input): + y_input = copy.deepcopy(y_input) + classes = np.unique(y_input) + xp = Counter(y_input) + class_weights = compute_class_weight('balanced', classes=np.unique(y_input), y=y_input) + if len(class_weights[(class_weights> 10)]) > 0: + class_weights = (class_weights/10) + else: + class_weights = (class_weights) + #print(' class_weights = %s' %class_weights) + class_weights[(class_weights<1)]=1 + class_rows = class_weights*[xp[x] for x in classes] + class_rows = class_rows.astype(int) + class_weighted_rows = dict(zip(classes,class_weights)) + #print(' class_weighted_rows = %s' %class_weighted_rows) + return class_weighted_rows +######################################################################## +### Split raw_train_set into train and valid data sets first +### This is a better way to split a dataset into train and test #### +### It does not assume a pre-defined size for the data set. +def is_valid(x, y): + return x % 5 == 0 +def is_test(x, y): + return x % 2 == 0 +def is_train(x, y): + return not is_test(x, y) +################################################################################## +def load_text_data(text_directory, project_name, keras_options, model_options, + verbose=0): + """ + Handy function that collects a sequence of text files into a tf.data generator. + Your text input folder or directory must be like the following. If not, you will get error. + main_directory/ + ...class_a/ + ......a_text_1.txt + ......a_text_2.txt + ...class_b/ + ......b_text_1.txt + ......b_text_2.txt + + Inputs: + ----------- + text_directory: This is the folder that contains .txt files organized by class_label. + project_name: This is where the model will be stored once it is trained. + keras_options: a data dictionary that contains keras model options you can send. + model_options: a data dictionary that saves all the characteristics of your model + + Outputs: + ----------- + train_ds: a train dataset in tf.data.Dataset + valid_ds: a validation dataset in tf.data.Dataset format + cat_vocab_dict: a data dictionary that saves all the characteristics of your data + model_options: a data dictionary that saves all the characteristics of your model + """ + cat_vocab_dict = dict() + cat_vocab_dict['target_variables'] = "target" + cat_vocab_dict['project_name'] = project_name + try: + text_train_folder = os.path.join(text_directory,"train") + if not os.path.exists(text_train_folder): + print("text use case. No train folder exists under given directory. Returning...") + return + except: + print('Error: Not able to find any train or test folder in the given folder %s' %text_train_folder) + print(""" You must put texts under folders named train, + validation (optional) and test folders under given %s folder. + Otherwise deep_autoviml won't work. """ %text_directory) + text_train_split = False + text_train_folder = os.path.join(text_directory,"train") + if not os.path.exists(text_train_folder): + print("No train folder found under given text directory %s. Returning..." %text_directory) + text_train_folder = os.path.join(text_directory,"validation") + text_valid_folder = os.path.join(text_directory,"validation") + if not os.path.exists(text_valid_folder): + print("No validation folder found under given text directory %s. Returning..." %text_directory) + text_train_split = True + #### make this a small number - default batch_size ### + batch_size = check_model_options(model_options, "batch_size", 64) + model_options["batch_size"] = batch_size + full_ds = tf.keras.preprocessing.text_dataset_from_directory(text_train_folder, + seed=111, + batch_size=batch_size) + if text_train_split: + ############## Split train into train and validation datasets here ############### + classes = full_ds.class_names + recover = lambda x,y: y + print('\nSplitting train into two: train and validation data') + valid_ds = full_ds.enumerate().filter(is_valid).map(recover) + train_ds = full_ds.enumerate().filter(is_train).map(recover) + else: + train_ds = full_ds + valid_ds = tf.keras.preprocessing.text_dataset_from_directory(text_valid_folder, + seed=111, + batch_size=batch_size) + classes = train_ds.class_names + #### Successfully loaded train and validation data sets ################ + cat_vocab_dict["text_classes"] = classes + cat_vocab_dict["target_transformed"] = True + cat_vocab_dict['modeltype'] = 'Classification' + MLB = My_LabelEncoder() + ins = copy.deepcopy(classes) + outs = np.arange(len(classes)) + MLB.transformer = dict(zip(ins,outs)) + MLB.inverse_transformer = dict(zip(outs,ins)) + cat_vocab_dict['target_le'] = MLB + print('Number of text classes = %d and they are: %s' %(len(classes), classes)) + print_one_text_from_dataset(train_ds, classes) + AUTOTUNE = tf.data.AUTOTUNE + train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE) + valid_ds = valid_ds.cache().prefetch(buffer_size=AUTOTUNE) + model_options["num_classes"] = len(classes) + cat_vocab_dict["batch_size"] = batch_size + return train_ds, valid_ds, cat_vocab_dict, model_options +################################################################################### +def select_rows_from_file_or_frame(train_datafile, model_options, targets, nrows_limit): + train_datafile = copy.deepcopy(train_datafile) + #### Set some defaults from model options which is required ## + DS_LEN = model_options["DS_LEN"] + sep = model_options["sep"] + header = model_options["header"] + csv_encoding = model_options["csv_encoding"] + modeltype = model_options['modeltype'] + compression = model_options['compression'] + ####### we randomly sample a small dataset to classify features ##################### + test_size = min(0.9, (1 - (nrows_limit/DS_LEN))) ### make sure there is a small train size + if test_size <= 0: + test_size = 0.9 + if DS_LEN > nrows_limit: + print(' Since number of rows > %s, loading a random sample of %d rows into pandas for EDA' %(nrows_limit, DS_LEN)) + ###### If it is a file you need to load it into a dataframe, it not leave it as is ### + if isinstance(train_datafile, str): + ###### load a small sample of data into a pandas dataframe ## + if DS_LEN >= 1e5: + train_small = pd.read_csv(train_datafile, nrows=nrows_limit, sep=sep, header=header, + encoding=csv_encoding, compression=compression) + else: + train_small = pd.read_csv(train_datafile, sep=sep, header=header, + encoding=csv_encoding, compression=compression) + else: + train_small = copy.deepcopy(train_datafile) + ####### If it is a classification problem, you need to stratify and select sample ### + + if modeltype != 'Regression': + copy_targets = copy.deepcopy(targets) + for each_target in copy_targets: + ### You need to remove rows that have very class samples - that is a problem while splitting train_small + list_of_few_classes = train_small[each_target].value_counts()[train_small[each_target].value_counts()<=10].index.tolist() + train_small = train_small.loc[~(train_small[each_target].isin(list_of_few_classes))] + train_small, _ = train_test_split(train_small, test_size=test_size, stratify=train_small[targets]) + else: + ### For Regression problems: load a small sample of data into a pandas dataframe ## + if DS_LEN <= nrows_limit: + train_small = train_small.sample(n=DS_LEN, random_state=99) + else: + train_small = train_small.sample(n=nrows_limit, random_state=99) + return train_small +###################################################################################### \ No newline at end of file diff --git a/build/lib/deep_autoviml/deep_autoviml.py b/build/lib/deep_autoviml/deep_autoviml.py new file mode 100644 index 0000000..5acd739 --- /dev/null +++ b/build/lib/deep_autoviml/deep_autoviml.py @@ -0,0 +1,524 @@ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pandas as pd +import numpy as np +pd.set_option('display.max_columns',500) +import matplotlib.pyplot as plt +import tempfile +import pdb +import os +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +from tensorflow import keras +#print('Tensorflow version on this machine: %s' %tf.__version__) +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup, CategoryCrossing +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization, Discretization, Hashing +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +from tensorflow.keras.models import Model, load_model +import tensorflow_hub as hub + +############################################################################################# +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle +############################################################################################# +##### Suppress all TF2 and TF1.x warnings ################### +tf2logger = tf.get_logger() +tf2logger.warning('Silencing TF2.x warnings') +tf2logger.root.removeHandler(tf2logger.root.handlers) +tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +import time +import os +import datetime + +from sklearn.metrics import balanced_accuracy_score, classification_report, confusion_matrix +from sklearn.metrics import roc_auc_score +from collections import defaultdict +############################################################################################ +# data pipelines +from .data_load.classify_features import classify_features +from .data_load.classify_features import classify_features_using_pandas + +from .data_load.classify_features import EDA_classify_and_return_cols_by_type +from .data_load.classify_features import EDA_classify_features +from .data_load.extract import find_problem_type, transform_train_target +from .data_load.extract import load_train_data, load_train_data_file +from .data_load.extract import load_train_data_frame, load_image_data +from .data_load.extract import load_text_data + +# keras preprocessing +from .preprocessing.preprocessing import perform_preprocessing +from .preprocessing.preprocessing_tabular import preprocessing_tabular +from .preprocessing.preprocessing_nlp import preprocessing_nlp +from .preprocessing.preprocessing_images import preprocessing_images +from .preprocessing.preprocessing_text import preprocessing_text + +# keras models and bring-your-own models +from .modeling.create_model import create_model +from .models import basic, dnn, reg_dnn, dnn_drop, giant_deep, cnn1, cnn2 +from .modeling.train_model import train_model +from .modeling.train_custom_model import train_custom_model +from .modeling.predict_model import predict, predict_images, predict_text +from .modeling.train_image_model import train_image_model +from .modeling.train_text_model import train_text_model + +# Utils +from .utilities.utilities import print_one_row_from_tf_dataset +from .utilities.utilities import print_one_row_from_tf_label +from .utilities.utilities import check_if_GPU_exists, plot_history +from .utilities.utilities import save_model_architecture + +############################################################################################# +### Split raw_train_set into train and valid data sets first +### This is a better way to split a dataset into train and test #### +### It does not assume a pre-defined size for the data set. +def is_valid(x, y): + return x % 5 == 0 +def is_test(x, y): + return x % 2 == 0 +def is_train(x, y): + return not is_test(x, y) +############################################################################################# +#### probably the most handy function of all! ############################################## +def left_subtract(l1,l2): + lst = [] + for i in l1: + if i not in l2: + lst.append(i) + return lst +############################################################################################## +def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep_autoviml", + save_model_flag=True, model_options={}, keras_options={}, + use_my_model='', model_use_case='', verbose=0, + use_mlflow=False,mlflow_exp_name='autoviml',mlflow_run_name='first_run' + ): + """ + #################################################################################### + #### Deep AutoViML #### + #### Developed by Ram Seshadri (2021) #### + #### Python 3, Tensforflow >= 2.4 #### + #################################################################################### + Inputs: + train_data_or_file: can be file or pandas dataframe: you need to give path to filename. + target: string or list. You can give one variable (string) or multiple variables (list) + keras_model_type: default = "fast". That will build a keras model and pipeline very fast. + Then you can try other options like 'fast1', 'fast2' and finally 'auto'. + You can also try 'CNN1', 'CNN2'. + If you are using it on NLP dataset, then set this to 'BERT' or 'USE'. + 'USE' stands for Universal Sentence Encoder. That's also a good model. + Then it will automatically download a base BERT model and use it. + project_name: default = "deep_autoviml". This is used to name the folder to save model. + save_model_flag: default = False: it determines wher you want to save your trained model + to local drive. If True, it will save it locally in project_name folder. + use_my_model: default = '' - you can create a file with any model architecture you + want and send in name of that file here. We will import that model + file and use it as model to run with inputs and output pipeline + we create. You can name file anything you want but Don't name + your model file as tensorflow.py or keras.py since when we import + that file, it will overwrite tensorflow and keras functions in + your code (disaster!) Also, you must name model variable as "model" + in that file. So that way, when we import it, we will use it as + "import model from xyz" file. Important! + Additionally, you can create a Sequential model variable and send it. + keras_options: dictionary: you can send in any keras model option you want: optimizer, + epochs, batchsize, etc. + "batchsize": default = "": you can leave it blank and we will automatically + calculate a batchsize + "patience": default = 10 ### patience of 10 seems ideal. You can raise or lower it + "epochs": default = 500 ## 500 seems ideal for most scenarios #### + "steps_per_epoch": default = 5 ### 5 seems ideal for most scenarios + 'optimizer': default = RMSprop(lr=0.1, rho=0.9) ##Adam(lr=0.1) #SGD(lr=0.1) + 'kernel_initializer': default = 'glorot_uniform' ### Others: 'he_uniform', etc. + 'num_layers': default = 2 : # this defines number of layers if you choose custom model + 'loss': default = it will choose automatically based on modeltype + ### you can define any keras loss function such as mae, mse, etc. + 'metrics': default = it will choose automatically based on modeltype + ## you can define any keras metric you like + 'monitor': default = it will choose automatically based on modeltype + 'mode': default = it will choose automatically based on modeltype + "lr_scheduler": default = "onecycle" but you can choose from any below: + ## ["scheduler", 'onecycle', 'rlr' (reduce LR on plateau), 'decay'] + "early_stopping": default = True. You can change it to False. + "class_weight": {}: you can send in class weights for imbalanced classes as a dictionary. + model_options: dictionary: you can send in any deep autoviml model option you + want to change using this dictionary. + You can change following as long as you use this option and same exact wordings: + For example: let's say you want to change number of categories in a variable + above which it is not a cat variable. + You can change that using following option: + model_options_defaults["variable_cat_limit"] = 30 + Similarly for the number of characters above which a string variable will be + considered an NLP variable: model_options_defaults["nlp_char_limit"] = 30 + Another option would be to inform autoviml about encoding in CSV file for it to + read such as 'latin-1' by setting {"csv_encoding": 'latin-1'} + Other examples: + "nlp_char_limit": default 50. Beyond this max limit of chars in column, it + will be considered NLP column and treated as such. + "variable_cat_limit": default 30. if a variable has more than this limit, it + will NOT be treated as a categorical variable. + "DS_LEN": default "". Number of rows in dataset. You can leave it "" to calculate automatically. + "csv_encoding": default='utf-8'. You can change to 'latin-1', 'iso-8859-1', 'cp1252', etc. + "cat_feat_cross_flag": if you want to cross categorical features such as A*B, B*C... + "sep" : default = "," comma but you can override it. Separator used in read_csv. + "idcols": default: empty list. Specify which variables you want to exclude from model. + "save_model_format": default is "" (empty string) which means tensorflow default .pb format: + Specify "h5" if you want to save it in ".h5" format. + "modeltype": default = '': if you leave it blank we will automatically determine it. + If you want to override, your options are: 'Regression', 'Classification', + 'Multi_Classification'. + We will figure out single label or multi-label problem based on your target + being string or list. + "header": default = 0 ### this is the header row for pandas to read + "compression": None => you can set it to zip or other file compression formats if your data is compressed + "csv_encoding": default 'utf-8'. But you can set it to any other csv encoding format your data is in + "label_encode_flag": False. But you can set it to True if you want it encoded. + "max_trials": default = 30 ## number of Storm Tuner trials ### Lower this for faster processing. + "tuner": default = 'storm' ## Storm Tuner is the default tuner. Optuna is the other option. + "embedding_size": default = 50 ## this is the NLP embedding size minimum + "tf_hub_model": default "" (empty string). If you want to supply TF hub model, provide URL here. + "image_directory": If you choose model_use_case as "image", then you must provide image folder. + "image_height": default is "" (empty string). Needed only for "image" use case. + "image_width": default is "" (empty string). Needed only for "image" use case. + "image_channels": default is "" (empty string). Needed only for image use case. Number of channels. + 'save_model_path': default is project_name/keras_model_type/datetime-hour-min/ + If you provide your own model path as a string, it will save it there. + model_use_case: default is "" (empty string). If "pipeline", you will get back pipeline only, not model. + It is a placeholder for future purposes. At the moment, leave it as empty string. + verbose = 1 will give you more charts and outputs. verbose 0 will run silently + with minimal outputs. + use_mlflow = This is used to enabling MLflow lifecycle and tracking. This is False be default. + MLflow is useed to manage the ML lifecycle, including experimentation, reproducibility, + deployment, and a central model registry. + mlflow_exp_name = MLflow experiment name. + mlflow_run_name = User has flexibilty to use custom run name. + + + """ + my_strategy = check_if_GPU_exists(1) + ######## C H E CK T Y P E O F K E R A S M O D E L ##################### + print() #### create a new line that's all ### + model_options_copy = copy.deepcopy(model_options) + keras_options_copy = copy.deepcopy(keras_options) + + #############MLFLOW Check#################################### + if use_mlflow: + import mlflow + mlflow.set_experiment(mlflow_exp_name) + mlflow.start_run(run_name=mlflow_run_name) + mlflow.tensorflow.autolog(every_n_iter=1) + + if isinstance(project_name,str): + if project_name == '': + project_name = "deep_autoviml" + else: + print('Project name must be a string and helps create a folder to store model.') + project_name = "deep_autoviml" + + save_model_path = os.path.join(project_name,keras_model_type) + save_model_path = get_save_folder(save_model_path) + if not os.path.exists(save_model_path): + os.makedirs(save_model_path, exist_ok = True) + trials_saved_path = os.path.join(save_model_path, "trials") + os.makedirs(trials_saved_path, exist_ok = True) + save_artifacts_path = os.path.join(save_model_path, "artifacts") + os.makedirs(save_artifacts_path, exist_ok = True) + save_logs_path = os.path.join(save_model_path, "mylogs") + os.makedirs(save_logs_path, exist_ok = True) + + print('Model and logs being saved in %s' %save_model_path) + + if keras_model_type.lower() in ['image', 'images', "image_classification"]: + ############### Now do special IMAGE processing here ################################### + if 'image_directory' in model_options.keys(): + print(' Image directory given as %s' %model_options['image_directory']) + image_dir = model_options["image_directory"] + else: + print(" No image directory given. Provide image directory in model_options...") + return + try: + print('For image use case:') + train_ds, valid_ds, cat_vocab_dict, model_options = load_image_data(image_dir, + project_name, keras_options_copy, + model_options_copy, verbose) + except: + print(' Error in image loading: check your model_options and try again.') + return + try: + deep_model = preprocessing_images(train_ds, model_options) + except: + print(' Error in image preprocessing: check your model_options and try again.') + return + ########## E N D O F S T R A T E G Y S C O P E ############# + deep_model, cat_vocab_dict = train_image_model(deep_model, train_ds, valid_ds, + cat_vocab_dict, keras_options_copy, model_options_copy, + project_name, save_model_flag) + print(deep_model.summary()) + return deep_model, cat_vocab_dict + elif keras_model_type.lower() in ['text', 'text classification', "text_classification"]: + ############### Now do special TEXT processing here ################################### + text_alt = True ### This means you use the text directory option + if 'text_directory' in model_options.keys(): + print(' text directory given as %s' %model_options['text_directory']) + text_dir = model_options["text_directory"] + else: + print(" No text directory given. Using train data given as input..." ) + text_alt = False ## this means you use the text file given + ################ T E X T C L A S S I F I C A T I O N ######### + if text_alt: + try: + train_ds, valid_ds, cat_vocab_dict, model_options = load_text_data(text_dir, + project_name, keras_options_copy, + model_options_copy, verbose) + except: + print(' Error in text folder loading: check your folder name and try again.') + return + else: + #### Use the text file given and split it into train and valid_ds #### + dft, model_options, full_ds, var_df, cat_vocab_dict, keras_options = load_train_data( + train_data_or_file, target, project_name, keras_options_copy, + model_options_copy, keras_model_type, verbose=verbose) + print('Loaded text classification file or dataframe using input given:') + ############## Split train into train and validation datasets here ############### + recover = lambda x,y: y + print('\nSplitting train into 80+20 percent: train and validation data') + valid_ds = full_ds.enumerate().filter(is_valid).map(recover) + train_ds = full_ds.enumerate().filter(is_train).map(recover) + ################### P R E P R O C E S S T E X T ######################### + try: + deep_model = preprocessing_text(train_ds, keras_model_type, model_options) + except: + print(' Error in text preprocessing: check your model_options and try again.') + return + + deep_model, cat_vocab_dict = train_text_model(deep_model, train_ds, valid_ds, + cat_vocab_dict, keras_options_copy, + project_name, save_model_flag) + print(deep_model.summary()) + return deep_model, cat_vocab_dict + + shuffle_flag = False + #### K E R A S O P T I O N S - THESE CAN BE OVERRIDDEN by your input keras_options dictionary #### + keras_options_defaults = {} + keras_options_defaults["batchsize"] = "" + keras_options_defaults['activation'] = '' + keras_options_defaults['save_weights_only'] = True + keras_options_defaults['use_bias'] = True + keras_options_defaults["patience"] = "" ### patience of 20 seems ideal. + keras_options_defaults["epochs"] = "" ## 500 seems ideal for most scenarios #### + keras_options_defaults["steps_per_epoch"] = "" ### 10 seems ideal for most scenarios + keras_options_defaults['optimizer'] = "RMSprop" + keras_options_defaults['kernel_initializer'] = '' + keras_options_defaults['num_layers'] = "" + keras_options_defaults['loss'] = "" + keras_options_defaults['metrics'] = "" + keras_options_defaults['monitor'] = "" + keras_options_defaults['mode'] = "" + keras_options_defaults["lr_scheduler"] = "" + keras_options_defaults["early_stopping"] = True + keras_options_defaults["class_weight"] = {} + + list_of_keras_options = ["batchsize", "activation", "save_weights_only", "use_bias", + "patience", "epochs", "steps_per_epoch", "optimizer", + "kernel_initializer", "num_layers", "class_weight", + "loss", "metrics", "monitor","mode", "lr_scheduler","early_stopping", + "class_weight"] + + keras_options = copy.deepcopy(keras_options_defaults) + if len(keras_options_copy) > 0: + print('Using following keras_options given as input:') + for key in list_of_keras_options: + if key in keras_options_copy.keys(): + print(' %s : %s' %(key, keras_options_copy[key])) + keras_options[key] = keras_options_copy[key] + + list_of_model_options = ["idcols","modeltype","sep","cat_feat_cross_flag", "model_use_case", "label_encode_flag", + "nlp_char_limit", "variable_cat_limit", "compression", "csv_encoding", "header", + "max_trials","tuner", "embedding_size", "tf_hub_model", "image_directory", + 'image_height', 'image_width', "image_channels", "save_model_path"] + + model_options_defaults = defaultdict(str) + model_options_defaults["idcols"] = [] + model_options_defaults["modeltype"] = '' + model_options_defaults["save_model_format"] = "" + model_options_defaults["sep"] = "," + model_options_defaults["cat_feat_cross_flag"] = False + model_options_defaults["model_use_case"] = '' + model_options_defaults["nlp_char_limit"] = 30 + model_options_defaults["variable_cat_limit"] = 30 + model_options_defaults["csv_encoding"] = 'utf-8' + model_options_defaults['compression'] = None ## is is needed in case to read Zip files + model_options_defaults["label_encode_flag"] = '' ## User can set it to True or False depending on their need. + model_options_defaults["header"] = 0 ### this is the header row for pandas to read + model_options_defaults["max_trials"] = 30 ## number of Storm Tuner trials ### + model_options_defaults['tuner'] = 'storm' ## Storm Tuner is the default tuner. Optuna is the other option. + model_options_defaults["embedding_size"] = "" ## this is the NLP embedding size minimum + model_options_defaults["tf_hub_model"] = "" ## If you want to use a pretrained Hub model, provide URL here. + model_options_defaults["image_directory"] = "" ## this is where images are input in form of folder + model_options_defaults['image_height'] = "" ## the height of the image must be given in number of pixels + model_options_defaults['image_width'] = "" ## the width of the image must be given in number of pixels + model_options_defaults["image_channels"] = "" ## number of channels in images provided + model_options_defaults['save_model_path'] = save_model_path + + model_options = copy.deepcopy(model_options_defaults) + if len(model_options_copy) > 0: + print('Using following model_options given as input:') + for key in list_of_model_options: + if key in model_options_copy.keys(): + print(' %s : %s' %(key, model_options_copy[key])) + model_options[key] = model_options_copy[key] + + fast_models = ['deep_and_wide','deep_wide','wide_deep', 'wide_and_deep','deep wide', + 'wide deep', 'fast','fast1', 'fast2', 'deep_and_cross', 'deep cross', 'deep and cross'] + if keras_model_type.lower() in fast_models: + print('max_trials set to 10 for fast models. Please increase it if you want better performance...') + model_options["max_trials"] = 10 + else: + if model_options["max_trials"] <= 20: + print('Your max_trials %s is below recommended 20. Please increase max_trials if you want better accuracy or a better model' %model_options["max_trials"]) + else: + print('Your max_trials %s is above recommended 20. Please reduce max_trials if you want it to run faster...' %model_options["max_trials"]) + + print(""" +################################################################################# +########### L O A D I N G D A T A I N T O TF.DATA.DATASET H E R E # +################################################################################# + """) + dft, model_options, batched_data, var_df, cat_vocab_dict, keras_options = load_train_data( + train_data_or_file, target, project_name, keras_options, + model_options, keras_model_type, verbose=verbose) + + try: + data_size = cat_vocab_dict['DS_LEN'] + except: + data_size = 10000 + cat_vocab_dict['DS_LEN'] = data_size + + modeltype = model_options['modeltype'] + + ########## Perform keras preprocessing here by building all layers needed ############# + print(""" +################################################################################# +########### K E R A S F E A T U R E P R E P R O C E S S I N G ####### +################################################################################# + """) + + nlp_inputs, meta_inputs, meta_outputs, nlp_outputs = perform_preprocessing(batched_data, var_df, + cat_vocab_dict, keras_model_type, + keras_options, model_options, + verbose) + + if isinstance(model_use_case, str): + if model_use_case: + if model_use_case.lower() == 'pipeline': + ########## Perform keras preprocessing only and return inputs + keras layers created ## + print('\nReturning a keras pipeline so you can create your own Functional model.') + return nlp_inputs, meta_inputs, meta_outputs, nlp_outputs + #### There may be other use cases for model_use_case in future hence leave this empty for now # + + #### you must create a functional model here + print('\nCreating a new Functional keras model now...') + print(''' +################################################################################# +########### C R E A T I N G A K E R A S M O D E L ############ +################################################################################# + ''') + ######### this is where you get the model body either by yourself or sent as input ## + ##### This takes care of providing multi-output predictions! ###### + model_body, keras_options = create_model(use_my_model, nlp_inputs, meta_inputs, meta_outputs, + nlp_outputs, keras_options, var_df, keras_model_type, + model_options, cat_vocab_dict) + + ########### C O M P I L E M O D E L H E R E ############# + ### For auto models we will add input and output layers later. See below... ######### + deep_model = model_body + + + if dft.shape[1] <= 100 : + plot_filename = save_model_architecture(deep_model, project_name, keras_model_type, cat_vocab_dict, + model_options, chart_name="model_before") + if plot_filename != "": + try: + display(Image(retina=True, filename=plot_filename)) + except: + print('Cannot save plot. Install pydot and graphviz if you want plots saved.') + print(""" +################################################################################# +########### T R A I N I N G K E R A S M O D E L H E R E ######### +################################################################################# + """) + + if keras_model_type.lower() not in ['auto','mixed_nlp']: + print('Training a %s model option...' %keras_model_type) + deep_model, cat_vocab_dict = train_model(deep_model, batched_data, target, keras_model_type, + keras_options, model_options, var_df, cat_vocab_dict, project_name, save_model_flag, verbose) + else: + #### This is used only for custom auto models and is out of the strategy scope ####### + print('Building and training a(n) %s model using %s Tuner...' %(keras_model_type, model_options['tuner'])) + deep_model, cat_vocab_dict = train_custom_model(nlp_inputs, meta_inputs, meta_outputs, nlp_outputs, + batched_data, target, keras_model_type, keras_options, + model_options, var_df, cat_vocab_dict, project_name, + save_model_flag, use_my_model, verbose) + if verbose >= 1: + print(deep_model.summary()) + if dft.shape[1] <= 100 : + plot_filename = save_model_architecture(deep_model, project_name, keras_model_type, cat_vocab_dict, + model_options, chart_name="model_after") + if plot_filename != "": + try: + display(Image(retina=True, filename=plot_filename)) + except: + print('Cannot save plot. Install pydot and graphviz if you want plots saved.') + distributed_values = (deep_model, cat_vocab_dict) + if use_mlflow: + mlflow.end_run() + print("""####################################################### + Please start Mlflow locally to track machine learning lifecycle and use as below + http://localhost:5000/ + ####################################################### """) + return distributed_values + +############################################################################################ +def get_save_folder(save_dir): + run_id = time.strftime("model_%Y_%m_%d_%H_%M_%S") + return os.path.join(save_dir, run_id) +############################################################################################ \ No newline at end of file diff --git a/build/lib/deep_autoviml/modeling/create_model.py b/build/lib/deep_autoviml/modeling/create_model.py new file mode 100644 index 0000000..eb1d59b --- /dev/null +++ b/build/lib/deep_autoviml/modeling/create_model.py @@ -0,0 +1,487 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +from collections import defaultdict +############################################################################################ +# data pipelines and feature engg here +from deep_autoviml.models import basic, dnn, reg_dnn, dnn_drop, giant_deep, cnn1, cnn2 +from deep_autoviml.preprocessing.preprocessing_tabular import encode_fast_inputs, create_fast_inputs +from deep_autoviml.preprocessing.preprocessing_tabular import encode_all_inputs, create_all_inputs +from deep_autoviml.preprocessing.preprocessing_tabular import encode_num_inputs, encode_auto_inputs + +from deep_autoviml.preprocessing.preprocessing_tabular import encode_nlp_inputs, create_nlp_inputs +from deep_autoviml.preprocessing.preprocessing_nlp import aggregate_nlp_dictionaries + +# Utils +from deep_autoviml.utilities.utilities import check_if_GPU_exists +from deep_autoviml.utilities.utilities import get_model_defaults, get_compiled_model, get_uncompiled_model +from deep_autoviml.utilities.utilities import check_model_options, check_keras_options +from deep_autoviml.utilities.utilities import add_outputs_to_model_body +from deep_autoviml.utilities.utilities import get_hidden_layers, add_outputs_to_auto_model_body + +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers + +from tensorflow.keras import layers +####################################################################################### +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle + +##### Suppress all TF2 and TF1.x warnings ################### +tf2logger = tf.get_logger() +tf2logger.warning('Silencing TF2.x warnings') +tf2logger.root.removeHandler(tf2logger.root.handlers) +tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +#### probably the most handy function of all! +def left_subtract(l1,l2): + lst = [] + for i in l1: + if i not in l2: + lst.append(i) + return lst +############################################################################################# +from sklearn.metrics import balanced_accuracy_score, accuracy_score +import tensorflow as tf +from sklearn.metrics import confusion_matrix +import numpy as np +from tensorflow.python.keras import backend as K +import sys +class BalancedAccuracy(tf.keras.metrics.Metric): + """ + ########################################################################################## + ###### Many thanks to the source below for this Balanced Accuracy Metric ################# + ### https://github.com/saeyslab/DeepLearning_for_ImagingFlowCytometry/blob/master/model.py + ########################################################################################## + """ + def __init__(self, noc, name="balanced_accuracy", **kwargs): + super(BalancedAccuracy, self).__init__(name=name, **kwargs) + + self.noc = noc + self.confusion_matrix = self.add_weight( + name = "confusion_matrix", + shape = (noc, noc), + initializer = "zeros", + dtype = tf.int32 + ) + + def reset_states(self): + K.batch_set_value([(v, np.zeros(shape=v.get_shape())) for v in self.variables]) + + def update_state(self, y_true, y_pred, sample_weight=None): + confusion_matrix = tf.math.confusion_matrix(y_true, tf.argmax(y_pred, axis=1), num_classes=self.noc) + return self.confusion_matrix.assign_add(confusion_matrix) + + def result(self): + diag = tf.linalg.diag_part(self.confusion_matrix) + rowsums = tf.math.reduce_sum(self.confusion_matrix, axis=1) + result = tf.math.reduce_mean(diag/rowsums, axis=0) + return result +########################################################################################## +def create_model(use_my_model, nlp_inputs, meta_inputs, meta_outputs, nlp_outputs, keras_options, var_df, + keras_model_type, model_options, cat_vocab_dict): + """ + This is a handy function to create a Sequential model architecture depending on keras_model_type option given. + It also can re-use a model_body (without input and output layers) given by the user as input for model_body. + It returns a model_body as well as a tuple containing a number of parameters used on defining the model and training it. + """ + data_size = model_options['DS_LEN'] + num_classes = model_options["num_classes"] + num_labels = model_options["num_labels"] + modeltype = model_options["modeltype"] + targets = cat_vocab_dict['target_variables'] + patience = check_keras_options(keras_options, "patience", 10) + cols_len = len([item for sublist in list(var_df.values()) for item in sublist]) + if not isinstance(meta_outputs, list): + data_dim = int(data_size*meta_outputs.shape[1]) + else: + data_dim = int(data_size*cols_len) + #### These can be standard for every keras option that you use layers ###### + kernel_initializer = check_keras_options(keras_options, 'kernel_initializer', 'glorot_uniform') + activation='relu' + if nlp_inputs: + nlp_flag = True + else: + nlp_flag = False + + ############## S E T T I N G U P DEEP_WIDE, DEEP_CROSS, FAST MODELS ######################## + cats = var_df['categorical_vars'] ### these are low cardinality vars - you can one-hot encode them ## + high_string_vars = var_df['discrete_string_vars'] ## discrete_string_vars are high cardinality vars ## embed them! + bools = var_df['bools'] + int_cats = var_df['int_cats'] + var_df['int_bools'] + ints = var_df['int_vars'] + floats = var_df['continuous_vars'] + nlps = var_df['nlp_vars'] + lats = var_df['lat_vars'] + lons = var_df['lon_vars'] + floats = left_subtract(floats, lats+lons) + + FEATURE_NAMES = bools + cats + high_string_vars + int_cats + ints + floats + NUMERIC_FEATURE_NAMES = int_cats + ints + FLOATS = floats + bools + CATEGORICAL_FEATURE_NAMES = cats + high_string_vars + + vocab_dict = defaultdict(list) + cats_copy = copy.deepcopy(CATEGORICAL_FEATURE_NAMES+NUMERIC_FEATURE_NAMES) + if len(cats_copy) > 0: + for each_name in cats_copy: + vocab_dict[each_name] = cat_vocab_dict[each_name]['vocab'] + + floats_copy = copy.deepcopy(FLOATS) + if len(floats_copy) > 0: + for each_float in floats_copy: + vocab_dict[each_float] = cat_vocab_dict[each_float]['vocab_min_var'] + + ###################### set some defaults for model parameters here ############## + keras_options, model_options, num_predicts, output_activation = get_model_defaults(keras_options, + model_options, targets) + ###### This is where you compile the model after it is built ############### + num_classes = model_options["num_classes"] + num_labels = model_options["num_labels"] + modeltype = model_options["modeltype"] + val_mode = keras_options["mode"] + val_monitor = keras_options["monitor"] + val_loss = keras_options["loss"] + val_metrics = keras_options["metrics"] + learning_rate = 5e-2 + ############################################################################ + print('Creating a(n) %s Functional model...' %keras_model_type) + try: + print(' number of outputs = %s, output_activation = %s' %( + num_labels, output_activation)) + print(' loss function: %s' %str(val_loss).split(".")[-1].split(" ")[0]) + except: + print(' loss fn = %s number of outputs = %s, output_activation = %s' %( + val_loss, num_labels, output_activation)) + try: + #optimizer = return_optimizer(keras_options['optimizer']) + ### you should use only string names for optimizers if strategy.scope is used + optimizer = keras_options['optimizer'] + except: + ##### set some default optimizers here for model parameters here ## + if not keras_options['optimizer']: + optimizer = keras.optimizers.SGD(learning_rate) + elif keras_options["optimizer"] in ['RMS', 'RMSprop']: + optimizer = keras.optimizers.RMSprop(learning_rate) + elif keras_options['optimizer'] in ['Adam', 'adam', 'ADAM', 'NADAM', 'Nadam']: + optimizer = keras.optimizers.Adam(learning_rate) + else: + optimizer = keras.optimizers.Adagrad(learning_rate) + keras_options['optimizer'] = optimizer + print(' initial learning rate = %s' %learning_rate) + print(' initial optimizer = %s' %str(optimizer).split(".")[-1].split(" ")[0]) + ################################################################################### + dense_layer1, dense_layer2, dense_layer3 = get_hidden_layers(data_dim) + print(' Recommended hidden layers (with units in each Dense Layer) = (%d, %d, %d)\n' %( + dense_layer1,dense_layer2,dense_layer3)) + fast_models = ['fast'] + fast_models1 = ['deep_and_wide','deep_wide','wide_deep', + 'wide_and_deep','deep wide', 'wide deep', 'fast1'] + fast_models2 = ['deep_and_cross', 'deep_cross', 'deep cross', 'fast2'] + nlp_models = ['bert', 'use', 'text', 'mixed_nlp'] + #### The Deep and Wide Model is a bit more complicated. So it needs some changes in inputs! ###### + prebuilt_models = ['basic', 'simple', 'default','dnn','reg_dnn', 'deep', 'big deep', + 'dnn_drop', 'big_deep', 'giant_deep', 'giant deep', + 'cnn1', 'cnn','cnn2'] + ###### Just do a simple check for auto models here #################### + preds = cat_vocab_dict["predictors_in_train"] + NON_NLP_VARS = left_subtract(preds, nlps) + if keras_model_type.lower() in fast_models+fast_models1+prebuilt_models+fast_models2+nlp_models: + if len(NON_NLP_VARS) == 0: + ## there are no non-NLP vars in dataset then just use NLP outputs Only + all_inputs = nlp_inputs + meta_outputs = nlp_outputs + model_body = Sequential([layers.Dense(dense_layer3, activation='relu')]) + model_body = add_outputs_to_model_body(model_body, meta_outputs) + model_body = get_compiled_model(all_inputs, model_body, output_activation, num_predicts, + modeltype, optimizer, val_loss, val_metrics, cols_len, targets) + print(' %s model loaded and compiled successfully...' %keras_model_type) + print(model_body.summary()) + return model_body, keras_options + else: + all_inputs = nlp_inputs + meta_inputs + else: + ### this means it's an auto model and you create one here + print(' creating %s model body...' %keras_model_type) + #num_layers = check_keras_options(keras_options, 'num_layers', 1) + model_body = tf.keras.Sequential([]) + #for l_ in range(num_layers): + # model_body.add(layers.Dense(dense_layer1, activation='selu', kernel_initializer="lecun_normal", + # activity_regularizer=tf.keras.regularizers.l2(0.01))) + return model_body, keras_options + ########################## This is for non-auto models ##################################### + if isinstance(use_my_model, str) : + if use_my_model == "": + if keras_model_type.lower() in ['basic', 'simple', 'default','sample model']: + ########## Now that we have setup the layers correctly, we can build some more hidden layers + model_body = basic.model + elif keras_model_type.lower() in ['reg_dnn', 'deep']: + ########## Now that we have setup the layers correctly, we can build some more hidden layers + model_body = reg_dnn.model + elif keras_model_type.lower() in ['dnn', 'simple_dnn']: + ########## Now that we have setup the layers correctly, we can build some more hidden layers + model_body = dnn.model + elif keras_model_type.lower() in ['dnn_drop', 'big_deep', 'big deep']: + #################################################### + model_body = dnn_drop.model + elif keras_model_type.lower() in ['giant', 'giant_deep', 'giant deep']: + #################################################### + model_body = giant_deep.model + elif keras_model_type.lower() in ['cnn', 'cnn1','cnn2']: + ########## Now that we have setup the layers correctly, we can build some more hidden layers + # Conv1D + global max pooling + if keras_model_type.lower() in ['cnn', 'cnn1']: + model_body = cnn1.model + else: + model_body = cnn2.model + ###### You have to do this for all prebuilt models #################### + if keras_model_type.lower() in prebuilt_models: + print('Adding inputs and outputs to a pre-built %s model...' %keras_model_type) + if not isinstance(meta_outputs, list): + model_body = add_outputs_to_model_body(model_body, meta_outputs) + else: + model_body = add_outputs_to_auto_model_body(model_body, meta_outputs, nlp_flag) + #### This final outputs is the one that is taken into final dense layer and compiled + print(' %s model loaded successfully. Now compiling model...' %keras_model_type) + if keras_model_type.lower() in fast_models: + ########## This is a simple fast model ######################### + dropout_rate = 0.1 + hidden_units = [dense_layer2, dense_layer3] + inputs = create_fast_inputs(FEATURE_NAMES, NUMERIC_FEATURE_NAMES, FLOATS) + features = encode_fast_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, vocab_dict, + use_embedding=False) + for units in hidden_units: + features = layers.Dense(units)(features) + features = layers.BatchNormalization()(features) + features = layers.ReLU()(features) + features = layers.Dropout(dropout_rate)(features) + all_inputs = list(inputs.values()) ### convert input layers to a list + if len(nlps) > 0: + print('Starting NLP string column layer preprocessing...') + nlp_inputs = create_nlp_inputs(nlps) + max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small = aggregate_nlp_dictionaries(nlps, cat_vocab_dict, model_options) + nlp_outputs = encode_nlp_inputs(nlp_inputs, cat_vocab_dict) + ### we call nlp_outputs as embedding in this section of the program #### + print(' NLP Preprocessing completed.') + nlp_inputs = list(nlp_inputs.values()) + all_inputs += nlp_inputs + model_body = layers.concatenate([features, nlp_outputs]) + print(' %s combined non-nlp and nlp outputs successfully...' %keras_model_type) + else: + model_body = features + print(' %s combined non-nlp outputs successfully...' %keras_model_type) + elif keras_model_type.lower() in fast_models1: + ############################################################################################### + # In a Wide & Deep model, the wide part of the model is a linear model, while the deep + # part of the model is a multi-layer feed-forward network. We use the sparse representation + # of the input features in the wide part of the model and the dense representation of the + # input features for the deep part of the model. + # VERY IMPORTANT TO NOTE that every input features contributes to both parts of the model with + # different representations. + ############################################################################################### + dropout_rate = 0.1 + hidden_units = [dense_layer2, dense_layer3] + inputs = create_all_inputs(FEATURE_NAMES, NUMERIC_FEATURE_NAMES, FLOATS) + wide = encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, vocab_dict, + use_embedding=False) + wide = layers.BatchNormalization()(wide) + deep = encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, vocab_dict, + use_embedding=True) + for units in hidden_units: + deep = layers.Dense(units)(deep) + deep = layers.BatchNormalization()(deep) + deep = layers.ReLU()(deep) + deep = layers.Dropout(dropout_rate)(deep) + #### If there are NLP vars in dataset, you must combine them ## + all_inputs = list(inputs.values()) ### convert input layers to a list + if len(nlps) > 0: + print('Starting NLP string column layer preprocessing...') + nlp_inputs = create_nlp_inputs(nlps) + max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small = aggregate_nlp_dictionaries(nlps, cat_vocab_dict, model_options) + nlp_outputs = encode_nlp_inputs(nlp_inputs, cat_vocab_dict) + ### we call nlp_outputs as embedding in this section of the program #### + print(' NLP Preprocessing completed.') + nlp_inputs = list(nlp_inputs.values()) + all_inputs += nlp_inputs + model_body = layers.concatenate([wide, deep, nlp_outputs]) + print(' %s combined wide, deep and nlp outputs successfully...' %keras_model_type) + else: + model_body = layers.concatenate([wide, deep]) + print(' %s combined wide and deep successfully...' %keras_model_type) + elif keras_model_type.lower() in fast_models2: + ############################################################################################### + # In a Deep & Cross model, the deep part of this model is the same as the deep part + # created in the previous model. The key idea of the cross part is to apply explicit + # feature crossing in an efficient way, where the degree of cross features grows with layer depth. + ############################################################################################### + dropout_rate = 0.1 + hidden_units = [dense_layer2, dense_layer3] + #hidden_units = [dense_layer3] + inputs = create_all_inputs(FEATURE_NAMES, NUMERIC_FEATURE_NAMES, FLOATS) + x0 = encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, vocab_dict, + use_embedding=True) + cross = x0 + for _ in hidden_units: + units = cross.shape[-1] + x = layers.Dense(units)(cross) + cross = x0 * x + cross + cross = layers.BatchNormalization()(cross) + + deep = x0 + for units in hidden_units: + deep = layers.Dense(units)(deep) + deep = layers.BatchNormalization()(deep) + deep = layers.ReLU()(deep) + deep = layers.Dropout(dropout_rate)(deep) + #### If there are NLP vars in dataset, you must combine them ## + all_inputs = list(inputs.values()) ### convert input layers to a list + if len(nlps) > 0: + print('Starting NLP string column layer preprocessing...') + nlp_inputs = create_nlp_inputs(nlps) + max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small = aggregate_nlp_dictionaries(nlps, cat_vocab_dict, model_options) + nlp_outputs = encode_nlp_inputs(nlp_inputs, cat_vocab_dict) + ### we call nlp_outputs as embedding in this section of the program #### + print(' NLP Preprocessing completed.') + nlp_inputs = list(nlp_inputs.values()) + all_inputs += nlp_inputs + model_body = layers.concatenate([cross, deep, nlp_outputs]) + print(' %s combined wide, deep and nlp outputs successfully...' %keras_model_type) + else: + model_body = layers.concatenate([cross, deep]) + print(' %s combined wide and deep successfully...' %keras_model_type) + ################################################################################ + elif keras_model_type.lower() in nlp_models: + print(' creating %s model body...' %keras_model_type) + num_layers = check_keras_options(keras_options, 'num_layers', 1) + model_body = tf.keras.Sequential([]) + for l_ in range(num_layers): + model_body.add(layers.Dense(dense_layer1, activation='selu', kernel_initializer="lecun_normal", + #activity_regularizer=tf.keras.regularizers.l2(0.01) + )) + print('Adding inputs and outputs to a pre-built %s model...' %keras_model_type) + if not isinstance(meta_outputs, list): + model_body = add_outputs_to_model_body(model_body, meta_outputs) + else: + model_body = add_outputs_to_auto_model_body(model_body, meta_outputs, nlp_flag) + #### This final outputs is the one that is taken into final dense layer and compiled + else: + try: + new_module = __import__(use_my_model) + print('Using the model given as input to build model body...') + model_body = new_module.model + print(' Loaded model from %s file successfully...' %use_my_model) + except: + print(' Loading %s model is erroring, hence building a simple sequential model with one layer...' %keras_model_type) + ########## In case none of the loading of files works, then set up a simple model! + model_body = Sequential([layers.Dense(dense_layer1, activation='relu')]) + ############ This is what you need to add to pre-built model body shells ### + print('Adding inputs and outputs to a pre-built %s model...' %keras_model_type) + if not isinstance(meta_outputs, list): + model_body = add_outputs_to_model_body(model_body, meta_outputs) + else: + model_body = add_outputs_to_auto_model_body(model_body, meta_outputs, nlp_flag) + #### This final outputs is the one that is taken into final dense layer and compiled + print(' %s model loaded successfully. Now compiling model...' %keras_model_type) + else: + print(' Using your custom model given as input...') + model_body = use_my_model + ############ This is what you need to add to pre-built model body shells ### + print('Adding inputs and outputs to a pre-built %s model...' %keras_model_type) + if not isinstance(meta_outputs, list): + model_body = add_outputs_to_model_body(model_body, meta_outputs) + else: + model_body = add_outputs_to_auto_model_body(model_body, meta_outputs, nlp_flag) + #### This final outputs is the one that is taken into final dense layer and compiled + print(' %s model loaded successfully. Now compiling model...' %keras_model_type) + ############# You need to compile the non-auto models here ############### + + model_body = get_compiled_model(all_inputs, model_body, output_activation, num_predicts, + modeltype, optimizer, val_loss, val_metrics, cols_len, targets) + print(' %s model loaded and compiled successfully...' %keras_model_type) + if cols_len > 100: + print('Too many columns to show model summary. Continuing...') + else: + print(model_body.summary()) + return model_body, keras_options +############################################################################### +def return_optimizer(hpq_optimizer): + """ + This returns the keras optimizer with proper inputs if you send the string. + hpq_optimizer: input string that stands for an optimizer such as "Adam", etc. + """ + learning_rate_set = 5e-2 + ##### These are the various optimizers we use ################################ + momentum = keras.optimizers.SGD(lr=learning_rate_set, momentum=0.9) + nesterov = keras.optimizers.SGD(lr=learning_rate_set, momentum=0.9, nesterov=True) + adagrad = keras.optimizers.Adagrad(lr=learning_rate_set) + rmsprop = keras.optimizers.RMSprop(lr=learning_rate_set, rho=0.9) + adam = keras.optimizers.Adam(lr=learning_rate_set, beta_1=0.9, beta_2=0.999) + adamax = keras.optimizers.Adamax(lr=learning_rate_set, beta_1=0.9, beta_2=0.999) + nadam = keras.optimizers.Nadam(lr=learning_rate_set, beta_1=0.9, beta_2=0.999) + best_optimizer = '' + ############################################################################# + #### This could be turned into a dictionary but for now leave is as is for readability ## + if hpq_optimizer.lower() == 'adam': + best_optimizer = adam + elif hpq_optimizer.lower() == 'sgd': + best_optimizer = momentum + elif hpq_optimizer.lower() == 'nadam': + best_optimizer = nadam + elif hpq_optimizer.lower() == 'adamax': + best_optimizer = adamax + elif hpq_optimizer.lower() == 'adagrad': + best_optimizer = adagrad + elif hpq_optimizer.lower() == 'rmsprop': + best_optimizer = rmsprop + else: + best_optimizer = nesterov + return best_optimizer +########################################################################################## diff --git a/build/lib/deep_autoviml/modeling/one_cycle.py b/build/lib/deep_autoviml/modeling/one_cycle.py new file mode 100644 index 0000000..2723186 --- /dev/null +++ b/build/lib/deep_autoviml/modeling/one_cycle.py @@ -0,0 +1,139 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import tensorflow as tf +import numpy as np +import matplotlib.pyplot as plt +import logging + +logging.getLogger('tensorflow').setLevel(logging.ERROR) + +from tensorflow.keras.callbacks import Callback +######################################################################################################### +###### One Cycle is a Super-Convergence technique developed by Leslie Smith: https://arxiv.org/abs/1708.07120 +###### Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates +###### This particular implementation is by Andrich van Wyk • September 02, 2019 +###### Used with permission: https://www.avanwyk.com/tensorflow-2-super-convergence-with-the-1cycle-policy/ +######################################################################################################### +class CosineAnnealer: + + def __init__(self, start, end, steps): + self.start = start + self.end = end + self.steps = steps + self.n = 0 + + def step(self): + self.n += 1 + cos = np.cos(np.pi * (self.n / self.steps)) + 1 + return self.end + (self.start - self.end) / 2. * cos + + +class OneCycleScheduler(Callback): + """ + ######################################################################################################### + ###### One Cycle is a Super-Convergence technique developed by Leslie Smith: https://arxiv.org/abs/1708.07120 + ###### Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates + ###### This particular implementation is by Andrich van Wyk • September 02, 2019 + ###### Credit: https://www.avanwyk.com/tensorflow-2-super-convergence-with-the-1cycle-policy/ + ######################################################################################################### + Callback that schedules the learning rate on a 1cycle policy as per Leslie Smith's paper(https://arxiv.org/pdf/1803.09820.pdf). + If the model supports a momentum parameter, it will also be adapted by the schedule. + The implementation adopts additional improvements as per the fastai library: https://docs.fast.ai/callbacks.one_cycle.html, where + only two phases are used and the adaptation is done using cosine annealing. + """ + + def __init__(self, lr_max, steps, mom_min=0.85, mom_max=0.95, phase_1_pct=0.3, div_factor=25.): + super(OneCycleScheduler, self).__init__() + lr_min = lr_max / div_factor + final_lr = lr_max / (div_factor * 1e4) + phase_1_steps = steps * phase_1_pct + phase_2_steps = steps - phase_1_steps + + self.phase_1_steps = phase_1_steps + self.phase_2_steps = phase_2_steps + self.phase = 0 + self.step = 0 + + self.phases = [[CosineAnnealer(lr_min, lr_max, phase_1_steps), CosineAnnealer(mom_max, mom_min, phase_1_steps)], + [CosineAnnealer(lr_max, final_lr, phase_2_steps), CosineAnnealer(mom_min, mom_max, phase_2_steps)]] + + self.lrs = [] + self.moms = [] + + def on_train_begin(self, logs=None): + self.phase = 0 + self.step = 0 + + self.set_lr(self.lr_schedule().start) + self.set_momentum(self.mom_schedule().start) + + def on_train_batch_begin(self, batch, logs=None): + self.lrs.append(self.get_lr()) + self.moms.append(self.get_momentum()) + + def on_train_batch_end(self, batch, logs=None): + self.step += 1 + if self.step >= self.phase_1_steps: + self.phase = 1 + self.set_lr(self.lr_schedule().step()) + self.set_momentum(self.mom_schedule().step()) + + def get_lr(self): + try: + return tf.keras.backend.get_value(self.model.optimizer.lr) + except AttributeError: + return None + + def get_momentum(self): + try: + return tf.keras.backend.get_value(self.model.optimizer.momentum) + except AttributeError: + return None + + def set_lr(self, lr): + try: + if lr < 0: + lr = 0.1 + self.phase = 0 + self.step = 0 + + self.set_lr(self.lr_schedule().start) + self.set_momentum(self.mom_schedule().start) + tf.keras.backend.clear_session() + + tf.keras.backend.set_value(self.model.optimizer.lr, lr) + except AttributeError: + pass # ignore + + def set_momentum(self, mom): + try: + tf.keras.backend.set_value(self.model.optimizer.momentum, mom) + except AttributeError: + pass # ignore + + def lr_schedule(self): + return self.phases[self.phase][0] + + def mom_schedule(self): + return self.phases[self.phase][1] + + def plot(self): + ax = plt.subplot(1, 2, 1) + ax.plot(self.lrs) + ax.set_title('Learning Rate') + ax = plt.subplot(1, 2, 2) + ax.plot(self.moms) + ax.set_title('Momentum') \ No newline at end of file diff --git a/build/lib/deep_autoviml/modeling/predict_model.py b/build/lib/deep_autoviml/modeling/predict_model.py new file mode 100644 index 0000000..c1e2cb3 --- /dev/null +++ b/build/lib/deep_autoviml/modeling/predict_model.py @@ -0,0 +1,733 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pandas as pd +import numpy as np +pd.set_option('display.max_columns',500) +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf + +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup, CategoryCrossing +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization, Discretization, Hashing +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers + +############################################################################################ +# data pipelines +from deep_autoviml.data_load.classify_features import classify_features_using_pandas + +from deep_autoviml.data_load.extract import fill_missing_values_for_TF2 +from deep_autoviml.utilities.utilities import print_one_row_from_tf_dataset, print_one_row_from_tf_label +############################################################################################ +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle +##### Suppress all TF2 and TF1.x warnings ################### +##### Suppress all TF2 and TF1.x warnings ################### +tf2logger = tf.get_logger() +tf2logger.warning('Silencing TF2.x warnings') +tf2logger.root.removeHandler(tf2logger.root.handlers) +tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D +from tensorflow.keras.layers import AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D +from tensorflow.keras.layers import GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +######################################################################################### +import os +import pickle +import time +############################################################################################ +def load_test_data(test_data_or_file, project_name, cat_vocab_dict="", + verbose=0): + """ + Load a CSV file and given a project name, it will load the artifacts in project_name folder. + Optionally you can provide the artifacts dictionary as "cat_vocab_dict" in this input. + + Outputs: + -------- + data_batches: a tf.data.Dataset which will be Repeat batched dataset + cat_vocab_dict: artifacts dictionary that you can feed to the predict function of model. + """ + ### load a small sample of data into a pandas dataframe ## + if isinstance(test_data_or_file, str): + test_small = pd.read_csv(test_data_or_file) ### this reads the entire file + else: + test_small = copy.deepcopy(test_data_or_file) + filesize = test_small.shape[0] + print('Loaded test data size: %d' %filesize) + #### All column names in Tensorflow should have no spaces ! So you must convert them here! + sel_preds = ["_".join(x.split(" ")) for x in list(test_small) ] + sel_preds = ["_".join(x.split("(")) for x in sel_preds ] + sel_preds = ["_".join(x.split(")")) for x in sel_preds ] + sel_preds = ["_".join(x.split("/")) for x in sel_preds ] + sel_preds = ["_".join(x.split("\\")) for x in sel_preds ] + sel_preds = [x.lower() for x in sel_preds ] + + test_small.columns = sel_preds + + print('Alert! Modified column names to satisfy rules for column names in Tensorflow...') + + ################### if cat_vocab_dict is not given, load it #### + no_cat_vocab_dict = False + if not cat_vocab_dict: + ### You must load it from disk ### + try: + pickle_path = os.path.join(project_name, "cat_vocab_dict.pickle") + print('\nLoading cat_vocab_dict file using pickle in %s...' %pickle_path) + cat_vocab_dict = pickle.load(open(pickle_path,"rb")) + print(' Loaded pickle file in %s' %pickle_path) + except: + print('Unable to load pickle file. Continuing...') + no_cat_vocab_dict = True + #################################################### + if not no_cat_vocab_dict: + target = cat_vocab_dict['target_variables'] + usecols = cat_vocab_dict['target_variables'] + if len(target) == 1: + target_name = target[0] + else: + target_name = target + else: + target = [] + target_name = '' + print('no target variable found since model artifacts dictionary could not be found') + ### classify variables using the small dataframe ## + model_options = {} + + if no_cat_vocab_dict: + model_options['DS_LEN'] = 10000 ### Just set some default ####### + ###### Just send in entire dataframe to convert and correct dtypes using this function ## + ###### If you don't do this, in some data sets due to mixed types it errors ### + ###### Just send in target_name as '' since we want even target to be corrected if it + ##### has the wrong data type since tensorflow automatically detects data types. + test_small, var_df, cat_vocab_dict = classify_features_using_pandas(test_small, target='', + model_options=model_options, verbose=verbose) + ########## Just transfer all the values from var_df to cat_vocab_dict ########## + for each_key in var_df: + cat_vocab_dict[each_key] = var_df[each_key] + #################################################################################### + else: + ###### Just send in entire dataframe to convert and correct dtypes using this function ## + ###### If you don't do this, in some data sets due to mixed types it errors ### + ###### Just send in target_name as '' since we want even target to be corrected if it + ##### has the wrong data type since tensorflow automatically detects data types. + model_options['DS_LEN'] = cat_vocab_dict['DS_LEN'] ### you need this to classify features + test_small, _, _ = classify_features_using_pandas(test_small, target='', + model_options=model_options, verbose=verbose) + ############ Now load the file or dataframe into tf.data.DataSet here ################# + + preds = list(test_small) + #batch_size = 64 ## artificially set a size ### + batch_size = cat_vocab_dict["batch_size"] + cat_vocab_dict["DS_LEN"] = filesize + num_epochs = None + ### Initially set this batch_size very high so that you can get the max(), min() and vocab to be realistic + if isinstance(test_data_or_file, str): + #### Set column defaults while reading dataset from CSV files - that way, missing values avoided! + ### The following are valid CSV dtypes for missing values: float32, float64, int32, int64, or string + ### fill all missing values in categorical variables with "None" + ### Similarly. fill all missing values in float variables with -99 + if test_small.isnull().sum().sum() > 0: + print('There are %d missing values in dataset - filling with default values...' %( + test_small.isnull().sum().sum())) + string_cols = test_small.select_dtypes(include='object').columns.tolist() + test_small.select_dtypes( + include='category').columns.tolist() + integer_cols = test_small.select_dtypes(include='integer').columns.tolist() + float_cols = test_small.select_dtypes(include='float').columns.tolist() + column_defaults = [-99.0 if x in float_cols else -99 if x in integer_cols else "missing" for x in test_small] + #### Once the missing data is filled, it's ready to load into tf.data.DataSet ############ + data_batches = tf.data.experimental.make_csv_dataset(test_data_or_file, + batch_size=batch_size, + column_names=preds, + label_name=None, + num_epochs = num_epochs, + column_defaults=column_defaults, + shuffle=False, + num_parallel_reads=tf.data.experimental.AUTOTUNE) + else: + #### This is to load dataframes into datasets ######################## + if test_small.isnull().sum().sum() > 0: + test_small = fill_missing_values_for_TF2(test_small, cat_vocab_dict) + + drop_cols = cat_vocab_dict['columns_deleted'] + if len(drop_cols) > 0: + print(' Dropping %s columns from dataset...' %drop_cols) + try: + test_small.drop(drop_cols, axis=1, inplace=True) + #### In some datasets, due to mixed data types in test_small, this next line errors. Beware!! + except: + print(' in some datasets, due to mixed data types in test, this errors. Continuing...') + data_batches = tf.data.Dataset.from_tensor_slices(dict(test_small)) + ### batch it if you are creating it from a dataframe + data_batches = data_batches.batch(batch_size, drop_remainder=False).repeat() + + print(' test data loaded successfully.') + + if verbose >= 1: + try: + print_one_row_from_tf_dataset(data_batches) + except: + pass + #### These are the input variables for which we are going to create keras.Inputs ###\ + return data_batches, cat_vocab_dict, test_small +################################################################################################# +def lenopenreadlines(filename): + with open(filename) as f: + return len(f.readlines()) +################################################################################################# +def find_batch_size(DS_LEN): + ### Since you cannot deal with a very large dataset in pandas, let's look into how big the file is + maxrows = 10000 + if DS_LEN < 100: + batch_ratio = 0.16 + elif DS_LEN >= 100 and DS_LEN < 1000: + batch_ratio = 0.05 + elif DS_LEN >= 1000 and DS_LEN < 10000: + batch_ratio = 0.01 + elif DS_LEN >= maxrows and DS_LEN <= 100000: + batch_ratio = 0.001 + else: + batch_ration = 0.0001 + batch_len = int(batch_ratio*DS_LEN) + print(' Batch size selected as %d' %batch_len) + return batch_len +############################################################################################### +class BalancedSparseCategoricalAccuracy(keras.metrics.SparseCategoricalAccuracy): + def __init__(self, name='balanced_sparse_categorical_accuracy', dtype=None): + super().__init__(name, dtype=dtype) + + def update_state(self, y_true, y_pred, sample_weight=None): + y_flat = y_true + if y_true.shape.ndims == y_pred.shape.ndims: + y_flat = tf.squeeze(y_flat, axis=[-1]) + y_true_int = tf.cast(y_flat, tf.int32) + + cls_counts = tf.math.bincount(y_true_int) + cls_counts = tf.math.reciprocal_no_nan(tf.cast(cls_counts, self.dtype)) + weight = tf.gather(cls_counts, y_true_int) + return super().update_state(y_true, y_pred, sample_weight=weight) +##################################################################################### +def load_model_dict(model_or_model_path, cat_vocab_dict, project_name, keras_model_type): + start_time = time.time() + if not cat_vocab_dict: + ### No cat_vocab_dict is given. Hence you must load it from disk ### + print('\nNo model artifacts file given. Loading cat_vocab_dict file using pickle. Will take time...') + if isinstance(model_or_model_path, str): + if model_or_model_path: + try: + pickle_path = os.path.join(model_or_model_path,os.path.join("artifacts", "cat_vocab_dict.pickle")) + cat_vocab_dict = pickle.load(open(pickle_path,"rb")) + print(' Loaded pickle file in %s' %pickle_path) + except: + print('Unable to load model and data artifacts cat_vocab_dictionary file. Returning...') + return [] + modeltype = cat_vocab_dict['modeltype'] + else: + try: + ### Since model_path is not given, we will use project_name and keras_model_type to find it ## + pickle_path = os.path.join(project_name, keras_model_type) + list_folders = os.listdir(pickle_path) + folder_path = list_folders[0] + pickle_path = os.path.join(pickle_path, folder_path) + pickle_path = os.path.join(pickle_path, "artifacts") + print('Selecting first artifacts file in folder %s. Change model path if you want different.' %folder_path) + pickle_path = os.path.join(pickle_path, "cat_vocab_dict.pickle") + cat_vocab_dict = pickle.load(open(pickle_path,"rb")) + print(' Loaded pickle file in %s' %pickle_path) + modeltype = cat_vocab_dict['modeltype'] + except: + print('Error: No path for model artifacts such as model_path given. Returning') + return + else: + ### cat_vocab_dictionary is given ##### + modeltype = cat_vocab_dict['modeltype'] + ### Check if model is available to be loaded ####### + print('\nLoading deep_autoviml model from %s folder. This will take time...' %model_or_model_path) + if isinstance(model_or_model_path, str): + try: + if model_or_model_path == "": + model_or_model_path = os.path.join(project_name, keras_model_type) + else: + if modeltype == 'Regression': + model = tf.keras.models.load_model(os.path.join(model_or_model_path)) + else: + model = tf.keras.models.load_model(os.path.join(model_or_model_path), + custom_objects={'BalancedSparseCategoricalAccuracy': BalancedSparseCategoricalAccuracy}) + except Exception as error: + print('Could not load given model.\nError: %s\n Please check your model path and try again.' %error) + return + else: + print('\nUsing %s model provided as input...' %model_or_model_path) + model = model_or_model_path + print('Time taken to load saved model = %0.0f seconds' %((time.time()-start_time))) + return model, cat_vocab_dict +################################################################################################### +########## THIS IS THE MAIN PREDICT() FUNCTION ##################################### +################################################################################################### +def predict(model_or_model_path, project_name, test_dataset, + keras_model_type, cat_vocab_dict="", verbose=0): + start_time2 = time.time() + model, cat_vocab_dict = load_model_dict(model_or_model_path, cat_vocab_dict, project_name, keras_model_type) + ##### load the test data set here ####### + if keras_model_type.lower() in ['nlp', 'text']: + NLP_VARS = cat_vocab_dict['predictors_in_train'] + else: + NLP_VARS = cat_vocab_dict['nlp_vars'] + ################################################################ + @tf.function + def process_NLP_features(features): + """ + This is how you combine all your string NLP features into a single new feature. + Then you can perform embedding on this combined feature. + It takes as input an ordered dict named features and returns the same features format. + """ + return tf.strings.reduce_join([features[i] for i in NLP_VARS],axis=0, + keepdims=False, separator=' ', name="combined") + ################################################################ + NLP_COLUMN = "combined_nlp_text" + ################################################################ + @tf.function + def combine_nlp_text(features): + ##use x to derive additional columns u want. Set the shape as well + y = {} + y.update(features) + y[NLP_COLUMN] = tf.strings.reduce_join([features[i] for i in NLP_VARS],axis=0, + keepdims=False, separator=' ') + return y + ################################################################ + if isinstance(test_dataset, str): + test_ds, cat_vocab_dict2, test_small = load_test_data(test_dataset, project_name=project_name, + cat_vocab_dict=cat_vocab_dict, verbose=verbose) + ### You have to load only the NLP or text variables into dataset. otherwise, it will fail during predict + batch_size = cat_vocab_dict2["batch_size"] + if NLP_VARS: + if keras_model_type.lower() in ['nlp', 'text']: + test_ds = test_ds.map(process_NLP_features) + test_ds = test_ds.unbatch().batch(batch_size) + print(' combined NLP or text vars: %s into a single feature successfully' %NLP_VARS) + else: + test_ds = test_ds.map(combine_nlp_text) + print(' combined NLP or text vars: %s into a single feature successfully' %NLP_VARS) + else: + print('No NLP vars in data set. No preprocessing done.') + DS_LEN = cat_vocab_dict2["DS_LEN"] + print("test data size = ",DS_LEN, ', batch_size = ',batch_size) + elif isinstance(test_dataset, pd.DataFrame) or isinstance(test_dataset, pd.Series): + if keras_model_type.lower() in ['nlp', 'text']: + #### You must only load the text or nlp columns into the dataset. Otherwise, it will fail during predict. + test_dataset = test_dataset[NLP_VARS] + test_ds, cat_vocab_dict2, test_small = load_test_data(test_dataset, project_name=project_name, + cat_vocab_dict=cat_vocab_dict, verbose=verbose) + batch_size = cat_vocab_dict2["batch_size"] + DS_LEN = cat_vocab_dict2["DS_LEN"] + print("test data size = ",DS_LEN, ', batch_size = ',batch_size) + if NLP_VARS: + if keras_model_type.lower() in ['nlp', 'text']: + test_ds = test_ds.map(process_NLP_features) + test_ds = test_ds.unbatch().batch(batch_size) + print(' processed NLP and text vars: %s successfully' %NLP_VARS) + else: + test_ds = test_ds.map(combine_nlp_text) + print(' combined NLP or text vars: %s into a single combined_nlp_text successfully' %NLP_VARS) + else: + print('No NLP vars in data set. No preprocessing done.') + else: + ### It must be a tf.data.Dataset hence just load it as is #### + if cat_vocab_dict: + DS_LEN = cat_vocab_dict["DS_LEN"] + batch_size = cat_vocab_dict["batch_size"] + else: + print('Since artifacts dictionary (cat_vocab_dict) not provided, using 100,000 as default test data size and batch size as 64.') + print(' if you want to modify them, send in cat_vocab_dict["DS_LEN"] and cat_vocab_dict["batch_size"]') + DS_LEN = 100000 + batch_size = 64 + test_ds = test_dataset + if NLP_VARS: + if keras_model_type.lower() in ['nlp', 'text']: + test_ds = test_ds.map(process_NLP_features) + test_ds = test_ds.unbatch().batch(batch_size) + print(' processed NLP and text vars: %s successfully' %NLP_VARS) + else: + test_ds = test_ds.map(combine_nlp_text) + print(' combined NLP or text vars: %s into a single combined_nlp_text successfully' %NLP_VARS) + else: + print('No NLP vars in data set. No preprocessing done.') + cat_vocab_dict2 = copy.deepcopy(cat_vocab_dict) + ################################################################################## + if cat_vocab_dict2['bools_converted']: + BOOLS = [] + print('Boolean cols=%s converted to strings' %BOOLS) + else: + BOOLS = cat_vocab_dict2['bools'] + print('Boolean cols=%s not converted to strings' %BOOLS) + ################################################################################# + #@tf.function + @tf.autograph.experimental.do_not_convert + def process_boolean_features(features_copy): + """ + This is how you convert all your boolean features into float variables. + The reason you have to do this is because tf.keras does not know how to handle boolean types. + It takes as input an ordered dict named features and returns the same in features format. + """ + for feature_name in features_copy: + if feature_name in BOOLS: + # Cast boolean feature values to string. + #features[feature_name] = tf.cast(features[feature_name], tf.dtypes.float32) + features_copy[feature_name] = tf.dtypes.cast(features_copy[feature_name], tf.int32) + return features_copy + ################################################################## + try: + print(' Boolean columns successfully converted to Integers') + test_ds = test_ds.map(process_boolean_features) + except: + print(' Error in converting Boolean columns to Integers. Continuing...') + ################################################################################# + bool_cols = cat_vocab_dict2['bools'] + #@tf.function + @tf.autograph.experimental.do_not_convert + def convert_boolean_to_string_predict(features_copy): + """ + This is how you convert all your boolean features into float variables. + The reason you have to do this is because tf.keras does not know how to handle boolean types. + It takes as input an ordered dict named features and returns the same in features format. + """ + for feature_name in features_copy: + if feature_name in bool_cols: + features_copy[feature_name] = tf.where(features_copy[feature_name], 'True', 'False') + return features_copy + ################## process next steps from here on ############################# + if cat_vocab_dict2['bools_converted']: + try: + print('Since booleans=%s have been converted to strings in Training, converting them to strings' %bool_cols) + test_ds = test_ds.map(convert_boolean_to_string_predict) + except: + print(' Error in converting boolean to string variables. Continuing...') + ################################################################################# + try: + ### This is merely to give hints to newbies that they may have mistakenly given different data types to train and test + bool_cols.sort() + _, _, _, _, _, bools = classify_dtypes_using_TF2_in_test(test_ds, idcols=cat_vocab_dict2['cols_delete'], verbose=0) + bools.sort() + if len(bools) == 0 and len(bool_cols) == 0: + pass + elif bool_cols == bools: + print('Possible Conflict: Boolean columns in train and test data were passed differently. Check your test data types.') + else: + print('Congratulations! boolean columns were passed identically in both train and test datasets.') + except: + print('Possible Conflict: Boolean columns in train and test data were passed differently. Check your test data types.') + #################################################################################################################### + ## num_steps is needed to predict on whole dataset once ## + try: + num_steps = int(np.ceil(DS_LEN/batch_size)) + print('Batch size = %s' %batch_size) + except: + num_steps = 1 + ######### See if you can predict here if not return the null result ##### + print(' number of steps needed to predict: %d' %num_steps) + y_test_preds_list = [] + targets = cat_vocab_dict2['target_variables'] + ##### Now you need to save the predictions ### + modeltype = cat_vocab_dict2['modeltype'] + num_labels = cat_vocab_dict2['num_labels'] + num_classes = cat_vocab_dict2['num_classes'] + ####### save the predictions only upto input size ### + ######## This is where we start predictions on test data set ############## + try: + y_probas = model.predict(test_ds, steps=num_steps) + except: + print('ERROR: Predictions from model erroring.') + print(' Check your model and ensure test data and their dtypes are same as train data and retry again.') + return + ###### Now convert the model predictions into classes ######### + try: + y_test_preds_list = convert_predictions_from_model(y_probas, cat_vocab_dict2, DS_LEN) + except: + print('Converting model predictions into classes or other forms is erroring. Convert it yourself.') + return y_probas + + + ##### We now show how many items are in the output ################### + print('Returning model predictions in form of a list...of length %d' %len(y_test_preds_list)) + print('Time taken in mins for predictions = %0.0f' %((time.time()-start_time2)/60)) + return y_test_preds_list +############################################################################################ +def convert_predictions_from_model(y_probas, cat_vocab_dict, DS_LEN): + y_test_preds_list = [] + target = cat_vocab_dict['target_variables'] + modeltype = cat_vocab_dict["modeltype"] + if isinstance(target, list): + if len(target) == 1: + target = target[0] + #### This is where predictions are converted back to classes ### + if isinstance(target, str): + if modeltype != 'Regression': + #### This is for Single Label classification problems ###### + y_test_preds_list.append(y_probas) + y_test_preds = y_probas.argmax(axis=1) + print(' Sample predictions before inverse_transform: %s' %y_test_preds[:5]) + if cat_vocab_dict["target_transformed"]: + try: + y_test_preds = cat_vocab_dict['target_le'].inverse_transform(y_test_preds) + print(' Sample predictions after inverse_transform: %s' %y_test_preds[:5]) + y_test_preds_t = y_test_preds[:DS_LEN] + y_test_preds_list.append(y_test_preds_t) + except: + print(' Inverse transform erroring. Continuing...') + y_test_preds_t = y_test_preds[:DS_LEN] + y_test_preds_list.append(y_test_preds_t) + else: + print(' Sample predictions after transform: %s' %y_test_preds[:5]) + y_test_preds_t = y_test_preds[:DS_LEN] + y_test_preds_list.append(y_test_preds_t) + else: + #### This is for Single Label regression problems ###### + y_test_preds = y_probas.ravel() + y_test_preds_t = y_test_preds[:DS_LEN] + y_test_preds_list.append(y_test_preds_t) + else: + if modeltype == 'Regression': + y_test_preds_list.append(y_probas) + ### This is for Multi-Label Regresison problems ### + for each_t in range(len(y_probas)): + if each_t == 0: + y_test_preds = y_probas[each_t].mean(axis=1) + else: + y_test_preds = np.c_[y_test_preds, y_probas[each_t].mean(axis=1)] + y_test_preds_t = y_test_preds[:DS_LEN] + y_test_preds_list.append(y_test_preds_t) + else: + y_preds = [] + #### This is Multi-Label Classification problems ###### + y_test_preds_list.append(y_probas) + ### in Multi-Label predictions, output is a list if loading test datafile or dataframe ## + if isinstance(y_probas, list): + print('Multi-Label Predictions has %s outputs' %len(y_probas)) + else: + print('Multi-Label Predictions shape:%s' %(y_probas.shape,)) + for each_t in range(len(y_probas)): + y_test_preds = y_probas[each_t].argmax(axis=1) + print(' Sample predictions for label: %s before transform: %s' %(each_t, y_test_preds[:5])) + if cat_vocab_dict["target_transformed"]: + try: + y_test_preds = cat_vocab_dict[each_target]['target_le'].inverse_transform( + y_test_preds) + print(' Sample predictions after inverse_transform: %s' %y_test_preds[:5]) + if each_t == 0: + y_preds = y_test_preds + else: + y_preds = np.c_[y_preds, y_test_preds] + y_test_preds_t = y_preds[:DS_LEN,:] + y_test_preds_list.append(y_test_preds_t) + except: + print(' Inverse transform erroring. Continuing...') + if each_t == 0: + y_preds = y_test_preds + else: + y_preds = np.c_[y_preds, y_test_preds] + y_test_preds_t = y_preds[:DS_LEN] + y_test_preds_list.append(y_test_preds_t) + else: + if each_t == 0: + y_preds = y_test_preds + else: + y_preds = np.c_[y_preds, y_test_preds] + y_test_preds_t = y_preds[:DS_LEN] + y_test_preds_list.append(y_test_preds_t) + return y_test_preds_list +########################################################################################### +from PIL import Image +import numpy as np +from skimage import transform +def process_image_file(filename, img_height, img_weight, img_channels): + np_image = Image.open(filename) + np_image = np.array(np_image).astype('float32') + np_image = transform.resize(np_image, (224, 224, 3)) + np_image = np.expand_dims(np_image, axis=0) + return np_image +############################################################################################## +def predict_images(test_image_dir, model_or_model_path, cat_vocab_dict, keras_model_type): + project_name = cat_vocab_dict["project_name"] + model_loaded, cat_vocab_dict = load_model_dict(model_or_model_path, cat_vocab_dict, project_name, keras_model_type) + ##### Now load the classes neede for predictions ### + y_test_preds_list = [] + classes = cat_vocab_dict['image_classes'] + img_height = cat_vocab_dict["image_height"] + img_width = cat_vocab_dict["image_width"] + batch_size = cat_vocab_dict["batch_size"] + img_channels = cat_vocab_dict["image_channels"] + if isinstance(test_image_dir, str): + if test_image_dir.split(".")[-1] in ["jpg","png"]: + print(" loading and predicting on file : %s" %test_image_dir) + pred_label = model_loaded.predict(process_image_file(test_image_dir, + img_height, img_weight, img_channels)) + print('Predicted Label: %s' %(classes[np.argmax(pred_label)])) + print('Predicted probabilities: %s' %pred_label) + else: + print(" loading and predicting on folder: %s" %test_image_dir) + test_ds = tf.keras.preprocessing.image_dataset_from_directory(test_image_dir, + seed=111, + image_size=(img_height, img_width), + batch_size=batch_size) + y_probas = model_loaded.predict(test_ds) + ### DS_LEN for image directories rarely exceeds 10000 so just set this as default ## + DS_LEN = 10000 + pred_label = convert_predictions_from_model(y_probas, cat_vocab_dict, DS_LEN) + return pred_label + else: + print('Error: test_image_dir should be either a directory containining test folder or a single JPG or PNG image file') + return None +######################################################################################################## +def predict_text(test_text_dir, model_or_model_path, cat_vocab_dict, keras_model_type): + project_name = cat_vocab_dict["project_name"] + model_loaded, cat_vocab_dict = load_model_dict(model_or_model_path, cat_vocab_dict, project_name, keras_model_type) + ##### Now load the classes neede for predictions ### + y_test_preds_list = [] + classes = cat_vocab_dict['text_classes'] + if isinstance(test_text_dir, str): + if test_text_dir.split(".")[-1] in ["txt"]: + try: + batch_size = cat_vocab_dict["batch_size"] + except: + batch_size = 32 + print(" loading and predicting on TXT file : %s" %test_text_dir) + pred_label = model_loaded.predict(test_text_dir) + print('Predicted Label: %s' %(classes[np.argmax(pred_label)])) + print('Predicted probabilities: %s' %pred_label) + elif test_text_dir.split(".")[-1] in ["csv"]: + print(" loading and predicting on CSV file : %s" %test_text_dir) + test_ds, cat_vocab_dict, test_small = load_test_data(test_text_dir, project_name, cat_vocab_dict=cat_vocab_dict, + verbose=0) + DS_LEN = cat_vocab_dict['DS_LEN'] + batch_size = cat_vocab_dict["batch_size"] + try: + num_steps = int(np.ceil(DS_LEN/batch_size)) + except: + num_steps = 1 + ######### See if you can predict here if not return the null result ##### + print(' number of steps needed to predict: %d' %num_steps) + y_probas = model_loaded.predict(test_ds, steps=num_steps) + pred_label = convert_predictions_from_model(y_probas, cat_vocab_dict, DS_LEN) + return pred_label + else: + try: + batch_size = cat_vocab_dict["batch_size"] + except: + batch_size = 32 + print(" loading and predicting on folder: %s" %test_text_dir) + test_ds = tf.keras.preprocessing.text_dataset_from_directory(test_text_dir, + seed=111, + batch_size=batch_size) + y_probas = model_loaded.predict(test_ds) + try: + DS_LEN = cat_vocab_dict['DS_LEN'] + except: + ### just set up a default number 10,000 + DS_LEN = 10000 + pred_label = convert_predictions_from_model(y_probas, cat_vocab_dict, DS_LEN) + return pred_label + else: + print('Error: test_text_dir should be either a directory containining test folder or a single .txt file') + return None +########################################################################################################## +from collections import defaultdict +def nested_dictionary(): + return defaultdict(nested_dictionary) +def find_preds(data_sample): + for featurex in data_sample.take(1): + return list(featurex.keys()) +def classify_dtypes_using_TF2_in_test(data_sample, idcols, verbose=0): + """ + If you send in a batch of Ttf.data.dataset with the name of target variable(s), you will get back + all the features classified by type such as cats, ints, floats and nlps. This is all done using TF2. + """ + print_features = False + nlps = [] + nlp_char_limit = 30 + all_ints = [] + floats = [] + cats = [] + bools = [] + int_vocab = 0 + feats_max_min = nested_dictionary() + preds = find_preds(data_sample) + #### Take(1) always displays only one batch only if num_epochs is set to 1 or a number. Otherwise No print! ######## + #### If you execute the below code without take, then it will go into an infinite loop if num_epochs was set to None. + if data_sample.element_spec[preds[0]].shape[0] is None or data_sample.element_spec[preds[0]].shape[0]: + for feature_batch in data_sample.take(1): + if verbose >= 1: + print(f"{target}: {label_batch[:4]}") + if len(feature_batch.keys()) <= 30: + print_features = True + if verbose >= 1: + print("features and their max, min, datatypes in one batch of size: ",batch_size) + for key, value in feature_batch.items(): + feats_max_min[key]["dtype"] = data_sample.element_spec[key].dtype + if feats_max_min[key]['dtype'] in [tf.float16, tf.float32, tf.float64]: + ## no need to find vocab of floating point variables! + floats.append(key) + elif feats_max_min[key]['dtype'] in [tf.int16, tf.int32, tf.int64]: + ### if it is an integer var, it is worth finding their vocab! + all_ints.append(key) + int_vocab = tf.unique(value)[0].numpy().tolist() + feats_max_min[key]['size_of_vocab'] = len(int_vocab) + elif feats_max_min[key]['dtype'] == 'bool': + ### if it is an integer var, it is worth finding their vocab! + bools.append(key) + int_vocab = tf.unique(value)[0].numpy().tolist() + feats_max_min[key]['size_of_vocab'] = len(int_vocab) + elif feats_max_min[key]['dtype'] in [tf.string]: + if tf.reduce_mean(tf.strings.length(feature_batch[key])).numpy() >= nlp_char_limit: + print('%s is detected and will be treated as an NLP variable') + nlps.append(key) + else: + cats.append(key) + if not print_features: + print('Number of variables in dataset is too numerous to print...skipping print') + + ints = [ x for x in all_ints if feats_max_min[x]['size_of_vocab'] > 30 and x not in idcols] + + int_cats = [ x for x in all_ints if feats_max_min[x]['size_of_vocab'] <= 30 and x not in idcols] + + return cats, int_cats, ints, floats, nlps, bools +############################################################################################################# \ No newline at end of file diff --git a/build/lib/deep_autoviml/modeling/train_custom_model.py b/build/lib/deep_autoviml/modeling/train_custom_model.py new file mode 100644 index 0000000..7922a3d --- /dev/null +++ b/build/lib/deep_autoviml/modeling/train_custom_model.py @@ -0,0 +1,1081 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pandas as pd +import numpy as np +pd.set_option('display.max_columns',500) +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +import os +os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' +def set_seed(seed=31415): + np.random.seed(seed) + tf.random.set_seed(seed) + os.environ['PYTHONHASHSEED'] = str(seed) + os.environ['TF_DETERMINISTIC_OPS'] = '1' +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup, CategoryCrossing +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization, Discretization, Hashing +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense, GaussianNoise + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +from tensorflow.keras.layers import LeakyReLU +##################################################################################### +from deep_autoviml.modeling.create_model import return_optimizer +from deep_autoviml.utilities.utilities import get_model_defaults, get_compiled_model + +# Utils +from deep_autoviml.utilities.utilities import print_one_row_from_tf_dataset, print_one_row_from_tf_label +from deep_autoviml.utilities.utilities import print_classification_metrics, print_regression_model_stats +from deep_autoviml.utilities.utilities import plot_regression_residuals +from deep_autoviml.utilities.utilities import print_classification_model_stats, plot_history, plot_classification_results +from deep_autoviml.utilities.utilities import add_outputs_to_model_body +from deep_autoviml.utilities.utilities import add_outputs_to_auto_model_body +from deep_autoviml.utilities.utilities import check_if_GPU_exists, get_chosen_callback +from deep_autoviml.utilities.utilities import save_valid_predictions, get_callbacks +from deep_autoviml.utilities.utilities import print_classification_header +from deep_autoviml.utilities.utilities import check_keras_options +from deep_autoviml.utilities.utilities import save_model_artifacts, save_model_architecture + +from deep_autoviml.data_load.extract import find_batch_size +from deep_autoviml.modeling.one_cycle import OneCycleScheduler +##################################################################################### +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle +############################################################################################# +##### Suppress all TF2 and TF1.x warnings ################### +tf2logger = tf.get_logger() +tf2logger.warning('Silencing TF2.x warnings') +tf2logger.root.removeHandler(tf2logger.root.handlers) +tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +#### probably the most handy function of all! +def left_subtract(l1,l2): + lst = [] + for i in l1: + if i not in l2: + lst.append(i) + return lst +############################################################################################## +import time +import os +import math + +from sklearn.metrics import balanced_accuracy_score, classification_report +from sklearn.metrics import confusion_matrix, roc_auc_score, accuracy_score +from collections import defaultdict +from tensorflow.keras import callbacks +######################################################################################### +### This is the Storm-Tuner which stands for Stochastic Random Mutator tuner +### More details can be found in this github: https://github.com/ben-arnao/stochasticmutatortuner +### You can also pip install storm-tuner --upgrade to get the latest version ########## +######################################################################################### +from storm_tuner import Tuner +######################################################################################### +### Split raw_train_set into train and valid data sets first +### This is a better way to split a dataset into train and test #### +### It does not assume a pre-defined size for the data set. +def is_valid(x, y): + return x % 5 == 0 +def is_test(x, y): + return x % 2 == 0 +def is_train(x, y): + return not is_test(x, y) +################################################################################## +# Reset Keras Session +def reset_keras(): + sess = get_session() + K.clear_session() + sess.close() + sess = get_session() + + try: + del opt_model ### delete this if it exists + del best_model # this is from global space - change this as you need + del deep_model ### delete this if it exists + print('deleted deep and best models from memory') + except: + pass + + print(gc.collect()) # if it does something you should see a number as output + + # use the same config as you used to create the session + config = tf.compat.v1.ConfigProto() + config.gpu_options.per_process_gpu_memory_fraction = 1 + config.gpu_options.visible_device_list = "0" + set_session(tf.compat.v1.Session(config=config)) +################################################################################## +def build_model_optuna(trial, inputs, meta_outputs, output_activation, num_predicts, modeltype, + optimizer_options, loss_fn, val_metrics, cols_len, targets, nlp_flag, regular_body): + + #tf.compat.v1.reset_default_graph() + #K.clear_session() + #reset_keras() + #tf.keras.backend.reset_uids() + ### Keep the number of layers slightly higher to increase model complexity ## + n_layers = trial.suggest_int("n_layers", 2, 8) + #num_hidden = trial.suggest_categorical("n_units", [32, 48, 64, 96, 128]) + num_hidden = trial.suggest_categorical("n_units", [50, 100, 150, 200, 250, 300, 350, 400, 450, 500]) + weight_decay = trial.suggest_float("weight_decay", 1e-8, 1e-3, log=True) + use_bias = trial.suggest_categorical("use_bias", [True, False]) + batch_norm = trial.suggest_categorical("batch_norm", [True, False]) + add_noise = trial.suggest_categorical("add_noise", [True, False]) + dropout = trial.suggest_float("dropout", 0.5, 0.9) + activation_fn = trial.suggest_categorical("activation", ['relu', 'elu', 'selu']) + kernel_initializer = trial.suggest_categorical("kernel_initializer", + ['glorot_uniform','he_normal','lecun_normal','he_uniform']) + kernel_size = num_hidden + model = tf.keras.Sequential() + + for i in range(n_layers): + kernel_size = int(kernel_size*0.80) + + model.add( + tf.keras.layers.Dense( + kernel_size, + name="opt_dense_"+str(i), use_bias=use_bias, + kernel_initializer=kernel_initializer, + #kernel_regularizer=tf.keras.regularizers.l2(weight_decay), + ) + ) + model.add(Activation(activation_fn,name="opt_activation_"+str(i))) + + if batch_norm: + model.add(BatchNormalization(name="opt_batchnorm_"+str(i))) + + if add_noise: + model.add(GaussianNoise(trial.suggest_float("adam_learning_rate", 1e-7, 1e-3, log=True))) + + model.add(Dropout(dropout, name="opt_drop_"+str(i))) + + #### Now we add the final layers to the model ######### + kwargs = {} + if isinstance(optimizer_options,str): + if optimizer_options == "": + optimizer_options = ["Adam", "SGD"] + optimizer_selected = trial.suggest_categorical("optimizer", optimizer_options) + else: + optimizer_selected = optimizer_options + else: + optimizer_selected = trial.suggest_categorical("optimizer", optimizer_options) + if optimizer_selected == "Adam": + kwargs["learning_rate"] = trial.suggest_float("adam_learning_rate", 1e-7, 1e-3, log=True) + kwargs["epsilon"] = trial.suggest_float( + "adam_epsilon", 1e-14, 1e-4, log=True + ) + elif optimizer_selected == "SGD": + kwargs["learning_rate"] = trial.suggest_float( + "sgd_opt_learning_rate", 1e-7, 1e-3, log=True + ) + kwargs["momentum"] = trial.suggest_float("sgd_opt_momentum", 0.8, 0.95) + + optimizer = getattr(tf.optimizers, optimizer_selected)(**kwargs) + ##### This is the simplest way to convert a sequential model to functional! + if regular_body: + opt_outputs = add_outputs_to_model_body(model, meta_outputs) + else: + opt_outputs = add_outputs_to_auto_model_body(model, meta_outputs, nlp_flag) + + comp_model = get_compiled_model(inputs, opt_outputs, output_activation, num_predicts, + modeltype, optimizer, loss_fn, val_metrics, cols_len, targets) + + return comp_model + +############################################################################### +def build_model_storm(hp, *args): + #### Before every sequential model definition you need to clear the Keras backend ## + keras.backend.clear_session() + + ###### we need to use the batch_size in a few small sizes #### + if len(args) == 2: + batch_limit, batch_nums = args[0], args[1] + batch_size = hp.Param('batch_size', [32, 64, 128, 256, 512, 1024, 2048], + ordered=True) + elif len(args) == 1: + batch_size = args[0] + batch_size = hp.Param('batch_size', [batch_size]) + else: + batch_size = hp.Param('batch_size', [32, 64, 128, 256, 512, 1024, 2048]) + + num_layers = hp.Param('num_layers', [1, 2, 3], ordered=True) + ##### Now let us build the model body ############### + model_body = Sequential([]) + + # example of model-wide unordered categorical parameter + activation_fn = hp.Param('activation', ['relu', 'selu', 'elu']) + use_bias = hp.Param('use_bias', [True, False]) + weight_decay = hp.Param("weight_decay", np.logspace(-8, -3, 10)) + #weight_decay = hp.Param("weight_decay", [1e-8, 1e-7,1e-6, 1e-5,1e-4]) + + batch_norm = hp.Param("batch_norm", [True, False]) + kernel_initializer = hp.Param("kernel_initializer", + ['glorot_uniform','he_normal','lecun_normal','he_uniform'], ordered=False) + dropout_flag = hp.Param('use_dropout', [True, False]) + batch_norm_flag = hp.Param('use_batch_norm', [True, False]) + + # example of per-block parameter + num_hidden = [50, 100, 150, 200, 250, 300, 350, 400, 450, 500] + + model_body.add(Dense(hp.Param('kernel_size_' + str(0), + num_hidden, ordered=True), + use_bias=use_bias, + kernel_initializer = kernel_initializer, + name="storm_dense_0", + #kernel_regularizer=keras.regularizers.l2(weight_decay) + )) + + model_body.add(Activation(activation_fn,name="activation_0")) + + # example of boolean param + if batch_norm_flag: + model_body.add(BatchNormalization(name="batch_norm_0")) + + if dropout_flag: + # example of nested param + # + # this param will not affect the configuration hash, if this block of code isn't executed + # this is to ensure we do not test configurations that are functionally the same + # but have different values for unused parameters + model_body.add(Dropout(hp.Param('dropout_value', [0.5, 0.6, 0.7, 0.8, 0.9], ordered=True), + name="dropout_0")) + + kernel_size = hp.values['kernel_size_' + str(0)] + if dropout_flag: + dropout_value = hp.values['dropout_value'] + else: + dropout_value = 0.5 + batch_norm_flag = hp.values['use_batch_norm'] + # example of inline ordered parameter + num_copy = copy.deepcopy(num_layers) + + for x in range(num_copy): + #### slowly reduce the kernel size after each round #### + kernel_size = int(0.75*kernel_size) + # example of per-block parameter + model_body.add(Dense(kernel_size, name="storm_dense_"+str(x+1), + use_bias=use_bias, + kernel_initializer = kernel_initializer, + #kernel_regularizer=keras.regularizers.l2(weight_decay) + )) + + model_body.add(Activation(activation_fn, name="activation_"+str(x+100))) + + # example of boolean param + if batch_norm_flag: + model_body.add(BatchNormalization(name="batch_norm_"+str(x+1))) + + if dropout_flag: + # example of nested param + # this param will not affect the configuration hash, if this block of code isn't executed + # this is to ensure we do not test configurations that are functionally the same + # but have different values for unused parameters + model_body.add(Dropout(dropout_value, name="dropout_"+str(x+1))) + + selected_optimizer = hp.Param('optimizer', ["Adam", "AdaMax", "Adagrad", "SGD", "RMSprop", "Nadam", 'nesterov'], + ordered=False) + + optimizer = return_optimizer_trials(hp, selected_optimizer) + + return model_body, optimizer + +############################################################################################ +class MyTuner(Tuner): + + def run_trial(self, trial, *args): + hp = trial.hyperparameters + #### Before every sequential model definition you need to clear the Keras backend ## + keras.backend.clear_session() + + ########## E N D O F S T R A T E G Y S C O P E ############# + train_ds, valid_ds = args[0], args[1] + epochs, steps = args[2], args[3] + inputs, meta_outputs = args[4], args[5] + cols_len, output_activation = args[6], args[7] + num_predicts, modeltype = args[8], args[9] + optimizer, val_loss = args[10], args[11] + val_metrics, patience = args[12], args[13] + val_mode, DS_LEN = args[14], args[15] + learning_rate, val_monitor = args[16], args[17] + callbacks_list, modeltype = args[18], args[19] + class_weights, batch_size = args[20], args[21] + batch_limit, batch_nums = args[22], args[23] + targets, nlp_flag, regular_body = args[24], args[25], args[26] + project_name, keras_model_type, = args[27], args[28] + cat_vocab_dict, model_options = args[29], args[30] + + model_body, optimizer = build_model_storm(hp, batch_limit, batch_nums) + + ##### This is the simplest way to convert a sequential model to functional model! + if regular_body: + storm_outputs = add_outputs_to_model_body(model_body, meta_outputs) + else: + storm_outputs = add_outputs_to_auto_model_body(model_body, meta_outputs, nlp_flag) + + #### This final outputs is the one that is taken into final dense layer and compiled + #print(' Custom model loaded successfully. Now compiling model...') + + ###### This is where you compile the model after it is built ############### + #### Add a final layer for outputs during compiled model phase ############# + + comp_model = get_compiled_model(inputs, storm_outputs, output_activation, num_predicts, + modeltype, optimizer, val_loss, val_metrics, cols_len, targets) + + #opt = comp_model.optimizer + #for var in opt.variables(): + # var.assign(tf.zeros_like(var)) + if len(self.trials) == 0: + ### Just save it once. Don't save it again and again. + save_model_architecture(comp_model, project_name, keras_model_type, cat_vocab_dict, + model_options, chart_name="model_before") + #print(' Custom model compiled successfully. Training model next...') + batch_numbers = [32, 64, 128, 256, 512, 1024, 2048, 4096] + shuffle_size = 1000 + batch_sizes = batch_numbers[:batch_nums] + #print('storm batch sizes = %s' %batch_sizes) + batch_size = np.random.choice(batch_sizes) + #print(' selected batch size = %s' %batch_size) + train_ds = train_ds.unbatch().batch(batch_size) + train_ds = train_ds.shuffle(shuffle_size, + reshuffle_each_iteration=False, seed=42).prefetch(batch_size)#.repeat(5) + valid_ds = valid_ds.unbatch().batch(batch_size) + valid_ds = valid_ds.prefetch(batch_size)#.repeat(5) + steps = 20 + storm_epochs = 5 + + history = comp_model.fit(train_ds, epochs=storm_epochs, #steps_per_epoch=steps,# batch_size=batch_size, + validation_data=valid_ds, #validation_steps=steps, + callbacks=callbacks_list, shuffle=True, class_weight=class_weights, + verbose=0) + # here we can define custom logic to assign a score to a configuration + if len(targets) == 1: + score = np.mean(history.history[val_monitor][-5:]) + else: + for i in range(len(targets)): + ### the next line uses the list of metrics to find one that is a closest match + metric1 = [x for x in history.history.keys() if (targets[i] in x) & ("loss" not in x) ] + val_metric = metric1[0] + if i == 0: + results = history.history[val_metric][-5:] + else: + results = np.c_[results,history.history[val_metric][-5:]] + score = results.mean(axis=1).mean() + #scores.append(score) + ##### This is where we capture the best learning rate from the optimizer chosen ###### + model_lr = comp_model.optimizer.learning_rate.numpy() + #self.user_var = model_lr + print(' found best learning rate = %s' %model_lr) + trial.metrics['final_lr'] = model_lr + #print(' trial final_lr = %s' %trial.metrics['final_lr']) + self.score_trial(trial, score) + #self.score_trial(trial, min(scores)) +##################################################################################### +def return_optimizer_trials(hp, hpq_optimizer): + """ + This returns the keras optimizer with proper inputs if you send the string. + hpq_optimizer: input string that stands for an optimizer such as "Adam", etc. + """ + ##### These are the various optimizers we use ################################ + momentum = keras.optimizers.SGD(lr=0.001, momentum=0.9) + nesterov = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True) + adagrad = keras.optimizers.Adagrad(lr=0.001) + rmsprop = keras.optimizers.RMSprop(lr=0.001, rho=0.9) + adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999) + adamax = keras.optimizers.Adamax(lr=0.001, beta_1=0.9, beta_2=0.999) + nadam = keras.optimizers.Nadam(lr=0.001, beta_1=0.9, beta_2=0.999) + best_optimizer = '' + ############################################################################# + lr_list = [1e-2, 1e-3, 1e-4] + if hpq_optimizer.lower() in ['adam']: + best_optimizer = tf.keras.optimizers.Adam(lr=hp.Param('init_lr', lr_list), + epsilon=hp.Param('epsilon', [1e-6, 1e-8, 1e-10, 1e-12, 1e-14], ordered=True)) + elif hpq_optimizer.lower() in ['sgd']: + best_optimizer = keras.optimizers.SGD(lr=hp.Param('init_lr', lr_list), + momentum=0.9) + elif hpq_optimizer.lower() in ['nadam']: + best_optimizer = keras.optimizers.Nadam(lr=hp.Param('init_lr', lr_list), + beta_1=0.9, beta_2=0.999) + elif hpq_optimizer.lower() in ['adamax']: + best_optimizer = keras.optimizers.Adamax(lr=hp.Param('init_lr', lr_list), + beta_1=0.9, beta_2=0.999) + elif hpq_optimizer.lower() in ['adagrad']: + best_optimizer = keras.optimizers.Adagrad(lr=hp.Param('init_lr', lr_list)) + elif hpq_optimizer.lower() in ['rmsprop']: + best_optimizer = keras.optimizers.RMSprop(lr=hp.Param('init_lr', lr_list), + rho=0.9) + elif hpq_optimizer.lower() in ['nesterov']: + best_optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True) + else: + best_optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9) + + return best_optimizer +##################################################################################### +from tensorflow.keras import backend as K +import tensorflow +import gc +from tensorflow.python.keras.backend import get_session, set_session +import tensorflow as tf + +########################################################################################## +import optuna +def train_custom_model(nlp_inputs, meta_inputs, meta_outputs, nlp_outputs, full_ds, target, + keras_model_type, keras_options, model_options, var_df, cat_vocab_dict, + project_name="", save_model_flag=True, use_my_model='', verbose=0 ): + """ + Given a keras model and a tf.data.dataset that is batched, this function will + train a keras model. It will first split the batched_data into train_ds and + valid_ds (80/20). Then it will select the right parameters based on model type and + train the model and evaluate it on valid_ds. It will return a keras model fully + trained on the full batched_data finally and train history. + """ + save_model_path = model_options['save_model_path'] + inputs = nlp_inputs + meta_inputs + nlps = var_df["nlp_vars"] + lats = var_df["lat_vars"] + lons = var_df["lon_vars"] + if nlp_inputs: + nlp_flag = True + else: + nlp_flag = False + start_time = time.time() + ######################## STORM TUNER and other DEFAULTS #################### + targets = cat_vocab_dict['target_variables'] + max_trials = model_options["max_trials"] + overwrite_flag = True ### This overwrites the trials so every time it runs it is new + data_size = check_keras_options(keras_options, 'data_size', 10000) + batch_size = check_keras_options(keras_options, 'batchsize', 64) + class_weights = check_keras_options(keras_options, 'class_weight', {}) + if not isinstance(model_options["label_encode_flag"], str): + if not model_options["label_encode_flag"]: + print(' removing class weights since label_encode_flag is set to False which means classes can be anything.') + class_weights = {} + print(' Class weights: %s' %class_weights) + num_classes = model_options["num_classes"] + num_labels = model_options["num_labels"] + modeltype = model_options["modeltype"] + patience = keras_options["patience"] + cols_len = len([item for sublist in list(var_df.values()) for item in sublist]) + if isinstance(meta_outputs, list): + data_dim = int(data_size) + NON_NLP_VARS = [] + else: + NON_NLP_VARS = left_subtract(cat_vocab_dict["predictors_in_train"], nlps) + try: + data_dim = int(data_size*meta_outputs.shape[1]) + except: + data_dim = int(data_size*(meta_outputs[0].shape[1])) + optimizer = keras_options['optimizer'] + early_stopping = check_keras_options(keras_options, "early_stopping", False) + print(' original datasize = %s, initial batchsize = %s' %(data_size, batch_size)) + print(" Early stopping : %s" %early_stopping) + NUMBER_OF_EPOCHS = check_keras_options(keras_options, "epochs", 100) + if keras_options['lr_scheduler'] in ['expo', 'ExponentialDecay', 'exponentialdecay']: + print(' chosen ExponentialDecay learning rate scheduler') + expo_steps = (NUMBER_OF_EPOCHS*data_size)//batch_size + learning_rate = keras.optimizers.schedules.ExponentialDecay(0.0001, expo_steps, 0.1) + else: + learning_rate = check_keras_options(keras_options, "learning_rate", 5e-2) + #### The steps are actually not needed but remove them later.### + if len(var_df['nlp_vars']) > 0: + steps = 10 + else: + steps = max(10, (data_size//(batch_size*2))) + steps = min(300, steps) + print(' recommended steps per epoch = %d' %steps) + STEPS_PER_EPOCH = check_keras_options(keras_options, "steps_per_epoch", + steps) + #### These can be standard for every keras option that you use layers ###### + kernel_initializer = check_keras_options(keras_options, 'kernel_initializer', 'lecun_normal') + activation='selu' + print(' default initializer = %s, default activation = %s' %(kernel_initializer, activation)) + ############################################################################ + use_bias = check_keras_options(keras_options, 'use_bias', True) + lr_scheduler = check_keras_options(keras_options, 'lr_scheduler', "") + onecycle_steps = max(10, np.ceil(data_size/(2*batch_size))*NUMBER_OF_EPOCHS) + print(' Onecycle steps = %d' %onecycle_steps) + ###################### set some defaults for model parameters here ############## + keras_options, model_options, num_predicts, output_activation = get_model_defaults(keras_options, + model_options, targets) + ################################################################################### + val_mode = keras_options["mode"] + val_monitor = keras_options["monitor"] + val_loss = keras_options["loss"] + val_metrics = keras_options["metrics"] + ######################################################################## + try: + print(' number of classes = %s, output_activation = %s' %( + num_predicts, output_activation)) + print(' loss function: %s' %str(val_loss).split(".")[-1].split(" ")[0]) + except: + print(' loss fn = %s number of classes = %s, output_activation = %s' %( + val_loss, num_predicts, output_activation)) + #### just use modeltype for printing that's all ### + modeltype = cat_vocab_dict['modeltype'] + + ############################################################################ + ### A Regular body does not have separate NLP outputs. #################### + ### However an Irregular body like fast models have separate NLP outputs. ## + ############################################################################ + regular_body = True + if isinstance(meta_outputs, list): + if nlp_flag: + if len(nlp_outputs) > 0: + ### This is a true nlp and we need to use nlp inputs ## + regular_body = False + else: + regular_body = True + else: + regular_body = False + ############################################################################ + + ### check the defaults for the following! + save_weights_only = check_keras_options(keras_options, "save_weights_only", False) + + print(' steps_per_epoch = %s, number epochs = %s' %(STEPS_PER_EPOCH, NUMBER_OF_EPOCHS)) + print(' val mode = %s, val monitor = %s, patience = %s' %(val_mode, val_monitor, patience)) + + callbacks_dict, tb_logpath = get_callbacks(val_mode, val_monitor, patience, + learning_rate, save_weights_only, + onecycle_steps, save_model_path) + chosen_callback = get_chosen_callback(callbacks_dict, keras_options) + if not keras_options["lr_scheduler"]: + print(' chosen keras LR scheduler = default') + else: + print(' chosen keras LR scheduler = %s' %keras_options['lr_scheduler']) + + ## You cannot use Unbatch to remove batch since we need it finding size below #### + #full_ds = full_ds.unbatch() + ############## Split train into train and validation datasets here ############### + recover = lambda x,y: y + print('\nSplitting train into 80+20 percent: train and validation data') + valid_ds1 = full_ds.enumerate().filter(is_valid).map(recover) + train_ds = full_ds.enumerate().filter(is_train).map(recover) + heldout_ds1 = valid_ds1 + ################################################################################## + print(' Splitting validation 20 into 10+10 percent: valid and heldout data') + valid_ds = heldout_ds1.enumerate().filter(is_test).map(recover) + heldout_ds = heldout_ds1.enumerate().filter(is_test).map(recover) + print('\nLoading model and setting params. Will take 2-3 mins. Please be patient.') + ################################################################################## + ### V E R Y I M P O R T A N T S T E P B E F O R E M O D E L F I T ### + ################################################################################## + shuffle_size = 1000 + if num_labels <= 1: + try: + y_test = np.concatenate(list(heldout_ds.map(lambda x,y: y).as_numpy_iterator())) + print(' Single-Label: Heldout data shape: %s' %(y_test.shape,)) + max_batch_size = int(min(y_test.shape[0], 4096)) + except: + max_batch_size = 48 + pass + else: + iters = int(data_size/batch_size) + 1 + for inum, each_target in enumerate(target): + add_ls = [] + for feats, labs in heldout_ds.take(iters): + add_ls.append(list(labs[each_target].numpy())) + flat_list = [item for sublist in add_ls for item in sublist] + if inum == 0: + each_array = np.array(flat_list) + else: + each_array = np.c_[each_array, np.array(flat_list)] + y_test = copy.deepcopy(each_array) + print(' Multi-Label: Heldout data shape: %s' %(y_test.shape,)) + max_batch_size = y_test.shape[0] + + if modeltype == 'Regression': + if (y_test>=0).all() : + ### if there are no negative values, then set output as positives only + output_activation = 'softplus' + print('Setting output activation layer as softplus since there are no negative values') + print(' Shuffle size = %d' %shuffle_size) + + train_ds = train_ds.prefetch(tf.data.AUTOTUNE).shuffle( + shuffle_size, reshuffle_each_iteration=False, seed=42)#.repeat() + valid_ds = valid_ds.prefetch(tf.data.AUTOTUNE)#.repeat() + if not isinstance(use_my_model, str): ### That means no tuner in this case #### + tuner = "None" + else: + tuner = model_options["tuner"] + print(' Training %s model using %s. This will take time...' %(keras_model_type, tuner)) + + from secrets import randbelow + rand_num = randbelow(10000) + tf.compat.v1.reset_default_graph() + K.clear_session() + + ####################################################################################### + ### E A R L Y S T O P P I N G T O P R E V E N T O V E R F I T T I N G ## + ####################################################################################### + if keras_options['lr_scheduler'] in ['expo', 'ExponentialDecay', 'exponentialdecay']: + callbacks_list_tuner = callbacks_dict['early_stop'] + else: + callbacks_list_tuner = [chosen_callback, callbacks_dict['early_stop']] + + targets = cat_vocab_dict["target_variables"] + ############################################################################ + ######## P E R FO R M T U N I N G H E R E ###################### + ############################################################################ + tune_mode = 'min' + trials_saved_path = os.path.join(save_model_path, "trials") + if num_labels > 1 and modeltype != 'Regression': + tune_mode = 'max' + else: + tune_mode = val_mode + if tuner.lower() == "storm": + ######## S T O R M T U N E R D E F I N E D H E R E ########### + randomization_factor = 0.5 + tuner = MyTuner(project_dir=trials_saved_path, + build_fn=build_model_storm, + objective_direction=tune_mode, + init_random=5, + max_iters=max_trials, + randomize_axis_factor=randomization_factor, + overwrite=True) + ################### S T o R M T U N E R ############################### + # parameters passed through 'search' go directly to the 'run_trial' method ## + #### This is where you find best model parameters for keras using SToRM ##### + ############################################################################# + start_time1 = time.time() + print(' STORM Tuner max_trials = %d, randomization factor = %0.2f' %( + max_trials, randomization_factor)) + tuner_epochs = 100 ### keep this low so you can run fast + tuner_steps = STEPS_PER_EPOCH ## keep this also very low + batch_limit = min(max_batch_size, int(5 * find_batch_size(data_size))) + batch_nums = int(min(8, math.log(batch_limit, 3))) + print('Max. batch size = %d, number of batch sizes to try: %d' %(batch_limit, batch_nums)) + + #### You have to make sure that inputs are unique, otherwise error #### + tuner.search(train_ds, valid_ds, tuner_epochs, tuner_steps, + inputs, meta_outputs, cols_len, output_activation, + num_predicts, modeltype, optimizer, val_loss, + val_metrics, patience, val_mode, data_size, + learning_rate, val_monitor, callbacks_list_tuner, + modeltype, class_weights, batch_size, + batch_limit, batch_nums, targets, nlp_flag, regular_body, + project_name, keras_model_type, cat_vocab_dict, + model_options) + best_trial = tuner.get_best_trial() + print(' best trial selected as %s' %best_trial) + ##### get the best model parameters now. Also split it into two models ########### + print('Time taken for tuning hyperparameters = %0.0f (mins)' %((time.time()-start_time1)/60)) + ########## S E L E C T B E S T O P T I M I Z E R and L R H E R E ############ + try: + hpq = tuner.get_best_config() + best_model, best_optimizer = build_model_storm(hpq, batch_size) + best_batch = hpq.values['batch_size'] + hpq_optimizer = hpq.values['optimizer'] + if best_trial.metrics['final_lr'] < 0: + print(' best learning rate less than zero. Resetting it....') + optimizer_lr = 0.01 + else: + optimizer_lr = best_trial.metrics['final_lr'] + print('Best hyperparameters: %s' %hpq.values) + except: + ### Sometimes the tuner cannot find a config that works! + deep_model = return_model_body(keras_options) + ### In some cases, the tuner doesn't select a good config in that case ## + best_batch = batch_size + hpq_optimizer = 'SGD' + best_optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True) + optimizer_lr = 0.01 + print(' Storm Tuner is erroring. Hence picking defaults including lr = %s' %optimizer_lr) + + ### Sometimes the learning rate is below zero - so reset it here! + ### Set the learning rate for the best optimizer here ###### + print('\nSetting best optimizer %s its best learning_rate = %s' %(hpq_optimizer, optimizer_lr)) + best_optimizer = return_optimizer(hpq_optimizer) + K.set_value(best_optimizer.learning_rate, optimizer_lr) + + ##### This is the simplest way to convert a sequential model to functional model! + if regular_body: + storm_outputs = add_outputs_to_model_body(best_model, meta_outputs) + else: + storm_outputs = add_outputs_to_auto_model_body(best_model, meta_outputs, nlp_flag) + #### This final outputs is the one that is taken into final dense layer and compiled + #print(' Custom model loaded successfully. Now compiling model...') + + ###### This is where you compile the model after it is built ############### + #### Add a final layer for outputs during compiled model phase ############# + + best_model = get_compiled_model(inputs, storm_outputs, output_activation, num_predicts, modeltype, + best_optimizer, val_loss, val_metrics, cols_len, targets) + deep_model = best_model + ####################################################################################### + elif tuner.lower() == "optuna": + ###### O P T U N A ########################## + ### This is where you build the optuna model ##### + #################################################### + optuna_scores = [] + def objective(trial): + optimizer_options = "" + opt_model = build_model_optuna(trial, inputs, meta_outputs, output_activation, num_predicts, + modeltype, optimizer_options, val_loss, val_metrics, cols_len, targets, nlp_flag, regular_body) + optuna_epochs = 5 + history = opt_model.fit(train_ds, validation_data=valid_ds, + epochs=optuna_epochs, shuffle=True, + callbacks=callbacks_list_tuner, + verbose=0) + if num_labels == 1: + score = np.mean(history.history[val_monitor][-5:]) + else: + for i in range(num_labels): + ### the next line uses the list of metrics to find one that is a closest match + metric1 = [x for x in history.history.keys() if (targets[i] in x) & ("loss" not in x) ] + val_metric = metric1[0] + if i == 0: + results = history.history[val_metric][-5:] + else: + results = np.c_[results,history.history[val_metric][-5:]] + score = results.mean(axis=1).mean() + optuna_scores.append(score) + return score + ##### This where you run optuna ################### + study_name = project_name+'_'+keras_model_type+'_study_'+str(rand_num) + if tune_mode == 'max': + study = optuna.create_study(study_name=study_name, direction="maximize", load_if_exists=False) + else: + study = optuna.create_study(study_name=study_name, direction="minimize", load_if_exists=False) + ### now find the best tuning hyper params here #### + study.optimize(objective, n_trials=max_trials) + print('Best trial score in Optuna: %s' %study.best_trial.value) + print(' Scores mean:', np.mean(optuna_scores), 'std:', np.std(optuna_scores)) + print(' Best params: %s' %study.best_params) + optimizer_options = study.best_params['optimizer'] + best_model = build_model_optuna(study.best_trial, inputs, meta_outputs, output_activation, num_predicts, + modeltype, optimizer_options, val_loss, val_metrics, cols_len, targets, nlp_flag, regular_body) + best_optimizer = best_model.optimizer + deep_model = build_model_optuna(study.best_trial, inputs, meta_outputs, output_activation, num_predicts, + modeltype, optimizer_options, val_loss, val_metrics, cols_len, targets, nlp_flag, regular_body) + best_batch = batch_size + optimizer_lr = best_optimizer.learning_rate.numpy() + print('\nBest optimizer = %s and best learning_rate = %s' %(best_optimizer, optimizer_lr)) + K.set_value(best_optimizer.learning_rate, optimizer_lr) + elif tuner.lower() == "none": + print('skipping tuner search since use_my_model flag set to True...') + best_model = use_my_model + deep_model = use_my_model + if regular_body: + best_outputs = add_outputs_to_model_body(best_model, meta_outputs) + deep_outputs = add_outputs_to_model_body(deep_model, meta_outputs) + else: + best_outputs = add_outputs_to_auto_model_body(best_model, meta_outputs, nlp_flag) + deep_outputs = add_outputs_to_auto_model_body(deep_model, meta_outputs, nlp_flag) + best_optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True) + best_batch = batch_size + optimizer_lr = best_optimizer.learning_rate.numpy() + print('\nBest optimizer = %s and best learning_rate = %s' %(best_optimizer, optimizer_lr)) + K.set_value(best_optimizer.learning_rate, optimizer_lr) + ####################################################################################### + #### The best_model will be used for predictions on valid_ds to get metrics ######### + best_model = get_compiled_model(inputs, best_outputs, output_activation, num_predicts, + modeltype, best_optimizer, val_loss, val_metrics, cols_len, targets) + deep_model = get_compiled_model(inputs, deep_outputs, output_activation, num_predicts, + modeltype, best_optimizer, val_loss, val_metrics, cols_len, targets) + ####################################################################################### + + #################################################################################### + ##### T R A IN A N D V A L I D A T I O N F O U N D H E R E ###### + #################################################################################### + + train_ds = train_ds.unbatch().batch(best_batch, drop_remainder=True) + train_ds = train_ds.shuffle(shuffle_size, + reshuffle_each_iteration=False, seed=42).prefetch(tf.data.AUTOTUNE)#.repeat() + + valid_ds = valid_ds.unbatch().batch(best_batch, drop_remainder=True) + valid_ds = valid_ds.prefetch(tf.data.AUTOTUNE)#.repeat() + + #################################################################################### + ############### F I R S T T R A I N F O R 1 0 0 E P O C H S ################## + ### You have to set both callbacks in order to learn what the best learning rate is + #################################################################################### + if keras_options['lr_scheduler'] in ['expo', 'ExponentialDecay', 'exponentialdecay']: + #### Exponential decay will take care of automatic reduction of Learning Rate + if early_stopping: + callbacks_list = [callbacks_dict['early_stop'], callbacks_dict['tensor_board']] + else: + callbacks_list = [callbacks_dict['tensor_board']] + else: + #### here you have to explicitly include Learning Rate reducer + if early_stopping: + callbacks_list = [callbacks_dict['early_stop'], callbacks_dict['tensor_board'], chosen_callback] + else: + callbacks_list = [callbacks_dict['tensor_board'], chosen_callback] + + print('Model training with best hyperparameters for %d epochs' %NUMBER_OF_EPOCHS) + for each_callback in callbacks_list: + print(' Callback added: %s' %str(each_callback).split(".")[-1]) + + ############################ M O D E L T R A I N I N G ################## + np.random.seed(42) + tf.random.set_seed(42) + history = best_model.fit(train_ds, validation_data=valid_ds, #batch_size=best_batch, + epochs=NUMBER_OF_EPOCHS, #steps_per_epoch=STEPS_PER_EPOCH, + callbacks=callbacks_list, class_weight=class_weights, + #validation_steps=STEPS_PER_EPOCH, + shuffle=True) + print(' Model training completed. Following metrics available: %s' %history.history.keys()) + print('Time taken to train model (in mins) = %0.0f' %((time.time()-start_time)/60)) + + ################################################################################# + ####### R E S E T K E R A S S E S S I O N + ################################################################################# + # Reset Keras Session + + K.clear_session() + reset_keras() + tf.compat.v1.reset_default_graph() + tf.keras.backend.reset_uids() + + ### Once the best learning rate is chosen the model is ready to be trained on full data + try: + ## this is where it stopped + stopped_epoch = max(5, int(pd.DataFrame(history.history).shape[0] - patience)) + except: + stopped_epoch = 100 + print(' Stopped epoch = %s' %stopped_epoch) + + ### Plot the epochs and loss metrics here ##################### + try: + if modeltype == 'Regression': + plot_history(history, val_monitor[4:], target) + elif modeltype == 'Classification': + plot_history(history, val_monitor[4:], target) + else: + plot_history(history, val_monitor[4:], target) + except: + print(' Plot history is erroring. Tensorboard logs can be found here: %s' %tb_logpath) + + print('Time taken to train model (in mins) = %0.0f' %((time.time()-start_time)/60)) + print(' Stopped epoch = %s' %stopped_epoch) + + ################################################################################# + ######## P R E D I C T O N H E L D O U T D A T A H E R E ###### + ################################################################################# + scores = [] + ls = [] + + print('Held out data actuals shape: %s' %(y_test.shape,)) + if verbose >= 1: + try: + print_one_row_from_tf_label(heldout_ds) + except: + print('could not print samples from heldout ds labels') + ########################################################################### + y_probas = best_model.predict(heldout_ds) + + if isinstance(target, str): + if modeltype != 'Regression': + y_test_preds = y_probas.argmax(axis=1) + else: + if y_test.dtype == 'int': + y_test_preds = y_probas.round().astype(int) + else: + y_test_preds = y_probas.ravel() + else: + if modeltype != 'Regression': + #### This is for multi-label binary or multi-class problems ## + for each_t in range(len(target)): + if each_t == 0: + y_test_preds = y_probas[each_t].argmax(axis=1).astype(int) + else: + y_test_preds = np.c_[y_test_preds, y_probas[each_t].argmax(axis=1).astype(int)] + else: + ### This is for Multi-Label Regression ### + for each_t in range(len(target)): + if each_t == 0: + y_test_preds = y_probas[each_t].mean(axis=1) + else: + y_test_preds = np.c_[y_test_preds, y_probas[each_t].mean(axis=1)] + if y_test.dtype == 'int': + y_test_preds = y_test_preds.round().astype(int) + + print('\nHeld out predictions shape:%s' %(y_test_preds.shape,)) + if verbose >= 1: + if modeltype != 'Regression': + print(' Sample predictions: %s' %y_test_preds[:10]) + else: + if num_labels == 1: + print(' Sample predictions: %s' %y_test_preds.ravel()[:10]) + else: + print(' Sample predictions:\n%s' %y_test_preds[:10]) + + ################################################################################# + ######## P L O T T I N G V A L I D A T I O N R E S U L T S ###### + ################################################################################# + print('\n###########################################################') + print(' Held-out test data set Results:') + num_labels = cat_vocab_dict['num_labels'] + num_classes = cat_vocab_dict['num_classes'] + + ######## Check for NaN in predictions ############################### + if check_for_nan_in_array(y_probas): + y_probas = pd.DataFrame(y_probas).fillna(0).values + elif check_for_nan_in_array(y_test_preds): + y_test_preds = pd.DataFrame(y_test_preds).fillna(0).values.ravel() + + ############### P R I N T I N G R E S U L T S ################# + if num_labels <= 1: + #### This is for Single-Label Problems only ################################ + if modeltype == 'Regression': + print_regression_model_stats(y_test, y_test_preds,target,plot_name=project_name) + ### plot the regression results here ######### + plot_regression_residuals(y_test, y_test_preds, target, project_name, num_labels) + else: + print_classification_header(num_classes, num_labels, target) + labels = cat_vocab_dict['original_classes'] + if cat_vocab_dict['target_transformed']: + target_names = cat_vocab_dict['transformed_classes'] + target_le = cat_vocab_dict['target_le'] + y_pred = y_probas.argmax(axis=1) + y_test_trans = target_le.inverse_transform(y_test) + y_pred_trans = target_le.inverse_transform(y_pred) + labels = np.unique(y_test_trans) ### sometimes there is less classes + plot_classification_results(y_test_trans, y_pred_trans, labels, labels, target) + else: + y_pred = y_probas.argmax(axis=1) + labels = np.unique(y_test) ### sometimes there are fewer classes ## + plot_classification_results(y_test, y_pred, labels, labels, target) + print_classification_metrics(y_test, y_probas, proba_flag=True) + else: + if modeltype == 'Regression': + #### This is for Multi-Label Regression ################################ + print_regression_model_stats(y_test, y_test_preds,target,plot_name=project_name) + ### plot the regression results here ######### + plot_regression_residuals(y_test, y_test_preds, target, project_name, num_labels) + else: + #### This is for Multi-Label Classification ################################ + try: + targets = cat_vocab_dict["target_variables"] + for i, each_target in enumerate(targets): + print_classification_header(num_classes, num_labels, each_target) + labels = cat_vocab_dict[each_target+'_original_classes'] + if cat_vocab_dict['target_transformed']: + ###### Use a nice classification matrix printing module here ######### + target_names = cat_vocab_dict[each_target+'_transformed_classes'] + target_le = cat_vocab_dict['target_le'][i] + y_pred = y_probas[i].argmax(axis=1) + y_test_trans = target_le.inverse_transform(y_test[:,i]) + y_pred_trans = target_le.inverse_transform(y_pred) + labels = np.unique(y_test_trans) ### sometimes there is less classes + plot_classification_results(y_test_trans, y_pred_trans, labels, labels, each_target) + else: + y_pred = y_probas[i].argmax(axis=1) + labels = np.unique(y_test[:,i]) ### sometimes there are fewer classes ## + plot_classification_results(y_test[:,i], y_pred, labels, labels, each_target) + print_classification_metrics(y_test[:,i], y_probas[i], proba_flag=True) + #### This prints additional metrics ############# + print(classification_report(y_test[:,i],y_test_preds[:,i])) + print(confusion_matrix(y_test[:,i], y_test_preds[:,i])) + except: + print_classification_metrics(y_test, y_test_preds, False) + print(classification_report(y_test, y_test_preds )) + ############### P R I N T I N G C O M P L E T E D ################# + + ################################################################################## + ### S E C O N D T R A I N O N F U L L T R A I N D A T A S E T ### + ################################################################################## + ############ train the model on full train data set now ############### + print('\nFinally, training on full train dataset. This will take time...') + full_ds = full_ds.unbatch().batch(best_batch) + full_ds = full_ds.shuffle(shuffle_size, + reshuffle_each_iteration=False, seed=42).prefetch(best_batch)#.repeat() + + ################# B E S T D E E P M O D E L ########################## + ##### You need to set the best learning rate from the best_model ################# + best_rate = best_model.optimizer.lr.numpy() + if best_rate < 0: + print(' best learning rate less than zero. Resetting it....') + best_rate = 0.01 + else: + pass + print(' best learning rate = %s' %best_rate) + K.set_value(deep_model.optimizer.learning_rate, best_rate) + print(" set learning rate using best model:", deep_model.optimizer.learning_rate.numpy()) + #### Dont set the epochs too low - let them be back to where they were stopped #### + print(' max epochs for training = %d' %stopped_epoch) + + ##### You save deep_model finally here using checkpoints ############## + callbacks_list = [ callbacks_dict['check_point'] ] + deep_model.fit(full_ds, epochs=stopped_epoch, #steps_per_epoch=STEPS_PER_EPOCH, batch_size=best_batch, + class_weight = class_weights, + callbacks=callbacks_list, shuffle=True, verbose=0) + ################################################################################## + ####### S A V E the model here using save_model_name ################# + ################################################################################## + + save_model_artifacts(deep_model, cat_vocab_dict, var_df, save_model_path, + save_model_flag, model_options) + + ############################################################################# + ##### C L E A R S E S S I O N B E F O R E C L O S I N G #### + ############################################################################# + #from numba import cuda + #cuda.select_device(0) + #cuda.close() + # Reset Keras Session + K.clear_session() + tf.compat.v1.reset_default_graph() + reset_keras() + tf.keras.backend.reset_uids() + + print('\nDeep_Auto_ViML completed. Total time taken = %0.0f (in mins)' %((time.time()-start_time)/60)) + + return deep_model, cat_vocab_dict +###################################################################################### +def return_model_body(keras_options): + num_layers = check_keras_options(keras_options, 'num_layers', 2) + model_body = tf.keras.Sequential([]) + for l_ in range(num_layers): + model_body.add(layers.Dense(64, activation='relu', kernel_initializer="lecun_normal", + #activity_regularizer=tf.keras.regularizers.l2(0.01) + )) + return model_body +######################################################################################## +def check_for_nan_in_array(array_in): + """ + If an array has NaN in it, this will return True. Otherwise, it will return False. + """ + array_sum = np.sum(array_in) + array_nan = np.isnan(array_sum) + return array_nan +######################################################################################## + diff --git a/build/lib/deep_autoviml/modeling/train_image_model.py b/build/lib/deep_autoviml/modeling/train_image_model.py new file mode 100644 index 0000000..9f3e704 --- /dev/null +++ b/build/lib/deep_autoviml/modeling/train_image_model.py @@ -0,0 +1,138 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pandas as pd +import numpy as np +pd.set_option('display.max_columns',500) +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +import os +def set_seed(seed=31415): + np.random.seed(seed) + tf.random.set_seed(seed) + os.environ['PYTHONHASHSEED'] = str(seed) + os.environ['TF_DETERMINISTIC_OPS'] = '1' +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup, CategoryCrossing +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization, Discretization, Hashing +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense, GaussianNoise + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +##################################################################################### +# Utils +from deep_autoviml.utilities.utilities import print_one_row_from_tf_dataset, print_one_row_from_tf_label +from deep_autoviml.utilities.utilities import print_classification_metrics, print_regression_model_stats +from deep_autoviml.utilities.utilities import print_classification_model_stats, plot_history, plot_classification_results +from deep_autoviml.utilities.utilities import plot_one_history_metric +from deep_autoviml.utilities.utilities import check_if_GPU_exists +from deep_autoviml.utilities.utilities import save_valid_predictions, predict_plot_images + +from deep_autoviml.data_load.extract import find_batch_size +from deep_autoviml.modeling.create_model import check_keras_options +from deep_autoviml.modeling.one_cycle import OneCycleScheduler +##################################################################################### +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle +##### Suppress all TF2 and TF1.x warnings ################### +tf2logger = tf.get_logger() +tf2logger.warning('Silencing TF2.x warnings') +tf2logger.root.removeHandler(tf2logger.root.handlers) +tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +#### probably the most handy function of all! +def left_subtract(l1,l2): + lst = [] + for i in l1: + if i not in l2: + lst.append(i) + return lst +############################################################################################## +import time +import os +from sklearn.metrics import balanced_accuracy_score, classification_report +from sklearn.metrics import confusion_matrix, roc_auc_score, accuracy_score +from collections import defaultdict +from tensorflow.keras import callbacks +############################################################################################# +def train_image_model(deep_model, train_ds, valid_ds, cat_vocab_dict, + keras_options, model_options, project_name, save_model_flag): + epochs = check_keras_options(keras_options, "epochs", 20) + save_model_path = model_options['save_model_path'] + tensorboard_logpath = os.path.join(save_model_path,"mylogs") + print('Tensorboard log directory can be found at: %s' %tensorboard_logpath) + cp = keras.callbacks.ModelCheckpoint(project_name, save_best_only=True, + save_weights_only=True, save_format='tf') + es = keras.callbacks.EarlyStopping(monitor=val_monitor, min_delta=0.00001, patience=patience, + verbose=1, mode=val_mode, baseline=None, restore_best_weights=True) + + tb = keras.callbacks.TensorBoard(log_dir=tensorboard_logpath, + histogram_freq=0, + write_graph=True, + write_images=True, + update_freq='epoch', + profile_batch=2, + embeddings_freq=1 + ) + callbacks_list = [cp, es, tb] + print('Training image model. This will take time...') + history = deep_model.fit(train_ds, epochs=epochs, validation_data=valid_ds, + callbacks=callbacks_list) + result = deep_model.evaluate(valid_ds) + print(' Model accuracy in Image validation data: %s' %result[1]) + #plot_history(history, "accuracy", 1) + fig = plt.figure(figsize=(8,6)) + ax1 = plt.subplot(1, 1, 1) + ax1.set_title('Model Training vs Validation Loss') + plot_one_history_metric(history, "accuracy", ax1) + classes = cat_vocab_dict["image_classes"] + predict_plot_images(deep_model, valid_ds, classes) + cat_vocab_dict['project_name'] = project_name + if save_model_flag: + print('\nSaving model in %s now...this will take time...' %save_model_path) + if not os.path.exists(save_model_path): + os.makedirs(save_model_path) + deep_model.save(save_model_path) + cat_vocab_dict['saved_model_path'] = save_model_path + print(' deep_autoviml image_model saved in %s directory' %save_model_path) + else: + print('\nModel not being saved since save_model_flag set to False...') + return deep_model, cat_vocab_dict diff --git a/build/lib/deep_autoviml/modeling/train_model.py b/build/lib/deep_autoviml/modeling/train_model.py new file mode 100644 index 0000000..64218d4 --- /dev/null +++ b/build/lib/deep_autoviml/modeling/train_model.py @@ -0,0 +1,387 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pandas as pd +import numpy as np +pd.set_option('display.max_columns',500) +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup, CategoryCrossing +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization, Discretization, Hashing +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +##################################################################################### +# Utils +from deep_autoviml.utilities.utilities import print_one_row_from_tf_dataset, print_one_row_from_tf_label +from deep_autoviml.utilities.utilities import print_classification_metrics, print_regression_model_stats +from deep_autoviml.utilities.utilities import plot_regression_residuals +from deep_autoviml.utilities.utilities import print_classification_model_stats, plot_history, plot_classification_results +from deep_autoviml.utilities.utilities import save_valid_predictions, print_classification_header +from deep_autoviml.utilities.utilities import get_callbacks, get_chosen_callback +from deep_autoviml.utilities.utilities import save_model_artifacts +from deep_autoviml.modeling.create_model import check_keras_options + +##################################################################################### +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle +############################################################################################# +##### Suppress all TF2 and TF1.x warnings ################### +tf2logger = tf.get_logger() +tf2logger.warning('Silencing TF2.x warnings') +tf2logger.root.removeHandler(tf2logger.root.handlers) +tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +#### probably the most handy function of all! +def left_subtract(l1,l2): + lst = [] + for i in l1: + if i not in l2: + lst.append(i) + return lst +############################################################################################## +import time +import os +from sklearn.metrics import balanced_accuracy_score, classification_report, confusion_matrix, roc_auc_score +import math +######################################################################################### +### Split raw_train_set into train and valid data sets first +### This is a better way to split a dataset into train and test #### +### It does not assume a pre-defined size for the data set. +def is_valid(x, y): + return x % 5 == 0 +def is_test(x, y): + return x % 2 == 0 +def is_train(x, y): + return not is_test(x, y) +################################################################################## +def train_model(deep_model, full_ds, target, keras_model_type, keras_options, + model_options, var_df, cat_vocab_dict, project_name="", save_model_flag=True, + verbose=0 ): + """ + Given a keras model and a tf.data.dataset that is batched, this function will + train a keras model. It will first split the batched_data into train_ds and + valid_ds (80/20). Then it will select the right parameters based on model type and + train the model and evaluate it on valid_ds. It will return a keras model fully + trained on the full batched_data finally and train history. + """ + #### just use modeltype for printing that's all ### + start_time = time.time() + ### check the defaults for the following! + save_model_path = model_options['save_model_path'] + save_weights_only = check_keras_options(keras_options, "save_weights_only", False) + data_size = check_keras_options(keras_options, 'data_size', 10000) + batch_size = check_keras_options(keras_options, 'batchsize', 64) + num_classes = model_options["num_classes"] + num_labels = model_options["num_labels"] + modeltype = model_options["modeltype"] + patience = check_keras_options(keras_options, "patience", 10) + optimizer = keras_options['optimizer'] + class_weights = check_keras_options(keras_options, "class_weight", {}) + if not isinstance(model_options["label_encode_flag"], str): + if not model_options["label_encode_flag"]: + print(' removing class weights since label_encode_flag is set to False which means classes can be anything.') + class_weights = {} + print(' class_weights: %s' %class_weights) + cols_len = len([item for sublist in list(var_df.values()) for item in sublist]) + print(' original datasize = %s, initial batchsize = %s' %(data_size, batch_size)) + NUMBER_OF_EPOCHS = check_keras_options(keras_options, "epochs", 100) + if keras_options['lr_scheduler'] in ['expo', 'ExponentialDecay', 'exponentialdecay']: + print(' chosen ExponentialDecay learning rate scheduler') + expo_steps = (NUMBER_OF_EPOCHS*data_size)//batch_size + learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, expo_steps, 0.1) + else: + learning_rate = check_keras_options(keras_options, "learning_rate", 5e-1) + steps = max(10, (data_size//(2*batch_size))) + print(' recommended steps per epoch = %d' %steps) + onecycle_steps = math.ceil(data_size / batch_size) * NUMBER_OF_EPOCHS + print(' recommended OneCycle steps = %d' %onecycle_steps) + STEPS_PER_EPOCH = check_keras_options(keras_options, "steps_per_epoch", + steps) + #### These can be standard for every keras option that you use layers ###### + kernel_initializer = check_keras_options(keras_options, 'kernel_initializer', 'lecun_normal') + activation='selu' + print(' default initializer = %s, default activation = %s' %(kernel_initializer, activation)) + default_optimizer = keras.optimizers.SGD(learning_rate) + use_bias = check_keras_options(keras_options, 'use_bias', True) + val_monitor = keras_options['monitor'] + val_mode = keras_options['mode'] + patience = keras_options["patience"] + + if keras_options['lr_scheduler'] in ['',"onecycle", "onecycle2"]: + #### you need to double the amount of patience for onecycle scheduler ## + print(' Recommended: Increase patience for "onecycle" scheduler') + patience = patience * 1.0 + callbacks_dict, tb_logpath = get_callbacks(val_mode, val_monitor, patience, learning_rate, + save_weights_only, onecycle_steps, save_model_path) + + if keras_options['lr_scheduler'] in ['expo', 'ExponentialDecay', 'exponentialdecay']: + if early_stopping: + callbacks_list = [callbacks_dict['early_stop'], callbacks_dict['print']] + else: + callbacks_list = [callbacks_dict['print']] + else: + chosen_callback = get_chosen_callback(callbacks_dict, keras_options) + if not keras_options['lr_scheduler']: + print(' chosen keras LR scheduler = default') + else: + print(' chosen keras LR scheduler = %s' %keras_options['lr_scheduler']) + if keras_options['early_stopping']: + callbacks_list = [chosen_callback, callbacks_dict['tensor_board'], callbacks_dict['early_stop']] + else: + callbacks_list = [chosen_callback, callbacks_dict['tensor_board'], callbacks_dict['print']] + + print(' val mode = %s, val monitor = %s, patience = %s' %(val_mode, val_monitor, patience)) + print(' number of epochs = %d, steps per epoch = %d' %(NUMBER_OF_EPOCHS, STEPS_PER_EPOCH)) + ############## Split train into train and validation datasets here ############### + ################################################################################## + recover = lambda x,y: y + print('\nSplitting train into 80+20 percent: train and validation data') + valid_ds1 = full_ds.enumerate().filter(is_valid).map(recover) + train_ds = full_ds.enumerate().filter(is_train).map(recover) + heldout_ds1 = valid_ds1 + ################################################################################## + valid_ds = heldout_ds1.enumerate().filter(is_test).map(recover) + heldout_ds = heldout_ds1.enumerate().filter(is_test).map(recover) + print(' Splitting validation 20 into 10+10 percent: valid and heldout data') + ################################################################################## + ### V E R Y I M P O R T A N T S T E P B E F O R E M O D E L F I T ### + ################################################################################## + shuffle_size = int(data_size) + #shuffle_size = 100000 + print(' shuffle size = %d' %shuffle_size) + train_ds = train_ds.cache().prefetch(batch_size).shuffle(shuffle_size, + reshuffle_each_iteration=False, seed=42)#.repeat() + valid_ds = valid_ds.prefetch(batch_size)#.repeat() + + print('Model training with best hyperparameters for %d epochs' %NUMBER_OF_EPOCHS) + for each_callback in callbacks_list: + print(' Callback added: %s' %str(each_callback).split(".")[-1]) + + ############################ M O D E L T R A I N I N G ################## + np.random.seed(42) + tf.random.set_seed(42) + history = deep_model.fit(train_ds, validation_data=valid_ds, class_weight=class_weights, + epochs=NUMBER_OF_EPOCHS, #steps_per_epoch=STEPS_PER_EPOCH, + callbacks=callbacks_list, #validation_steps=STEPS_PER_EPOCH, + shuffle=False) + + print(' Model training completed. Following metrics available: %s' %history.history.keys()) + try: + ##### this is where it stopped - you have toi subtract patience from it + stopped_epoch = max(5,int(pd.DataFrame(history.history).shape[0] - patience)) + except: + stopped_epoch = 100 + + print('Time taken to train model (in mins) = %0.0f' %((time.time()-start_time)/60)) + + #### train the model on full train data set now ### + start_time = time.time() + print(' Stopped epoch = %s' %stopped_epoch) + + ################################################################################## + ####### S A V E the model here using save_model_name ################# + ################################################################################## + + save_model_artifacts(deep_model, cat_vocab_dict, var_df, save_model_path, + save_model_flag, model_options) + print() ### just create an extra line after saving that is all + + ################################################################################# + ######## P R E D I C T O N H E L D O U T D A T A H E R E ###### + ################################################################################# + try: + if num_labels <= 1: + y_test = np.concatenate(list(heldout_ds.map(lambda x,y: y).as_numpy_iterator())) + print(' Single-Label: Heldout data shape: %s' %(y_test.shape,)) + else: + iters = int(data_size/batch_size) + 1 + for inum, each_target in enumerate(target): + add_ls = [] + for feats, labs in heldout_ds.take(iters): + add_ls.append(list(labs[each_target].numpy())) + flat_list = [item for sublist in add_ls for item in sublist] + if inum == 0: + each_array = np.array(flat_list) + else: + each_array = np.c_[each_array, np.array(flat_list)] + y_test = copy.deepcopy(each_array) + print(' Multi-Label: Heldout data shape: %s' %(y_test.shape,)) + scores = [] + ls = [] + if verbose >= 1: + try: + print_one_row_from_tf_label(heldout_ds) + except: + print('could not print samples from heldout ds labels') + ########################################################################### + except: + print('Model erroring on heldout_ds predictions. Returning with model and artifacts dictionary.') + return deep_model, cat_vocab_dict + + y_probas = deep_model.predict(heldout_ds) + + if isinstance(target, str): + if modeltype != 'Regression': + y_test_preds = y_probas.argmax(axis=1) + else: + if y_test.dtype == 'int': + y_test_preds = y_probas.round().astype(int) + else: + y_test_preds = y_probas.ravel() + else: + if modeltype != 'Regression': + #### This is for multi-label binary or multi-class problems ## + for each_t in range(len(target)): + if each_t == 0: + y_test_preds = y_probas[each_t].argmax(axis=1).astype(int) + else: + y_test_preds = np.c_[y_test_preds, y_probas[each_t].argmax(axis=1).astype(int)] + else: + ### This is for Multi-Label Regression ### + for each_t in range(len(target)): + if each_t == 0: + y_test_preds = y_probas[each_t].mean(axis=1) + else: + y_test_preds = np.c_[y_test_preds, y_probas[each_t].mean(axis=1)] + if y_test.dtype == 'int': + y_test_preds = y_test_preds.round().astype(int) + + print('\nHeld out predictions shape:%s' %(y_test_preds.shape,)) + if verbose >= 1: + if modeltype != 'Regression': + print(' Sample predictions: %s' %y_test_preds[:10]) + else: + if num_labels == 1: + print(' Sample predictions: %s' %y_test_preds.ravel()[:10]) + else: + print(' Sample predictions:\n%s' %y_test_preds[:10]) + + ################################################################################# + ######## P L O T T I N G V A L I D A T I O N R E S U L T S ###### + ################################################################################# + ### Plot the epochs and loss metrics here ##################### + try: + #print(' Additionally, Tensorboard logs can be found here: %s' %tb_logpath) + if modeltype == 'Regression': + plot_history(history, val_monitor[4:], target) + elif modeltype == 'Classification': + plot_history(history, val_monitor[4:], target) + else: + plot_history(history, val_monitor[4:], target) + except: + print(' Plot history is erroring. Tensorboard logs can be found here: %s' %tb_logpath) + + print('\n###########################################################') + print(' Held-out test data set Results:') + num_labels = cat_vocab_dict['num_labels'] + num_classes = cat_vocab_dict['num_classes'] + if num_labels <= 1: + #### This is for Single-Label Problems only ################################ + if modeltype == 'Regression': + print_regression_model_stats(y_test, y_test_preds,target,plot_name=project_name) + ### plot the regression results here ######### + plot_regression_residuals(y_test, y_test_preds, target, project_name, num_labels) + else: + print_classification_header(num_classes, num_labels, target) + labels = cat_vocab_dict['original_classes'] + if cat_vocab_dict['target_transformed']: + target_names = cat_vocab_dict['transformed_classes'] + target_le = cat_vocab_dict['target_le'] + y_pred = y_probas.argmax(axis=1) + y_test_trans = target_le.inverse_transform(y_test) + y_pred_trans = target_le.inverse_transform(y_pred) + plot_classification_results(y_test_trans, y_pred_trans, labels, labels, target) + else: + y_pred = y_probas.argmax(axis=1) + plot_classification_results(y_test, y_pred, labels, labels, target) + print_classification_metrics(y_test, y_probas, proba_flag=True) + else: + if modeltype == 'Regression': + #### This is for Multi-Label Regression ################################ + print_regression_model_stats(y_test, y_test_preds,target,plot_name=project_name) + ### plot the regression results here ######### + plot_regression_residuals(y_test, y_test_preds, target, project_name, num_labels) + else: + #### This is for Multi-Label Classification ################################ + try: + targets = cat_vocab_dict["target_variables"] + for i, each_target in enumerate(targets): + print_classification_header(num_classes, num_labels, each_target) + labels = cat_vocab_dict[each_target+'_original_classes'] + if cat_vocab_dict['target_transformed']: + ###### Use a nice classification matrix printing module here ######### + target_names = cat_vocab_dict[each_target+'_transformed_classes'] + target_le = cat_vocab_dict['target_le'][i] + y_pred = y_probas[i].argmax(axis=1) + y_test_trans = target_le.inverse_transform(y_test[:,i]) + y_pred_trans = target_le.inverse_transform(y_pred) + labels = np.unique(y_test_trans) ### sometimes there is less classes + plot_classification_results(y_test_trans, y_pred_trans, labels, labels, each_target) + else: + y_pred = y_probas[i].argmax(axis=1) + labels = np.unique(y_test[:,i]) ### sometimes there are fewer classes ## + plot_classification_results(y_test[:,i], y_pred, labels, labels, each_target) + print_classification_metrics(y_test[:,i], y_probas[i], proba_flag=True) + #### This prints additional metrics ############# + print(classification_report(y_test[:,i],y_test_preds[:,i])) + print(confusion_matrix(y_test[:,i], y_test_preds[:,i])) + except: + print_classification_metrics(y_test, y_test_preds, False) + print(classification_report(y_test, y_test_preds )) + + ################################################################################## + ### V E R Y I M P O R T A N T S T E P B E F O R E M O D E L F I T ### + ################################################################################## + print('\nTraining on full train dataset for %d epochs. This will take time...' %stopped_epoch) + full_ds = full_ds.cache().shuffle(shuffle_size).prefetch(batch_size) #.repeat() + #heldout_ds = heldout_ds.shuffle(shuffle_size).prefetch(batch_size) + deep_model.fit(full_ds, epochs=stopped_epoch, #steps_per_epoch=STEPS_PER_EPOCH, + class_weight=class_weights, verbose=0) + + print(' completed. Time taken (in mins) = %0.0f' %((time.time()-start_time)/100)) + + return deep_model, cat_vocab_dict +###################################################################################### diff --git a/build/lib/deep_autoviml/modeling/train_text_model.py b/build/lib/deep_autoviml/modeling/train_text_model.py new file mode 100644 index 0000000..4902091 --- /dev/null +++ b/build/lib/deep_autoviml/modeling/train_text_model.py @@ -0,0 +1,147 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pandas as pd +import numpy as np +pd.set_option('display.max_columns',500) +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +import os +def set_seed(seed=31415): + np.random.seed(seed) + tf.random.set_seed(seed) + os.environ['PYTHONHASHSEED'] = str(seed) + os.environ['TF_DETERMINISTIC_OPS'] = '1' +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup, CategoryCrossing +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization, Discretization, Hashing +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense, GaussianNoise + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +##################################################################################### +# Utils +from deep_autoviml.utilities.utilities import print_one_row_from_tf_dataset, print_one_row_from_tf_label +from deep_autoviml.utilities.utilities import print_classification_metrics, print_regression_model_stats +from deep_autoviml.utilities.utilities import print_classification_model_stats, plot_history, plot_classification_results +from deep_autoviml.utilities.utilities import plot_one_history_metric +from deep_autoviml.utilities.utilities import check_if_GPU_exists +from deep_autoviml.utilities.utilities import save_valid_predictions, predict_plot_images + +from deep_autoviml.data_load.extract import find_batch_size +from deep_autoviml.modeling.create_model import check_keras_options +from deep_autoviml.modeling.one_cycle import OneCycleScheduler +##################################################################################### +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle +############################################################################################# +##### Suppress all TF2 and TF1.x warnings ################### +try: + tf.logging.set_verbosity(tf.logging.ERROR) +except: + tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +#### probably the most handy function of all! +def left_subtract(l1,l2): + lst = [] + for i in l1: + if i not in l2: + lst.append(i) + return lst +############################################################################################## +import time +import os +from sklearn.metrics import balanced_accuracy_score, classification_report +from sklearn.metrics import confusion_matrix, roc_auc_score, accuracy_score +from collections import defaultdict +from tensorflow.keras import callbacks +############################################################################################# +def train_text_model(deep_model, train_ds, valid_ds, cat_vocab_dict, + keras_options, model_options, project_name, save_model_flag): + epochs = check_keras_options(keras_options, "epochs", 20) + save_model_path = model_options['save_model_path'] + tensorboard_logpath = os.path.join(save_model_path,"mylogs") + print('Tensorboard log directory can be found at: %s' %tensorboard_logpath) + cp = keras.callbacks.ModelCheckpoint(project_name, save_best_only=True, + save_weights_only=True, save_format='tf') + ### sometimes a model falters and restore_best_weights gives len() not found error. So avoid True option! + val_mode = "max" + val_monitor = "val_accuracy" + patience = check_keras_options(keras_options, "patience", 10) + + es = keras.callbacks.EarlyStopping(monitor=val_monitor, min_delta=0.00001, patience=patience, + verbose=1, mode=val_mode, baseline=None, restore_best_weights=True) + + tb = keras.callbacks.TensorBoard(log_dir=tensorboard_logpath, + histogram_freq=0, + write_graph=True, + write_images=True, + update_freq='epoch', + profile_batch=2, + embeddings_freq=1 + ) + callbacks_list = [cp, es, tb] + print('Training text model. This will take time...') + history = deep_model.fit(train_ds, epochs=epochs, validation_data=valid_ds, + callbacks=callbacks_list) + result = deep_model.evaluate(valid_ds) + print(' Model accuracy in text validation data: %s' %result[1]) + #plot_history(history, "accuracy", 1) + fig = plt.figure(figsize=(8,6)) + ax1 = plt.subplot(1, 1, 1) + ax1.set_title('Model Training vs Validation Loss') + plot_one_history_metric(history, "accuracy", ax1) + classes = cat_vocab_dict["text_classes"] + loss, accuracy = deep_model.evaluate(valid_ds) + print("Loss: ", loss) + print("Accuracy: ", accuracy) + cat_vocab_dict['project_name'] = project_name + if save_model_flag: + print('\nSaving model. This will take time...' ) + if not os.path.exists(save_model_path): + os.makedirs(save_model_path) + deep_model.save(save_model_path) + cat_vocab_dict['saved_model_path'] = save_model_path + print(' deep_autoviml text saved in %s directory' %save_model_path) + else: + print('\nModel not being saved since save_model_flag set to False...') + return deep_model, cat_vocab_dict +################################################################################# diff --git a/build/lib/deep_autoviml/models/basic.py b/build/lib/deep_autoviml/models/basic.py new file mode 100644 index 0000000..a2b9803 --- /dev/null +++ b/build/lib/deep_autoviml/models/basic.py @@ -0,0 +1,51 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import tensorflow as tf +from tensorflow import keras +#### Make sure it is Tensorflow 2.4 or greater! +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras import models +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras import layers +from tensorflow.keras.layers import BatchNormalization, Activation +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D +from tensorflow.keras.layers import AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D +from tensorflow.keras.layers import GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +from functools import partial + +RegDense = partial(Dense, kernel_initializer="he_normal", kernel_regularizer=keras.regularizers.l2(0.01)) + +model = Sequential([ + BatchNormalization(), + Activation("elu"), + RegDense(100), + BatchNormalization(), + Activation("elu"), + RegDense(100), + Activation("elu"), + RegDense(100), +]); diff --git a/build/lib/deep_autoviml/models/cnn1.py b/build/lib/deep_autoviml/models/cnn1.py new file mode 100644 index 0000000..efdfd89 --- /dev/null +++ b/build/lib/deep_autoviml/models/cnn1.py @@ -0,0 +1,52 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import tensorflow as tf +from tensorflow import keras +#### Make sure it is Tensorflow 2.4 or greater! +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras import models +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras import layers +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D +from tensorflow.keras.layers import AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D +from tensorflow.keras.layers import GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +model = tf.keras.Sequential() +model.add(Reshape((-1, 1))) ### you need to make input as 3-D for CNN models +#model.add(Conv1D(100, 32, name='conv1', padding="same", activation="relu", strides=2, data_format='channels_first')) +model.add(Conv1D(100, 32, name='conv1', padding="same", activation="relu", strides=2, data_format='channels_last')) +model.add(MaxPooling1D(pool_size=5)) +model.add(Dropout(0.5)) +model.add(Reshape((-1, 1))) ### you need to make input as 3-D for CNN models +#model.add(Conv1D(64, 16, name='conv2', padding="same", activation="relu", strides=2, data_format='channels_first')) +model.add(Conv1D(64, 16, name='conv2', padding="same", activation="relu", strides=2, data_format='channels_last')) +model.add(GlobalAveragePooling1D()) +model.add(Dropout(0.5)) +model.add(layers.Flatten()) +model.add(layers.Dense(32, activation="relu")) +model.add(layers.Dropout(0.25)) + diff --git a/build/lib/deep_autoviml/models/cnn2.py b/build/lib/deep_autoviml/models/cnn2.py new file mode 100644 index 0000000..2e867e6 --- /dev/null +++ b/build/lib/deep_autoviml/models/cnn2.py @@ -0,0 +1,52 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import tensorflow as tf +from tensorflow import keras +#### Make sure it is Tensorflow 2.4 or greater! +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras import models +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras import layers +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D +from tensorflow.keras.layers import AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D +from tensorflow.keras.layers import GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ + +model = tf.keras.Sequential([ + layers.Reshape((-1, 1)), ### you need to make input as 3-D for CNN models + layers.Conv1D(100, 64, padding="same", activation="relu", strides=3), + layers.GlobalMaxPooling1D(), + layers.Dropout(0.5), + layers.Reshape((-1, 1)), ### you need to make input as 3-D for CNN models + layers.Conv1D(64, 32, padding="same", activation="relu", strides=3), + layers.GlobalMaxPooling1D(), + layers.Dropout(0.2), + layers.Flatten(), + layers.Dense(32, activation="relu"), + layers.Dropout(0.25), + ]) + diff --git a/build/lib/deep_autoviml/models/deep_and_wide.py b/build/lib/deep_autoviml/models/deep_and_wide.py new file mode 100644 index 0000000..280df8f --- /dev/null +++ b/build/lib/deep_autoviml/models/deep_and_wide.py @@ -0,0 +1,48 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import tensorflow as tf +from tensorflow import keras +#### Make sure it is Tensorflow 2.4 or greater! +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras import models +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras import layers +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D +from tensorflow.keras.layers import AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D +from tensorflow.keras.layers import GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ + +model = models.Sequential([ + BatchNormalization(), + Dropout(0.5), + layers.Dense(128, activation='relu', kernel_initializer='he_normal'), + BatchNormalization(), + Dropout(0.5), + layers.Dense(64, activation='relu', kernel_initializer='he_normal'), + BatchNormalization(), + Dropout(0.2), + ]) \ No newline at end of file diff --git a/build/lib/deep_autoviml/models/dnn.py b/build/lib/deep_autoviml/models/dnn.py new file mode 100644 index 0000000..f676ff0 --- /dev/null +++ b/build/lib/deep_autoviml/models/dnn.py @@ -0,0 +1,48 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import tensorflow as tf +from tensorflow import keras +#### Make sure it is Tensorflow 2.4 or greater! +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras import models +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras import layers +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D +from tensorflow.keras.layers import AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D +from tensorflow.keras.layers import GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ + +model = Sequential([ + BatchNormalization(), + Activation("elu"), + Dense(200), + BatchNormalization(), + Activation("elu"), + Dense(200), + Activation("elu"), + Dense(200), +]); diff --git a/build/lib/deep_autoviml/models/dnn_drop.py b/build/lib/deep_autoviml/models/dnn_drop.py new file mode 100644 index 0000000..a7fdb26 --- /dev/null +++ b/build/lib/deep_autoviml/models/dnn_drop.py @@ -0,0 +1,51 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import tensorflow as tf +from tensorflow import keras +#### Make sure it is Tensorflow 2.4 or greater! +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras import models +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras import layers +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D +from tensorflow.keras.layers import AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D +from tensorflow.keras.layers import GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ + +model = models.Sequential([ + BatchNormalization(), + Dropout(0.5), + layers.Dense(300, activation='relu', kernel_initializer='he_normal'), + BatchNormalization(), + Dropout(0.5), + layers.Dense(300, activation='relu', kernel_initializer='he_normal'), + BatchNormalization(), + Dropout(0.2), + layers.Dense(300, activation='relu', kernel_initializer='he_normal'), + BatchNormalization(), + Dropout(0.2), + ]) \ No newline at end of file diff --git a/build/lib/deep_autoviml/models/giant_deep.py b/build/lib/deep_autoviml/models/giant_deep.py new file mode 100644 index 0000000..48aa3de --- /dev/null +++ b/build/lib/deep_autoviml/models/giant_deep.py @@ -0,0 +1,53 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import tensorflow as tf +from tensorflow import keras +#### Make sure it is Tensorflow 2.4 or greater! +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras import models +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras import layers +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D +from tensorflow.keras.layers import AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D +from tensorflow.keras.layers import GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ + +model = models.Sequential([ + layers.BatchNormalization(), + layers.Dropout(0.5), + layers.Dense(300, activation='relu', use_bias=True, + kernel_initializer='he_normal'), + layers.BatchNormalization(), + layers.Dropout(0.50), + layers.Dense(200, activation='relu', use_bias=True, + kernel_initializer='he_normal'), + layers.BatchNormalization(), + layers.Dropout(0.25), + layers.Dense(100, activation='relu', use_bias=True, + kernel_initializer='he_normal') + ]) + diff --git a/build/lib/deep_autoviml/models/reg_dnn.py b/build/lib/deep_autoviml/models/reg_dnn.py new file mode 100644 index 0000000..6f07be8 --- /dev/null +++ b/build/lib/deep_autoviml/models/reg_dnn.py @@ -0,0 +1,51 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import tensorflow as tf +from tensorflow import keras +#### Make sure it is Tensorflow 2.4 or greater! +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras import models +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras import layers +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D +from tensorflow.keras.layers import AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Embedding, Reshape, Dropout, Dense +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D +from tensorflow.keras.layers import GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +from functools import partial + +RegDense = partial(Dense, kernel_initializer="he_normal", kernel_regularizer=keras.regularizers.l2(0.01)) + +model = Sequential([ + BatchNormalization(), + Activation("elu"), + RegDense(200), + BatchNormalization(), + Activation("elu"), + RegDense(200), + Activation("elu"), + RegDense(200), +]); diff --git a/build/lib/deep_autoviml/models/tf_hub_lookup.py b/build/lib/deep_autoviml/models/tf_hub_lookup.py new file mode 100644 index 0000000..f2f0c57 --- /dev/null +++ b/build/lib/deep_autoviml/models/tf_hub_lookup.py @@ -0,0 +1,139 @@ +map_name_to_handle = { + 'bert_en_uncased_L-12_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3', + 'bert_en_cased_L-12_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3', + 'bert_multi_cased_L-12_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3', + 'small_bert/bert_en_uncased_L-2_H-128_A-2': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/2', + 'small_bert/bert_en_uncased_L-2_H-256_A-4': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/2', + 'small_bert/bert_en_uncased_L-2_H-512_A-8': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/2', + 'small_bert/bert_en_uncased_L-2_H-768_A-12': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/2', + 'small_bert/bert_en_uncased_L-4_H-128_A-2': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/2', + 'small_bert/bert_en_uncased_L-4_H-256_A-4': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/2', + 'small_bert/bert_en_uncased_L-4_H-512_A-8': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/2', + 'small_bert/bert_en_uncased_L-4_H-768_A-12': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/2', + 'small_bert/bert_en_uncased_L-6_H-128_A-2': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/2', + 'small_bert/bert_en_uncased_L-6_H-256_A-4': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/2', + 'small_bert/bert_en_uncased_L-6_H-512_A-8': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/2', + 'small_bert/bert_en_uncased_L-6_H-768_A-12': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/2', + 'small_bert/bert_en_uncased_L-8_H-128_A-2': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/2', + 'small_bert/bert_en_uncased_L-8_H-256_A-4': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/2', + 'small_bert/bert_en_uncased_L-8_H-512_A-8': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/2', + 'small_bert/bert_en_uncased_L-8_H-768_A-12': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/2', + 'small_bert/bert_en_uncased_L-10_H-128_A-2': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/2', + 'small_bert/bert_en_uncased_L-10_H-256_A-4': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/2', + 'small_bert/bert_en_uncased_L-10_H-512_A-8': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/2', + 'small_bert/bert_en_uncased_L-10_H-768_A-12': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/2', + 'small_bert/bert_en_uncased_L-12_H-128_A-2': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/2', + 'small_bert/bert_en_uncased_L-12_H-256_A-4': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/2', + 'small_bert/bert_en_uncased_L-12_H-512_A-8': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/2', + 'small_bert/bert_en_uncased_L-12_H-768_A-12': + 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/2', + 'albert_en_base': + 'https://tfhub.dev/tensorflow/albert_en_base/2', + 'electra_small': + 'https://tfhub.dev/google/electra_small/2', + 'electra_base': + 'https://tfhub.dev/google/electra_base/2', + 'experts_pubmed': + 'https://tfhub.dev/google/experts/bert/pubmed/2', + 'experts_wiki_books': + 'https://tfhub.dev/google/experts/bert/wiki_books/2', + 'talking-heads_base': + 'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/2', +} + +map_hub_to_name = dict([(v,k) for (k,v) in map_name_to_handle.items()]) + +map_name_to_preprocess = { + 'bert_en_uncased_L-12_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'bert_en_cased_L-12_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3', + 'small_bert/bert_en_uncased_L-2_H-128_A-2': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-2_H-256_A-4': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-2_H-512_A-8': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-2_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-4_H-128_A-2': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-4_H-256_A-4': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-4_H-512_A-8': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-4_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-6_H-128_A-2': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-6_H-256_A-4': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-6_H-512_A-8': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-6_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-8_H-128_A-2': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-8_H-256_A-4': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-8_H-512_A-8': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-8_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-10_H-128_A-2': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-10_H-256_A-4': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-10_H-512_A-8': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-10_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-12_H-128_A-2': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-12_H-256_A-4': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-12_H-512_A-8': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'small_bert/bert_en_uncased_L-12_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'bert_multi_cased_L-12_H-768_A-12': + 'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3', + 'albert_en_base': + 'https://tfhub.dev/tensorflow/albert_en_preprocess/3', + 'electra_small': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'electra_base': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'experts_pubmed': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'experts_wiki_books': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', + 'talking-heads_base': + 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', +} diff --git a/build/lib/deep_autoviml/preprocessing/preprocessing.py b/build/lib/deep_autoviml/preprocessing/preprocessing.py new file mode 100644 index 0000000..0e9779f --- /dev/null +++ b/build/lib/deep_autoviml/preprocessing/preprocessing.py @@ -0,0 +1,468 @@ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +from collections import defaultdict +import os +############################################################################################ +# data pipelines and feature engg here +from deep_autoviml.preprocessing.preprocessing_tabular import preprocessing_tabular +from deep_autoviml.preprocessing.preprocessing_nlp import preprocessing_nlp, aggregate_nlp_dictionaries +from deep_autoviml.preprocessing.preprocessing_tabular import encode_auto_inputs +from deep_autoviml.preprocessing.preprocessing_tabular import create_fast_inputs +from deep_autoviml.preprocessing.preprocessing_tabular import encode_all_inputs, create_all_inputs +from deep_autoviml.data_load.classify_features import find_remove_duplicates +from deep_autoviml.preprocessing.preprocessing_tabular import encode_nlp_inputs, create_nlp_inputs + + +# Utils +#from deep_autoviml.utilities.utilities import get_model_defaults +from deep_autoviml.modeling.create_model import get_model_defaults +from deep_autoviml.utilities.utilities import get_hidden_layers +from deep_autoviml.utilities.utilities import check_model_options + +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +from tensorflow.keras.layers import Dense, LSTM, GRU, Input, concatenate, Embedding +from tensorflow.keras.layers import Reshape, Activation, Flatten +import tensorflow_hub as hub + +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle + +##### Suppress all TF2 and TF1.x warnings ################### +##### Suppress all TF2 and TF1.x warnings ################### +tf2logger = tf.get_logger() +tf2logger.warning('Silencing TF2.x warnings') +tf2logger.root.removeHandler(tf2logger.root.handlers) +tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +#### probably the most handy function of all! +def left_subtract(l1,l2): + lst = [] + for i in l1: + if i not in l2: + lst.append(i) + return lst +############################################################################################# +def perform_preprocessing(train_ds, var_df, cat_vocab_dict, keras_model_type, + keras_options, model_options, verbose=0): + """ + Remember this is the most valuable part of this entire library! + This is one humongous preprocessing step to build everything needed for preprocessing into a keras model! + But it will break in some cases since we cannot handle every known dataset! + It will be good enough for most instances to create a fast keras pipeline + baseline model. + You can always fine tune it. + You can always create your own model and feed it once you have successfully created preprocessing pipeline. + """ + num_classes = model_options["num_classes"] + num_labels = model_options["num_labels"] + modeltype = model_options["modeltype"] + embedding_size = model_options["embedding_size"] + train_data_is_file = model_options['train_data_is_file'] + cat_feat_cross_flag = check_model_options(model_options,"cat_feat_cross_flag", False) + targets = cat_vocab_dict["target_variables"] + preds = cat_vocab_dict["predictors_in_train"] + ############ This is where you get all the classified features ######## + cats = var_df['categorical_vars'] ### these are low cardinality vars - you can one-hot encode them ## + high_string_vars = var_df['discrete_string_vars'] ## discrete_string_vars are high cardinality vars ## embed them! + bools = var_df['bools'] + int_cats = var_df['int_cats'] + var_df['int_bools'] + ints = var_df['int_vars'] + floats = var_df['continuous_vars'] + nlps = var_df['nlp_vars'] + lats = var_df['lat_vars'] + lons = var_df['lon_vars'] + floats = left_subtract(floats, lats+lons) + #### You must exclude NLP vars from this since they have their own preprocesing + NON_NLP_VARS = left_subtract(preds, nlps) + FEATURE_NAMES = bools + cats + high_string_vars + int_cats + ints + floats + NUMERIC_FEATURE_NAMES = int_cats + ints + ######## Check if booleans have to be treated as strings or as floats here ## + if train_data_is_file: + FLOATS = floats + CATEGORICAL_FEATURE_NAMES = cats + high_string_vars +bools + else: + FLOATS = floats + bools + CATEGORICAL_FEATURE_NAMES = cats + high_string_vars + ##################################################################### + + vocab_dict = defaultdict(list) + cats_copy = copy.deepcopy(CATEGORICAL_FEATURE_NAMES+NUMERIC_FEATURE_NAMES) + if len(cats_copy) > 0: + for each_name in cats_copy: + vocab_dict[each_name] = cat_vocab_dict[each_name]['vocab'] + + bools_copy = copy.deepcopy(bools) + if len(bools_copy) > 0: + for each_name in bools_copy: + vocab_dict[each_name] = ['True','False','missing'] + + + floats_copy = copy.deepcopy(FLOATS) + if len(floats_copy) > 0: + for each_float in floats_copy: + vocab_dict[each_float] = cat_vocab_dict[each_float]['vocab_min_var'] + ##### set the defaults for the LSTM or GRU model here ######################### + batch_size = 32 + # Convolution + kernel_size = 3 + filters = 128 + pool_size = 4 + + # LSTM + lstm_output_size = 96 + gru_units = 96 + + # Training + drop_out = 0.2 + if modeltype == 'Regression': + class_size = 1 + else: + if num_classes == 2: + class_size = 1 + else: + class_size = num_classes + ###### Now calculate some layer sizes ##### + data_size = cat_vocab_dict["DS_LEN"] + data_dim = data_size*len(FEATURE_NAMES) + dense_layer1, dense_layer2, dense_layer3 = get_hidden_layers(data_dim) + ################################################################################# + ########### F E A T U R E P R E P R O C E S S I N G H E R E ####### + ################################################################################# + nlps = var_df['nlp_vars'] + keras_options, model_options, num_predicts, output_activation = get_model_defaults(keras_options, + model_options, targets) + ################## NLP Text Features are Proprocessed Here ################ + nlp_inputs = [] + nlp_names = [] + embedding = [] + ################## All other Features are Proprocessed Here ################ + ### make sure you include mixed_nlp and combined_nlp in this list since you want it separated + fast_models = ['fast','deep_and_wide','deep_wide','wide_deep', "mixed_nlp","combined_nlp", + 'wide_and_deep','deep wide', 'wide deep', 'fast1', + 'deep_and_cross', 'deep_cross', 'deep cross', 'fast2',"text"] + ############################################################################## + meta_outputs = [] + print('Preprocessing non-NLP layers for %s Keras model...' %keras_model_type) + + if not keras_model_type.lower() in fast_models: + ############################################################################################ + ############ I N "A U T O" M O D E L S we use Lat and Lon with NLP right here ######### + ############################################################################################ + if len(lats+lons) > 0: + print(' Now combine all numeric and non-numeric vars into a Deep only model...') + meta_outputs, meta_inputs, meta_names = preprocessing_tabular(train_ds, var_df, + cat_feat_cross_flag, model_options, + cat_vocab_dict, keras_model_type, verbose) + print(' All Non-NLP feature preprocessing completed.') + ### this is the order in which columns have been trained ### + if len(nlps) > 0: + print('Starting NLP string column layer preprocessing...') + nlp_inputs = create_nlp_inputs(nlps) + max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small = aggregate_nlp_dictionaries(nlps, cat_vocab_dict, model_options) + nlp_encoded = encode_nlp_inputs(nlp_inputs, cat_vocab_dict) + ### we call nlp_outputs as embedding in this section of the program #### + print('NLP Preprocessing completed.') + #merged = [meta_outputs, nlp_encoded] + merged = layers.concatenate([nlp_encoded, meta_outputs]) + print(' combined categorical+numeric with nlp outputs successfully for %s model...' %keras_model_type) + nlp_inputs = list(nlp_inputs.values()) + else: + merged = meta_outputs + final_training_order = nlp_names + meta_names + ### find their dtypes - remember to use element_spec[0] for train data sets! + ds_types = dict([(col_name, train_ds.element_spec[0][col_name].dtype) for col_name in final_training_order ]) + col_type_tuples = [(name,ds_types[name]) for name in final_training_order] + if verbose >= 2: + print('Inferred column names, layers and types (double-check for duplicates and correctness!): \n%s' %col_type_tuples) + print(' %s model loaded and compiled successfully...' %keras_model_type) + else: + ############################################################################################ + #### In "auto" vs. "mixed_nlp", the NLP processings are different. Numeric process is same. + #### Here both NLP and NON-NLP varas are combined with embedding to form a deep wide model # + ############################################################################################ + print(' Now combine all numeric+cat+NLP vars into a Deep and Wide model') + ## Since we are processing NLPs separately we need to remove them from inputs ### + if len(NON_NLP_VARS) == 0: + print(' There are zero non-NLP variables in this dataset. No non-NLP preprocesing needed...') + meta_inputs = [] + else: + FEATURE_NAMES = left_subtract(FEATURE_NAMES, nlps) + dropout_rate = 0.1 + hidden_units = [dense_layer2, dense_layer3] + inputs = create_fast_inputs(FEATURE_NAMES, NUMERIC_FEATURE_NAMES, FLOATS) + #all_inputs = dict(zip(meta_names,meta_inputs)) + #### In auto models we want "wide" to be short. Hence use_embedding to be True. + wide = encode_auto_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, vocab_dict, + hidden_units, use_embedding=True) + wide = layers.BatchNormalization()(wide) + deep = encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, vocab_dict, + use_embedding=True) + deep = layers.BatchNormalization()(deep) + meta_inputs = list(inputs.values()) ### convert input layers to a list + #### If there are NLP vars in dataset, you must combine the nlp_outputs ## + print(' All Non-NLP feature preprocessing completed.') + if len(nlps) > 0: + print('Starting NLP string column layer preprocessing...') + nlp_inputs = create_nlp_inputs(nlps) + max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small = aggregate_nlp_dictionaries(nlps, cat_vocab_dict, model_options) + nlp_encoded = encode_nlp_inputs(nlp_inputs, cat_vocab_dict) + ### we call nlp_outputs as embedding in this section of the program #### + print('NLP preprocessing completed.') + merged = [wide, deep, nlp_encoded] + print(' Combined wide, deep and nlp outputs successfully') + nlp_inputs = list(nlp_inputs.values()) + else: + merged = [wide, deep] + print(' %s combined wide and deep successfully...' %keras_model_type) + ### if NLP_outputs is NOT a list, it means there is some NLP variable in the data set + if not isinstance(merged, list): + print('Shape of output from all preprocessing layers before model training = %s' %(merged.shape,)) + return nlp_inputs, meta_inputs, merged, embedding + elif keras_model_type.lower() in ['mixed_nlp', 'combined_nlp']: + ### this is similar to auto models but uses TFHub models for NLP preprocessing ##### + if len(NON_NLP_VARS) == 0: + print(' Non-NLP vars is zero in this dataset. No tabular preprocesing needed...') + meta_inputs = [] + else: + ############################################################################################ + #### In "auto" vs. "mixed_nlp", the NLP processings are different. Numeric process is same. + ############################################################################################ + print(' Now combine all numeric and non-numeric vars into a Deep and Wide model...') + #### Here both NLP and NON-NLP varas are combined with embedding to form a deep wide model # + FEATURE_NAMES = left_subtract(FEATURE_NAMES, nlps) + dropout_rate = 0.5 + hidden_units = [dense_layer2, dense_layer3] + inputs = create_fast_inputs(FEATURE_NAMES, NUMERIC_FEATURE_NAMES, FLOATS) + #all_inputs = dict(zip(meta_names,meta_inputs)) + wide = encode_auto_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, vocab_dict, + hidden_units, use_embedding=False) + wide = layers.BatchNormalization()(wide) + deep = encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, vocab_dict, + use_embedding=True) + deep = layers.BatchNormalization()(deep) + meta_inputs = list(inputs.values()) ### convert input layers to a list + print(' All Non-NLP feature preprocessing completed.') + #### If there are NLP vars in dataset, you use TFHub models in this case ## + if len(nlps) > 0: + print('Starting NLP string column layer preprocessing...') + nlp_inputs, embedding, nlp_names = mixed_preprocessing_nlp(train_ds, model_options, + var_df, cat_vocab_dict, + keras_model_type, verbose) + ### we call nlp_outputs as embedding in this section of the program #### + print(' NLP Preprocessing completed.') + else: + print('There are no NLP variables in this dataset for preprocessing...') + embedding = [] + if isinstance(embedding, list): + ### This means embedding is an empty list with nothing in it ### + meta_outputs = layers.concatenate([wide, deep]) + print(' Combined wide, deep layers successfully.') + else: + meta_outputs = layers.concatenate([wide, deep, embedding]) + print(' Combined wide, deep and NLP (with TFHub) successfully.') + else: + meta_inputs = [] + ##### You need to send in the ouput from embedding layer to this sequence of layers #### + + nlp_outputs = [] + if not isinstance(embedding, list): + if keras_model_type.lower() in ['bert','text', 'use',"nnlm"]: + ###### This is where you define the NLP Embedded Layers ######## + #x = layers.Dense(64, activation='relu')(embedding) + #x = layers.Dense(32, activation='relu')(x) + #nlp_outputs = layers.Dropout(0.2)(x) + #nlp_outputs = layers.Dropout(0.2)(embedding) + if isinstance(meta_outputs, list): + #### if there are no other variables then leave it as embedding output + nlp_outputs = embedding + else: + #### If there are other variables, then convert this embedding to an output + nlp_outputs = layers.Dense(num_predicts, activation=output_activation)(embedding) + elif keras_model_type.lower() in ['lstm']: + x = layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True))(embedding) + x = layers.Bidirectional(tf.keras.layers.LSTM(64))(x) + x = layers.Dense(64, activation='relu')(x) + x = layers.Dense(32, activation='relu')(x) + x = layers.Dropout(0.2)(x) + nlp_outputs = layers.Dense(num_predicts, activation=output_activation)(x) + # = layers.Bidirectional(layers.LSTM(dense_layer1, dropout=0.3, recurrent_dropout=0.3, + # return_sequences=False, batch_size=batch_size, + # kernel_regularizer=regularizers.l2(0.01)))(x) + + elif keras_model_type.lower() in ['cnn1']: + # Conv1D + global max pooling + x = Conv1D(dense_layer1, 14, name='cnn_dense1', padding="same", + activation="relu", strides=3)(embedding) + x = GlobalMaxPooling1D()(x) + nlp_outputs = layers.Dense(num_predicts, activation=output_activation)(x) + elif keras_model_type.lower() in fast_models: + # We add a vanilla hidden layer that's all + #nlp_outputs = layers.Dense(num_predicts, activation=output_activation)(embedding) + nlp_outputs = embedding + elif keras_model_type.lower() in ['gru','cnn2']: + #### Use this only for Binary-Class classification problems ######## + #### LSTM with 1D convnet with maxpooling ######## + x = Conv1D(filters, + kernel_size, + padding='valid', + activation='relu', + strides=1)(embedding) + x = MaxPooling1D(pool_size=pool_size)(x) + x = GRU(units=gru_units, dropout=drop_out, recurrent_dropout=drop_out)(x) + if modeltype == 'Regression': + #nlp_outputs = Dense(class_size, activation='linear')(x) + x = Dense(128, activation='relu')(x) + else: + #nlp_outputs = Dense(class_size, activation='sigmoid')(x) + x = Dense(128, activation='relu')(x) + nlp_outputs = layers.Dense(num_predicts, activation=output_activation)(x) + elif keras_model_type.lower() in ['cnn']: + #### Use this only for future Multi-Class classification problems ######### + #### CNN Model: create a 1D convnet with global maxpooling ######## + x = Conv1D(128, kernel_size, activation='relu')(embedding) + x = MaxPooling1D(kernel_size)(x) + x = Conv1D(128, kernel_size, activation='relu')(x) + x = MaxPooling1D(kernel_size)(x) + x = Conv1D(128, kernel_size, activation='relu')(x) + x = GlobalMaxPooling1D()(x) + x = Dense(128, activation='relu')(x) + #nlp_outputs = Dense(class_size, activation='softmax')(x) + nlp_outputs = layers.Dense(num_predicts, activation=output_activation)(x) + + #### This is only for all "fast" and "auto" with latitude and longitude columns ## + if isinstance(meta_outputs, list): + ### if meta_outputs is a list, it means there is no int, float or cat variable in this data set + print('There is no numeric or cat or int variables in this data set.') + if isinstance(nlp_outputs, list): + ### if NLP_outputs is a list, it means there is no NLP variable in the data set + print('There is no NLP variable in this data set. Returning') + consolidated_outputs = meta_outputs + else: + print('Shape of encoded NLP variables just before training: %s' %(nlp_outputs.shape,)) + consolidated_outputs = nlp_outputs + else: + print('Shape of non-NLP encoded variables just before model training = %s' %(meta_outputs.shape,)) + if isinstance(nlp_outputs, list): + ### if NLP_outputs is a list, it means there is no NLP variable in the data set + print(' There is no NLP variable in this data set. Continuing...') + #x = layers.concatenate([meta_outputs]) + consolidated_outputs = meta_outputs + else: + ### if NLP_outputs is NOT a list, it means there is some NLP variable in the data set + print(' Shape of encoded NLP variables just before training: %s' %(nlp_outputs.shape,)) + consolidated_outputs = layers.concatenate([nlp_outputs, meta_outputs]) + print('Shape of output from all preprocessing layers before model training = %s' %(consolidated_outputs.shape,)) + return nlp_inputs, meta_inputs, consolidated_outputs, nlp_outputs +########################################################################################## +def mixed_preprocessing_nlp(train_ds, model_options, + var_df, cat_vocab_dict, + keras_model_type, verbose=0): + """ + This is only for mixed NLP preprocessing of tabular and nlp datasets + """ + nlp_inputs = [] + all_nlp_encoded = [] + all_nlp_embeddings = [] + nlp_col_names = [] + nlp_columns = var_df['nlp_vars'] + nlp_columns = list(set(nlp_columns)) + + if len(nlp_columns) == 1: + nlp_column = nlp_columns[0] + elif keras_model_type.lower() == 'combined_nlp': + nlp_column = 'combined_nlp_text' ### this is when there are multiple nlp columns ## + else: + ### This is to keep nlp columns separate ### + nlp_column = '' + + #### Now perform NLP preproprocessing for each nlp_column ###### + ######### This is where we load Swivel model and process each nlp column ### + try: + bert_model_name = "Swivel-20" + if os.name == 'nt': + tfhub_path = os.path.join(keras_model_type, 'tf_cache') + os.environ['TFHUB_CACHE_DIR'] = tfhub_path + tfhub_handle_encoder = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1' + else: + tfhub_handle_encoder = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1' + hub_layer = hub.KerasLayer(tfhub_handle_encoder, + input_shape=[], + dtype=tf.string, + trainable=False, name="Swivel20_encoder") + print(f' {bert_model_name} selected from: {tfhub_handle_encoder}') + ### this is for mixed nlp models. You use Swivel to embed NLP columns fast #### + if len(nlp_columns) > 1: + copy_nlp_columns = copy.deepcopy(nlp_columns) + for each_nlp in copy_nlp_columns: + nlp_input = tf.keras.Input(shape=(), dtype=tf.string, name=each_nlp) + nlp_inputs.append(nlp_input) + x = hub_layer(nlp_input) + all_nlp_encoded.append(x) + nlp_col_names.append(each_nlp) + else: + nlp_input = tf.keras.Input(shape=(), dtype=tf.string, name=nlp_column) + x = hub_layer(nlp_input) + ### Now we combine all inputs and outputs in one place here ########### + nlp_inputs.append(nlp_input) + all_nlp_encoded.append(x) + nlp_col_names.append(nlp_column) + except: + print(' Error: Skipping %s for keras layer preprocessing...' %nlp_column) + ### we gather all outputs above into a single list here called all_features! + if len(all_nlp_encoded) == 0: + print('There are no NLP string variables in this dataset to preprocess!') + elif len(all_nlp_encoded) == 1: + all_nlp_embeddings = all_nlp_encoded[0] + else: + all_nlp_embeddings = layers.concatenate(all_nlp_encoded) + + return nlp_inputs, all_nlp_embeddings, nlp_col_names +################################################################################# diff --git a/build/lib/deep_autoviml/preprocessing/preprocessing_images.py b/build/lib/deep_autoviml/preprocessing/preprocessing_images.py new file mode 100644 index 0000000..a5c4269 --- /dev/null +++ b/build/lib/deep_autoviml/preprocessing/preprocessing_images.py @@ -0,0 +1,116 @@ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +from itertools import combinations +from collections import defaultdict + +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +############################################################################################ +# data pipelines and feature engg here + +# pre-defined TF2 Keras models and your own models here +from deep_autoviml.data_load.classify_features import check_model_options + +# Utils + +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup, Hashing +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding, CategoryCrossing +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization, Discretization +from tensorflow.keras.layers import Embedding, Flatten + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +import tensorflow_hub as hub + + +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle +############################################################################################# +##### Suppress all TF2 and TF1.x warnings ################### +try: + tf.logging.set_verbosity(tf.logging.ERROR) +except: + tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +def preprocessing_images(train_ds, model_options): + """ + This produces a preprocessing layer for an incoming tf.data.Dataset. It can be images only. + You need to just send in a tf.data.DataSet from the training folder and a model_options dictionary. + It will return a full-model-ready layer that you can add to your Keras Functional model as image layer! + ########### Motivation and suggestions for coding for Image processing came from this blog ######### + Greatly indebted to Srivatsan for his Github and notebooks: https://github.com/srivatsan88/YouTubeLI + #################################################################################################### + """ + try: + ####### L O A D F E A T U R E E X T R A C T O R ################ + url = "https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/4" + feature_extractor = check_model_options(model_options, "tf_hub_model", url) + img_height = model_options["image_height"] + img_width = model_options["image_width"] + image_channels = model_options["image_channels"] + num_predicts = model_options["num_predicts"] + try: + feature_extractor_layer = hub.KerasLayer(feature_extractor, input_shape=( + img_height,img_width,image_channels)) + except: + print('Loading model from Tensorflow Hub failed. Check the URL and try again...') + return + feature_extractor_layer.trainable = False + normalization_layer = tf.keras.layers.experimental.preprocessing.Rescaling(1./255) + tf.random.set_seed(111) + model = tf.keras.Sequential([ + normalization_layer, + feature_extractor_layer, + tf.keras.layers.Dropout(0.3), + tf.keras.layers.Dense(num_predicts,activation='softmax') + ]) + model.compile( + optimizer='adam', + loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True), + metrics=['accuracy']) + except: + print(' Error: Failed image preprocessing layer. Returning...') + return + return model diff --git a/build/lib/deep_autoviml/preprocessing/preprocessing_nlp.py b/build/lib/deep_autoviml/preprocessing/preprocessing_nlp.py new file mode 100644 index 0000000..fc18fc2 --- /dev/null +++ b/build/lib/deep_autoviml/preprocessing/preprocessing_nlp.py @@ -0,0 +1,405 @@ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +from itertools import combinations +from collections import defaultdict + +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +############################################################################################ +# data pipelines and feature engg here +from deep_autoviml.data_load.classify_features import check_model_options +from deep_autoviml.data_load.classify_features import find_remove_duplicates + +# pre-defined TF2 Keras models and your own models here +from deep_autoviml.models.tf_hub_lookup import map_hub_to_name, map_name_to_handle +from deep_autoviml.models.tf_hub_lookup import map_name_to_preprocess + +# Utils + +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup, Hashing +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding, CategoryCrossing +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization, Discretization +from tensorflow.keras.layers import Embedding, Flatten + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +import tensorflow_hub as hub + + +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle +############################################################################################# +##### Suppress all TF2 and TF1.x warnings ################### +tf2logger = tf.get_logger() +tf2logger.warning('Silencing TF2.x warnings') +tf2logger.root.removeHandler(tf2logger.root.handlers) +tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +#### probably the most handy function of all! +def left_subtract(l1,l2): + lst = [] + for i in l1: + if i not in l2: + lst.append(i) + return lst +############################################################################################## +import os +########################################################################################### +# We remove punctuations and HTMLs from tweets. This is done in a function, +# so that it can be passed as a parameter to the TextVectorization object. +import re +import string +def custom_standardization(input_data): + lowercase = tf.strings.lower(input_data) + stripped_html = tf.strings.regex_replace(lowercase, "
    ", " ") + return tf.strings.regex_replace( + stripped_html, "[%s]" % re.escape(string.punctuation), "" + ) +############################################################################################## +def closest(lst, K): + """ + Find a number in list lst that is closest to the value K. + """ + return lst[min(range(len(lst)), key = lambda i: abs(lst[i]-K))] +############################################################################################## +def preprocessing_nlp(train_ds, model_options, var_df, cat_vocab_dict, keras_model_type, verbose=0): + """ + This produces a preprocessing layer for an incoming NLP column using TextVectorization from keras. + You need to just send in a tf.data.DataSet from the training portion of your dataset and an nlp_column name. + It will return a full-model-ready layer that you can add to your Keras Functional model as an NLP_layer! + max_tokens_zip is a dictionary of each NLP column name and its max_tokens as defined by train data. + """ + nlp_inputs = [] + all_nlp_encoded = [] + all_nlp_embeddings = [] + nlp_col_names = [] + nlp_columns = var_df['nlp_vars'] + nlp_columns = list(set(nlp_columns)) + fast_models = ['fast'] + + fast_models1 = ['deep_and_wide','deep_wide','wide_deep', + 'wide_and_deep','deep wide', 'wide deep', 'fast1', + 'deep_and_cross', 'deep_cross', 'deep cross', 'fast2'] + + max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small = aggregate_nlp_dictionaries( + nlp_columns, cat_vocab_dict, model_options, verbose) + + if len(nlp_columns) == 1: + nlp_column = nlp_columns[0] + else: + nlp_column = 'combined_nlp_text' ### this is when there are multiple nlp columns ## + + ### Find the best sizes for various dimensions here ########### + seq_lengths = list(seq_tokens_zip.values()) + maximum_sequence_length = max(seq_lengths) + ## ideally you should create an unduplicated list of vocabs here and find its size + ### the vocab_train_small holds the entire vocab of train_small data set! + max_vocab_size = len(vocab_train_small) + 10 + best_embedding_size = max(list(embed_tokens_zip.values())) + print('Max vocab size = %s' %max_vocab_size) + + ###### Let us set up the defauls for embedding size and max tokens to process each column + NLP_VARS = copy.deepcopy(nlp_columns) + max_features = max_vocab_size ## this is the size of vocab of the whole corpus + embedding_dim = best_embedding_size ## this is the vector size + sequence_length = maximum_sequence_length ## this is the length of each sentence consisting of words + + #### Now perform NLP preproprocessing for each nlp_column ###### + tf_hub_model = model_options["tf_hub_model"] + tf_hub = False + if not tf_hub_model == "": + print('Using Tensorflow Hub model: %s given as input' %tf_hub_model) + tf_hub = True + ##### This is where we use different pre-trained models to create word and sentence embeddings ## + if keras_model_type.lower() in ['bert']: + print('Loading %s model this will take time...' %keras_model_type) + if os.name == 'nt': + tfhub_path = os.path.join(keras_model_type, 'tf_cache') + os.environ['TFHUB_CACHE_DIR'] = tfhub_path + if tf_hub: + tfhub_handle_encoder = model_options['tf_hub_model'] + try: + bert_model_name = map_hub_to_name[tfhub_handle_encoder] + tfhub_handle_preprocess = map_name_to_preprocess[bert_model_name] + except: + bert_model_name = 'BERT_given_by_user_input' + tfhub_handle_preprocess = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3" + else: + bert_model_name = "BERT Uncased Small" + tfhub_handle_preprocess = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3" + tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/2' + preprocessor = hub.KerasLayer(tfhub_handle_preprocess, name='BERT_preprocessing') + encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder') + print(f' {bert_model_name} selected: {tfhub_handle_encoder}') + print(f' Preprocessor auto-selected: {tfhub_handle_preprocess}') + elif keras_model_type.lower() in ["use"]: + print('Loading %s model this will take time...' %keras_model_type) + if os.name == 'nt': + tfhub_path = os.path.join(keras_model_type, 'tf_cache') + os.environ['TFHUB_CACHE_DIR'] = tfhub_path + if tf_hub: + bert_model_name = "USE given" + tfhub_handle_encoder = model_options['tf_hub_model'] + else: + bert_model_name = "Universal Sentence Encoder4" + tfhub_handle_encoder = "https://tfhub.dev/google/universal-sentence-encoder/4" + encoder = hub.KerasLayer(tfhub_handle_encoder, + input_shape=[], + dtype=tf.string, + trainable=True, name='USE4_encoder') + print(f' {bert_model_name} selected: {tfhub_handle_encoder}') + elif keras_model_type.lower() in fast_models1: + bert_model_name = "fast NNLM 50 with Normalization" + if os.name == 'nt': + tfhub_path = os.path.join(keras_model_type, 'tf_cache') + os.environ['TFHUB_CACHE_DIR'] = tfhub_path + tfhub_handle_encoder = 'https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2' + else: + tfhub_handle_encoder = 'https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2' + hub_layer = hub.KerasLayer(tfhub_handle_encoder, + input_shape=[], + dtype=tf.string, + trainable=False, name="NNLM50_encoder") + print(f' {bert_model_name} selected from: {tfhub_handle_encoder}') + elif keras_model_type.lower() in ["nlp"]: + bert_model_name = "Swivel-20" + if os.name == 'nt': + tfhub_path = os.path.join(keras_model_type, 'tf_cache') + os.environ['TFHUB_CACHE_DIR'] = tfhub_path + tfhub_handle_encoder = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1' + else: + tfhub_handle_encoder = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1' + hub_layer = hub.KerasLayer(tfhub_handle_encoder, + input_shape=[], + dtype=tf.string, + trainable=False, name="Swivel20_encoder") + print(f' {bert_model_name} selected from: {tfhub_handle_encoder}') + elif keras_model_type.lower() in fast_models: + #### For fast models you just use Vectorization and Embedding that's all ####### + # Use the text vectorization layer to normalize, split, and map strings to + # integers. Note that the layer uses the custom standardization defined above. + # Set maximum_sequence length as all samples are not of the same length. + ### if you used custom_standardization function, you cannot load the saved model!! be careful! + bert_model_name = 'Text Vectorization' + vectorize_layer = TextVectorization( + standardize='lower_and_strip_punctuation', + max_tokens=max_features, + output_mode="int", + split="whitespace", + ngrams=None, + output_sequence_length=sequence_length, + pad_to_max_tokens=True, + vocabulary=vocab_train_small, + ) + print(f' {bert_model_name} selected along with Embedding layer') + else: + #### This is for auto model option. You can ignore their models in tfhub in that case + #### If they give the default NLP or text as input, then we would use a default model. + bert_model_name = 'Swivel_20_model' + #bert_model_name = "Auto NNLM 50 with Normalization" + tfhub_handle_encoder = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1" + #if os.name == 'nt': + # tfhub_path = os.path.join(keras_model_type, 'tf_cache') + # os.environ['TFHUB_CACHE_DIR'] = tfhub_path + # tfhub_handle_encoder = 'https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2' + #else: + # tfhub_handle_encoder = 'https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2' + hub_layer = hub.KerasLayer(tfhub_handle_encoder, output_shape=[20], + input_shape=[], + dtype=tf.string, trainable=False, name="Swivel_encoder") + #hub_layer = hub.KerasLayer(tfhub_handle_encoder, + # input_shape=[], + # dtype=tf.string, trainable=False, name="NNLM50_encoder") + print(f' {bert_model_name} selected from: {tfhub_handle_encoder}') + + #### Next, we add an NLP layer to map those vocab indices into a space of dimensionality + #### Vocabulary size defines how many unique words you think might be in that training data + ### Sequence length defines how we should convert each word into a sequence of integers of fixed length + + #### Now let us process all NLP columns by using embeddings from Keras #### + ###### A string input for each string column ############################### + ##### Now we handle multiple choices in embedding and model building ### + try: + if keras_model_type.lower() in ['bert']: + nlp_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name=nlp_column) + ### You need to do some special pre-processing if it is a BERT model + x = encoder(preprocessor(nlp_input))['pooled_output'] + elif keras_model_type.lower() in ["use"]: + nlp_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name=nlp_column) + ### You need to do some special pre-processing if it is a BERT model + x = encoder(nlp_input) + elif keras_model_type.lower() in fast_models: + nlp_input = tf.keras.Input(shape=(), dtype=tf.string, name=nlp_column) + x = vectorize_layer(nlp_input) + x = layers.Embedding(max_features+1, embedding_dim, input_length=sequence_length, name=nlp_column+'_embedding')(x) + x = Flatten()(x) + elif keras_model_type.lower() in ["nlp"]: + ### this is for NLP models. You use Swivel to embed NLP columns fast #### + for each_nlp in nlp_columns: + nlp_input = tf.keras.Input(shape=(), dtype=tf.string, name=each_nlp) + nlp_inputs.append(nlp_input) + elif keras_model_type.lower() in fast_models1: + ### this is for AUTO models. You use NNLM or NLNM to embed NLP columns fast #### + nlp_input = tf.keras.Input(shape=(), dtype=tf.string, name=nlp_column) + x = hub_layer(nlp_input) + else: + ### this is for AUTO models. You use Swivel to embed NLP columns fast #### + nlp_input = tf.keras.Input(shape=(), dtype=tf.string, name=nlp_column) + x = hub_layer(nlp_input) + ### Now we combine all inputs and outputs in one place here ########### + nlp_inputs.append(nlp_input) + all_nlp_encoded.append(x) + nlp_col_names.append(nlp_column) + except: + print(' Error: Skipping %s for keras layer preprocessing...' %nlp_column) +### we gather all outputs above into a single list here called all_features! + if len(all_nlp_encoded) == 0: + print('There are no NLP string variables in this dataset to preprocess!') + elif len(all_nlp_encoded) == 1: + all_nlp_embeddings = all_nlp_encoded[0] + else: + all_nlp_embeddings = layers.concatenate(all_nlp_encoded) + + return nlp_inputs, all_nlp_embeddings, nlp_col_names +############################################################################################### +def one_hot_encode_categorical_target(features, labels, categories): + """Returns a one-hot encoded tensor representing categorical values.""" + # The entire encoding can fit on one line: + labels = tf.cast(tf.equal(categories, tf.reshape(labels, [-1, 1])), tf.int32) + return (features, labels) +############################################################################################## +def convert_classifier_targets(labels): + """ + This handy function converts target labels that are binary or multi-class (whether integer or string) into integers. + This is similar to a label encoder in scikit-learn but works on tensorflow tf.data.Datasets. + Just send in a tf.data.Dataset and it will split it into features and labels and then turn them into correct labels. + It returns the converted labels and a dictionary which you can use to convert it back to original labels. Neat! + """ + _, converted_labels = tf.unique(labels) + return converted_labels +######################################################################################### +def compare_two_datasets_with_idcol(train_ds, validate_ds, idcol,verbose=0): + ls_test = list(validate_ds.as_numpy_iterator()) + ls_train = list(train_ds.as_numpy_iterator()) + if verbose >= 1: + print(' Size of dataset 1 = %d' %(len(ls_train))) + print(' Size of dataset 2 = %d' %(len(ls_test))) + ts_list = [ls_test[x][0][idcol] for x in range(len(ls_test)) ] + tra_list = [ls_train[x][0][idcol] for x in range(len(ls_train)) ] + print('Alert! %d rows in common between dataset 1 and 2' %(len(tra_list) - len(left_subtract(tra_list, ts_list)))) +########################################################################################## +def process_continuous_data(data): + # Normalize data + max_data = tf.reduce_max(data) + min_data = tf.reduce_max(data) + data = (tf.cast(data, tf.float32) - min_data)/(max_data - min_data) + return tf.reshape(data, [-1, 1]) +########################################################################################## +# Process continuous features. +def preprocess(features, labels): + for feature in floats: + features[feature] = process_continuous_data(features[feature]) + return features, labels +########################################################################################## +def encode_NLP_column(train_ds, nlp_column, nlp_input, vocab_size, sequence_length): + text_ds = train_ds.map(lambda x,y: x[nlp_column]) + vectorize_layer = TextVectorization( + #standardize=custom_standardization, + standardize = 'lower_and_strip_punctuation', + max_tokens=vocab_size, + output_mode='int', + output_sequence_length=sequence_length) + # Tensorflow uses the word "adapt" to mean "fit" when learning vocabulary from a data set + # You must call adapt first on a training data set and let it learn from that data set + vectorize_layer.adapt(text_ds) + + ###### This is where you put NLP embedding layer into your data #### + nlp_vectorized = vectorize_layer(nlp_input) + ### Sometimes the get_vocabulary() errors due to special chars in utf-8. Hence avoid it. + #print(f" {nlp_column} vocab size = {vocab_size}, sequence_length={sequence_length}") + return nlp_vectorized +################################################################################################ +def aggregate_nlp_dictionaries(nlp_columns, cat_vocab_dict, model_options, verbose=0): + """ + This function aggregates all the dictionaries you need for nlp processing. + Just send in a list of nlp variables and a small data sample and it will compute all + the seq lengths, embedding_dims and vocabs for each nlp variable in the input list. + """ + lst = [8, 16, 24, 32, 48, 64, 96, 128, 256] + #### max_tokens_zip calculate the max number of unique words in a vocabulary #### + max_tokens_zip = defaultdict(int) + #### seq_tokens_zip calculates the max sequence length in a vocabulary #### + seq_tokens_zip = defaultdict(int) + #### embed_tokens_zip calculates the embedding dimension for each nlp_column #### + embed_tokens_zip = defaultdict(int) + #### This carries the + nlps_copy = copy.deepcopy(nlp_columns) + seq_lengths = [] + vocab_train_small = [] + if len(nlps_copy) > 0: + vocab_train_small = [] + for each_name in nlps_copy: + if verbose >= 2: + print('Creating aggregate_nlp_dictionaries for nlp column = %s' %each_name) + max_tokens_zip[each_name] = cat_vocab_dict[each_name]['size_of_vocab'] + print(' size of vocabulary = %s' %max_tokens_zip[each_name]) + seq_tokens_zip[each_name] = cat_vocab_dict[each_name]['seq_length'] + seq_lengths.append(seq_tokens_zip[each_name]) + if verbose >= 2: + print(' sequence length = %s' %seq_tokens_zip[each_name]) + vocab_size = cat_vocab_dict[each_name]['size_of_vocab'] + vocab_train_small += cat_vocab_dict[each_name]['vocab'] + vocab_train_small = np.unique(vocab_train_small).tolist() + best_embedding_size = closest(lst, vocab_size//50000) + if verbose >= 2: + print(' recommended embedding_size = %s' %best_embedding_size) + input_embedding_size = check_model_options(model_options, "embedding_size", best_embedding_size) + if input_embedding_size != best_embedding_size: + if verbose >= 2: + print(' input embedding size given as %d. Overriding recommended embedding_size...' %input_embedding_size) + best_embedding_size = input_embedding_size + embed_tokens_zip[each_name] = best_embedding_size + return max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small +################################################################################################## \ No newline at end of file diff --git a/build/lib/deep_autoviml/preprocessing/preprocessing_tabular.py b/build/lib/deep_autoviml/preprocessing/preprocessing_tabular.py new file mode 100644 index 0000000..7c260b0 --- /dev/null +++ b/build/lib/deep_autoviml/preprocessing/preprocessing_tabular.py @@ -0,0 +1,1692 @@ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +######################################################################################################################## +###### Many thanks to Hasan Rafiq for his excellent tutorial on Tensorflow pipelines where I learnt many helpful hints: +###### https://colab.research.google.com/gist/rafiqhasan/6f00aecf1feafd83ba9dfefef8907ee8/dl-e2e-taxi-dataset-tf2-keras.ipynb +###### Watch the entire video below on Srivatsan Srinivasan's excellent YouTube channel: AI Engineering ############## +###### https://youtu.be/wPri78CFSEw ############## +######################################################################################################################## +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +############################################################################################ +# data pipelines and feature engg here + +# pre-defined TF2 Keras models and your own models here + +# Utils + +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup, Hashing +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding, CategoryCrossing +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization, Discretization +from tensorflow.keras.layers import Embedding, Flatten + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers + +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle +############################################################################################# +##### Suppress all TF2 and TF1.x warnings ################### +tf2logger = tf.get_logger() +tf2logger.warning('Silencing TF2.x warnings') +tf2logger.root.removeHandler(tf2logger.root.handlers) +tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +######################################################################################################################## +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +#### probably the most handy function of all! +def left_subtract(l1,l2): + lst = [] + for i in l1: + if i not in l2: + lst.append(i) + return lst +############################################################################################################################# +###### Many thanks to ML Design Patterns by Lak Lakshmanan which provided the following date-time TF functions below: +###### You can find more at : https://github.com/GoogleCloudPlatform/ml-design-patterns/tree/master/02_data_representation +############################################################################################################################ +import datetime +def get_dayofweek(s): + DAYS = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat','Sun'] + ts = parse_datetime(s) + return DAYS[ts.weekday()] + +def get_monthofyear(s): + MONTHS = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun','Jul','Aug','Sep','Oct','Nov','Dec'] + ts = parse_datetime(s) + return MONTHS[ts.month-1] + +def get_hourofday(s): + ts = parse_datetime(s) + return str(ts.hour) + +@tf.function +def dayofweek(ts_in): + """ + This function converts dayofweek as a number to a string such as 4 means Thursday in dayofweek format. + """ + return tf.map_fn( + lambda dayofweek_number: tf.py_function(get_dayofweek, inp=[dayofweek_number], Tout=tf.string), + ts_in) + +@tf.function +def hourofday(ts_in): + """ + This function converts dayofweek as a number to a string such as 4 means Thursday in dayofweek format. + """ + return tf.map_fn( + lambda dayofweek_number: tf.py_function(get_hourofday, inp=[dayofweek_number], Tout=tf.string), + ts_in) + +@tf.function +def monthofyear(ts_in): + """ + This function converts dayofweek as a number to a string such as 4 means Thursday in dayofweek format. + """ + return tf.map_fn( + lambda dayofweek_number: tf.py_function(get_monthofyear, inp=[dayofweek_number], Tout=tf.string), + ts_in) + + +def parse_datetime(timestring): + if type(timestring) is not str: + timestring = timestring.numpy().decode('utf-8') # if it is a Tensor + return pd.to_datetime(timestring, infer_datetime_format=True, errors='coerce') + +########################################################################################################## +from itertools import combinations +from collections import defaultdict +import copy +import time +def preprocessing_tabular(train_ds, var_df, cat_feature_cross_flag, model_options, cat_vocab_dict, + keras_model_type,verbose=0): + """ + ############################################################################################ + # This preprocessing layer returns a tuple (all_features, all_inputs) as arguments to create_model function + # You must then create a Functional model by transforming all_features into outputs like this: + # The final step in create_model will use all_inputs as inputs + x = tf.keras.layers.Dense(32, activation="relu")(all_features) + x = tf.keras.layers.Dropout(0.5)(x) + output = tf.keras.layers.Dense(1)(x) + model = tf.keras.Model(all_inputs, output) + ############################################################################################ + """ + start_time = time.time() + drop_cols = var_df['cols_delete'] + train_data_is_file = model_options['train_data_is_file'] + ######### Now that you have the variable classification dictionary, just separate them out! ## + cats = var_df['categorical_vars'] ### these are low cardinality vars - you can one-hot encode them ## + high_string_vars = var_df['discrete_string_vars'] ## discrete_string_vars are high cardinality vars ## embed them! + bools = var_df['bools'] + int_bools = var_df['int_bools'] + int_cats = var_df['int_cats'] + var_df['int_bools'] + ints = var_df['int_vars'] + floats = var_df['continuous_vars'] + nlps = var_df['nlp_vars'] + idcols = var_df['IDcols'] + dates = var_df['date_vars'] + lats = var_df['lat_vars'] + lons = var_df['lon_vars'] + matched_lat_lons = var_df['matched_pairs'] + floats = left_subtract(floats, lats+lons) + + #### These are the most important variables from this program: all inputs and outputs + all_inputs = [] + all_encoded = [] + all_features = [] + all_input_names = [] + dropout_rate = 0.1 ### this is needed for feature crossing + + ### just use this to set the limit for max tokens for different variables ### + ### we are setting the number of max_tokens to be 2X the number of tokens found in train + max_tokens_zip = defaultdict(int) + ##### Just combine Boolean and cat variables here to set the vocab #### + cats_copy = copy.deepcopy(cats) + if len(cats_copy) > 0: + for each_name in cats_copy: + max_tokens_zip[each_name] = cat_vocab_dict[each_name]['vocab'] ### just send vocab in + high_cats_copy = copy.deepcopy(high_string_vars) + if len(high_cats_copy) > 0: + for each_name in high_cats_copy: + max_tokens_zip[each_name] = cat_vocab_dict[each_name]['vocab'] ### just send vocab in + copy_int_cats = copy.deepcopy(int_cats) + if len(copy_int_cats) > 0: + for each_int in copy_int_cats: + max_tokens_zip[each_int] = cat_vocab_dict[each_int]['vocab'].tolist() ### just send vocab in + copy_int_bools = copy.deepcopy(int_bools) + if len(copy_int_bools) > 0: + for each_int in copy_int_bools: + max_tokens_zip[each_int] = cat_vocab_dict[each_int]['vocab'].tolist() ### just send vocab in + copy_floats = copy.deepcopy(floats+bools) + if len(copy_floats) > 0: + for each_float in copy_floats: + max_tokens_zip[each_float] = cat_vocab_dict[each_float]['vocab_min_var'] ### just send vocab as its a list + copy_ints = copy.deepcopy(ints) + if len(copy_ints) > 0: + for each_int in copy_ints: + max_tokens_zip[each_int] = int(1*(cat_vocab_dict[each_int]['size_of_vocab'])) ### size of vocab here + if verbose > 1: + print(' Number of categories in:') + for each_name in max_tokens_zip.keys(): + if not each_name in high_cats_copy: + print(' %s: %s' %(each_name, max_tokens_zip[each_name])) + else: + continue + + ####### CAVEAT : All the inputs and outputs should follow this same sequence below! ###### + all_date_inputs = [] + all_bool_inputs = [] + all_int_inputs = [] + all_int_cat_inputs = [] + all_cat_inputs = [] + all_num_inputs = [] + all_latlon_inputs = [] + ############## CAVEAT: This is the new way of concatenating different kinds of variables together #### + all_new_cat_encoded = [] ## this is the new way of encoding cat vars and tying them together ### + all_new_numeric_encoded = [] ### This is the new way of encoding num vars and tying them together ## + + ############## CAVEAT: The encoded outputs should follow the same sequence as inputs above! + all_date_encoded = [] + all_bool_encoded = [] + all_int_bool_encoded = [] + all_int_encoded = [] + all_int_cat_encoded = [] + all_cat_encoded = [] + all_high_cat_encoded = [] + all_feat_cross_encoded = [] + all_num_encoded = [] + all_latlon_encoded = [] + lat_lon_paired_encoded = [] + cat_encoded_dict = dict([]) + cat_input_dict = dict([]) + date_input_dict = dict([]) + bool_input_dict = dict([]) + ############################### + high_cats_alert = 50 ### set this number to alery you when a variable has high dimensions. Should it? + hidden_units = [32, 32] ## this is the number needed for feature crossing + ####### We start creating variables encoding with date-time variables first ########### + dates_copy = copy.deepcopy(dates) + if len(dates) > 0: + for each_date in dates_copy: + #### You just create the date-time input only once and reuse the same input again and again + date_input = keras.Input(shape=(1,), name=each_date, dtype="string") + date_input_dict[each_date] = date_input + all_date_inputs.append(date_input) + all_input_names.append(each_date) + try: + ### for datetime strings, you need to split them into hour, day and month ###### + encoded_hour = encode_date_time_var_hourofday_categorical(date_input, each_date, train_ds) + all_date_encoded.append(encoded_hour) + if verbose: + print(' %s : after date-hour encoding shape: %s' %(each_date, encoded_hour.shape[1])) + if encoded_hour.shape[1] > high_cats_alert: + print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + except: + print(' Error: Skipping %s since Keras Date hourofday preprocessing erroring' %each_date) + try: + ### for datetime strings, you need to split them into hour, day and month ###### + encoded_day = encode_date_time_var_dayofweek_categorical(date_input, each_date, train_ds) + all_date_encoded.append(encoded_day) + if verbose: + print(' %s : after date-day encoding shape: %s' %(each_date, encoded_day.shape[1])) + if encoded_day.shape[1] > high_cats_alert: + print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + except: + print(' Error: Skipping %s since Keras Date dayofweek preprocessing erroring' %each_date) + try: + ### for datetime strings, you need to split them into hour, day and month ###### + encoded_month = encode_date_time_var_monthofyear_categorical(date_input, each_date, train_ds) + all_date_encoded.append(encoded_month) + if verbose: + print(' %s : after date-month encoding shape: %s' %(each_date, encoded_month.shape[1])) + if encoded_month.shape[1] > high_cats_alert: + print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + except: + print(' Error: Skipping %s since Keras Date dayofweek preprocessing erroring' %each_date) + #### This is where you do the category crossing of hourofday and dayofweek first 24*7 bins + + try: + encoded_hour_day = encode_cat_feature_crosses_numeric(encoded_day, encoded_hour, train_ds, + bins_num=24*7) + all_date_encoded.append(encoded_hour_day) + if verbose: + print(' %s : after date-hour-day encoding shape: %s' %(each_date, encoded_hour_day.shape[1])) + if encoded_hour_day.shape[1] > high_cats_alert: + print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + except: + print(' Error: Skipping %s since Keras Date day-hour cross preprocessing erroring' %each_date) + #### This is where you do the category crossing of dayofweek and monthofyear first 12*7 bins + try: + encoded_month_day = encode_cat_feature_crosses_numeric(encoded_month, encoded_day, train_ds, + bins_num=12*7) + all_date_encoded.append(encoded_month_day) + if verbose: + print(' %s : after date-day-month encoding shape: %s' %(each_date, encoded_month_day.shape[1])) + if encoded_month_day.shape[1] > high_cats_alert: + print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + except: + print(' Error: Skipping %s since Keras Date month-day cross preprocessing erroring' %each_date) + + + ##### If boolean variables exist, you must do this here ###### + if len(bools) > 0: + bools_copy = copy.deepcopy(bools) + try: + for each_bool in bools_copy: + #### You just create the boolean input as float since we are converting it ### + if train_data_is_file: + bool_input = keras.Input(shape=(1,), name=each_bool, dtype=tf.string) + else: + bool_input = keras.Input(shape=(1,), name=each_bool, dtype="float32") + bool_input_dict[each_bool] = bool_input + encoded = bool_input + all_bool_encoded.append(encoded) + all_bool_inputs.append(bool_input) + all_input_names.append(each_bool) + if verbose: + print(' %s : after boolean encoding, is now float with shape: %s' %(each_bool, encoded.shape[1])) + except: + print(' Error: Skipping %s since Keras Bolean preprocessing is erroring' %each_bool) + + ###### This is where we handle Boolean + Integer variables - we just combine them ################## + int_bools_copy = copy.deepcopy(int_bools) + if len(int_bools_copy) > 0: + for each_int in int_bools_copy: + try: + int_input = keras.Input(shape=(1,), name=each_int, dtype="int32") + cat_input_dict[each_int] = int_input + vocab = max_tokens_zip[each_int] + layer = tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=vocab, + mask_token=None, num_oov_indices=1, output_mode="int") + # Convert the string input values into a one hot encoding. + encoded = layer(int_input) + all_int_inputs.append(int_input) + all_int_bool_encoded.append(encoded) + all_input_names.append(each_int) + if verbose: + print(' %s number of categories = %d and after integer encoding shape: %s' %(each_int, + len(max_tokens_zip[each_int]), encoded.shape[1])) + except: + print(' Error: Skipping %s since Keras Boolean Integer preprocessing erroring' %each_int) + + ###### This is where we handle high cardinality >50 categories integers ################## + ints_copy = copy.deepcopy(ints) + if len(ints_copy) > 0: + for each_int in ints_copy: + try: + ### for integers that are very high cardinality, you can cut them down by half for starters + if max_tokens_zip[each_int] <= 100: + nums_bin = max(5, int(max_tokens_zip[each_int]/10)) + elif max_tokens_zip[each_int] > 100 and max_tokens_zip[each_int] <= 1000: + nums_bin = max(10, int(max_tokens_zip[each_int]/10)) + else: + nums_bin = max(20, int(max_tokens_zip[each_int]/40)) + int_input = keras.Input(shape=(1,), name=each_int, dtype="int32") + if (max_tokens_zip[each_int] >= high_cats_alert): + encoded = encode_any_integer_to_hash_categorical(int_input, each_int, + train_ds, nums_bin) + if verbose: + print(' %s encoded: %d categories, %d bins. After integer HASH encoding shape = %s' %(each_int, + max_tokens_zip[each_int], nums_bin, encoded.shape[1])) + else: + encoded = encode_categorical_and_integer_features(int_input, each_int, + train_ds, is_string=False) + if verbose: + print(' %s encoded: %d categories. After integer encoding shape: %s' %(each_int, + max_tokens_zip[each_int], encoded.shape[1])) + all_int_inputs.append(int_input) + all_int_encoded.append(encoded) + all_input_names.append(each_int) + if verbose: + if (encoded.shape[1] >= high_cats_alert): + print(' High Dims Alert! Convert %s to float??' %each_int) + except: + print(' Error: Skipping %s since Keras Integer preprocessing erroring' %each_int) + + + ###### This is where we handle low cardinality <=50 categories integers ################## + + ints_cat_copy = copy.deepcopy(int_cats) + if len(ints_cat_copy) > 0: + for each_int in ints_cat_copy: + try: + int_input = keras.Input(shape=(1,), name=each_int, dtype="int32") + cat_input_dict[each_int] = int_input + vocab = max_tokens_zip[each_int] + #encoded = encode_integer_to_categorical_feature(int_input, each_int, + # train_ds, vocab) + encoded = encode_categorical_and_integer_features(int_input, each_int, + train_ds, is_string=False) + all_int_cat_inputs.append(int_input) + all_int_cat_encoded.append(encoded) + all_input_names.append(each_int) + if verbose: + print(' %s encoded: %d categories. After integer encoding shape: %s' %(each_int, + len(vocab), encoded.shape[1])) + if encoded.shape[1] > high_cats_alert: + if verbose: + print(' High Dims Alert! Convert %s to float??' %each_int) + except: + print(' Error: Skipping %s since Keras Integer Categorical preprocessing erroring' %each_int) + + ##### All Discrete String and Categorical features are encoded as strings ########### + cats_copy = copy.deepcopy(cats) + if len(cats_copy) > 0: + for each_cat in cats_copy: + if each_cat in lats+lons: + continue ### skip if these variables are already in another list + try: + cat_input = keras.Input(shape=(1,), name=each_cat, dtype="string") + cat_input_dict[each_cat] = cat_input + vocab = max_tokens_zip[each_cat] + max_tokens = len(vocab) + cat_encoded = encode_categorical_and_integer_features(cat_input, each_cat, + train_ds, is_string=True) + #cat_encoded = encode_string_categorical_feature_categorical(cat_input, each_cat, + # train_ds, vocab) + all_cat_inputs.append(cat_input) + all_cat_encoded.append(cat_encoded) + cat_encoded_dict[each_cat] = cat_encoded + all_input_names.append(each_cat) + if verbose: + print(' %s number of categories = %d: after string to categorical encoding shape: %s' %( + each_cat, max_tokens, cat_encoded.shape[1])) + if cat_encoded.shape[1] > high_cats_alert: + if verbose: + print(' High Dims Alert! Convert %s to float??' %each_int) + except: + print(' Error: Skipping %s since Keras Categorical preprocessing erroring' %each_cat) + + ##### All Discrete String and Categorical features are encoded as strings ########### + high_cats_copy = copy.deepcopy(high_string_vars) + if len(high_cats_copy) > 0: + for each_cat in high_cats_copy: + if each_cat in lats+lons: + continue ### skip if these variables are already in another list + try: + #### The next line is not a typo: this input should be left without shape. Be Careful! + cat_input = keras.Input(shape=(1,), name=each_cat, dtype="string") + vocabulary = max_tokens_zip[each_cat] + encoded = encode_any_feature_to_embed_categorical(cat_input, each_cat, + train_ds, vocabulary) + all_cat_inputs.append(cat_input) + all_high_cat_encoded.append(encoded) + cat_encoded_dict[each_cat] = encoded + all_input_names.append(each_cat) + if verbose: + print(' %s : after high cardinality cat encoding shape: %s' %(each_cat, encoded.shape[1])) + if encoded.shape[1] > high_cats_alert: + print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + except: + print(' Error: Skipping %s since Keras Discrete Strings (high cats) preprocessing erroring' %each_cat) + + #### If the feature crosses for categorical variables are requested, then do this here ### + + if cat_feature_cross_flag: + if isinstance(cat_feature_cross_flag, str): + if cat_feature_cross_flag in ['cat','categorical']: + cross_cats = copy.deepcopy(cats) + elif cat_feature_cross_flag in ['num', 'numeric']: + cross_cats = int_cats+int_bools + elif cat_feature_cross_flag in ['both', 'Both']: + cross_cats = cats + int_cats + int_bools + else: + #### If it is true just perform string categorical crosses ### + cross_cats = copy.deepcopy(cats) + if len(cross_cats) < 2: + print('Feature crossing requested but not many cat or int-cat or int-bool variables in data') + else: + print(' Performing %s feature crossing using %d variables: \n %s' %(cat_feature_cross_flag, len(cross_cats), cross_cats)) + ##### Now you perform crosses for each kind of crosses requested #################### + if cat_feature_cross_flag and len(cross_cats) > 1: + feat_cross_encoded = perform_new_feature_crossing(cat_input_dict, cross_cats, train_ds) + all_feat_cross_encoded += feat_cross_encoded + else: + print(' no feature crossing performed') + ################################################################################## + # Numerical features are treated as Numericals ### this is a direct feed to the final layer ### + nums_copy = left_subtract(floats,lats+lons) + num_only_encoded = [] + + if len(nums_copy) > 0: + for each_num in nums_copy: + try: + num_input = keras.Input(shape=(1,), name=each_num, dtype="float32") + ### Let's assume we don't do any encoding but use Normalization ## + feat_mean = max_tokens_zip[each_num][0] + feat_var = max_tokens_zip[each_num][1] + normalizer = Normalization(mean=feat_mean, variance=feat_var) + encoded = normalizer(num_input) + #encoded = encode_numerical_feature_numeric(num_input, each_num, train_ds) + all_num_inputs.append(num_input) + all_num_encoded.append(encoded) + num_only_encoded.append(encoded) + all_input_names.append(each_num) + print(' %s numeric column left as is since float' %each_num) + except: + print(' Error: Skipping %s due to Keras float preprocessing error' %each_num) + + + # Latitude and Longitude Numerical features are Binned first and then Category Encoded ####### + lat_lon_paired_dict = dict([]) + #### Just remember that dtype of Input should match the dtype of the column! ##### + # Latitude and Longitude Numerical features are Binned first and then Category Encoded ####### + lat_lists = [] + lats_copy = copy.deepcopy(lats) + if len(lats_copy) > 0: + for each_lat in lats_copy: + lat_lists += list(cat_vocab_dict[each_lat]['vocab']) + lats_copy = copy.deepcopy(lats) + for each_lat in lats_copy: + try: + bins_lat = pd.qcut(lat_lists, q=find_number_bins(cat_vocab_dict[each_lat]['vocab']), + duplicates='drop', retbins=True)[1] + ##### Now we create the inputs and the encoded outputs ###### + lat_lon_input = keras.Input(shape=(1,), name=each_lat, dtype="float32") + all_latlon_inputs.append(lat_lon_input) + lat_lon_encoded = encode_binning_numeric_feature_categorical(lat_lon_input, each_lat, train_ds, + bins_lat=bins_lat, + bins_num=len(bins_lat)) + all_latlon_encoded.append(lat_lon_encoded) + lat_lon_paired_dict[each_lat] = lat_lon_encoded + all_input_names.append(each_lat) + if verbose: + print(' %s : after latitude binning encoding shape: %s' %(each_lat, lat_lon_encoded.shape[1])) + if lat_lon_encoded.shape[1] > high_cats_alert: + print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + except: + print(' Error: Skipping %s since Keras latitudes var preprocessing erroring' %each_lat) + + lon_lists = [] + lons_copy = copy.deepcopy(lons) + if len(lons_copy) > 0: + for each_lon in lons_copy: + lon_lists += list(cat_vocab_dict[each_lon]['vocab']) + lons_copy = copy.deepcopy(lons) + for each_lon in lons_copy: + try: + bins_lon = pd.qcut(lon_lists, q=find_number_bins(cat_vocab_dict[each_lon]['vocab']), + duplicates='drop', retbins=True)[1] + ##### Now we create the inputs and the encoded outputs ###### + lat_lon_input = keras.Input(shape=(1,), name=each_lon, dtype="float32") + all_latlon_inputs.append(lat_lon_input) + lat_lon_encoded = encode_binning_numeric_feature_categorical(lat_lon_input, each_lon, train_ds, + bins_lat=bins_lon, + bins_num=len(bins_lon)) + all_latlon_encoded.append(lat_lon_encoded) + lat_lon_paired_dict[each_lon] = lat_lon_encoded + all_input_names.append(each_lon) + if verbose: + print(' %s : after longitude binning encoding shape: %s' %(each_lon, lat_lon_encoded.shape[1])) + if lat_lon_encoded.shape[1] > high_cats_alert: + print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + except: + print(' Error: Skipping %s since Keras longitudes var preprocessing erroring' %each_lon) + + #### this is where you match the pairs of latitudes and longitudes to create an embedding layer + if len(matched_lat_lons) > 0: + matched_lat_lons_copy = copy.deepcopy(matched_lat_lons) + for (lat_in_pair, lon_in_pair) in matched_lat_lons_copy: + try: + encoded_pair = encode_feature_crosses_lat_lon_numeric(lat_lon_paired_dict[lat_in_pair], + lat_lon_paired_dict[lon_in_pair], + dataset=train_ds, bins_lat=bins_lat) + lat_lon_paired_encoded.append(encoded_pair) + if verbose: + print(' %s + %s : after matched lat-lon crosses encoding shape: %s' %(lat_in_pair, lon_in_pair, encoded_pair.shape[1])) + if encoded_pair.shape[1] > high_cats_alert: + print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + except: + print(' Error: Skipping (%s, %s) since Keras lat-lon paired preprocessing erroring' %(lat_in_pair, lon_in_pair)) + + ##### SEQUENCE OF THESE INPUTS AND OUTPUTS MUST MATCH ABOVE - we gather all outputs above into a single list + all_inputs = all_bool_inputs + all_date_inputs + all_int_inputs + all_int_cat_inputs + all_cat_inputs + all_num_inputs + all_latlon_inputs + all_encoded = all_bool_encoded+all_high_cat_encoded+all_date_encoded+all_int_bool_encoded+all_int_encoded+all_int_cat_encoded+all_cat_encoded+all_feat_cross_encoded+all_num_encoded+all_latlon_encoded+lat_lon_paired_encoded + ###### This is new way of processing categorical and numeric variables ######### + all_low_cat_encoded = all_int_bool_encoded+all_cat_encoded+all_int_encoded+all_int_cat_encoded ## these are integer outputs #### + all_numeric_encoded = all_num_encoded+all_bool_encoded+all_date_encoded+all_high_cat_encoded+all_feat_cross_encoded+all_latlon_encoded+lat_lon_paired_encoded## these are all float ## + + ###### This is new way of processing categorical and numeric variables ######### + #all_low_cat_encoded = all_int_bool_encoded + all_cat_encoded + all_int_cat_encoded + #all_numeric_encoded = all_num_encoded + all_latlon_encoded + lat_lon_paired_encoded + all_feat_cross_encoded + + + ###### This is where we determine the size of different layers ######### + data_size = model_options['DS_LEN'] + if len(all_numeric_encoded) == 0: + meta_numeric_len = 1 + elif len(all_numeric_encoded) == 1: + meta_numeric_len = all_numeric_encoded[0].shape[1] + else: + meta_numeric_len = layers.concatenate(all_numeric_encoded).shape[1] + data_dim = int(data_size*meta_numeric_len) + if data_dim <= 1e6: + dense_layer1 = max(16,int(data_dim/30000)) + dense_layer2 = max(8,int(dense_layer1*0.5)) + dense_layer3 = max(4,int(dense_layer2*0.5)) + elif data_dim > 1e6 and data_dim <= 1e8: + dense_layer1 = max(24,int(data_dim/500000)) + dense_layer2 = max(12,int(dense_layer1*0.5)) + dense_layer3 = max(6,int(dense_layer2*0.5)) + elif data_dim > 1e8 or keras_model_type.lower() in ['big_deep', 'big deep']: + dense_layer1 = 300 + dense_layer2 = 200 + dense_layer3 = 100 + dense_layer1 = min(300,dense_layer1) + dense_layer2 = min(200,dense_layer2) + dense_layer3 = min(100,dense_layer3) + ############ D E E P and W I D E M O D E L S P R E P R O C E S S I N G ######## + #### P R E P R O C E S S I N G F O R A L L O T H E R M O D E L S ######## + ####Concatenate all features( Numerical input ) + skip_meta_categ1 = False + #concat_kernel_initializer = "glorot_uniform" + concat_kernel_initializer = "he_normal" + concat_activation = 'relu' + concat_layer_neurons = dense_layer1 + + ####Concatenate all categorical features( Categorical input ) ######### + if len(all_low_cat_encoded) == 0: + skip_meta_categ1 = True + meta_categ1 = None + elif len(all_low_cat_encoded) == 1: + meta_input_categ1 = all_low_cat_encoded[0] + meta_categ1 = layers.Dense(concat_layer_neurons, kernel_initializer=concat_kernel_initializer)(meta_input_categ1) + else: + int_list = [x for x in all_low_cat_encoded if x.dtype in [np.int8, np.int16, np.int32, np.int64]] + float_list = [ x for x in all_low_cat_encoded if x.dtype in [np.float32, np.float64]] + if len(float_list) == len(all_low_cat_encoded): + ### All of them are floats ### + all_high_cat_encoded += float_list + else: + meta_input_categ1 = layers.concatenate(int_list) + all_high_cat_encoded += float_list + #WIDE - This Dense layer connects to input layer - Categorical Data + meta_categ1 = layers.Dense(concat_layer_neurons, kernel_initializer=concat_kernel_initializer)(meta_input_categ1) + + skip_meta_categ2 = False + if len(all_high_cat_encoded) == 0: + skip_meta_categ2 = True + meta_categ2 = None + elif len(all_high_cat_encoded) == 1: + meta_input_categ2 = all_high_cat_encoded[0] + meta_categ2 = layers.Dense(concat_layer_neurons, kernel_initializer=concat_kernel_initializer)(meta_input_categ2) + meta_categ2 = layers.BatchNormalization()(meta_categ2) + else: + meta_input_categ2 = layers.concatenate(all_high_cat_encoded) + meta_categ2 = layers.Dense(concat_layer_neurons, kernel_initializer=concat_kernel_initializer)(meta_input_categ2) + meta_categ2 = layers.BatchNormalization()(meta_categ2) + + skip_meta_numeric = False + if len(all_numeric_encoded) == 0: + skip_meta_numeric = True + meta_numeric = None + elif len(all_numeric_encoded) == 1: + meta_input_numeric = all_numeric_encoded[0] + meta_numeric = layers.Dense(concat_layer_neurons, kernel_initializer=concat_kernel_initializer)(meta_input_numeric) + meta_numeric = layers.BatchNormalization()(meta_numeric) + else: + #### You must concatenate these encoded outputs before sending them out! + #DEEP - This Dense layer connects to input layer - Numeric Data + meta_input_numeric = layers.concatenate(all_numeric_encoded) + meta_numeric = layers.Dense(concat_layer_neurons, kernel_initializer=concat_kernel_initializer)(meta_input_numeric) + meta_numeric = layers.BatchNormalization()(meta_numeric) + + + ####Concatenate both Wide and Deep layers + #### in the end, you copy it into another variable called all_features so that you can easily remember name + all_encoded_dict = list(zip([skip_meta_categ1, skip_meta_categ2, skip_meta_numeric], + ['meta_categ1', 'meta_categ2', 'meta_numeric'])) + ###### This is how you concatenate the various layers ############################### + concat_layers = [] + try: + for (each_skip, each_encoded) in all_encoded_dict: + # The order in which we feed the inputs is as follows: nlps + dates + ints + cats + floats + lat-lons + if each_skip: + ### This means that you must skip adding this layer ###### + continue + else: + #### This means you must add that layer ########## + concat_layers.append(eval(each_encoded)) + except: + print(' Error: preprocessing layers for %s models is erroring' %keras_model_type) + + if len(concat_layers) == 0: + print('There are no cat, integer or float variables in this data set. Hence continuing...') + all_features = [] + elif len(concat_layers) == 1: + all_features = concat_layers[0] + else: + all_features = layers.concatenate(concat_layers) + + print('Time taken for preprocessing (in seconds) = %d' %(time.time()-start_time)) + return all_features, all_inputs, all_input_names + +############################################################################################# +def find_number_bins(series): + """ + Input can be a numpy array or pandas series. Otherwise it will blow up. Be careful! + Returns the recommended number of bins for any Series in pandas + Input must be a float or integer column. Don't send in alphabetical series! + """ + try: + num_of_quantiles = int(np.log2(series.nunique())+1) + except: + num_of_quantiles = max(2, int(np.log2(len(series)/5))) + return num_of_quantiles +############################################################################################# +##### Thanks to Francois Chollet for his excellent tutorial on Keras Preprocessing functions! +##### https://keras.io/examples/structured_data/structured_data_classification_from_scratch/ +##### Some of the functions below are derived from the tutorial. I have added many more here. +############################################################################################ +def encode_numerical_feature_numeric(feature_input, name, dataset): + """ + Inputs: + ---------- + feature_input: must be a keras.Input variable, so make sure you create a variable first for the + column in your dataset that want to transform. Please make sure it has a + shape of (None, 1). + name: this is the name of the column in your dataset that you want to transform + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + + Outputs: + ----------- + encoded_feature: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, 1) - None indicates that it has not been + """ + # Create a Normalization layer for our feature + normalizer = Normalization() + + # Prepare a Dataset that only yields our feature + feature_ds = dataset.map(lambda x, y: x[name]) + feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) + + # Learn the statistics of the data + normalizer.adapt(feature_ds) + + # Normalize the input feature + encoded_feature = normalizer(feature_input) + return encoded_feature + +########################################################################################### +def encode_binning_numeric_feature_categorical(feature, name, dataset, bins_lat, bins_num=10): + """ + Inputs: + ---------- + feature: must be a keras.Input variable, so make sure you create a variable first for the + column in your dataset that want to transform. Please make sure it has a + shape of (None, 1). + name: this is the name of the column in your dataset that you want to transform + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + + Outputs: + ----------- + encoded_feature: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, 1) - None indicates that it has not been + CategoryEncoding output dtype is float32 even if output is binary or count. + """ + # Create a StringLookup layer which will turn strings into integer indices + index = Discretization(bin_boundaries = bins_lat) + + # Prepare a Dataset that only yields our feature + feature_ds = dataset.map(lambda x, y: x[name]) + feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) + + # If the bin_boundaries are given, then no adapt is needed since it is already known. + #index.adapt(feature_ds) + + # Turn the string input into integer indices + encoded_feature = index(feature) + + # Create a CategoryEncoding for our integer indices + lat_bins = 21 + #encoder = CategoryEncoding(num_tokens=bins_num+1, output_mode="binary") + encoder = CategoryEncoding(num_tokens=lat_bins, output_mode="binary") + + # Prepare a dataset of indices + feature_ds = feature_ds.map(index) + + # Learn the space of possible indices + encoder.adapt(feature_ds) + + # Apply one-hot encoding to our indices + encoded_feature = encoder(encoded_feature) + return encoded_feature + +########################################################################################### +def encode_categorical_and_integer_features(feature, name, dataset, is_string): + lookup_class = StringLookup if is_string else IntegerLookup + # Create a lookup layer which will turn strings into integer indices + lookup = lookup_class(output_mode="binary") + + # Prepare a Dataset that only yields our feature + feature_ds = dataset.map(lambda x, y: x[name]) + feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) + + # Learn the set of possible string values and assign them a fixed integer index + lookup.adapt(feature_ds) + + # Turn the string input into integer indices + encoded_feature = lookup(feature) + return encoded_feature +############################################################################## +def encode_string_categorical_feature_categorical(feature_input, name, dataset, vocab): + """ + Inputs: + ---------- + feature_input: must be a keras.Input variable, so make sure you create a variable first for the + column in your dataset that want to transform. Please make sure it has a + shape of (None, 1). + name: this is the name of the column in your dataset that you want to transform + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + + Outputs: + ----------- + encoded_feature: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, 1) - None indicates that it is not batched. + When the output_mode = "binary" or "count", the output is in float otherwise it is integer. + """ + extra_oov = 3 + # Create a StringLookup layer which will turn strings into integer indices + index = StringLookup(max_tokens=None, num_oov_indices=extra_oov, + vocabulary=vocab, output_mode="count") + + # Prepare a Dataset that only yields our feature + feature_ds = dataset.map(lambda x, y: x[name]) + feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) + + ### turn it into an index first ### + encoded_feature = index(feature_input) + + # Create a CategoryEncoding for our integer indices + lat_bins = 10 + encoder = CategoryEncoding(num_tokens=lat_bins, output_mode="binary") + + # Prepare a dataset of indices + feature_ds = feature_ds.map(index) + + # Learn the set of possible string values and assign them a fixed integer index + encoder.adapt(feature_ds) + + # Turn the string input into integer indices + encoded_feature = encoder(encoded_feature) + + return encoded_feature + +########################################################################################### +def encode_integer_to_categorical_feature(feature_input, name, dataset, vocab): + """ + Inputs: + ---------- + feature_input: must be a keras.Input variable, so make sure you create a variable first for the + column in your dataset that want to transform. Please make sure it has a + shape of (None, 1). + name: this is the name of the column in your dataset that you want to transform + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + + Outputs: + ----------- + encoded_feature: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, 1) - None indicates that it has not been + When the output_mode = "binary" or "count", the output is in float otherwise it is integer. + """ + extra_oov = 3 + # Create a StringLookup layer which will turn strings into integer indices + ### For now we will leave the max_values as None which means there is no limit. + index = IntegerLookup(vocabulary=vocab, mask_token=None, + num_oov_indices=extra_oov, oov_token=-9999, + output_mode='int') + + # Prepare a Dataset that only yields our feature + feature_ds = dataset.map(lambda x, y: x[name]) + feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) + + # Learn the set of possible string values and assign them a fixed integer index + index.adapt(feature_ds) + + # Turn the string input into integer indices + encoded_feature = index(feature_input) + + return encoded_feature + +########################################################################################### +def encode_cat_feature_crosses_numeric(encoded_input1, encoded_input2, dataset, bins_num=64): + """ + This function does feature crosses of two categorical features sent in as encoded inputs. + DO NOT SEND in RAW KERAS.INPUTs = they won't work here. This function takes those that are encoded. + It then creates a feature cross, hashes the resulting categories and then category encodes them. + The resulting output can be directly used an encoded variable for building pipelines. + + Inputs: + ---------- + encoded_input1: This must be an encoded input - create a Keras.input variable first. + Then do a StringLookup column on it and then a CategoryEncoding of it. Now you + can feed that encoded variable into this as the first input. + encoded_input1: This must be an encoded input - Similar to above: create a Keras.input variable first. + Then do a StringLookup column on it and then a CategoryEncoding of it. Now you + can feed that encoded variable into this as the second input. + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + bins_num: this is the number of bins you want to use in the hashing of the column + Typically this can be 64. But you can make it smaller or larger. + + + Outputs: + ----------- + cat_cross_cat1_cat2: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, 1) - None indicates it is batched. + CategoryEncoding output dtype is float32 even if output is binary or count. + """ + ########### Categorical cross of two categorical features is done here ######### + cross_cat1_cat2 = tf.keras.layers.experimental.preprocessing.CategoryCrossing()( + [encoded_input1, encoded_input2]) + hash_cross_cat1_cat2 = tf.keras.layers.experimental.preprocessing.Hashing(num_bins=bins_num)( + cross_cat1_cat2) + #cat_cross_cat1_cat2 = tf.keras.layers.experimental.preprocessing.CategoryEncoding( + # num_tokens = bins_num)(hash_cross_cat1_cat2) + + # Cross to embedding + cat_cross_cat1_cat2 = tf.keras.layers.Embedding( + (bins_num + 1) , 10) (hash_cross_cat1_cat2) + cat_cross_cat1_cat2 = tf.reduce_sum(cat_cross_cat1_cat2, axis=-2) + + return cat_cross_cat1_cat2 +########################################################################################### +def encode_cat_feature_crosses(encoded_input1, encoded_input2, dataset, bins_num=64): + """ + This function does feature crosses of two categorical features sent in as encoded inputs. + DO NOT SEND in RAW KERAS.INPUTs = they won't work here. This function takes those that are encoded. + It then creates a feature cross, hashes the resulting categories and then category encodes them. + The resulting output can be directly used an encoded variable for building pipelines. + + Inputs: + ---------- + encoded_input1: This must be an encoded input - create a Keras.input variable first. + Then do a StringLookup column on it and then a CategoryEncoding of it. Now you + can feed that encoded variable into this as the first input. + encoded_input1: This must be an encoded input - Similar to above: create a Keras.input variable first. + Then do a StringLookup column on it and then a CategoryEncoding of it. Now you + can feed that encoded variable into this as the second input. + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + bins_num: this is the number of bins you want to use in the hashing of the column + Typically this can be 64. But you can make it smaller or larger. + + + Outputs: + ----------- + cat_cross_cat1_cat2: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, 1) - None indicates it is batched. + CategoryEncoding output dtype is float32 even if output is binary or count. + """ + ########### Categorical cross of two categorical features is done here ######### + cross_cat1_cat2 = tf.keras.layers.experimental.preprocessing.CategoryCrossing()( + [encoded_input1, encoded_input2]) + hash_cross_cat1_cat2 = tf.keras.layers.experimental.preprocessing.Hashing(num_bins=bins_num)( + cross_cat1_cat2) + cat_cross_cat1_cat2 = tf.keras.layers.experimental.preprocessing.CategoryEncoding( + num_tokens = bins_num)(hash_cross_cat1_cat2) + + return cat_cross_cat1_cat2 +########################################################################################### +def encode_feature_crosses_lat_lon_numeric(cat_pickup_lat, cat_pickup_lon, dataset, bins_lat): + """ + This function does feature crosses of a paired latitude and logitude sent in as encoded inputs. + DO NOT SEND in RAW KERAS.INPUTs = they won't work here. This function takes those that are encoded. + It then creates a feature cross, hashes the resulting categories and then category encodes them. + The resulting output can be directly used an encoded variable for building pipelines. + + Inputs: + ---------- + cat_pickup_lat: This must be an encoded input - create a Keras.input variable first. + Then do a Discretization column on it and then a CategoryEncoding of it. Now you + can feed that encoded variable into this as the first input. + cat_pickup_lon: This must be an encoded input - Similar to above: create a Keras.input variable first. + Then do a Discretization column on it and then a CategoryEncoding of it. Now you + can feed that encoded variable into this as the second input. + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + bins_lat: this is a pandas qcut bins - DO NOT SEND IN A NUMBER. It will fail! + Typically you do this after binning the Latitude or Longitude after pd.qcut and set ret_bins=True. + + + Outputs: + ----------- + embed_cross_pick_lon_lat: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, embedding_dim) - None indicates it is batched. + hashing creates integer outputs. So don't concatenate it with float outputs. + """ + num_bins_lat = len(bins_lat) + if num_bins_lat <= 100: + nums_bin = max(5, int(num_bins_lat/10)) + elif num_bins_lat > 100 and num_bins_lat <= 1000: + nums_bin = max(10, int(num_bins_lat/10)) + else: + nums_bin = max(20, int(nums_bin/40)) + ### nums_bin = (len(bins_lat) + 1) ** 2 ## this was the old one + ########### Categorical cross of two categorical features is done here ######### + cross_pick_lon_lat = tf.keras.layers.experimental.preprocessing.CategoryCrossing()( + [cat_pickup_lat, cat_pickup_lon]) + hash_cross_pick_lon_lat = tf.keras.layers.experimental.preprocessing.Hashing( + num_bins=nums_bin)(cross_pick_lon_lat) + + # Cross to embedding + embed_cross_pick_lon_lat = tf.keras.layers.Embedding( + nums_bin, 10) (hash_cross_pick_lon_lat) + embed_cross_pick_lon_lat = tf.reduce_sum(embed_cross_pick_lon_lat, axis=-2) + + return embed_cross_pick_lon_lat +################################################################################ +def encode_any_integer_to_hash_categorical(feature_input, name, dataset, bins_num=30): + """ + Inputs: + ---------- + feature_input: must be a keras.Input variable, so make sure you create a variable first for the + column in your dataset that want to transform. Please make sure it has a + shape of (None, 1). + name: this is the name of the column in your dataset that you want to transform + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + bins_num: this is the number of bins you want the hashing layer to split the data into + + Outputs: + ----------- + encoded_feature: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, bins_num) - None indicates data has been batched + hashing always gives you int64 encodings. So make sure you concatenate it with other integer encoders. + """ + # Use the Hashing layer to hash the values to the range [0, 30] + hasher = Hashing(num_bins=bins_num, salt=1337) + + # Prepare a Dataset that only yields our feature + feature_ds = dataset.map(lambda x, y: x[name]) + feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) + + # Learn the set of possible string values and assign them a fixed integer index + hasher.adapt(feature_ds) + + # Turn the string input into integer indices + encoded_feature = hasher(feature_input) + + return encoded_feature + +########################################################################################### +def encode_any_feature_to_embed_categorical(feature_input, name, dataset, vocabulary): + """ + Inputs: + ---------- + feature_input: must be a keras.Input variable, so make sure you create a variable first for the + column in your dataset that want to transform. Please make sure it has a + shape of (None, 1). + name: this is the name of the column in your dataset that you want to transform + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + vocabulary: this is the number of bins you want the hashing layer to split the data into + + Outputs: + ----------- + encoded_feature: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, bins_num) - None indicates data has been batched + """ + extra_oov = 4 + # Use the lookup table first and then use embedding layer + lookup = StringLookup( + vocabulary=vocabulary, + mask_token=None, + num_oov_indices=extra_oov, + max_tokens=None, + output_mode="int" + ) + # Convert the string input values into integer indices. + #feature_ds = dataset.map(lambda x, y: x[name]) + #feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) + + # Learn the set of possible string values and assign them a fixed integer index + #lookup.adapt(feature_ds) + encoded_feature = lookup(feature_input) + #embedding_dims = int(math.sqrt(len(vocabulary))) + embedding_dims = int(max(2, math.log(len(vocabulary), 2))) + # Create an embedding layer with the specified dimensions. + embedding = tf.keras.layers.Embedding( + input_dim=len(vocabulary)+extra_oov, output_dim=embedding_dims + ) + # Convert the index values to embedding representations. + encoded_feature = embedding(encoded_feature) + encoded_feature = Flatten()(encoded_feature) + + return encoded_feature + +########################################################################################### +def encode_date_time_var_dayofweek_categorical(feature_input, name, dataset): + """ + This function will split the day of week from date-time column and create a new column. + It will take a keras.Input variable as input and return a keras.layers variable as output. + + Inputs: + ---------- + feature_input: must be a keras.Input variable, so make sure you create a variable first for the + date-time column in your dataset that you want to transform. Please make sure it has a + shape of (None, 1). It will split the hour of day from that column and create a new column. + name: this is the name of the column in your dataset that you want to transform + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + + Outputs: + ----------- + encoded_feature: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, 1) - None indicates that it has not been + CategoryEncoding output dtype is float32 even if output is binary or count. + """ + index = StringLookup() + + # Prepare a Dataset that only yields our feature + feature_ds = dataset.map(lambda x, y: dayofweek(x[name])) + feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) + + # Learn the set of possible string values and assign them a fixed integer index + index.adapt(feature_ds) + + # Turn the string input into integer indices + encoded_feature = index(feature_input) + + # Create a CategoryEncoding for our integer indices + encoder = CategoryEncoding(num_tokens=8, output_mode="binary") + + # Prepare a dataset of indices + feature_ds = feature_ds.map(index) + + # Learn the space of possible indices + encoder.adapt(feature_ds) + + # Apply one-hot encoding to our indices + encoded_feature = encoder(encoded_feature) + return encoded_feature + +def encode_date_time_var_monthofyear_categorical(feature_input, name, dataset): + """ + This function will split the month of year from date-time column and create a new column. + It will take a keras.Input variable as input and return a keras.layers variable as output. + + Inputs: + ---------- + feature_input: must be a keras.Input variable, so make sure you create a variable first for the + date-time column in your dataset that you want to transform. Please make sure it has a + shape of (None, 1). It will split the hour of day from that column and create a new column. + name: this is the name of the column in your dataset that you want to transform + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + + Outputs: + ----------- + encoded_feature: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, 1) - None indicates that it has not been + CategoryEncoding output dtype is float32 even if output is binary or count. + """ + index = StringLookup() + + # Prepare a Dataset that only yields our feature + feature_ds = dataset.map(lambda x, y: monthofyear(x[name])) + feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) + + # Learn the set of possible string values and assign them a fixed integer index + index.adapt(feature_ds) + + # Turn the string input into integer indices + encoded_feature = index(feature_input) + + # Create a CategoryEncoding for our integer indices + encoder = CategoryEncoding(num_tokens=13, output_mode="binary") + + # Prepare a dataset of indices + feature_ds = feature_ds.map(index) + + # Learn the space of possible indices + encoder.adapt(feature_ds) + + # Apply one-hot encoding to our indices + encoded_feature = encoder(encoded_feature) + return encoded_feature + +def encode_date_time_var_hourofday_categorical(feature_input, name, dataset): + """ + This function will split the hour of day from date-time column and create a new column. + It will take a keras.Input variable as input and return a keras.layers variable as output. + + Inputs: + ---------- + feature_input: must be a keras.Input variable, so make sure you create a variable first for the + date-time column in your dataset that you want to transform. Please make sure it has a + shape of (None, 1). It will split the hour of day from that column and create a new column. + name: this is the name of the column in your dataset that you want to transform + dataset: this is the variable holding the tf.data.Dataset of your data. Can be any kind of dataset. + for example: it can be a batched or a prefetched dataset. + Warning: You must be careful to set num_epochs when creating this dataset. + If num_epochs=None, this function will loop forever. If you set it to a number, + it will stop after that many epochs. So be careful! + + Outputs: + ----------- + encoded_feature: a keras.Tensor. You can use this tensor in keras models for training. + The Tensor has a shape of (None, 1) - None indicates that it has not been + CategoryEncoding output dtype is float32 even if output is binary or count. + """ + index = StringLookup() + + # Prepare a Dataset that only yields our feature + feature_ds = dataset.map(lambda x, y: hourofday(x[name])) + feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) + + # Learn the set of possible string values and assign them a fixed integer index + index.adapt(feature_ds) + + # Turn the string input into integer indices + encoded_feature = index(feature_input) + + # Create a CategoryEncoding for our integer indices + encoder = CategoryEncoding(num_tokens=25, output_mode="binary") + + # Prepare a dataset of indices + feature_ds = feature_ds.map(index) + + # Learn the space of possible indices + encoder.adapt(feature_ds) + + # Apply one-hot encoding to our indices + encoded_feature = encoder(encoded_feature) + return encoded_feature +################################################################################ +def one_hot_encode_categorical_target(features, labels, categories): + """Returns a one-hot encoded tensor representing categorical values.""" + # The entire encoding can fit on one line: + labels = tf.cast(tf.equal(categories, tf.reshape(labels, [-1, 1])), tf.int32) + return (features, labels) +############################################################################################## +def convert_classifier_targets(labels): + """ + This handy function converts target labels that are binary or multi-class (whether integer or string) into integers. + This is similar to a label encoder in scikit-learn but works on tensorflow tf.data.Datasets. + Just send in a tf.data.Dataset and it will split it into features and labels and then turn them into correct labels. + It returns the converted labels and a dictionary which you can use to convert it back to original labels. Neat! + """ + _, converted_labels = tf.unique(labels) + return converted_labels +######################################################################################### +def compare_two_datasets_with_idcol(train_ds, validate_ds, idcol,verbose=0): + ls_test = list(validate_ds.as_numpy_iterator()) + ls_train = list(train_ds.as_numpy_iterator()) + if verbose >= 1: + print(' Size of dataset 1 = %d' %(len(ls_train))) + print(' Size of dataset 2 = %d' %(len(ls_test))) + ts_list = [ls_test[x][0][idcol] for x in range(len(ls_test)) ] + tra_list = [ls_train[x][0][idcol] for x in range(len(ls_train)) ] + print('Alert! %d rows in common between dataset 1 and 2' %(len(tra_list) - len(left_subtract(tra_list, ts_list)))) +########################################################################################## +def process_continuous_data(data): + # Normalize data + max_data = tf.reduce_max(data) + min_data = tf.reduce_max(data) + data = (tf.cast(data, tf.float32) - min_data)/(max_data - min_data) + return tf.reshape(data, [-1, 1]) +########################################################################################## +# Process continuous features. +def preprocess(features, labels): + for feature in floats: + features[feature] = process_continuous_data(features[feature]) + return features, labels +########################################################################################## +###### This code is a modified version of keras.io documentation code examples ########## +###### https://keras.io/examples/structured_data/wide_deep_cross_networks/ ########## +########################################################################################## +def encode_auto_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FEATURES_WITH_VOCABULARY, + hidden_units, use_embedding=False): + ### These are for string cat variables with small number of categories ############# + cat_encoded = [] + numeric_encoded = [] + text_encoded = [] + encoded_features = [] + #### In "auto" model, "wide" part is short. Hence we use "count" with "embedding" flag. + for feature_name in inputs: + vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name] + extra_oov = 3 + if feature_name in CATEGORICAL_FEATURE_NAMES: + cat_encoded.append('') + cat_len = len(vocabulary) + lookup = StringLookup(vocabulary=vocabulary, + mask_token=None, + oov_token = '~UNK~') + if len(vocabulary) > 32: + # Convert the string input values into integer indices. + encoded_feature = inputs[feature_name] + encoded_feature = lookup(encoded_feature) + embedding_dims = int(max(2, math.log(len(vocabulary), 2))) + # Create an embedding layer with the specified dimensions. + embedding = Embedding( + input_dim=len(vocabulary)+extra_oov, output_dim=embedding_dims + ) + # Convert the index values to embedding representations. + encoded_feature = embedding(encoded_feature) + cat_encoded[-1] = Flatten()(encoded_feature) + else: + encoded_feature = inputs[feature_name] + encoded_feature = lookup(encoded_feature) + cat_encoded[-1] = CategoryEncoding(num_tokens = cat_len + 1)(encoded_feature) + elif feature_name in FLOATS: + ### you just ignore the floats in cross models #### + numeric_encoded.append('') + feat_mean = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name][0] + feat_var = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name][1] + normalizer = Normalization(mean=feat_mean, variance=feat_var) + numeric_encoded[-1] = normalizer(inputs[feature_name]) + else: + cat_encoded.append('') + if len(vocabulary) > 100: + print(' ALERT! Excessive dimensions in %s. Should integer %s be a float variable?' %( + len(vocabulary), feature_name)) + use_embedding = True + lookup = IntegerLookup( + vocabulary=vocabulary, + mask_token=None, + num_oov_indices=extra_oov, + max_tokens=None, + oov_token=-9999, + output_mode="count" if use_embedding else "binary", + ) + # Use the numerical features as-is. + encoded_feature = inputs[feature_name] + cat_encoded[-1] = lookup(encoded_feature) + ##### This is where are float encoded features are combined ### + ####Concatenate all features( Numerical input ) + encoded_features = layers.concatenate(cat_encoded+numeric_encoded) + return encoded_features +################################################################################################## +import math +def encode_fast_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FEATURES_WITH_VOCABULARY, + use_embedding=False): + encoded_features = [] + for feature_name in inputs: + vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name] + extra_oov = 3 + if feature_name in CATEGORICAL_FEATURE_NAMES: + # Create a lookup to convert string values to an integer indices. + # Since we are not using a mask token but expecting some out of vocabulary + # (oov) tokens, we set mask_token to None and num_oov_indices to extra_oov. + if len(vocabulary) > 32: + use_embedding = True + lookup = StringLookup( + vocabulary=vocabulary, + mask_token=None, + num_oov_indices=extra_oov, + max_tokens=None, + output_mode="int" if use_embedding else "binary", + ) + if use_embedding: + # Convert the string input values into integer indices. + encoded_feature = inputs[feature_name] + encoded_feature = lookup(encoded_feature) + #embedding_dims = int(math.sqrt(len(vocabulary))) + embedding_dims = int(max(2, math.log(len(vocabulary), 2))) + # Create an embedding layer with the specified dimensions. + embedding = layers.Embedding( + input_dim=len(vocabulary)+extra_oov, output_dim=embedding_dims + ) + # Convert the index values to embedding representations. + encoded_feature = embedding(encoded_feature) + encoded_feature = Flatten()(encoded_feature) + else: + # Convert the string input values into a one hot encoding. + encoded_feature = lookup(inputs[feature_name]) + elif feature_name in FLOATS: + ### you just ignore the floats in cross models #### + feat_mean = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name][0] + feat_var = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name][1] + normalizer = Normalization(mean=feat_mean, variance=feat_var) + encoded_feature = normalizer(inputs[feature_name]) + else: + if len(vocabulary) > 100: + print(' ALERT! Excessive feature dimension in %s. Should %s be a float variable?' %( + len(vocabulary), feature_name)) + use_embedding = True + lookup = IntegerLookup( + vocabulary=vocabulary, + mask_token=None, + num_oov_indices=extra_oov, + max_tokens=None, + oov_token=-9999, + output_mode="count" if use_embedding else "binary", + ) + # Use the numerical features as-is. + encoded_feature = inputs[feature_name] + encoded_feature = lookup(encoded_feature) + encoded_features.append(encoded_feature) + ##### This is where are float encoded features are combined ### + all_features = layers.concatenate(encoded_features) + return all_features +########################################################################################## +###### This code is a modified version of keras.io documentation code examples ########## +###### https://keras.io/examples/structured_data/wide_deep_cross_networks/ ########## +########################################################################################## +def closest(lst, K): + """ + Find a number in list lst that is closest to the value K. + """ + return lst[min(range(len(lst)), key = lambda i: abs(lst[i]-K))] +############################################################################################## +def create_nlp_inputs(nlp_vars): + inputs = {} + for feature_name in nlp_vars: + inputs[feature_name] = layers.Input( + name=feature_name, shape=(1,), dtype=tf.string + ) + return inputs +################################################################################# +def encode_nlp_inputs(inputs, CATEGORICAL_FEATURES_WITH_VOCABULARY): + layerlist = [] + list_embedding_sizes = [8, 16, 24, 32, 48, 64, 96, 128, 256] + for feature_name in inputs: + vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]['vocab'] + extra_oov = 50 + #vocab_size = int(math.sqrt(len(vocabulary))) + #best_embedding_size = closest(list_embedding_sizes, vocab_size//4000) + best_embedding_size = int(max(2, math.log(len(vocabulary), 2))) + + lookup = StringLookup( + vocabulary=vocabulary, + mask_token=None, + num_oov_indices=extra_oov, + max_tokens=None, + output_mode="int", + ) + # Convert the string input values into integer indices. + encoded_feature = inputs[feature_name] + encoded_feature = lookup(encoded_feature) + #embedding_dims = int(math.sqrt(len(vocabulary))) + # Create an embedding layer with the specified dimensions. + embedding = layers.Embedding( + input_dim=len(vocabulary)+extra_oov, output_dim=best_embedding_size + ) + # Convert the index values to embedding representations. + encoded_feature = embedding(encoded_feature) + encoded_feature = Flatten()(encoded_feature) + layerlist.append(encoded_feature) + ##### This is where are float encoded features are combined ### + all_features = layers.concatenate(layerlist) + #all_features = layers.Concatenate(axis = -1)(layerlist) + #all_features = Flatten()(all_features) + return all_features +################################################################################# +def create_fast_inputs(FEATURE_NAMES, NUMERIC_FEATURE_NAMES, FLOATS): + inputs = {} + for feature_name in FEATURE_NAMES: + if feature_name in FLOATS: + inputs[feature_name] = layers.Input( + name=feature_name, shape=(1,), dtype=tf.float32 + ) + elif feature_name in NUMERIC_FEATURE_NAMES: + inputs[feature_name] = layers.Input( + name=feature_name, shape=(1,), dtype=tf.int32 + ) + else: + inputs[feature_name] = layers.Input( + name=feature_name, shape=(1,), dtype=tf.string + ) + return inputs +################################################################################# +def create_all_inputs(FEATURE_NAMES, NUMERIC_FEATURE_NAMES, FLOATS): + inputs = {} + for feature_name in FEATURE_NAMES: + if feature_name in FLOATS: + inputs[feature_name] = layers.Input( + name=feature_name, shape=(1,), dtype=tf.float32 + ) + elif feature_name in NUMERIC_FEATURE_NAMES: + inputs[feature_name] = layers.Input( + name=feature_name, shape=(1,), dtype=tf.float32 + ) + else: + inputs[feature_name] = layers.Input( + name=feature_name, shape=(1,), dtype=tf.string + ) + return inputs +######################################################################################## +def encode_num_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FEATURES_WITH_VOCABULARY, + use_embedding=False): + encoded_features = [] + for feature_name in inputs: + if feature_name in FLOATS: + # Convert the string input values into a one hot encoding. + encoded_feature = inputs[feature_name] + encoded_features.append(encoded_feature) + ##### This is where are float encoded features are combined ### + all_features = layers.concatenate(encoded_features) + return all_features +#################################################################################################### +def encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FEATURES_WITH_VOCABULARY, + use_embedding=False): + #### This is a new version intended to reduce dimensions ################# + encoded_features = [] + for feature_name in inputs: + vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name] + extra_oov = 3 + if feature_name in CATEGORICAL_FEATURE_NAMES: + # Create a lookup to convert string values to an integer indices. + # Since we are not using a mask token but expecting some out of vocabulary + # (oov) tokens, we set mask_token to None and num_oov_indices to extra_oov. + if len(vocabulary) > 32: + use_embedding = True + lookup = StringLookup( + vocabulary=vocabulary, + mask_token=None, + num_oov_indices=extra_oov, + max_tokens=None, + output_mode="int" if use_embedding else "binary", + ) + if use_embedding: + # Convert the string input values into integer indices. + encoded_feature = inputs[feature_name] + encoded_feature = lookup(encoded_feature) + embedding_dims = int(max(2, math.log(len(vocabulary), 2))) + # Create an embedding layer with the specified dimensions. + embedding = layers.Embedding( + input_dim=len(vocabulary)+extra_oov, output_dim=embedding_dims + ) + # Convert the index values to embedding representations. + encoded_feature = embedding(encoded_feature) + encoded_feature = Flatten()(encoded_feature) + else: + # Convert the string input values into a one hot encoding. + encoded_feature = lookup(inputs[feature_name]) + else: + ##### This is for both integer and floats = they are considered same ### + ### you just ignore the floats in cross models #### + feat_mean = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name][0] + feat_var = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name][1] + normalizer = Normalization(mean=feat_mean, variance=feat_var) + encoded_feature = normalizer(inputs[feature_name]) + #encoded_feature = inputs[feature_name] + encoded_features.append(encoded_feature) + ################### + int_list = [x for x in encoded_features if x.dtype in [np.int8, np.int16, np.int32, np.int64]] + float_list = [ x for x in encoded_features if x.dtype in [np.float32, np.float64]] + if len(int_list) > 0: + all_int_features = layers.concatenate(int_list) + meta_int1 = layers.Dense(32)(all_int_features) + if len(float_list) > 0: + all_float_features = layers.concatenate(float_list) + meta_float1 = layers.Dense(32)(all_float_features) + #### You can add a Dense layer if needed here ########### + if len(int_list) > 0: + if len(float_list) > 0: + all_features = layers.concatenate([meta_int1, meta_float1]) + else: + all_features = layers.concatenate([meta_int1]) + else: + all_features = layers.concatenate([meta_float1]) + ##### This is where are float encoded features are combined ### + return all_features +################################################################################ +from itertools import combinations +def perform_new_feature_crossing(cat_input_dict, cross_cats, dataset): + combos = combinations(cross_cats, 2) + combos_encoded_list = [] + for cat1, cat2 in combos: + cat_combo_encoded = encode_cat_feature_crosses(cat_input_dict[cat1], cat_input_dict[cat2], dataset, bins_num=64) + combos_encoded_list.append(cat_combo_encoded) + return combos_encoded_list +################################################################################## +def perform_feature_crossing(cat_input_dict, cross_cats, cats, floats, max_tokens_zip, verbose=0): + + high_cats_alert = 50 ### set this number to alery you when a variable has high dimensions. Should it? + hidden_units = [32, 32] ## this is the number needed for feature crossing + dropout_rate = 0.1 ### this is set at a low rate ### + if len(cross_cats) > 20: + print(' Too many categorical features (>20). hence taking first 20 features for crossing.') + ls = cross_cats[:20] + elif len(cross_cats) != len(cat_input_dict): + ls = cross_cats + else: + ls = list(cat_input_dict.keys()) + ################################################################################### + keys = list(cat_input_dict.keys()) + for each in keys: + if each not in ls: + cat_input_dict.pop(each) + cross_cats = copy.deepcopy(ls) + ################################################################################## + try: + # This is a deep and cross network for cat and int-cat + int-bool feature crosses + each_cat_coded = encode_all_inputs(cat_input_dict, cats, floats, max_tokens_zip, + use_embedding=True) + cross = each_cat_coded + for _ in hidden_units: + units = cross.shape[-1] + x = layers.Dense(units)(cross) + if cross.dtype == x.dtype: + cross = each_cat_coded * x + cross + else: + each_cat_coded = tf.cast(each_cat_coded, tf.float32) + cross = tf.cast(cross, tf.float32) + cross = each_cat_coded * x + cross + + cross = layers.BatchNormalization()(cross) + + deep = each_cat_coded + for units in hidden_units: + deep = layers.Dense(units)(deep) + deep = layers.BatchNormalization()(deep) + deep = layers.ReLU()(deep) + deep = layers.Dropout(dropout_rate)(deep) + + feat_cross_encoded = layers.concatenate([cross, deep]) + #feat_cross_encoded = cross + if verbose: + print(' feature crossing completed: cross encoding shape: %s' %(feat_cross_encoded.shape[1])) + if feat_cross_encoded.shape[1] > high_cats_alert: + print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + except: + print(' Error: Skipping feature crossing since Keras preprocessing step is erroring') + return feat_cross_encoded +######################################################################################## +class OutputLayer(tf.keras.layers.Layer): + def __init__(self): + super(OutputLayer, self).__init__() + + def call(self, inputs, **kwargs): + outputs = tf.keras.backend.cast(inputs, "string") + return outputs +########################################################################################## +def encode_bool_inputs(inputs): + encoded_features = [] + for feature_name in inputs: + # Convert the string input values into a one hot encoding. + lookup = StringLookup( + vocabulary=['True', 'False'], + mask_token=None, + num_oov_indices=1, + max_tokens=None, + output_mode="binary" + ) + encoded_feature = inputs[feature_name] + encoded_feature = lookup(encoded_feature) + encoded_features.append(encoded_feature) + ##### This is where are float encoded features are combined ### + if len(inputs) == 1: + all_features = encoded_feature + else: + all_features = layers.concatenate(encoded_features) + return all_features +############################################################################################ diff --git a/build/lib/deep_autoviml/preprocessing/preprocessing_text.py b/build/lib/deep_autoviml/preprocessing/preprocessing_text.py new file mode 100644 index 0000000..443c7a5 --- /dev/null +++ b/build/lib/deep_autoviml/preprocessing/preprocessing_text.py @@ -0,0 +1,117 @@ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import tempfile +import pdb +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +from itertools import combinations +from collections import defaultdict + +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +############################################################################################ +# data pipelines and feature engg here + +# pre-defined TF2 Keras models and your own models here +from deep_autoviml.data_load.classify_features import check_model_options + +# Utils + +############################################################################################ +# TensorFlow ≥2.4 is required +import tensorflow as tf +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup, Hashing +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding, CategoryCrossing +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization, Discretization +from tensorflow.keras.layers import Embedding, Flatten + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers +import tensorflow_hub as hub + + +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from IPython.core.display import Image, display +import pickle +############################################################################################# +##### Suppress all TF2 and TF1.x warnings ################### +try: + tf.logging.set_verbosity(tf.logging.ERROR) +except: + tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) +############################################################################################ +from tensorflow.keras.layers import Reshape, MaxPooling1D, MaxPooling2D, AveragePooling2D, AveragePooling1D +from tensorflow.keras import Model, Sequential +from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, GlobalMaxPooling1D, Dropout, Conv1D +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization +############################################################################################ +def preprocessing_text(train_ds, keras_model_type, model_options): + """ + #################################################################################################### + This produces a preprocessing layer for an incoming NLP column using TextVectorization from keras. + You need to just send in a tf.data.DataSet from training folder and it will automatically apply NLP. + It will return a full-model-ready layer that you can add to your Keras Functional model as an NLP_layer! + max_tokens_zip is a dictionary of each NLP column name and its max_tokens as defined by train data. + ########### Motivation and suggestions for coding for Image processing came from this blog ######### + Greatly indebted to Srivatsan for his Github and notebooks: https://github.com/srivatsan88/YouTubeLI + #################################################################################################### + """ + num_predicts = model_options["num_classes"] + try: + if keras_model_type.lower() in ["text"]: + ####### L O A D F E A T U R E E X T R A C T O R ################ + url = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1" + tf_hub_model = check_model_options(model_options, "tf_hub_model", url) + feature_extractor_layer = hub.KerasLayer(tf_hub_model, output_shape=[20], + input_shape=[], dtype=tf.string, trainable=False) + units = 16 + print('Using Swivel-20D model from TensorFlow Hub') + else: + tf_hub_model = "https://tfhub.dev/google/nnlm-en-dim50/2" + feature_extractor_layer = hub.KerasLayer(tf_hub_model, output_shape=[50], + input_shape=[], dtype=tf.string, trainable=True) + units = 32 + print(' Using NNLM-50D model from TensorFlow Hub') + tf.random.set_seed(111) + model = tf.keras.Sequential([ + feature_extractor_layer, + tf.keras.layers.Dense(units, activation='relu'), + tf.keras.layers.Dense(num_predicts,activation='sigmoid') + ]) + model.compile( + optimizer='adam', + loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True), + metrics=['accuracy']) + except: + print(' Error: Failed NLP preprocessing layer. Returning...') + return + return model diff --git a/build/lib/deep_autoviml/utilities/utilities.py b/build/lib/deep_autoviml/utilities/utilities.py new file mode 100644 index 0000000..f12dc39 --- /dev/null +++ b/build/lib/deep_autoviml/utilities/utilities.py @@ -0,0 +1,1330 @@ +############################################################################################ +#Copyright 2021 Google LLC + +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +############################################################################################ +import pdb +from sklearn.metrics import balanced_accuracy_score, classification_report +from sklearn.metrics import confusion_matrix, roc_auc_score, accuracy_score +from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error +from collections import defaultdict +import pandas as pd +import numpy as np +pd.set_option('display.max_columns',500) +import matplotlib.pyplot as plt +import copy +import warnings +warnings.filterwarnings(action='ignore') +import functools +# Make numpy values easier to read. +np.set_printoptions(precision=3, suppress=True) +################################################################################ +import tensorflow as tf +np.random.seed(42) +tf.random.set_seed(42) +from tensorflow.keras import layers +from tensorflow import keras +from tensorflow.keras.layers.experimental.preprocessing import Normalization, StringLookup +from tensorflow.keras.layers.experimental.preprocessing import IntegerLookup, CategoryEncoding +from tensorflow.keras.layers.experimental.preprocessing import TextVectorization + +from tensorflow.keras.optimizers import SGD, Adam, RMSprop +from tensorflow.keras import layers +from tensorflow.keras import optimizers +from tensorflow.keras.models import Model, load_model +from tensorflow.keras import callbacks +from tensorflow.keras import backend as K +from tensorflow.keras import utils +from tensorflow.keras.layers import BatchNormalization +from tensorflow.keras.optimizers import SGD +from tensorflow.keras import regularizers + +from tensorflow.keras import layers +from tensorflow.keras.models import Model, load_model +################################################################################ +from deep_autoviml.modeling.one_cycle import OneCycleScheduler +##### Suppress all TF2 and TF1.x warnings ################### +tf2logger = tf.get_logger() +tf2logger.warning('Silencing TF2.x warnings') +tf2logger.root.removeHandler(tf2logger.root.handlers) +tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) + +################################################################################ +import os +def check_if_GPU_exists(verbose=0): + GPU_exists = False + gpus = tf.config.list_physical_devices('GPU') + logical_gpus = tf.config.list_logical_devices('GPU') + tpus = tf.config.list_logical_devices('TPU') + #### In some cases like Kaggle kernels, the GPU is not enabled. Hence this check. + if logical_gpus: + # Restrict TensorFlow to only use the first GPU + if verbose: + print("GPUs found in this device...: ") + print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU") + if len(logical_gpus) > 1: + device = "gpus" + else: + device = "gpu" + try: + tf.config.experimental.set_visible_devices(gpus[0], 'GPU') + except RuntimeError as e: + # Visible devices must be set before GPUs have been initialized + print(e) + try: + # Currently, memory growth needs to be the same across GPUs + for gpu in gpus: + tf.config.experimental.set_memory_growth(gpu, True) + print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs") + except RuntimeError as e: + # Memory growth must be set before GPUs have been initialized + print(e) + elif tpus: + device = "tpu" + print("GPUs found in this device...: ") + if verbose: + print("Listing all TPU devices: ") + for tpu in tpus: + print(tpu) + else: + print('Only CPU found on this device') + device = "cpu" + #### Set Strategy ########## + if device == "tpu": + try: + resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='') + tf.config.experimental_connect_to_cluster(resolver) + # This is the TPU initialization code that has to be at the beginning. + tf.tpu.experimental.initialize_tpu_system(resolver) + strategy = tf.distribute.TPUStrategy(resolver) + if verbose: + print('Setting TPU strategy using %d devices' %strategy.num_replicas_in_sync) + except: + if verbose: + print('Setting TPU strategy using Colab...') + resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR']) + tf.config.experimental_connect_to_cluster(resolver) + # This is the TPU initialization code that has to be at the beginning. + tf.tpu.experimental.initialize_tpu_system(resolver) + strategy = tf.distribute.TPUStrategy(resolver) + elif device == "gpu": + strategy = tf.distribute.MirroredStrategy() + if verbose: + print('Setting Mirrored GPU strategy using %d devices' %strategy.num_replicas_in_sync) + elif device == "gpus": + strategy = tf.distribute.MultiWorkerMirroredStrategy() + if verbose: + print('Setting Multiworker GPU strategy using %d devices' %strategy.num_replicas_in_sync) + else: + strategy = tf.distribute.OneDeviceStrategy(device='/device:CPU:0') + if verbose: + print('Setting CPU strategy using %d devices' %strategy.num_replicas_in_sync) + return strategy +###################################################################################### +def print_one_row_from_tf_dataset(test_ds): + """ + No matter how big a dataset or batch size, this handy function will print the first row. + This way you can test what's in each row of a tensorflow dataset that you sent in as input + You need to provide at least one column in the dataset for it to check if it should print it. + Inputs: + ------- + test_ds: tf.data.DataSet - this must be batched and num_epochs must be an integer. + - otherwise it won't print! + """ + try: + if isinstance(test_ds, tuple): + dict_row = list(test_ds.as_numpy_iterator())[0] + else: + dict_row = test_ds + print("Printing one batch from the dataset:") + preds = list(dict_row.element_spec[0].keys()) + if dict_row.element_spec[0][preds[0]].shape[0] is None or isinstance( + dict_row.element_spec[0][preds[0]].shape[0], int): + for batch, head in dict_row.take(1): + for labels, value in batch.items(): + print("{:40s}: {}".format(labels, value.numpy()[:4])) + except: + print(' Error printing. Continuing...') +######################################################################################### +def print_one_row_from_tf_label(test_label): + """ + No matter how big a dataset or batch size, this handy function will print the first row. + This way you can test what's in each row of a tensorflow dataset that you sent in as input + You need to provide at least one column in the dataset for it to check if it should print it. + Inputs: + ------- + test_label: tf.data.DataSet - this must be batched and num_epochs must be an integer. + - otherwise it won't print! + """ + if isinstance(test_label, tuple): + dict_row = list(test_label.as_numpy_iterator())[0] + else: + dict_row = test_label + preds = list(dict_row.element_spec[0].keys()) + try: + ### This is for multilabel problems only #### + if len(dict_row.element_spec[1]) >= 1: + labels = list(dict_row.element_spec[1].keys()) + for feats, labs in dict_row.take(1): + for each_label in labels: + print(' label = %s, samples: %s' %(each_label, labs[each_label])) + except: + ### This is for single problems only #### + if dict_row.element_spec[0][preds[0]].shape[0] is None or isinstance( + dict_row.element_spec[0][preds[0]].shape[0], int): + for feats, labs in dict_row.take(1): + print(" samples from label: %s" %(labs.numpy().tolist()[:10])) +########################################################################################## +from sklearn.base import TransformerMixin +from collections import defaultdict +import pandas as pd +import numpy as np +class My_LabelEncoder(TransformerMixin): + """ + ################################################################################################ + ###### This Label Encoder class works just like sklearn's Label Encoder! ##################### + ##### You can label encode any column in a data frame using this new class. But unlike sklearn, + the beauty of this function is that it can take care of NaN's and unknown (future) values. + It uses the same fit() and fit_transform() methods of sklearn's LabelEncoder class. + ################################################################################################ + Usage: + MLB = My_LabelEncoder() + train[column] = MLB.fit_transform(train[column]) + test[column] = MLB.transform(test[column]) + """ + def __init__(self): + self.transformer = defaultdict(str) + self.inverse_transformer = defaultdict(str) + + def fit(self,testx): + if isinstance(testx, pd.Series): + pass + elif isinstance(testx, np.ndarray): + testx = pd.Series(testx) + else: + return testx + outs = np.unique(testx.factorize()[0]) + ins = np.unique(testx.factorize()[1]).tolist() + if -1 in outs: + ins.insert(0,np.nan) + self.transformer = dict(zip(ins,outs.tolist())) + self.inverse_transformer = dict(zip(outs.tolist(),ins)) + return self + + def transform(self, testx): + if isinstance(testx, pd.Series): + pass + elif isinstance(testx, np.ndarray): + testx = pd.Series(testx) + else: + return testx + ins = np.unique(testx.factorize()[1]).tolist() + missing = [x for x in ins if x not in self.transformer.keys()] + if len(missing) > 0: + for each_missing in missing: + max_val = np.max(list(self.transformer.values())) + 1 + self.transformer[each_missing] = max_val + self.inverse_transformer[max_val] = each_missing + ### now convert the input to transformer dictionary values + outs = testx.map(self.transformer).values + return outs + + def inverse_transform(self, testx): + ### now convert the input to transformer dictionary values + if isinstance(testx, pd.Series): + outs = testx.map(self.inverse_transformer).values + elif isinstance(testx, np.ndarray): + outs = pd.Series(testx).map(self.inverse_transformer).values + else: + outs = testx[:] + return outs +################################################################################# +import matplotlib.patches as mpatches +import matplotlib.pyplot as plt +from sklearn.metrics import accuracy_score, balanced_accuracy_score +################################################################################# +def plot_history(history, metric, targets): + if isinstance(targets, str): + #### This is for single label problems + fig = plt.figure(figsize=(15,6)) + #### first metric is always the loss - just plot it! + hist = pd.DataFrame(history.history) + hist['epoch'] = history.epoch + ax1 = plt.subplot(1, 2, 1) + ax1.set_title('Model Training vs Validation Loss') + plot_one_history_metric(history, "loss", ax1) + ax2 = plt.subplot(1, 2, 2) + ax2.set_title('Model Training vs Validation %s' %metric) + ##### Now let's plot the second metric #### + plot_one_history_metric(history, metric, ax2) + else: + ### This is for Multi-Label problems + for each_target in targets: + fig = plt.figure(figsize=(15,6)) + hist = pd.DataFrame(history.history) + hist['epoch'] = history.epoch + ax1 = plt.subplot(1, 2, 1) + ax1.set_title('Model Training vs Validation Loss') + plot_one_history_metric(history, each_target+"_loss", ax1) + ax2 = plt.subplot(1, 2, 2) + ### Since we are using total loss, we must find another metric to show. + ### This is how we do it - by collecting all metrics with target name + ### and pick the first one. This may or may not always get the best answer, but we will see. + metric1 = [x for x in hist.columns.tolist() if (each_target in x) & ("loss" not in x) ] + metric2 = metric1[0] + ### the next line is not a typo! it should always be target[0] + ### since val_monitor is based on the first target's metric only! + #metric1 = metric.replace(targets[0],'') + #metric2 = each_target + metric1 + ax2.set_title('Model Training vs Validation %s' %metric2) + plot_one_history_metric(history, metric2, ax2) + plt.show(); +####################################################################################### +def plot_one_history_metric(history, metric, ax): + train_metrics = history.history[metric] + val_metrics = history.history['val_'+metric] + epochs = range(1, len(train_metrics) + 1) + ax.plot(epochs, train_metrics) + ax.plot(epochs, val_metrics) + ax.set_xlabel("Epochs") + ax.set_ylabel(metric) + ax.legend(["train_"+metric, 'val_'+metric]) +#################################################################################### +from sklearn.metrics import classification_report, confusion_matrix +from sklearn.metrics import balanced_accuracy_score +from collections import OrderedDict +from collections import Counter +def print_classification_model_stats(y_test, y_preds): + """ + This will print both multi-label and multi-class metrics. + You must send in only actual and predicted labels. No probabilities!! + """ + try: + assert y_preds.shape[1] + for i in range(y_preds.shape[1]): + if y_preds.shape[1] == y_test.shape[1]: + print('Target label %s results:' %(i+1)) + print_classification_model_metrics(y_test[:,i], y_preds[:,i]) + else: + print('error printing: number of labels in actuals and predicted are different ') + except: + ### This is a binary class only ####### + print_classification_model_metrics(y_test, y_preds) + +def print_classification_model_metrics(y_true, predicted): + """ + This prints classification metrics in a nice format only for binary classes + """ + #### Use this to Test Classification Problems Only #### + try: + y_pred = predicted.argmax(axis=1) + except: + y_pred = predicted + print('Balanced Accuracy = %0.2f%%' %( + 100*balanced_accuracy_score(y_true, y_pred))) + print('Confusion Matrix:') + print(confusion_matrix(y_true, y_pred)) + print(classification_report(y_true, y_pred)) + print('#####################################################################') + return balanced_accuracy_score(y_true, y_pred) +################################################################################### +from sklearn.metrics import classification_report +import matplotlib.pyplot as plt +import seaborn as sns +def plot_classification_results(y_true, y_pred, labels, target_names, title_string=""): + try: + fig, axes = plt.subplots(1,2,figsize=(15,6)) + draw_confusion_matrix(y_true, y_pred, labels, target_names, '%s Confusion Matrix' %title_string, ax=axes[0]) + try: + clf_report = classification_report(y_true, + y_pred, + labels=labels, + target_names=target_names, + output_dict=True) + except: + clf_report = classification_report(y_true,y_pred,labels=target_names, + target_names=labels,output_dict=True) + sns.heatmap(pd.DataFrame(clf_report).iloc[:, :].T, annot=True,ax=axes[1],fmt='0.2f'); + axes[1].set_title('Classification Report') + except: + print('Error: could not plot classification results. Continuing...') +###################################################################################### +import matplotlib.pyplot as plt +import seaborn as sns +from sklearn.metrics import confusion_matrix, f1_score +def draw_confusion_matrix(y_test,y_pred, labels, target_names, model_name='Model',ax=''): + """ + This plots a beautiful confusion matrix based on input: ground truths and predictions + """ + #Confusion Matrix + '''Plotting CONFUSION MATRIX''' + import seaborn as sns + sns.set_style('darkgrid') + + '''Display''' + from IPython.core.display import display, HTML + display(HTML("")) + pd.options.display.float_format = '{:,.2f}'.format + + #Get the confusion matrix and put it into a df + + cm = confusion_matrix(y_test, y_pred) + + cm_df = pd.DataFrame(cm, + index = labels, + columns = target_names, + ) + + sns.heatmap(cm_df, + center=0, + cmap=sns.diverging_palette(220, 15, as_cmap=True), + annot=True, + fmt='g', + ax=ax) + + ax.set_title(' %s \nF1 Score(avg = micro): %0.2f \nF1 Score(avg = macro): %0.2f' %( + model_name,f1_score(y_test, y_pred, average='micro'),f1_score(y_test, y_pred, average='macro')), + fontsize = 13) + ax.set_ylabel('True label', fontsize = 13) + ax.set_xlabel('Predicted label', fontsize = 13) +################################################################################ +def print_regression_model_stats(actuals, predicted, targets='', plot_name=''): + """ + This program prints and returns MAE, RMSE, MAPE. + If you like the MAE and RMSE to have a title or something, just give that + in the input as "title" and it will print that title on the MAE and RMSE as a + chart for that model. Returns MAE, MAE_as_percentage, and RMSE_as_percentage + """ + if isinstance(actuals,pd.Series) or isinstance(actuals,pd.DataFrame): + actuals = actuals.values + if isinstance(predicted,pd.Series) or isinstance(predicted,pd.DataFrame): + predicted = predicted.values + if len(actuals) != len(predicted): + print('Error: Number of actuals and predicted dont match. Continuing...') + if targets == "": + try: + ### This is for Multi_Label Problems ### + assert actuals.shape[1] + multi_label = True + if isinstance(actuals,pd.Series): + cols = [actuals.name] + elif isinstance(actuals,pd.DataFrame): + cols = actuals.columns.tolist() + else: + cols = ['target_'+str(i) for i in range(actuals.shape[1])] + except: + #### THis is for Single Label problems ##### + multi_label = False + if isinstance(actuals,pd.Series): + cols = [actuals.name] + elif isinstance(actuals,pd.DataFrame): + cols = actuals.columns.tolist() + else: + cols = ['target_1'] + else: + cols = copy.deepcopy(targets) + if isinstance(targets, str): + cols = [targets] + if len(cols) == 1: + multi_label = False + else: + multi_label = True + try: + plot_regression_scatters(actuals,predicted,cols,plot_name=plot_name) + except: + print('Could not draw regression plot but continuing...') + if multi_label: + for i in range(actuals.shape[1]): + actuals_x = actuals[:,i] + try: + predicted_x = predicted[:,i] + except: + predicted_x = predicted.ravel() + print('Regression Metrics for Target=%s' %cols[i]) + mae, mae_asp, rmse_asp = print_regression_metrics(actuals_x, predicted_x) + else: + mae, mae_asp, rmse_asp = print_regression_metrics(actuals, predicted) + return mae, mae_asp, rmse_asp +################################################################################ +from sklearn.metrics import r2_score +def print_regression_metrics(actuals, predicted): + predicted = np.nan_to_num(predicted) + mae = mean_absolute_error(actuals, predicted) + mae_asp = (mean_absolute_error(actuals, predicted)/actuals.std())*100 + rmse_asp = (np.sqrt(mean_squared_error(actuals,predicted))/actuals.std())*100 + rmse = print_rmse(actuals, predicted) + _ = print_mape(actuals, predicted) + mape = print_mape(actuals, predicted) + print(' MAE = %0.4f' %mae) + print(" MAPE = %0.0f%%" %(mape)) + print(' RMSE = %0.4f' %rmse) + print(' MAE as %% std dev of Actuals = %0.1f%%' %(mae/abs(actuals).std()*100)) + # Normalized RMSE print('RMSE = {:,.Of}'.format(rmse)) + print(' R-Squared (%% ) = %0.0f%%' %(100*r2_score(actuals,predicted))) + print(' Normalized RMSE (%% of Std Dev of Actuals) = %0.0f%%' %(100*rmse/actuals.std())) + return mae, mae_asp, rmse_asp +################################################################################ +def print_static_rmse(actuals, predicted, start_from=0,verbose=0): + """ + this calculates the ratio of the rmse error to the standard deviation of the actuals. + This ratio should be below 1 for a model to be considered useful. + The comparison starts from the row indicated in the "start_from" variable. + """ + predicted = np.nan_to_num(predicted) + rmse = np.sqrt(mean_squared_error(actuals[start_from:],predicted[start_from:])) + std_dev = actuals[start_from:].std() + if verbose >= 1: + print(' RMSE = %0.2f' %rmse) + print(' Std Deviation of Actuals = %0.2f' %(std_dev)) + print(' Normalized RMSE = %0.1f%%' %(rmse*100/std_dev)) + print(' R-Squared (%% ) = %0.0f%%' %(100*r2_score(actuals,predicted))) + return rmse, rmse/std_dev +################################################################################ +from sklearn.metrics import mean_squared_error,mean_absolute_error +def print_rmse(y, y_hat): + """ + Calculating Root Mean Square Error https://en.wikipedia.org/wiki/Root-mean-square_deviation + """ + mse = np.mean((y - y_hat)**2) + return np.sqrt(mse) + +def print_mape(y, y_hat): + """ + Calculating Mean Absolute Percent Error https://en.wikipedia.org/wiki/Mean_absolute_percentage_error + """ + perc_err = (100*(y - y_hat))/y + return np.mean(abs(perc_err)) +################################################################################ +from sklearn import metrics +import matplotlib.pyplot as plt +import copy +def print_classification_header(num_classes, num_labels, target_name): + ######## This is where you start printing metrics ############### + if isinstance(num_classes, list) : + if np.max(num_classes) > 2: + print('Multi Label (multi-output) Multi-Class Report: %s' %target_name) + print('#################################################################') + else: + print('Multi Label (multi-output) Binary Class Metrics Report: %s' %target_name) + print('#################################################################') + else: + if num_classes > 2: + print('Single Label (single-output), Multi-Class Report: %s' %target_name) + print('#################################################################') + else: + print('Single Label, Multi Class Model Metrics Report: %s' %target_name) + print('#################################################################') + +def print_classification_metrics(y_test, y_probs, proba_flag=True): + """ + ####### Send in the actual_values and prediction_probabilities for binary classes + This will return back metrics and print them all in a neat format + """ + y_test = copy.deepcopy(y_test) + multi_label_flag = False + multi_class_flag = False + #### for some cases, you won't get proba, so check the proba_flag + if proba_flag: + y_preds = y_probs.argmax(axis=1) + else: + y_preds = copy.deepcopy(y_probs) + ##### check if it is multi-class ##### + if len(np.unique(y_test)) > 2 or max(np.unique(y_test)) >= 2: + multi_class_flag = True + elif len(np.unique(y_preds)) > 2 or max(np.unique(y_preds)) >= 2: + multi_class_flag = True + ########### This is where we print the metrics ################### + try: + if not multi_class_flag and not multi_label_flag: + # Calculate comparison metrics for Binary classification results. + accuracy = metrics.accuracy_score(y_test, y_preds) + balanced_accuracy = metrics.balanced_accuracy_score(y_test, y_preds) + precision = metrics.precision_score(y_test, y_preds) + f1_score = metrics.f1_score(y_test, y_preds) + recall = metrics.recall_score(y_test, y_preds) + if type(np.mean((y_test==y_preds))) == pd.Series: + print(' Accuracy = %0.1f%%' %(np.mean(accuracy)*100)) + else: + print(' Accuracy = %0.1f%%' %(accuracy*100)) + print(' Balanced Accuracy = %0.1f%%' %(balanced_accuracy*100)) + print(' Precision = %0.1f%%' %(precision*100)) + if proba_flag: + average_precision = np.mean(metrics.precision_score(y_test, y_preds, average=None)) + else: + average_precision = metrics.precision_score(y_test, y_preds, average='macro') + print(' Average Precision = %0.1f%%' %(average_precision*100)) + print(' Recall = %0.1f%%' %(recall*100)) + print(' F1 Score = %0.1f%%' %(f1_score*100)) + if proba_flag: + roc_auc = metrics.roc_auc_score(y_test, y_probs[:,1]) + #fpr, tpr, threshold = metrics.roc_curve(y_test, y_probs[:,1]) + #roc_auc = metrics.auc(fpr, tpr) + print(' ROC AUC = %0.1f%%' %(roc_auc*100)) + else: + roc_auc = 0 + print('#####################################################') + return [accuracy, balanced_accuracy, precision, average_precision, f1_score, recall, roc_auc] + else: + # Calculate comparison metrics for Multi-Class classification results. + accuracy = np.mean((y_test==y_preds)) + if multi_label_flag: + balanced_accuracy = np.mean(metrics.recall_score(y_test, y_preds, average=None)) + precision = metrics.precision_score(y_test, y_preds, average=None) + average_precision = metrics.precision_score(y_test, y_preds, average='macro') + f1_score = metrics.f1_score(y_test, y_preds, average=None) + recall = metrics.recall_score(y_test, y_preds, average=None) + else: + balanced_accuracy = metrics.balanced_accuracy_score(y_test, y_preds) + precision = metrics.precision_score(y_test, y_preds, average = None) + average_precision = metrics.precision_score(y_test, y_preds,average='macro') + f1_score = metrics.f1_score(y_test, y_preds, average = None) + recall = metrics.recall_score(y_test, y_preds, average = None) + if type(np.mean((y_test==y_preds))) == pd.Series: + print(' Accuracy = %0.1f%%' %(np.mean(accuracy)*100)) + else: + print(' Accuracy = %0.1f%%' %(accuracy*100)) + print(' Balanced Accuracy (average recall) = %0.1f%%' %(balanced_accuracy*100)) + print(' Average Precision (macro) = %0.1f%%' %(average_precision*100)) + ### these are basically one for each class ##### + print(' Precisions by class:') + for precisions in precision: + print(' %0.1f%% ' %(precisions*100),end="") + print('\n Recall Scores by class:') + for recalls in recall: + print(' %0.1f%% ' %(recalls*100), end="") + print('\n F1 Scores by class:') + for f1_scores in f1_score: + print(' %0.1f%% ' %(f1_scores*100),end="") + # Return list of metrics to be added to a Dataframe to compare models. + except: + print(' print classification metrics erroring. Continuing...') + print('\n#####################################################') + return [accuracy, balanced_accuracy, precision, average_precision, f1_score, recall, 0] +################################################################################################## +def find_rare_class(classes, verbose=0): + ######### Print the % count of each class in a Target variable ##### + """ + Works on Multi Class too. Prints class percentages count of target variable. + It returns the name of the Rare class (the one with the minimum class member count). + This can also be helpful in using it as pos_label in Binary and Multi Class problems. + """ + counts = OrderedDict(Counter(classes)) + total = sum(counts.values()) + if verbose >= 1: + print(' Class -> Counts -> Percent') + sorted_keys = sorted(counts.keys()) + for cls in sorted_keys: + print("%12s: % 7d -> % 5.1f%%" % (cls, counts[cls], counts[cls]/total*100)) + if type(pd.Series(counts).idxmin())==str: + return pd.Series(counts).idxmin() + else: + return int(pd.Series(counts).idxmin()) +############################################################################### +##################################################################### +##### REGRESSION CHARTS AND METRICS ARE PRINTED PLOTTED HERE +##################################################################### +import time +from itertools import cycle +def plot_regression_scatters(df, df2, num_vars, kind='scatter', plot_name=''): + """ + Great way to plot continuous variables fast. Just sent them in and it will take care of the rest! + """ + figsize = (10, 10) + colors = cycle('byrcmgkbyrcmgkbyrcmgkbyrcmgk') + num_vars_len = len(num_vars) + col = 2 + start_time = time.time() + row = len(num_vars) + fig, ax = plt.subplots(row, col) + if col < 2: + fig.set_size_inches(min(15,8),row*5) + fig.subplots_adjust(hspace=0.5) ### This controls the space betwen rows + fig.subplots_adjust(wspace=0.3) ### This controls the space between columns + else: + fig.set_size_inches(min(col*10,20),row*5) + fig.subplots_adjust(hspace=0.3) ### This controls the space betwen rows + fig.subplots_adjust(wspace=0.3) ### This controls the space between columns + fig.suptitle('Regression Metrics Plots for %s Model' %plot_name, fontsize=20) + counter = 0 + if row == 1: + ax = ax.reshape(-1,1).T + for k in np.arange(row): + row_color = next(colors) + for l in np.arange(col): + try: + if col==1: + if row == 1: + x = df[:] + y = df2[:] + else: + x = df[:,k] + y = df2[:,k] + ax1 = ax[k][l] + lineStart = x.min() + lineEnd = x.max() + ax1.scatter(x, y, color=row_color) + ax1.plot([lineStart, lineEnd], [lineStart, lineEnd], 'k-', color=next(colors)) + ax1.set_xlabel('Actuals') + ax1.set_ylabel('Predicted') + ax1.set_title('Predicted vs Actuals Plot for Target = %s' %num_vars[k]) + else: + if row == 1: + x = df[:] + y = df2[:] + else: + x = df[:,k] + y = df2[:,k] + lineStart = x.min() + lineEnd = x.max() + if l == 0: + ax1 = ax[k][l] + ax1.scatter(x, y, color = row_color) + ax1.plot([lineStart, lineEnd], [lineStart, lineEnd], 'k-', color = next(colors)) + ax1.set_xlabel('Actuals') + ax1.set_ylabel('Predicted') + ax1.set_title('Predicted vs Actuals Plot for Target = %s' %num_vars[k]) + else: + ax1 = ax[k][l] + try: + assert y.shape[1] + ax1.hist((x-y.ravel()), density=True,color = row_color) + except: + ax1.hist((x-y), density=True,color = row_color) + ax1.axvline(linewidth=2, color='k') + ax1.set_title('Residuals Plot for Target = %s' %num_vars[k]) + except: + if col == 1: + counter += 1 + else: + ax[k][l].set_title('No Predicted vs Actuals Plot for plot as %s is not numeric' %num_vars[k]) + counter += 1 + print('Regression Plots completed in %0.3f seconds' %(time.time()-start_time)) +################################################################################ +def plot_regression_residuals(y_test, y_test_preds, target, project_name, num_labels): + """ + Another set of plots for continuous variables. + """ + try: + if isinstance(target, str): + colors = cycle('byrcmgkbyrcmgkbyrcmgkbyrcmgk') + row_color = next(colors) + plt.figure(figsize=(15,6)) + ax1 = plt.subplot(1, 2, 1) + residual = pd.Series((y_test - y_test_preds)) + residual.plot(ax=ax1, color='b') + ax1.set_title('Residuals by each row in held-out set') + ax1.axhline(y=0.0, linewidth=2, color=next(colors)) + pdf = save_valid_predictions(y_test, y_test_preds.ravel(), project_name, num_labels) + ax2 = plt.subplot(1, 2, 2) + pdf.plot(ax=ax2) + ax2.set_title('Actuals vs Predictions by each row in held-out set') + else: + pdf = save_valid_predictions(y_test, y_test_preds, project_name, num_labels) + plt.figure(figsize=(15,6)) + colors = cycle('byrcmgkbyrcmgkbyrcmgkbyrcmgk') + for i in range(num_labels): + row_color = next(colors) + ax1 = plt.subplot(1, num_labels, i+1) + residual = pd.Series((y_test[:,i] - y_test_preds[:,i])) + residual.plot(ax=ax1, color=row_color) + ax1.set_title(f"Actuals_{i} (x-axis) vs. Residuals_{i} (y-axis)") + ax1.axhline(y=0.0, linewidth=2, color=next(colors)) + plt.figure(figsize=(15, 6)) + colors = cycle('byrcmgkbyrcmgkbyrcmgkbyrcmgk') + for j in range(num_labels): + row_color = next(colors) + pair_cols = ['actuals_'+str(j), 'predictions_'+str(j)] + ax2 = plt.subplot(1, num_labels, j+1) + pdf[pair_cols].plot(ax=ax2) + ax2.set_title('Actuals_{j} vs Predictions_{j} for each row ') + except: + print('Regression plots erroring. Continuing...') +############################################################################################# +import os +def save_valid_predictions(y_test, y_preds, project_name, num_labels): + if num_labels == 1: + pdf = pd.DataFrame([y_test, y_preds]) + pdf = pdf.T + pdf.columns= ['actuals','predictions'] + else: + pdf = pd.DataFrame(np.c_[y_test, y_preds]) + act_names = ['actuals_'+str(x) for x in range(y_test.shape[1])] + pred_names = ['predictions_'+str(x) for x in range(y_preds.shape[1])] + pdf.columns = act_names + pred_names + preds_file = project_name+'_predictions.csv' + preds_path = os.path.join(project_name, preds_file) + pdf.to_csv(preds_path,index=False) + print('Saved predictions in %s file' %preds_path) + return pdf +######################################################################################### +import matplotlib.pyplot as plt +from IPython.display import Image, display +def print_one_image_from_dataset(train_ds, classes): + plt.figure(figsize=(10, 10)) + for images, labels in train_ds.take(1): + for i in range(9): + ax = plt.subplot(3, 3, i + 1) + plt.imshow(images[i].numpy().astype("uint8")) + plt.title(classes[labels[i]]) + plt.axis("off"); +######################################################################################### +def predict_plot_images(model, test_ds, classes): + plt.figure(figsize=(10, 10)) + for images, labels in test_ds.take(1): + for i in range(9): + ax = plt.subplot(3, 3, i + 1) + + plt.tight_layout() + + img = tf.keras.preprocessing.image.img_to_array(images[i]) + img = np.expand_dims(img, axis=0) + + pred=model.predict(img) + plt.imshow(images[i].numpy().astype("uint8")) + plt.title("Actual Label: %s" % classes[labels[i]]) + plt.text(1, 240, "Predicted Label: %s" % classes[np.argmax(pred)], fontsize=12) + + plt.axis("off") +####################################################################################### +def add_outputs_to_auto_model_body(model_body, meta_outputs, nlp_flag=False): + """ + This is specially for "auto" model types only. It requires special handling. + """ + if nlp_flag: + wide, deep, nlp_outputs = meta_outputs + else: + wide, deep = meta_outputs + ##### This is the simplest way to convert a sequential model to functional! + for num, each_layer in enumerate(model_body.layers): + if num == 0: + final_outputs = each_layer(deep) + else: + final_outputs = each_layer(final_outputs) + if nlp_flag: + model_body = layers.concatenate([nlp_outputs, wide, final_outputs], name='auto_concatenate_layer') + else: + model_body = layers.concatenate([wide, final_outputs], name='auto_concatenate_layer') + return model_body +################################################################################# +def add_outputs_to_model_body(model_body, meta_outputs): + ##### This is the simplest way to convert a sequential model to functional! + for num, each_layer in enumerate(model_body.layers): + if num == 0: + final_outputs = each_layer(meta_outputs) + else: + final_outputs = each_layer(final_outputs) + return final_outputs +############################################################################### +def check_keras_options(keras_options, name, default): + try: + if keras_options[name]: + value = keras_options[name] + else: + value = default + except: + value = default + return value +##################################################################################### +def check_model_options(model_options, name, default): + try: + if model_options[name]: + value = model_options[name] + else: + value = default + except: + value = default + return value +##################################################################################### +# Callback for printing the LR at the end of each epoch. +class PrintLR(tf.keras.callbacks.Callback): + def on_epoch_end(self, epoch, logs=None): + print('\nLearning rate for epoch {} is {}'.format(epoch + 1, + self.model.optimizer.lr.numpy())) +##################################################################################### +# Function for decaying the learning rate. +# You can define any decay function you need. +def schedules(epoch, lr): + return lr * (0.997 ** np.floor(epoch / 2)) + +LEARNING_RATE = 0.01 +def decay(epoch): + return LEARNING_RATE - 0.0099 * (0.75 ** (1 + epoch/2)) +####################################################################################### +import os +def get_callbacks(val_mode, val_monitor, patience, learning_rate, save_weights_only, steps=100, + save_dir='./deep_autoviml'): + """ + #################################################################################### + ##### LEARNING RATE SCHEDULING : Setup Learning Rate Multiple Ways ######### + #################################################################################### + """ + keras.backend.clear_session() + callbacks_dict = {} + tensorboard_logpath = os.path.join(save_dir,"mylogs") + print('Tensorboard log directory can be found at: %s' %tensorboard_logpath) + + cp = keras.callbacks.ModelCheckpoint("deep_autoviml", save_best_only=True, + save_weights_only=save_weights_only, save_format='tf') + ### sometimes a model falters and restore_best_weights gives len() not found error. So avoid True option! + lr_patience = max(5,int(patience*0.5)) + rlr = keras.callbacks.ReduceLROnPlateau(monitor=val_monitor, factor=0.90, + patience=lr_patience, min_lr=1e-6, mode='auto', min_delta=0.00001, cooldown=0, verbose=1) + + ###### This is one variation of onecycle ##### + onecycle = OneCycleScheduler(steps=steps, lr_max=0.1) + + ###### This is another variation of onecycle ##### + onecycle2 = OneCycleScheduler2(iterations=steps, max_rate=0.05) + + lr_sched = keras.callbacks.LearningRateScheduler(schedules) + + # Setup Learning Rate decay. + lr_decay_cb = keras.callbacks.LearningRateScheduler(decay) + + es = keras.callbacks.EarlyStopping(monitor=val_monitor, min_delta=0.00001, patience=patience, + verbose=1, mode=val_mode, baseline=None, restore_best_weights=False) + + tb = keras.callbacks.TensorBoard(log_dir=tensorboard_logpath, + histogram_freq=0, + write_graph=True, + write_images=True, + update_freq='epoch', + profile_batch=2, + embeddings_freq=1 + ) + + pr = PrintLR() + + callbacks_dict['onecycle'] = onecycle + callbacks_dict['onecycle2'] = onecycle2 + callbacks_dict['check_point'] = cp + callbacks_dict['scheduler'] = lr_sched + callbacks_dict['early_stop'] = es + callbacks_dict['tensor_board'] = tb + callbacks_dict['print'] = pr + callbacks_dict['reducer'] = rlr + callbacks_dict['rlr'] = rlr + callbacks_dict['decay'] = lr_decay_cb + + return callbacks_dict, tensorboard_logpath +#################################################################################### +def get_chosen_callback(callbacks_dict, keras_options): + ##### Setup Learning Rate decay. + #### Onecycle is another fast way to find the best learning in large datasets ###### + ########## This is where we look for various schedulers + if keras_options['lr_scheduler'] == "onecycle2": + lr_scheduler = callbacks_dict['onecycle2'] + elif keras_options['lr_scheduler'] == 'onecycle': + lr_scheduler = callbacks_dict['onecycle'] + elif keras_options['lr_scheduler'] in ['reducer', 'rlr']: + lr_scheduler = callbacks_dict['reducer'] + elif keras_options['lr_scheduler'] == 'decay': + lr_scheduler = callbacks_dict['decay'] + elif keras_options['lr_scheduler'] == "scheduler": + lr_scheduler = callbacks_dict['scheduler'] + else: + lr_scheduler = callbacks_dict['rlr'] + return lr_scheduler +################################################################################################ +def get_chosen_callback2(callbacks_dict, keras_options): + ##### Setup Learning Rate decay. + #### Onecycle is another fast way to find the best learning in large datasets ###### + ########## This is where we look for various schedulers + if keras_options['lr_scheduler'] == "onecycle2": + lr_scheduler = callbacks_dict['onecycle2'] + elif keras_options['lr_scheduler'] == 'onecycle': + lr_scheduler = callbacks_dict['onecycle'] + elif keras_options['lr_scheduler'] == 'rlr': + lr_scheduler = callbacks_dict['rlr'] + elif keras_options['lr_scheduler'] == 'decay': + lr_scheduler = callbacks_dict['lr_decay_cb'] + else: + lr_scheduler = callbacks_dict['rlr'] + keras_options['lr_scheduler'] = "rlr" + return lr_scheduler +################################################################################################ +import math +class OneCycleScheduler2(keras.callbacks.Callback): + def __init__(self, iterations, max_rate, start_rate=None, + last_iterations=None, last_rate=None): + self.iterations = iterations + self.max_rate = max_rate + self.start_rate = start_rate or max_rate / 10 + self.last_iterations = last_iterations or iterations // 10 + 1 + self.half_iteration = (iterations - self.last_iterations) // 2 + self.last_rate = last_rate or self.start_rate / 1000 + self.iteration = 0 + def _interpolate(self, iter1, iter2, rate1, rate2): + return ((rate2 - rate1) * (self.iteration - iter1) + / (iter2 - iter1) + rate1) + def on_batch_begin(self, batch, logs): + if self.iteration < self.half_iteration: + rate = self._interpolate(0, self.half_iteration, self.start_rate, self.max_rate) + elif self.iteration < 2 * self.half_iteration: + rate = self._interpolate(self.half_iteration, 2 * self.half_iteration, + self.max_rate, self.start_rate) + else: + rate = self._interpolate(2 * self.half_iteration, self.iterations, + self.start_rate, self.last_rate) + self.iteration += 1 + if rate < 0: + rate = self.start_rate + self.iteration = 0 + K.clear_session() + K.set_value(self.model.optimizer.lr, rate) +##################################################################################################### +def find_columns_with_infinity(df): + """ + This function finds all columns in a dataframe that have inifinite values (np.inf or -np.inf) + It returns a list of column names. If the list is empty, it means no columns were found. + """ + add_cols = [] + sum_cols = 0 + for col in df.columns: + inf_sum1 = 0 + inf_sum2 = 0 + inf_sum1 = len(df[df[col]==np.inf]) + inf_sum2 = len(df[df[col]==-np.inf]) + if (inf_sum1 > 0) or (inf_sum2 > 0): + add_cols.append(col) + sum_cols += inf_sum1 + sum_cols += inf_sum2 + return add_cols +##################################################################################################### +import copy +def drop_rows_with_infinity(df, cols_list, fill_value=None): + """ + This feature engineering function will fill infinite values in your data with a fill_value. + You might need this function during deep_learning models where infinite values don't work. + You can also leave the fill_value as None which means we will drop the rows with infinity. + This function checks for both negative and positive infinity values to fill or remove. + """ + # first you must drop rows that have inf in them #### + print(' Shape of dataset initial: %s' %(df.shape[0])) + corr_list_copy = copy.deepcopy(cols_list) + init_rows = df.shape[0] + if fill_value: + for col in corr_list_copy: + ### Capping using the n largest value based on n given in input. + maxval = df[col].max() ## what is the maximum value in this column? + minval = df[col].min() + if maxval == np.inf: + sorted_list = sorted(df[col].unique()) + ### find the n_smallest values after the maximum value based on given input n + next_best_value_index = sorted_list.index(np.inf) - 1 + capped_value = sorted_list[next_best_value_index] + df.loc[df[col]==maxval, col] = capped_value ## maximum values are now capped + if minval == -np.inf: + sorted_list = sorted(df[col].unique()) + ### find the n_smallest values after the maximum value based on given input n + next_best_value_index = sorted_list.index(-np.inf)+1 + capped_value = sorted_list[next_best_value_index] + df.loc[df[col]==minval, col] = capped_value ## maximum values are now capped + print(' capped all rows with infinite values in data') + else: + for col in corr_list_copy: + df = df[df[col]!=np.inf] + df = df[df[col]!=-np.inf] + dropped_rows = init_rows - df.shape[0] + print(' dropped %d rows due to infinite values in data' %dropped_rows) + print(' Shape of dataset after dropping rows: %s' %(df.shape[0])) + ### Double check that all columns have been fixed ############### + cols_with_infinity = find_columns_with_infinity(df) + if cols_with_infinity: + print('There are still %d columns with infinite values. Returning...' %len(cols_with_infinity)) + else: + print('There are no columns with infinite values.') + return df +################################################################################################ +def get_hidden_layers(data_dim): + if data_dim <= 1e6: + dense_layer1 = max(96,int(data_dim/30000)) + dense_layer2 = max(64,int(dense_layer1*0.5)) + dense_layer3 = max(32,int(dense_layer2*0.5)) + elif data_dim > 1e6 and data_dim <= 1e8: + dense_layer1 = max(192,int(data_dim/50000)) + dense_layer2 = max(128,int(dense_layer1*0.5)) + dense_layer3 = max(64,int(dense_layer2*0.5)) + elif data_dim > 1e8 or keras_model_type == 'big_deep': + dense_layer1 = 400 + dense_layer2 = 200 + dense_layer3 = 100 + dense_layer1 = min(300,dense_layer1) + dense_layer2 = min(200,dense_layer2) + dense_layer3 = min(100,dense_layer3) + return dense_layer1, dense_layer2, dense_layer3 +################################################################################################### +def print_one_text_from_dataset(raw_train_ds, class_names): + """ + print one row from the dataset - only works for text data + """ + for text_batch, label_batch in raw_train_ds.take(1): + for i in range(1): + print("Text", text_batch.numpy()[i]) + print("Label", label_batch.numpy()[i]) + print("Label 0 corresponds to", class_names[0]) + print("Label 1 corresponds to", class_names[1]) +######################################################################################### +import pickle +def save_model_artifacts(deep_model, cat_vocab_dict, var_df, save_model_path, + save_model_flag, model_options): + """ + This saves the model if the save_model_flag os set to True. + However it also saves the model artifacts such as cat_vocab_dict and var_df also. + """ + if save_model_flag: + try: + print('\nSaving model...this will take time...') + if model_options["save_model_format"]: + deep_model.save(save_model_path, save_format=model_options["save_model_format"]) + print(' deep model saved in %s directory in %s format' %( + save_model_path, model_options["save_model_format"])) + else: + deep_model.save(save_model_path) + print(' deep model saved in %s directory in .pb format' %save_model_path) + cat_vocab_dict['saved_model_path'] = save_model_path + cat_vocab_dict['save_model_format'] = model_options["save_model_format"] + except: + print('Erroring: Model not saved') + else: + cat_vocab_dict['saved_model_path'] = save_model_path + print('\nModel not being saved since save_model_flag set to False...') + + #### make sure you save the cat_vocab_dict to use later during predictions + save_artifacts_path = os.path.join(save_model_path, "artifacts") + try: + pickle_path = os.path.join(save_artifacts_path,"cat_vocab_dict")+".pickle" + print('\nSaving vocab dictionary using pickle...will take time...' ) + with open(pickle_path, "wb") as fileopen: + fileopen.write(pickle.dumps(cat_vocab_dict)) + print(' Saved pickle file in %s' %pickle_path) + except: + print('Unable to save cat_vocab_dict - please save or pickle it yourself.') + ####### make sure you save the variable definitions file ########### + try: + pickle_path = os.path.join(save_artifacts_path,"var_df")+".pickle" + print('\nSaving variable definitions file using pickle...will take time...' ) + with open(pickle_path, "wb") as fileopen: + fileopen.write(pickle.dumps(var_df)) + print(' Saved pickle file in %s' %pickle_path) + except: + print('Unable to save cat_vocab_dict - please save or pickle it yourself.') +###################################################################################### +def get_model_defaults(keras_options, model_options, targets): + num_classes = model_options["num_classes"] + num_labels = model_options["num_labels"] + modeltype = model_options["modeltype"] + patience = check_keras_options(keras_options, "patience", 10) + use_bias = check_keras_options(keras_options, 'use_bias', True) + optimizer = check_keras_options(keras_options,'optimizer', Adam(lr=0.01, beta_1=0.9, beta_2=0.999)) + if modeltype == 'Regression': + reg_loss = check_keras_options(keras_options,'loss','mae') ### you can use tf.keras.losses.Huber() instead + #val_metrics = [check_keras_options(keras_options,'metrics',keras.metrics.RootMeanSquaredError(name='rmse'))] + METRICS = [keras.metrics.RootMeanSquaredError(name='rmse'), keras.metrics.MeanAbsoluteError(name='mae')] + #METRICS = [keras.metrics.MeanSquaredError(name="mean_squared_error", dtype=None)] + #METRICS = ['mean_squared_error'] + val_metrics = check_keras_options(keras_options,'metrics',METRICS) + num_predicts = 1*num_labels + if num_labels <= 1: + val_loss = check_keras_options(keras_options,'loss', reg_loss) + val_metric = 'rmse' + else: + val_loss = [] + for i in range(num_labels): + val_loss.append(reg_loss) + val_metric = 'loss' + ####### If you change the val_metrics above, you must also change its name here #### + output_activation = 'linear' ### use "relu" or "softplus" if you want positive values as output + elif modeltype == 'Classification': + ##### This is for Binary Classification Problems + #val_loss = check_keras_options(keras_options,'loss','sparse_categorical_crossentropy') + #val_metrics = [check_keras_options(keras_options,'metrics','AUC')] + #val_metrics = check_keras_options(keras_options,'metrics','accuracy') + cat_loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) + val_loss = check_keras_options(keras_options,'loss', cat_loss) + bal_acc = BalancedSparseCategoricalAccuracy() + #bal_acc = 'accuracy' + val_metrics = check_keras_options(keras_options,'metrics',bal_acc) + if num_labels <= 1: + num_predicts = int(num_classes*num_labels) + else: + #### This is for multi-label problems wihere number of classes will be a list + num_predicts = num_classes + output_activation = "sigmoid" + ####### If you change the val_metrics above, you must also change its name here #### + val_metric = 'balanced_sparse_categorical_accuracy' + #val_metric = 'accuracy' + else: + #### this is for multi-class problems #### + cat_loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) + #val_loss = check_keras_options(keras_options,'loss','sparse_categorical_crossentropy') + #val_metrics = check_keras_options(keras_options,'metrics','accuracy') + if num_labels <= 1: + num_predicts = int(num_classes*num_labels) + val_loss = check_keras_options(keras_options,'loss', cat_loss) + bal_acc = BalancedSparseCategoricalAccuracy() + #bal_acc = 'accuracy' + val_metrics = check_keras_options(keras_options, 'metrics', bal_acc) + else: + #### This is for multi-label problems wihere number of classes will be a list + num_predicts = num_classes + val_loss = [] + for i in range(num_labels): + val_loss.append(cat_loss) + bal_acc = BalancedSparseCategoricalAccuracy() + #bal_acc = 'accuracy' + val_metrics = check_keras_options(keras_options, 'metrics', bal_acc) + output_activation = 'softmax' + ####### If you change the val_metrics above, you must also change its name here #### + val_metric = 'balanced_sparse_categorical_accuracy' + #val_metric = 'accuracy' + ############## Suggested number of neurons in each layer ################## + if modeltype == 'Regression': + val_monitor = check_keras_options(keras_options, 'monitor', 'val_'+val_metric) + val_mode = check_keras_options(keras_options,'mode', 'min') + elif modeltype == 'Classification': + ##### This is for Binary Classification Problems + if num_labels <= 1: + val_monitor = check_keras_options(keras_options,'monitor', 'val_'+val_metric) + val_mode = check_keras_options(keras_options,'mode', 'max') + val_metric = 'balanced_sparse_categorical_accuracy' + else: + val_metric = 'balanced_sparse_categorical_accuracy' + #val_metric = 'accuracy' + ### you cannot combine multiple metrics here unless you write a new function. + target_A = targets[0] + val_monitor = 'val_loss' ### this combines all losses and is best to minimize + val_mode = check_keras_options(keras_options,'mode', 'min') + #val_monitor = check_keras_options(keras_options,'monitor', 'val_auc') + #val_monitor = check_keras_options(keras_options,'monitor', 'val_accuracy') + else: + #### this is for multi-class problems + if num_labels <= 1: + val_metric = 'balanced_sparse_categorical_accuracy' + val_monitor = check_keras_options(keras_options,'monitor', 'val_'+val_metric) + val_mode = check_keras_options(keras_options, 'mode', 'max') + else: + val_metric = 'balanced_sparse_categorical_accuracy' + #val_metric = 'accuracy' + ### you cannot combine multiple metrics here unless you write a new function. + target_A = targets[0] + val_monitor = 'val_loss' + val_mode = check_keras_options(keras_options,'mode', 'min') + #val_monitor = check_keras_options(keras_options, 'monitor','val_accuracy') + ############################################################################## + keras_options["mode"] = val_mode + keras_options["monitor"] = val_monitor + keras_options["metrics"] = val_metrics + keras_options['loss'] = val_loss + keras_options["patience"] = patience + keras_options['use_bias'] = use_bias + keras_options['optimizer'] = optimizer + return keras_options, model_options, num_predicts, output_activation +############################################################################### +def get_uncompiled_model(inputs, result, output_activation, + num_predicts, modeltype, cols_len, targets): + ### The next 3 steps are most important! Don't mess with them! + #model_preprocessing = Model(inputs, meta_outputs) + #preprocessed_inputs = model_preprocessing(inputs) + #result = model_body(preprocessed_inputs) + ##### now you can add the final layer here ######### + multi_label_predictions = defaultdict(list) + if isinstance(num_predicts, int): + key = 'predictions' + if modeltype == 'Regression': + ### this will be just 1 in regression #### + if num_predicts > 1: + for each_label in range(num_predicts): + value = layers.Dense(1, activation=output_activation, + name=targets[each_label])(result) + multi_label_predictions[key].append(value) + else: + value = layers.Dense(1, activation=output_activation, + name=targets[0])(result) + multi_label_predictions[key].append(value) + else: + ### this will be number of classes in classification ### + value = layers.Dense(num_predicts, activation=output_activation, + name=targets[0])(result) + multi_label_predictions[key].append(value) + else: + #### This will be for multi-label, multi-class predictions only ### + for each_label in range(len(num_predicts)): + key = 'predictions' + if modeltype == 'Regression': + ### this will be just 1 in regression #### + value = layers.Dense(1, activation=output_activation, + name=targets[0])(result) + else: + ### this will be number of classes in classification ### + value = layers.Dense(num_predicts[each_label], activation=output_activation, + name=targets[each_label])(result) + multi_label_predictions[key].append(value) + outputs = multi_label_predictions[key] ### outputs will be a list of Dense layers + ##### Set the inputs and outputs of the model here + + uncompiled_model = Model(inputs=inputs, outputs=outputs) + return uncompiled_model + +##################################################################################### +def get_compiled_model(inputs, meta_outputs, output_activation, num_predicts, modeltype, + optimizer, val_loss, val_metrics, cols_len, targets): + model = get_uncompiled_model(inputs, meta_outputs, output_activation, + num_predicts, modeltype, cols_len, targets) + + model.compile( + optimizer=optimizer, + loss=val_loss, + metrics=val_metrics, + ) + return model +############################################################################### +class BalancedSparseCategoricalAccuracy(keras.metrics.SparseCategoricalAccuracy): + def __init__(self, name='balanced_sparse_categorical_accuracy', dtype=None): + super().__init__(name, dtype=dtype) + + def update_state(self, y_true, y_pred, sample_weight=None): + y_flat = y_true + if y_true.shape.ndims == y_pred.shape.ndims: + y_flat = tf.squeeze(y_flat, axis=[-1]) + y_true_int = tf.cast(y_flat, tf.int32) + + cls_counts = tf.math.bincount(y_true_int) + cls_counts = tf.math.reciprocal_no_nan(tf.cast(cls_counts, self.dtype)) + weight = tf.gather(cls_counts, y_true_int) + return super().update_state(y_true, y_pred, sample_weight=weight) +##################################################################################### +def save_model_architecture(model, project_name, keras_model_type, cat_vocab_dict, + model_options, chart_name="model_before"): + """ + This function saves the model architecture in a PNG file in the artifacts sub-folder of project_name folder + """ + if isinstance(project_name,str): + if project_name == '': + project_name = "deep_autoviml" + else: + print('Project name must be a string and helps create a folder to store model.') + project_name = "deep_autoviml" + save_model_path = model_options['save_model_path'] + save_artifacts_path = os.path.join(save_model_path, "artifacts") + try: + plot_filename = os.path.join(save_artifacts_path,chart_name)+".png" + print('\nSaving model architecture...') + tf.keras.utils.plot_model(model = model, to_file=plot_filename, dpi=72, + show_layer_names=True, rankdir="LR", show_shapes=True) + print(' model architecture saved in file: %s' %plot_filename) + except: + print('Model architecture not saved due to error. Continuing...') + plot_filename = "" + return plot_filename +######################################################################################################### diff --git a/deep_autoviml.egg-info/PKG-INFO b/deep_autoviml.egg-info/PKG-INFO new file mode 100644 index 0000000..7e8aa7e --- /dev/null +++ b/deep_autoviml.egg-info/PKG-INFO @@ -0,0 +1,224 @@ +Metadata-Version: 2.1 +Name: deep-autoviml +Version: 0.0.82 +Summary: Automatically Build Deep Learning Models and Pipelines fast! +Home-page: https://github.com/AutoViML/deep_autoviml +Author: Ram Seshadri +License: Apache License 2.0 +Description: # deep_autoviml + ## Build keras pipelines and models in a single line of code! + ![banner](logo.jpg) + [![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/) + [![ForTheBadge built-with-love](http://ForTheBadge.com/images/badges/built-with-love.svg)](https://github.com/AutoViML) + [![standard-readme compliant](https://img.shields.io/badge/standard--readme-OK-green.svg?style=flat-square)](https://github.com/RichardLitt/standard-readme) + [![Python Versions](https://img.shields.io/pypi/pyversions/autoviml.svg?logo=python&logoColor=white)](https://pypi.org/project/autoviml) + [![Build Status](https://travis-ci.org/joemccann/dillinger.svg?branch=master)](https://github.com/AutoViML) + ## Table of Contents + + + ## Update (Jan 2022): Now with mlflow! + You can now add `mlflow` experiment tracking to all your deep_autoviml runs. [mlflow](https://mlflow.org/) is a popular python library for experiment tracking and MLOps in general. See more details below under `mlflow`. + + ## Motivation + ✨ deep_autoviml is a powerful new deep learning library with a very simple design goal: ✨ + ```Make it easy for novices and experts to experiment and build tensorflow.keras preprocessing pipelines and models in fewest steps.``` + But just because we make it easy, does not mean you should trust everything that it does or treat it like a black box. You must still use your own judgement and intutition to make sure the results are accurate and explainable, not to mention that the model conforms to Responsbile AI principles. + + ### Watch YouTube Video for Demo of Deep_AutoViML + [![YouTube Demo](deep_6.jpg)](https://www.youtube.com/watch?v=IcpwNNNXsWE) + + ### What is Deep AutoViML? + Deep AutoViML is the next version of AutoViML, a popular automl library that was developed using pandas, scikit-learn and xgboost+catboost. Deep AutoViML takes the best features of AutoViML and uses the latest generation of tensorflow and keras libraries to build a fast model and data pipeline for MLOps use cases. + + deep autoviml is primarily meant for sophisticated data engineers, data scientists and ML engineers to quickly prototype and build tensorflow 2.4.1+ models and pipelines for any data set, any size using a single line of code. It can build models for structured data, NLP and image datasets. It can also handle time series data sets in the future. + 1. You can either choose deep_autoviml to automatically buid a custom Tensorflow model + 1. Instead, you can "bring your own model" ("BYOM" option) model to attach keras data pipelines to your model. + 1. Additionally, you can choose any Tensorflow Hub model (TFHub) to custom train on your data. Just look for instructions below in "Tips for using deep_autoviml" section. + 1. There are 4 ways to build your model quickly or slowly depending on your needs: + - fast: a quick model that uses only dense layers (deep layers) + - fast1: a deep and wide model that uses both deep and wide layers. This is slightly slower than `fast` model. + - fast2: a deep and cross model that crosses some variables (hence deep and cross). This is about the same speed as 'fast1` model. + - auto: This uses `Optuna` or `Storm-Tuner` to perform combinations of dense layers and select best architecture. This will take the longest time. + + ![why_deep](deep_2.jpg) + ## Features + These are the main features that distinguish deep_autoviml from other libraries: + - It uses keras preprocessing layers which are more intuitive, and are included inside your model to simplify deployment + - The pipeline is available to you to use as inputs in your own functional model (if you so wish - you must specify that option in the input - see below for "pipeline") + - It can import any csv, txt or gzip file or file patterns (that fit multiple files) and it can scale to any data set of any size due to tf.data.Dataset's superior data pipelining features (such as cache, prefetch, batch, etc.) + - It uses an amazing new tuner called [STORM tuner](https://github.com/ben-arnao/StoRM) that quickly searches for the best hyperparameters for your keras model in fewer than 25 trials + - If you want to fine tune your model even further, you can fiddle with a wide variety of model options or keras options using **kwargs like dictionaries + - You can import your own custom Sequential model and watch it transform it into a functional model with additional preprocessing and output layers and train the model with your data + - You can save the model on your local machine or copy it to any cloud provider's storage bucket and serve it from there using tensorflow Serving (TF.Serving) + - Since your model contains preprocessing layers built-in, you just need to provide your Tensorflow serving model with raw data to test and get back predictions in the same format as your training labels. + ![how_it_works](deep_1.jpg) + + ## Technology + deep_autoviml uses the latest in tensorflow (2.4.1+) td.data.Datasets and tf.keras preprocessing technologies: the Keras preprocessing layers enable you to encapsulate feature engineering and preprocessing into the model itself. This makes the process for training and predictions the same: just feed input data (in the form of files or dataframes) and the model will take care of all preprocessing before predictions. + + To perform its preprocessing on the model itself, deep_autoviml uses [tensorflow](https://www.tensorflow.org/) (TF 2.4.1+ and later versions) and [tf.keras](https://www.tensorflow.org/api_docs/python/tf/keras) experimental preprocessing layers: these layers are part of your saved model. They become part of the model's computational graph that can be optimized and executed on any device including GPU's and TPU's. By packaging everything as a single unit, we save the effort in reimplementing the preprocessing logic on the production server. The new model can take raw tabular data with numeric and categorical variables or strings text directly without any preprocessing. This avoids missing or incorrect configuration for the preprocesing_layer during production. + + In addition, to select the best hyper parameters for the model, it uses a new open source library: + - [storm-tuner](https://github.com/ben-arnao/StoRM) - storm-tuner is an amazing new library that enables us to quickly fine tune our keras sequential models with hyperparameters and find a performant model within a few trials. + ![how_deep](deep_4.jpg) + + ## Install + deep_autoviml requires [tensorflow](https://www.tensorflow.org/api_docs/python/tf) v2.4.1+ and [storm-tuner](https://github.com/ben-arnao/StoRM) to run. Don't worry! We will install these libraries when you install deep_autoviml. + + ``` + pip install deep_autoviml + ``` + + For your own conda environment... + + ``` + conda create -n python=3.7 anaconda + conda activate # ON WINDOWS: `source activate ` + pip install deep_autoviml + or + pip install git+https://github.com/AutoViML/deep_autoviml.git + ``` + + ## Usage + ![deep_usage](deep_5.jpg) + deep_autoviml can be invoked with a simple import and run statement: + + ``` + from deep_autoviml import deep_autoviml as deepauto + ``` + + Load a data set (any .csv or .gzip or .gz or .txt file) into deep_autoviml and it will split it into Train and Validation datasets inside. You only need to provide a target variable, a project_name to store files in your local machine and leave the rest to defaults: + + ``` + model, cat_vocab_dict = deepauto.fit(train, target, keras_model_type="auto", + project_name="deep_autoviml", keras_options={}, model_options={}, + save_model_flag=True, use_my_model='', model_use_case='', verbose=0, + use_mlflow=False, mlflow_exp_name='autoviml', mlflow_run_name='first_run') + ``` + + Once deep_autoviml writes your saved model and cat_vocab_dict files to disk in the project_name directory, you can load it from anywhere (including cloud) for predictions like this using the model and cat_vocab_dict generated above: + + There are two kinds of predictions: This is the usual (typical) format. + ``` + predictions = deepauto.predict(model, project_name, test_dataset=test, + keras_model_type=keras_model_type, cat_vocab_dict=cat_vocab_dict) + ``` + + In case you are performing image classification, then you need to use `deepauto.predict_images()` for making predictions. See the Image section below for more details. + + ## API + **Arguments** + + deep_autoviml requires only a single line of code to get started. You can however, fine tune the model we build using multiple options using dictionaries named "model_options" and "keras_options". These two dictionaries act like python **kwargs to enable you to fine tune hyperparameters for building our tf.keras model. Instructions on how to use them are provided below. + + ![how_deep](deep_3.jpg) + + - `train`: could be a datapath+filename or a pandas dataframe. Deep Auto_ViML even handles gz or gzip files. You must specify the full path and file name for it find and load it. + - `target`: name of the target variable in the data set. + - `keras_model_type`: default is "auto" ## But always try "fast", then "fast1", and "fast2", finally "auto". If you want to run NLP, use "BERT" and if you want to do image classification, set it to "image". In most structured data sets, keras_model_type is a quick way for you to select some fantastic model architectures that have been successful in the past. For example: + fast: a quick model that applies deep layers for all variables. + fast1: a deep and wide model that sends the same variables to both a deep and wide layer simultaneously. + fast2: a deep and cross model that crosses some variables to build a deep and cross layer simultaneously. + auto: This will build multiple dense layers in sequence that will then use Storm-Tuner to fine tune the hyper parameters for your model. + - `project_name`: must be a string. Name of the folder where we will save your keras saved model and logs for tensorboard + - `model_options`: must be a dictionary. For example: {'max_trials':5} sets the number of trials to run Storm-Tuner to search for the best hyper parameters for your keras model. + - `keras_options`: must be a dictionary. You can use it for changing any keras model option you want such as "epochs", "kernel_initializer", "activation", "loss", "metrics", etc. + - `model_use_case`: must be a string. You can use it for telling deep_autoviml what kind of use case you will use such as "time series", "seq2seq", modeling etc. This option is currently not used but you should watch this space for more model announcements. + - `save_model_flag`: must be True or False. The model will be saved in keras model format. + - `use_my_model`: This is where "bring your own model" (BYOM) option comes into play. This BYOM model must be a keras Sequential model with NO input layers and output layers! You can define it and send it as input here. We will add input and preprocessing layers to it automatically. Your custom defined model must contain only hidden layers (Dense, Conv1D, Conv2D, etc.), and dropouts, activations, etc. The default for this argument is "" (empty string) which means we will build your model. If you provide your custom model object here, we will use it instead. + - `verbose`: must be 0, 1 or 2. Can also be True or False. You can see more and more outputs as you increase the verbose level. If you want to see a chart of your model, use verbose = 2. But you must have graphviz and pydot installed in your machine to see the model plot. + -`use_mlflow`: default = False. Use for MLflow lifecycle tracking. You can set it to True. MLflow is an open source python library useed to manage ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. + Once the model training (via `fit` method) is done, you need to run MLflow locally from your working directory. Run below command on command line. This will start MLflow UI on port 5000 (http://localhost:5000/) and user can manage and visualize the end-to-end machine learning lifecycle.
    + `$ mlflow ui` + -`mlflow_exp_name`: Default value is 'autoviml'. MLflow experiment name. You can change this to any string you want. + -`mlflow_run_name`: Default value is'first_run'. Each run under an experiment can have a unique run name. You can change this. + + ## Image + ![image_deep](deep_7.jpg) + Leaf Images referred to here are from Kaggle and are copyright of Kaggle. They are shown for illustrative purposes. + [Kaggle Leaf Image Classification](https://www.kaggle.com/c/leaf-classification) + + deep_autoviml can do image classification. All you need to do is to organize your image_dir folder under train, validation and test sub folders. Train folder for example, can contain images for each label as a sub-folder. All you need to provide is the name of the image directory for example "leaf_classification" and deep_autoviml will automatically read the images and assign them correct labels and the correct dataset (train, test, etc.) + + `image_dir` = `"leaf_classification"` + You also need to provide the height and width of each image as well as the number of channels for each image. + ``` + img_height = 224 + img_width = 224 + img_channels = 3 + ``` + You then need to set the keras model type argument as "image". + + `keras_model_type` = `"image"` + + You also need to send in the above arguments as model options as follows: + `model_options = {'image_directory': image_dir, 'image_height': img_height, 'image_width':img_width, 'image_channels':img_channels }` + + You can then call deep_autoviml for training the model as usual with these inputs: + ```model, dicti = deepauto.fit(trainfile, target, keras_model_type=keras_model_type, project_name='leaf_classification', save_model_flag=False, model_options=model_options, keras_options=keras_options, use_my_model='', verbose=0)``` + + To make predictions, you need to provide the dictionary ("dicti") from above and the trained model. You also need to provide where the test images are stored as follows. + `test_image_dir = 'leaf_classification/test'` + `predictions = deepauto.predict_images(test_image_dir, model, dicti)` + + ## NLP + ![NLP_deep](deep_8.jpg) + deep_autoviml can also do NLP text classification. There are two ways to do NLP: +
  • 1. Using folders and sub-folders
  • + All you need to do is to organize your text_dir folder under train, validation and test sub folders. Train folder for example, can contain Text files for each label as a sub-folder. All you have to do is: + + `keras_model_type` as `"BERT"` or `keras_model_type` as `"USE"` or and it will use either BERT or Universal Sentence Encoder to preprocess and transform your text into embeddings to feed to a model. +
  • 2. Using CSV file
  • + Just provide a CSV file with column names and text. If you have multiple text columns, it will handle all of them automatically. If you want to mix numeric and text columns, you can do so in the same CSV file. deep_autoviml will automatically detect which columns are text (NLP) and which columns are numeric and do preprocessing automatically. You can specify whether to use: + + `keras_model_type` as `"BERT"` or `keras_model_type` as `"USE"` or and it will use either BERT or Universal Sentence Encoder as specified on your text columns. If you want to use neither of them, you can just specify: + + `keras_model_type` as `"auto"` and deep_autoviml will automatically choose the best embedding for your model. + + + ## Tips + You can use the following arguments in your input to make deep_autoviml work best for you: + - `model_options = {"model_use_case":'pipeline'}`: If you only want keras preprocessing layers (i.e. keras pipeline) then set the model_use_case input to "pipeline" and Deep Auto_ViML will not build a model but just return the keras input and preprocessing layers. You can use these inputs and output layers to any sequential model you choose and build your own custom model. + - `model_options = {'max_trials':5}`: Always start with a small number of max_trials in model_options dictionary or a dataframe. Start with 5 trials and increase it by 20 each time to see if performance improves. Stop when performance of the model doesn't improve any more. This takes time. + - `model_options = {'cat_feat_cross_flag':True}`: default is False but change it to True and see if adding feature crosses with your categorical features helps improve the model. However, do not do this for a large data set! This will explode the number of features in your model. Be careful! + - `model_options = {'nlp_char_limit':20}`: If you want to run NLP Text preprocessing on any column, set this character limit low and deep_autoviml will then detect that column as an NLP column automatically. The default is 30 chars. + - `keras_options = {"patience":30}`: If you want to reduce Early Stopping, then increase the patience to 30 or higher. Your model will train longer but you might get better performance. + - `use_my_model = my_sequential_model`: If you want to bring your own custom model for training, then define a Keras Sequential model (you can name it anything but for example purposes, we have named it my_sequential_model) but don't include inputs or output layers! Just define your hidden layers! Deep Auto_ViML will automatically add inputs and output layers to your model and train it. It will also save your model after training. You can use this model for predictions. + - `keras_model_type = "image"`: If you want to build a model for image classification, then you can use this option. But you must add the following additional options in model_options dictionary: `model_options = {"image_height":__, "image_width": __, "image_channels": __, "image_directory": __}`. + - `model_options = {"tf_hub_model": "URL"}`: If you want to use a pre-trained Tensorflow Hub model such as [BERT](https://tfhub.dev/google/collections/bert/1) or a [feature extractor](https://tfhub.dev/google/imagenet/mobilenet_v3_small_100_224/feature_vector/5) for image classification, then you can use its TF Hub model URL by providing it in model_options dictionary as follows: `model_options = {"tf_hub_model": "URL of TF hub model"}` + - `keras_model_type = "BERT"` or `keras_model_type = "USE"`: If you want to use a default [BERT](https://tfhub.dev/google/collections/bert/1) model or a Universal Sentence Encoder model, just set this option to either "BERT" or "USE" and we will load a default small pre-trained model from TF Hub, train it on your dataset and give you back a pipeline with BERT/USE in it! If you want to use some other BERT model than the one we have chosen, please go to Tensorflow Hub and find your model's URL and set `model_options = {"tf_hub_model": "URL of TF hub model"}` and we will train whatever BERT model you have chosen with your data. + + ## Maintainers + + * [@AutoViML](https://github.com/AutoViML) + + ## Contributing + + See [the contributing file](contributing.md)! + + PRs accepted. + + ## License + + Apache License 2.0 © 2020 Ram Seshadri + + ## DISCLAIMER + This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose. + +Platform: UNKNOWN +Classifier: Programming Language :: Python :: 3 +Classifier: Operating System :: OS Independent +Description-Content-Type: text/markdown diff --git a/deep_autoviml.egg-info/SOURCES.txt b/deep_autoviml.egg-info/SOURCES.txt new file mode 100644 index 0000000..34259b5 --- /dev/null +++ b/deep_autoviml.egg-info/SOURCES.txt @@ -0,0 +1,34 @@ +README.md +setup.py +deep_autoviml/__init__.py +deep_autoviml/__version__.py +deep_autoviml/deep_autoviml.py +deep_autoviml.egg-info/PKG-INFO +deep_autoviml.egg-info/SOURCES.txt +deep_autoviml.egg-info/dependency_links.txt +deep_autoviml.egg-info/requires.txt +deep_autoviml.egg-info/top_level.txt +deep_autoviml/data_load/classify_features.py +deep_autoviml/data_load/extract.py +deep_autoviml/modeling/create_model.py +deep_autoviml/modeling/one_cycle.py +deep_autoviml/modeling/predict_model.py +deep_autoviml/modeling/train_custom_model.py +deep_autoviml/modeling/train_image_model.py +deep_autoviml/modeling/train_model.py +deep_autoviml/modeling/train_text_model.py +deep_autoviml/models/basic.py +deep_autoviml/models/cnn1.py +deep_autoviml/models/cnn2.py +deep_autoviml/models/deep_and_wide.py +deep_autoviml/models/dnn.py +deep_autoviml/models/dnn_drop.py +deep_autoviml/models/giant_deep.py +deep_autoviml/models/reg_dnn.py +deep_autoviml/models/tf_hub_lookup.py +deep_autoviml/preprocessing/preprocessing.py +deep_autoviml/preprocessing/preprocessing_images.py +deep_autoviml/preprocessing/preprocessing_nlp.py +deep_autoviml/preprocessing/preprocessing_tabular.py +deep_autoviml/preprocessing/preprocessing_text.py +deep_autoviml/utilities/utilities.py \ No newline at end of file diff --git a/deep_autoviml.egg-info/dependency_links.txt b/deep_autoviml.egg-info/dependency_links.txt new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/deep_autoviml.egg-info/dependency_links.txt @@ -0,0 +1 @@ + diff --git a/deep_autoviml.egg-info/requires.txt b/deep_autoviml.egg-info/requires.txt new file mode 100644 index 0000000..dcfe390 --- /dev/null +++ b/deep_autoviml.egg-info/requires.txt @@ -0,0 +1,14 @@ +emoji +ipython +jupyter +matplotlib +numpy~=1.19.2 +optuna +pandas +regex +scikit-learn<=0.24.2,>=0.23.1 +storm-tuner>=0.0.8 +tensorflow-text~=2.5 +tensorflow_hub~=0.12.0 +tensorflow~=2.5 +xlrd diff --git a/deep_autoviml.egg-info/top_level.txt b/deep_autoviml.egg-info/top_level.txt new file mode 100644 index 0000000..2af75d5 --- /dev/null +++ b/deep_autoviml.egg-info/top_level.txt @@ -0,0 +1 @@ +deep_autoviml diff --git a/deep_autoviml/__version__.py b/deep_autoviml/__version__.py index 4160163..f0a3fbc 100644 --- a/deep_autoviml/__version__.py +++ b/deep_autoviml/__version__.py @@ -20,6 +20,6 @@ __author__ = "Ram Seshadri" __description__ = "deep_autoviml - build and test multiple Tensorflow 2.0 models and pipelines" __url__ = "https://github.com/Auto_ViML/deep_autoviml.git" -__version__ = "0.0.77" +__version__ = "0.0.83" __license__ = "Apache License 2.0" __copyright__ = "2020-21 Google" diff --git a/deep_autoviml/data_load/__pycache__/classify_features.cpython-38.pyc b/deep_autoviml/data_load/__pycache__/classify_features.cpython-38.pyc index b96bc36..758e194 100644 Binary files a/deep_autoviml/data_load/__pycache__/classify_features.cpython-38.pyc and b/deep_autoviml/data_load/__pycache__/classify_features.cpython-38.pyc differ diff --git a/deep_autoviml/data_load/__pycache__/extract.cpython-38.pyc b/deep_autoviml/data_load/__pycache__/extract.cpython-38.pyc index f61f657..bdc0a6e 100644 Binary files a/deep_autoviml/data_load/__pycache__/extract.cpython-38.pyc and b/deep_autoviml/data_load/__pycache__/extract.cpython-38.pyc differ diff --git a/deep_autoviml/data_load/classify_features.py b/deep_autoviml/data_load/classify_features.py index da64635..e451e40 100644 --- a/deep_autoviml/data_load/classify_features.py +++ b/deep_autoviml/data_load/classify_features.py @@ -217,7 +217,7 @@ def classify_columns(df_preds, model_options={}, verbose=0): #### If there are 30 chars are more in a discrete_string_var, it is then considered an NLP variable ### if a variable has more than this many chars, it will be treated like a NLP variable - max_nlp_char_size = check_model_options(model_options, "nlp_char_limit", 30) + max_nlp_char_size = check_model_options(model_options, "nlp_char_limit", 50) ### if a variable has more than this limit, it will not be treated like a cat variable # #### Cat_Limit defines the max number of categories a column can have to be called a categorical colum cat_limit = check_model_options(model_options, "variable_cat_limit", 30) @@ -502,7 +502,7 @@ def classify_features_using_pandas(data_sample, target, model_options={}, verbos nlps = [] bools = [] ### if a variable has more than this many chars, it will be treated like a NLP variable - nlp_char_limit = check_model_options(model_options, "nlp_char_limit", 30) + nlp_char_limit = check_model_options(model_options, "nlp_char_limit", 50) ### if a variable has more than this limit, it will not be treated like a cat variable # cat_limit = check_model_options(model_options, "variable_cat_limit", 30) ### Classify features using the previously define function ############# @@ -540,7 +540,7 @@ def classify_features_using_pandas(data_sample, target, model_options={}, verbos floats = [] preds_copy = copy.deepcopy(preds) for key in preds_copy: - if data_sample[key].dtype in ['object'] or str(data_sample[key].dtype) == 'category': + if str(data_sample[key].dtype) in ['object', 'category']: if type('str') in data_sample[key].map(type).value_counts().index: feats_max_min[key]["dtype"] = "string" elif data_sample[key].map(type).value_counts().index[0] == int: @@ -574,7 +574,7 @@ def classify_features_using_pandas(data_sample, target, model_options={}, verbos discrete_strings.remove(key) var_df1['discrete_string_vars'] = copy.deepcopy(discrete_strings) #### This is not a mistake - you have to test it again. That way we make sure type is safe - if data_sample[key].dtype in ['object'] or str(data_sample[key].dtype) == 'category': + if str(data_sample[key].dtype) in ['object', 'category']: if data_sample[key].map(type).value_counts().index[0] == object or data_sample[key].map(type).value_counts().index[0] == str: feats_max_min[key]["dtype"] = "string" elif data_sample[key].dtype in ['bool']: @@ -627,10 +627,11 @@ def classify_features_using_pandas(data_sample, target, model_options={}, verbos feats_max_min[key]["vocab"] = vocab feats_max_min[key]['size_of_vocab'] = len(vocab) elif feats_max_min[key]['dtype'] in ['string']: - data_types = len(data_sample[key].fillna("missing").map(type).value_counts()) + data_sample[[key]] = data_sample[[key]].fillna("missing") + data_types = len(data_sample[key].map(type).value_counts()) if data_types > 1: print('\nDATA CLEANING ALERT: Dropping %s since it has %s mixed data types.' %(key, data_types)) - print(' Transform variable to single data type and re-run. Continuing...') + print(' Convert this variable to a single data type and re-run deep_autoviml.') ignore_variables.append(key) preds.remove(key) feats_max_min['predictors_in_train'] = preds @@ -642,7 +643,7 @@ def classify_features_using_pandas(data_sample, target, model_options={}, verbos discrete_strings.remove(key) var_df1['discrete_string_vars'] = copy.deepcopy(discrete_strings) if not key in ignore_variables: - if np.mean(data_sample[key].fillna("missing").map(len)) >= nlp_char_limit: + if np.max(data_sample[key].map(len)) >= nlp_char_limit: ### This is for NLP variables. You want to remove duplicates ##### if key in dates: continue @@ -652,7 +653,7 @@ def classify_features_using_pandas(data_sample, target, model_options={}, verbos elif key in discrete_strings: discrete_strings.remove(key) var_df1['discrete_string_vars'] = discrete_strings - print('%s is detected and will be treated as an NLP variable' %key) + print('%s is detected as an NLP variable' %key) if key not in var_df1['nlp_vars']: var_df1['nlp_vars'].append(key) #### Now let's calculate some statistics on this NLP variable ### @@ -663,14 +664,14 @@ def classify_features_using_pandas(data_sample, target, model_options={}, verbos ### Immediately cap the vocab size to 300,000 - don't measure its vocab!! data_sample = data_sample.sample(frac=0.1, random_state=0) try: - vocab = np.concatenate(data_sample[key].fillna('missing').map(tokenize_fast)) + vocab = np.concatenate(data_sample[key].map(tokenize_fast)) except: - vocab = np.concatenate(data_sample[key].fillna('missing').map(tokenize_fast).values) + vocab = np.concatenate(data_sample[key].map(tokenize_fast).values) vocab = np.unique(vocab).tolist() feats_max_min[key]["vocab"] = vocab try: - feats_max_min[key]['seq_length'] = int(data_sample[key].fillna("missing").map(len).max()) - num_words_in_each_row = data_sample[key].fillna("missing").map(lambda x: len(x.split(" "))).mean() + feats_max_min[key]['seq_length'] = int(data_sample[key].map(len).max()) + num_words_in_each_row = data_sample[key].map(lambda x: len(x.split(" "))).mean() feats_max_min[key]['size_of_vocab'] = int(num_rows_in_data*num_words_in_each_row) except: feats_max_min[key]['seq_length'] = len(vocab) // num_rows_in_data @@ -679,10 +680,8 @@ def classify_features_using_pandas(data_sample, target, model_options={}, verbos ### This is for string variables ######## #### Now we select features if they are present in the data set ### num_rows_in_data = model_options['DS_LEN'] - if data_sample[key].isnull().sum() > 0: - vocab = data_sample[key].fillna("missing").unique().tolist() - else: - vocab = data_sample[key].unique().tolist() + data_sample[[key]] = data_sample[[key]].fillna("missing") + vocab = data_sample[key].unique().tolist() vocab = np.unique(vocab).tolist() #vocab = ['missing' if type(x) != str else x for x in vocab] feats_max_min[key]["vocab"] = vocab @@ -748,7 +747,7 @@ def classify_features_using_pandas(data_sample, target, model_options={}, verbos print('Not performing feature crossing for categorical nor integer variables' ) return data_sample, var_df1, feats_max_min ############################################################################################ -def EDA_classify_and_return_cols_by_type(df1, nlp_char_limit=20): +def EDA_classify_and_return_cols_by_type(df1, nlp_char_limit=50): """ EDA stands for Exploratory data analysis. This function performs EDA - hence the name ######################################################################################## @@ -763,7 +762,8 @@ def EDA_classify_and_return_cols_by_type(df1, nlp_char_limit=20): nlpcols = [] for each_cat in cats: try: - if df1[each_cat].map(len).mean() >=nlp_char_limit: + df1[[each_cat]] = df1[[each_cat]].fillna('missing') + if df1[each_cat].map(len).max() >=nlp_char_limit: nlpcols.append(each_cat) catcols.remove(each_cat) except: @@ -775,7 +775,7 @@ def EDA_classify_and_return_cols_by_type(df1, nlp_char_limit=20): floatcols = df1.select_dtypes(include='float').columns.tolist() return catcols, int_cats, intcols, floatcols, nlpcols ############################################################################################ -def EDA_classify_features(train, target, idcols, nlp_char_limit=20): +def EDA_classify_features(train, target, idcols, nlp_char_limit=50): ### Test Labeler is a very important dictionary that will help transform test data same as train #### test_labeler = defaultdict(list) @@ -1081,7 +1081,7 @@ def classify_dtypes_using_TF2(data_sample, preds, idcols, verbose=0): """ print_features = False nlps = [] - nlp_char_limit = 30 + nlp_char_limit = 50 all_ints = [] floats = [] cats = [] @@ -1108,7 +1108,8 @@ def classify_dtypes_using_TF2(data_sample, preds, idcols, verbose=0): int_vocab = tf.unique(value)[0].numpy().tolist() feats_max_min[key]['size_of_vocab'] = len(int_vocab) elif feats_max_min[key]['dtype'] in [tf.string]: - if tf.reduce_mean(tf.strings.length(feature_batch[key])).numpy() >= nlp_char_limit: + feature_batch[[key]] = feature_batch[[key]].fillna("missing") + if tf.reduce_max(tf.strings.length(feature_batch[key])).numpy() >= nlp_char_limit: print('%s is detected and will be treated as an NLP variable') nlps.append(key) else: diff --git a/deep_autoviml/data_load/extract.py b/deep_autoviml/data_load/extract.py index a826883..5c8c4d3 100644 --- a/deep_autoviml/data_load/extract.py +++ b/deep_autoviml/data_load/extract.py @@ -132,7 +132,7 @@ def transform_train_target(train_target, target, modeltype, model_label, cat_voc train_target = copy.deepcopy(train_target) cat_vocab_dict = copy.deepcopy(cat_vocab_dict) ### Just have to change the target from string to Numeric in entire dataframe! ### - + if modeltype != 'Regression': if model_label == 'Multi_Label': target_copy = copy.deepcopy(target) @@ -315,34 +315,50 @@ def load_train_data_file(train_datafile, target, keras_options, model_options, v ### if modeltype is given, then do not find the model type using this function _, _, usecols = find_problem_type(train_small, target, model_options, verbose) - label_encode_flag = False + ########## Find small details about the data to help create the right model ### - - if modeltype == 'Classification' or modeltype == 'Multi_Classification': + label_encode_flag = model_options["label_encode_flag"] + if isinstance(label_encode_flag, str): + if modeltype == 'Classification' or modeltype == 'Multi_Classification': + if isinstance(target, str): + #### This is for Single-Label problems ######## + if train_small[target].dtype == 'object' or str(train_small[target].dtype).lower() == 'category': + label_encode_flag = True + elif 0 not in np.unique(train_small[target]): + label_encode_flag = False + if verbose: + print(' label encoding must be done since there is no zero class!') + target_vocab = train_small[target].unique() + num_classes = len(target_vocab) + elif isinstance(target, list): + #### This is for Multi-Label problems ######## + num_classes = [] + for each_target in target: + if train_small[each_target].dtype == 'object' or str(train_small[target[0]].dtype).lower() == 'category': + label_encode_flag = True + elif 0 not in np.unique(train_small[each_target]): + label_encode_flag = False + if verbose: + print(' label encoding must be done since there is no zero class!') + target_vocab = train_small[each_target].unique().tolist() + num_classes.append(len(target_vocab)) + else: + num_classes = 1 + target_vocab = [] + label_encode_flag = False + else: if isinstance(target, str): - #### This is for Single-Label problems ######## - if train_small[target].dtype == 'object' or str(train_small[target].dtype).lower() == 'category': - label_encode_flag = True - elif 0 not in np.unique(train_small[target]): - label_encode_flag = True ### label encoding must be done since no zero class! target_vocab = train_small[target].unique() num_classes = len(target_vocab) - elif isinstance(target, list): - #### This is for Multi-Label problems ######## - num_classes = [] - for each_target in target: - if train_small[each_target].dtype == 'object' or str(train_small[target[0]].dtype).lower() == 'category': - label_encode_flag = True - elif 0 not in np.unique(train_small[each_target]): - label_encode_flag = True - target_vocab = train_small[each_target].unique().tolist() - num_classes.append(len(target_vocab)) - else: - num_classes = 1 - target_vocab = [] + else: + for each_target in copy_target: + target_vocab = train_small[target].unique().tolist() + num_classes_each = len(target_vocab) + num_classes.append(int(num_classes_each)) + #### This is where we set the model_options for num_classes and num_labels ######### model_options['num_classes'] = num_classes - + ############# Sample Data classifying features into variaous types ################## print('Loaded a small data sample of size = %s into pandas dataframe to analyze...' %(train_small.shape,)) ### classify variables using the small dataframe ## @@ -695,12 +711,13 @@ def load_train_data_frame(train_dataframe, target, keras_options, model_options, #### if target is changed you must send that modified target back to other processes ###### ### usecols is basically target in a list format. Very handy to know when target is a list. - + try: modeltype = model_options["modeltype"] if model_options["modeltype"] == '': ### usecols is basically target in a list format. Very handy to know when target is a list. modeltype, model_label, usecols = find_problem_type(train_dataframe, target, model_options, verbose) + model_options["modeltype"] = modeltype else: if isinstance(target, str): usecols = [target] @@ -732,37 +749,49 @@ def load_train_data_frame(train_dataframe, target, keras_options, model_options, cat_vocab_dict['modeltype'] = modeltype model_options['batch_size'] = batch_size ########## Find small details about the data to help create the right model ### - target_transformed = False - if modeltype != 'Regression': - if isinstance(target, str): - #### This is for Single Label Problems ###### - if train_small[target].dtype == 'object' or str(train_small[target].dtype).lower() == 'category': - target_transformed = True - target_vocab = train_small[target].unique() - num_classes = len(target_vocab) - else: - if 0 not in np.unique(train_small[target]): - target_transformed = True ### label encoding must be done since no zero class! - target_vocab = train_small[target].unique() - num_classes = len(train_small[target].value_counts()) - elif isinstance(target, list): - #### This is for Multi-Label Problems ####### - copy_target = copy.deepcopy(target) - num_classes = [] - for each_target in copy_target: - if train_small[target[0]].dtype == 'object' or str(train_small[target[0]].dtype).lower() == 'category': + target_transformed = model_options["label_encode_flag"] + if isinstance(target_transformed, str): + if modeltype != 'Regression': + if isinstance(target, str): + #### This is for Single Label Problems ###### + if train_small[target].dtype == 'object' or str(train_small[target].dtype).lower() == 'category': target_transformed = True - target_vocab = train_small[target].unique().tolist() - num_classes_each = len(target_vocab) + target_vocab = train_small[target].unique() + num_classes = len(target_vocab) else: - if 0 not in np.unique(train_small[target[0]]): + if 0 not in np.unique(train_small[target]): target_transformed = True ### label encoding must be done since no zero class! - target_vocab = train_small[target[0]].unique() - num_classes_each = train_small[target].apply(np.unique).apply(len).max() - num_classes.append(int(num_classes_each)) + target_vocab = train_small[target].unique() + num_classes = len(train_small[target].value_counts()) + elif isinstance(target, list): + #### This is for Multi-Label Problems ####### + copy_target = copy.deepcopy(target) + num_classes = [] + for each_target in copy_target: + if train_small[target[0]].dtype == 'object' or str(train_small[target[0]].dtype).lower() == 'category': + target_transformed = True + target_vocab = train_small[target].unique().tolist() + num_classes_each = len(target_vocab) + else: + if 0 not in np.unique(train_small[target[0]]): + target_transformed = True ### label encoding must be done since no zero class! + target_vocab = train_small[target[0]].unique() + num_classes_each = train_small[target].apply(np.unique).apply(len).max() + num_classes.append(int(num_classes_each)) + else: + num_classes = 1 + target_vocab = [] + target_transformed = False else: - num_classes = 1 - target_vocab = [] + if isinstance(target, str): + target_vocab = train_small[target].unique() + num_classes = len(target_vocab) + else: + for each_target in copy_target: + target_vocab = train_small[target].unique().tolist() + num_classes_each = len(target_vocab) + num_classes.append(int(num_classes_each)) + ########### find the number of labels in data #### if isinstance(target, str): num_labels = 1 @@ -777,7 +806,7 @@ def load_train_data_frame(train_dataframe, target, keras_options, model_options, cat_vocab_dict['num_labels'] = num_labels cat_vocab_dict['num_classes'] = num_classes cat_vocab_dict["target_transformed"] = target_transformed - + #### once the dataframe has been classified, you can again change train_small to original dataframe ## train_small = copy.deepcopy(train_dataframe) @@ -1059,18 +1088,23 @@ def combine_nlp_text(features): y[NLP_COLUMN] = tf.strings.reduce_join([features[i] for i in NLP_VARS],axis=0, keepdims=False, separator=' ') return y - ################################################################ + ###################################################################################### ### You have to load only the NLP or text variables into dataset. ### otherwise, it will fail during predict. Yo still need to create input for them. ### In mixed_NLP models, you drop original NLP vars and combine them into one NLP var. - if NLP_VARS and keras_model_type.lower() in ['nlp','text']: + ###################################################################################### + + if NLP_VARS and keras_model_type.lower() in ['nlp','text', 'mixed_nlp', 'combined_nlp']: if keras_model_type.lower() in ['nlp', 'text']: train_ds = train_ds.map(lambda x, y: (process_NLP_features(x), y)) #train_ds = train_ds.unbatch().batch(batch_size) print(' processed NLP or text vars: %s successfully' %NLP_VARS) - else: + elif keras_model_type.lower() in ['combined_nlp']: train_ds = train_ds.map(lambda x, y: (combine_nlp_text(x), y)) print(' combined NLP or text vars: %s into a single feature successfully' %NLP_VARS) + else: + ### Mixed NLP is to keep NLP vars separate so they can be processed individually ## + print(' keeping NLP vars separate') else: print(' No special text preprocessing done for NLP vars.') ############################################################################ @@ -1305,10 +1339,11 @@ def select_rows_from_file_or_frame(train_datafile, model_options, targets, nrows modeltype = model_options['modeltype'] compression = model_options['compression'] ####### we randomly sample a small dataset to classify features ##################### - test_size = min(0.9, (1 - (nrows_limit/DS_LEN))) ### make sure there is a small train size + test_size = min(0.9, (1 - (nrows_limit/DS_LEN))) ### make sure there is a small train size if test_size <= 0: test_size = 0.9 - print(' Since number of rows > maxrows, loading a random sample of %d rows into pandas for EDA' %nrows_limit) + if DS_LEN > nrows_limit: + print(' Since number of rows > %s, loading a random sample of %d rows into pandas for EDA' %(nrows_limit, DS_LEN)) ###### If it is a file you need to load it into a dataframe, it not leave it as is ### if isinstance(train_datafile, str): ###### load a small sample of data into a pandas dataframe ## @@ -1321,6 +1356,7 @@ def select_rows_from_file_or_frame(train_datafile, model_options, targets, nrows else: train_small = copy.deepcopy(train_datafile) ####### If it is a classification problem, you need to stratify and select sample ### + if modeltype != 'Regression': copy_targets = copy.deepcopy(targets) for each_target in copy_targets: @@ -1330,6 +1366,9 @@ def select_rows_from_file_or_frame(train_datafile, model_options, targets, nrows train_small, _ = train_test_split(train_small, test_size=test_size, stratify=train_small[targets]) else: ### For Regression problems: load a small sample of data into a pandas dataframe ## - train_small = train_small.sample(n=nrows_limit, random_state=99) + if DS_LEN <= nrows_limit: + train_small = train_small.sample(n=DS_LEN, random_state=99) + else: + train_small = train_small.sample(n=nrows_limit, random_state=99) return train_small ###################################################################################### \ No newline at end of file diff --git a/deep_autoviml/deep_autoviml.py b/deep_autoviml/deep_autoviml.py index fb4720d..96a64b7 100644 --- a/deep_autoviml/deep_autoviml.py +++ b/deep_autoviml/deep_autoviml.py @@ -50,7 +50,6 @@ from tensorflow.keras import regularizers from tensorflow.keras.models import Model, load_model import tensorflow_hub as hub -import tensorflow_text as text ############################################################################################# from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error @@ -135,7 +134,9 @@ def left_subtract(l1,l2): ############################################################################################## def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep_autoviml", save_model_flag=True, model_options={}, keras_options={}, - use_my_model='', model_use_case='', verbose=0): + use_my_model='', model_use_case='', verbose=0, + use_mlflow=False,mlflow_exp_name='autoviml',mlflow_run_name='first_run' + ): """ #################################################################################### #### Deep AutoViML #### @@ -196,6 +197,11 @@ def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep Another option would be to inform autoviml about encoding in CSV file for it to read such as 'latin-1' by setting {"csv_encoding": 'latin-1'} Other examples: + "nlp_char_limit": default 50. Beyond this max limit of chars in column, it + will be considered NLP column and treated as such. + "variable_cat_limit": default 30. if a variable has more than this limit, it + will NOT be treated as a categorical variable. + "DS_LEN": default "". Number of rows in dataset. You can leave it "" to calculate automatically. "csv_encoding": default='utf-8'. You can change to 'latin-1', 'iso-8859-1', 'cp1252', etc. "cat_feat_cross_flag": if you want to cross categorical features such as A*B, B*C... "sep" : default = "," comma but you can override it. Separator used in read_csv. @@ -208,6 +214,9 @@ def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep We will figure out single label or multi-label problem based on your target being string or list. "header": default = 0 ### this is the header row for pandas to read + "compression": None => you can set it to zip or other file compression formats if your data is compressed + "csv_encoding": default 'utf-8'. But you can set it to any other csv encoding format your data is in + "label_encode_flag": False. But you can set it to True if you want it encoded. "max_trials": default = 30 ## number of Storm Tuner trials ### Lower this for faster processing. "tuner": default = 'storm' ## Storm Tuner is the default tuner. Optuna is the other option. "embedding_size": default = 50 ## this is the NLP embedding size minimum @@ -228,6 +237,13 @@ def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep It is a placeholder for future purposes. At the moment, leave it as empty string. verbose = 1 will give you more charts and outputs. verbose 0 will run silently with minimal outputs. + use_mlflow = This is used to enabling MLflow lifecycle and tracking. This is False be default. + MLflow is useed to manage the ML lifecycle, including experimentation, reproducibility, + deployment, and a central model registry. + mlflow_exp_name = MLflow experiment name. + mlflow_run_name = User has flexibilty to use custom run name. + + """ my_strategy = check_if_GPU_exists(1) ######## C H E CK T Y P E O F K E R A S M O D E L ##################### @@ -235,6 +251,13 @@ def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep model_options_copy = copy.deepcopy(model_options) keras_options_copy = copy.deepcopy(keras_options) + #############MLFLOW Check#################################### + if use_mlflow: + import mlflow + mlflow.set_experiment(mlflow_exp_name) + mlflow.start_run(run_name=mlflow_run_name) + mlflow.tensorflow.autolog(every_n_iter=1) + if isinstance(project_name,str): if project_name == '': project_name = "deep_autoviml" @@ -255,7 +278,7 @@ def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep print('Model and logs being saved in %s' %save_model_path) if keras_model_type.lower() in ['image', 'images', "image_classification"]: - ############### Now do special image processing here ################################### + ############### Now do special IMAGE processing here ################################### if 'image_directory' in model_options.keys(): print(' Image directory given as %s' %model_options['image_directory']) image_dir = model_options["image_directory"] @@ -282,7 +305,7 @@ def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep print(deep_model.summary()) return deep_model, cat_vocab_dict elif keras_model_type.lower() in ['text', 'text classification', "text_classification"]: - ############### Now do special text processing here ################################### + ############### Now do special TEXT processing here ################################### text_alt = True ### This means you use the text directory option if 'text_directory' in model_options.keys(): print(' text directory given as %s' %model_options['text_directory']) @@ -415,8 +438,8 @@ def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep print(' %s : %s' %(key, keras_options_copy[key])) keras_options[key] = keras_options_copy[key] - list_of_model_options = ["idcols","modeltype","sep","cat_feat_cross_flag", "model_use_case", - "nlp_char_limit", "variable_cat_limit", "csv_encoding", "header", + list_of_model_options = ["idcols","modeltype","sep","cat_feat_cross_flag", "model_use_case", "label_encode_flag", + "nlp_char_limit", "variable_cat_limit", "compression", "csv_encoding", "header", "max_trials","tuner", "embedding_size", "tf_hub_model", "image_directory", 'image_height', 'image_width', "image_channels", "save_model_path"] @@ -430,6 +453,8 @@ def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep model_options_defaults["nlp_char_limit"] = 30 model_options_defaults["variable_cat_limit"] = 30 model_options_defaults["csv_encoding"] = 'utf-8' + model_options_defaults['compression'] = None ## is is needed in case to read Zip files + model_options_defaults["label_encode_flag"] = '' ## User can set it to True or False depending on their need. model_options_defaults["header"] = 0 ### this is the header row for pandas to read model_options_defaults["max_trials"] = 30 ## number of Storm Tuner trials ### model_options_defaults['tuner'] = 'storm' ## Storm Tuner is the default tuner. Optuna is the other option. @@ -498,7 +523,7 @@ def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep #### There may be other use cases for model_use_case in future hence leave this empty for now # #### you must create a functional model here - print('\nCreating a new Functional model here...') + print('\nCreating a new Functional keras model now...') print(''' ################################################################################# ########### C R E A T I N G A K E R A S M O D E L ############ @@ -535,7 +560,7 @@ def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep keras_options, model_options, var_df, cat_vocab_dict, project_name, save_model_flag, verbose) else: #### This is used only for custom auto models and is out of the strategy scope ####### - print('Building and training an automatic model using %s Tuner...' %model_options['tuner']) + print('Building and training a(n) %s model using %s Tuner...' %(keras_model_type, model_options['tuner'])) deep_model, cat_vocab_dict = train_custom_model(nlp_inputs, meta_inputs, meta_outputs, nlp_outputs, batched_data, target, keras_model_type, keras_options, model_options, var_df, cat_vocab_dict, project_name, @@ -551,10 +576,16 @@ def fit(train_data_or_file, target, keras_model_type="basic", project_name="deep except: print('Cannot save plot. Install pydot and graphviz if you want plots saved.') distributed_values = (deep_model, cat_vocab_dict) + if use_mlflow: + mlflow.end_run() + print("""####################################################### + Please start Mlflow locally to track machine learning lifecycle and use as below + http://localhost:5000/ + ####################################################### """) return distributed_values ############################################################################################ def get_save_folder(save_dir): - run_id = time.strftime("model_%Y_%m_%d-%H_%M_%S") + run_id = time.strftime("model_%Y_%m_%d_%H_%M_%S") return os.path.join(save_dir, run_id) -###################################################################################### +############################################################################################ \ No newline at end of file diff --git a/deep_autoviml/modeling/__pycache__/create_model.cpython-38.pyc b/deep_autoviml/modeling/__pycache__/create_model.cpython-38.pyc index 6c8996f..a3260f5 100644 Binary files a/deep_autoviml/modeling/__pycache__/create_model.cpython-38.pyc and b/deep_autoviml/modeling/__pycache__/create_model.cpython-38.pyc differ diff --git a/deep_autoviml/modeling/__pycache__/predict_model.cpython-38.pyc b/deep_autoviml/modeling/__pycache__/predict_model.cpython-38.pyc index 96a2da1..7e5756c 100644 Binary files a/deep_autoviml/modeling/__pycache__/predict_model.cpython-38.pyc and b/deep_autoviml/modeling/__pycache__/predict_model.cpython-38.pyc differ diff --git a/deep_autoviml/modeling/__pycache__/train_custom_model.cpython-38.pyc b/deep_autoviml/modeling/__pycache__/train_custom_model.cpython-38.pyc index 4a6978a..708afa3 100644 Binary files a/deep_autoviml/modeling/__pycache__/train_custom_model.cpython-38.pyc and b/deep_autoviml/modeling/__pycache__/train_custom_model.cpython-38.pyc differ diff --git a/deep_autoviml/modeling/__pycache__/train_model.cpython-38.pyc b/deep_autoviml/modeling/__pycache__/train_model.cpython-38.pyc index 0d06530..74b8523 100644 Binary files a/deep_autoviml/modeling/__pycache__/train_model.cpython-38.pyc and b/deep_autoviml/modeling/__pycache__/train_model.cpython-38.pyc differ diff --git a/deep_autoviml/modeling/create_model.py b/deep_autoviml/modeling/create_model.py index d4259b4..28bd3ba 100644 --- a/deep_autoviml/modeling/create_model.py +++ b/deep_autoviml/modeling/create_model.py @@ -228,7 +228,7 @@ def create_model(use_my_model, nlp_inputs, meta_inputs, meta_outputs, nlp_output fast_models2 = ['deep_and_cross', 'deep_cross', 'deep cross', 'fast2'] nlp_models = ['bert', 'use', 'text', 'mixed_nlp'] #### The Deep and Wide Model is a bit more complicated. So it needs some changes in inputs! ###### - prebuilt_models = ['basic', 'simple', 'default','dnn','reg_dnn', + prebuilt_models = ['basic', 'simple', 'default','dnn','reg_dnn', 'deep', 'big deep', 'dnn_drop', 'big_deep', 'giant_deep', 'giant deep', 'cnn1', 'cnn','cnn2'] ###### Just do a simple check for auto models here #################### @@ -270,10 +270,10 @@ def create_model(use_my_model, nlp_inputs, meta_inputs, meta_outputs, nlp_output elif keras_model_type.lower() in ['dnn', 'simple_dnn']: ########## Now that we have setup the layers correctly, we can build some more hidden layers model_body = dnn.model - elif keras_model_type.lower() in ['dnn_drop', 'big_deep']: + elif keras_model_type.lower() in ['dnn_drop', 'big_deep', 'big deep']: #################################################### model_body = dnn_drop.model - elif keras_model_type.lower() in ['giant', 'giant_deep']: + elif keras_model_type.lower() in ['giant', 'giant_deep', 'giant deep']: #################################################### model_body = giant_deep.model elif keras_model_type.lower() in ['cnn', 'cnn1','cnn2']: @@ -443,6 +443,7 @@ def create_model(use_my_model, nlp_inputs, meta_inputs, meta_outputs, nlp_output #### This final outputs is the one that is taken into final dense layer and compiled print(' %s model loaded successfully. Now compiling model...' %keras_model_type) ############# You need to compile the non-auto models here ############### + model_body = get_compiled_model(all_inputs, model_body, output_activation, num_predicts, modeltype, optimizer, val_loss, val_metrics, cols_len, targets) print(' %s model loaded and compiled successfully...' %keras_model_type) diff --git a/deep_autoviml/modeling/predict_model.py b/deep_autoviml/modeling/predict_model.py index 57be62b..3ad0c32 100644 --- a/deep_autoviml/modeling/predict_model.py +++ b/deep_autoviml/modeling/predict_model.py @@ -28,7 +28,6 @@ ############################################################################################ # TensorFlow ≥2.4 is required import tensorflow as tf -import tensorflow_text as text np.random.seed(42) tf.random.set_seed(42) diff --git a/deep_autoviml/modeling/train_custom_model.py b/deep_autoviml/modeling/train_custom_model.py index 6a2361b..7922a3d 100644 --- a/deep_autoviml/modeling/train_custom_model.py +++ b/deep_autoviml/modeling/train_custom_model.py @@ -52,6 +52,7 @@ def set_seed(seed=31415): from tensorflow.keras.layers import BatchNormalization from tensorflow.keras.optimizers import SGD from tensorflow.keras import regularizers +from tensorflow.keras.layers import LeakyReLU ##################################################################################### from deep_autoviml.modeling.create_model import return_optimizer from deep_autoviml.utilities.utilities import get_model_defaults, get_compiled_model @@ -150,17 +151,16 @@ def build_model_optuna(trial, inputs, meta_outputs, output_activation, num_predi #K.clear_session() #reset_keras() #tf.keras.backend.reset_uids() - - n_layers = trial.suggest_int("n_layers", 1, 4) + ### Keep the number of layers slightly higher to increase model complexity ## + n_layers = trial.suggest_int("n_layers", 2, 8) #num_hidden = trial.suggest_categorical("n_units", [32, 48, 64, 96, 128]) num_hidden = trial.suggest_categorical("n_units", [50, 100, 150, 200, 250, 300, 350, 400, 450, 500]) - #weight_decay = trial.suggest_float("weight_decay", 1e-8, 1e-3, log=True) - weight_decay = trial.suggest_float("weight_decay", 1e-8, 1e-7,1e-6, 1e-5,1e-4, 1e-3,1e-2, 1e-1) + weight_decay = trial.suggest_float("weight_decay", 1e-8, 1e-3, log=True) use_bias = trial.suggest_categorical("use_bias", [True, False]) batch_norm = trial.suggest_categorical("batch_norm", [True, False]) add_noise = trial.suggest_categorical("add_noise", [True, False]) - dropout = trial.suggest_float("dropout", 0, 0.5) - activation_fn = trial.suggest_categorical("activation", ['relu', 'tanh', 'elu', 'selu']) + dropout = trial.suggest_float("dropout", 0.5, 0.9) + activation_fn = trial.suggest_categorical("activation", ['relu', 'elu', 'selu']) kernel_initializer = trial.suggest_categorical("kernel_initializer", ['glorot_uniform','he_normal','lecun_normal','he_uniform']) kernel_size = num_hidden @@ -183,7 +183,7 @@ def build_model_optuna(trial, inputs, meta_outputs, output_activation, num_predi model.add(BatchNormalization(name="opt_batchnorm_"+str(i))) if add_noise: - model.add(GaussianNoise(trial.suggest_float("adam_learning_rate", 1e-5, 1e-1, log=True))) + model.add(GaussianNoise(trial.suggest_float("adam_learning_rate", 1e-7, 1e-3, log=True))) model.add(Dropout(dropout, name="opt_drop_"+str(i))) @@ -198,13 +198,13 @@ def build_model_optuna(trial, inputs, meta_outputs, output_activation, num_predi else: optimizer_selected = trial.suggest_categorical("optimizer", optimizer_options) if optimizer_selected == "Adam": - kwargs["learning_rate"] = trial.suggest_float("adam_learning_rate", 1e-5, 1e-1, log=True) + kwargs["learning_rate"] = trial.suggest_float("adam_learning_rate", 1e-7, 1e-3, log=True) kwargs["epsilon"] = trial.suggest_float( "adam_epsilon", 1e-14, 1e-4, log=True ) elif optimizer_selected == "SGD": kwargs["learning_rate"] = trial.suggest_float( - "sgd_opt_learning_rate", 1e-5, 1e-2, log=True + "sgd_opt_learning_rate", 1e-7, 1e-3, log=True ) kwargs["momentum"] = trial.suggest_float("sgd_opt_momentum", 0.8, 0.95) @@ -224,27 +224,27 @@ def build_model_optuna(trial, inputs, meta_outputs, output_activation, num_predi def build_model_storm(hp, *args): #### Before every sequential model definition you need to clear the Keras backend ## keras.backend.clear_session() - + ###### we need to use the batch_size in a few small sizes #### if len(args) == 2: batch_limit, batch_nums = args[0], args[1] - batch_size = hp.Param('batch_size', [32, 48, 64, 96, 128, 256], + batch_size = hp.Param('batch_size', [32, 64, 128, 256, 512, 1024, 2048], ordered=True) elif len(args) == 1: batch_size = args[0] - hp.Param('batch_size', [batch_size]) + batch_size = hp.Param('batch_size', [batch_size]) else: - hp.Param('batch_size', [32]) + batch_size = hp.Param('batch_size', [32, 64, 128, 256, 512, 1024, 2048]) num_layers = hp.Param('num_layers', [1, 2, 3], ordered=True) ##### Now let us build the model body ############### model_body = Sequential([]) # example of model-wide unordered categorical parameter - activation_fn = hp.Param('activation', ['tanh','relu', 'selu', 'elu']) + activation_fn = hp.Param('activation', ['relu', 'selu', 'elu']) use_bias = hp.Param('use_bias', [True, False]) - #weight_decay = hp.Param("weight_decay", np.logspace(-8, -3)) - weight_decay = hp.Param("weight_decay", [1e-8, 1e-7,1e-6, 1e-5,1e-4, 1e-3,1e-2, 1e-1]) + weight_decay = hp.Param("weight_decay", np.logspace(-8, -3, 10)) + #weight_decay = hp.Param("weight_decay", [1e-8, 1e-7,1e-6, 1e-5,1e-4]) batch_norm = hp.Param("batch_norm", [True, False]) kernel_initializer = hp.Param("kernel_initializer", @@ -275,14 +275,14 @@ def build_model_storm(hp, *args): # this param will not affect the configuration hash, if this block of code isn't executed # this is to ensure we do not test configurations that are functionally the same # but have different values for unused parameters - model_body.add(Dropout(hp.Param('dropout_value', [0.1, 0.2, 0.3, 0.4, 0.5], ordered=True), + model_body.add(Dropout(hp.Param('dropout_value', [0.5, 0.6, 0.7, 0.8, 0.9], ordered=True), name="dropout_0")) kernel_size = hp.values['kernel_size_' + str(0)] if dropout_flag: dropout_value = hp.values['dropout_value'] else: - dropout_value = 0.00 + dropout_value = 0.5 batch_norm_flag = hp.values['use_batch_norm'] # example of inline ordered parameter num_copy = copy.deepcopy(num_layers) @@ -367,10 +367,12 @@ def run_trial(self, trial, *args): save_model_architecture(comp_model, project_name, keras_model_type, cat_vocab_dict, model_options, chart_name="model_before") #print(' Custom model compiled successfully. Training model next...') + batch_numbers = [32, 64, 128, 256, 512, 1024, 2048, 4096] shuffle_size = 1000 - batch_sizes = np.linspace(8, batch_limit,batch_nums).astype(int).tolist() - batch_size = hp.Param('batch_size', batch_sizes, ordered=True) - #print('storm batch size = %s' %batch_size) + batch_sizes = batch_numbers[:batch_nums] + #print('storm batch sizes = %s' %batch_sizes) + batch_size = np.random.choice(batch_sizes) + #print(' selected batch size = %s' %batch_size) train_ds = train_ds.unbatch().batch(batch_size) train_ds = train_ds.shuffle(shuffle_size, reshuffle_each_iteration=False, seed=42).prefetch(batch_size)#.repeat(5) @@ -421,22 +423,23 @@ def return_optimizer_trials(hp, hpq_optimizer): nadam = keras.optimizers.Nadam(lr=0.001, beta_1=0.9, beta_2=0.999) best_optimizer = '' ############################################################################# + lr_list = [1e-2, 1e-3, 1e-4] if hpq_optimizer.lower() in ['adam']: - best_optimizer = tf.keras.optimizers.Adam(lr=hp.Param('init_lr', [1e-2, 1e-3, 1e-4]), + best_optimizer = tf.keras.optimizers.Adam(lr=hp.Param('init_lr', lr_list), epsilon=hp.Param('epsilon', [1e-6, 1e-8, 1e-10, 1e-12, 1e-14], ordered=True)) elif hpq_optimizer.lower() in ['sgd']: - best_optimizer = keras.optimizers.SGD(lr=hp.Param('init_lr', [1e-2, 1e-3, 1e-4]), + best_optimizer = keras.optimizers.SGD(lr=hp.Param('init_lr', lr_list), momentum=0.9) elif hpq_optimizer.lower() in ['nadam']: - best_optimizer = keras.optimizers.Nadam(lr=hp.Param('init_lr', [1e-2, 1e-3, 1e-4]), + best_optimizer = keras.optimizers.Nadam(lr=hp.Param('init_lr', lr_list), beta_1=0.9, beta_2=0.999) elif hpq_optimizer.lower() in ['adamax']: - best_optimizer = keras.optimizers.Adamax(lr=hp.Param('init_lr', [1e-2, 1e-3, 1e-4]), + best_optimizer = keras.optimizers.Adamax(lr=hp.Param('init_lr', lr_list), beta_1=0.9, beta_2=0.999) elif hpq_optimizer.lower() in ['adagrad']: - best_optimizer = keras.optimizers.Adagrad(lr=hp.Param('init_lr', [1e-2, 1e-3, 1e-4])) + best_optimizer = keras.optimizers.Adagrad(lr=hp.Param('init_lr', lr_list)) elif hpq_optimizer.lower() in ['rmsprop']: - best_optimizer = keras.optimizers.RMSprop(lr=hp.Param('init_lr', [1e-2, 1e-3, 1e-4]), + best_optimizer = keras.optimizers.RMSprop(lr=hp.Param('init_lr', lr_list), rho=0.9) elif hpq_optimizer.lower() in ['nesterov']: best_optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True) @@ -480,6 +483,10 @@ def train_custom_model(nlp_inputs, meta_inputs, meta_outputs, nlp_outputs, full_ data_size = check_keras_options(keras_options, 'data_size', 10000) batch_size = check_keras_options(keras_options, 'batchsize', 64) class_weights = check_keras_options(keras_options, 'class_weight', {}) + if not isinstance(model_options["label_encode_flag"], str): + if not model_options["label_encode_flag"]: + print(' removing class weights since label_encode_flag is set to False which means classes can be anything.') + class_weights = {} print(' Class weights: %s' %class_weights) num_classes = model_options["num_classes"] num_labels = model_options["num_labels"] @@ -503,7 +510,7 @@ def train_custom_model(nlp_inputs, meta_inputs, meta_outputs, nlp_outputs, full_ if keras_options['lr_scheduler'] in ['expo', 'ExponentialDecay', 'exponentialdecay']: print(' chosen ExponentialDecay learning rate scheduler') expo_steps = (NUMBER_OF_EPOCHS*data_size)//batch_size - learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, expo_steps, 0.1) + learning_rate = keras.optimizers.schedules.ExponentialDecay(0.0001, expo_steps, 0.1) else: learning_rate = check_keras_options(keras_options, "learning_rate", 5e-2) #### The steps are actually not needed but remove them later.### @@ -542,10 +549,21 @@ def train_custom_model(nlp_inputs, meta_inputs, meta_outputs, nlp_outputs, full_ val_loss, num_predicts, output_activation)) #### just use modeltype for printing that's all ### modeltype = cat_vocab_dict['modeltype'] - ### set some flags for choosing the right model buy here ################### + + ############################################################################ + ### A Regular body does not have separate NLP outputs. #################### + ### However an Irregular body like fast models have separate NLP outputs. ## + ############################################################################ regular_body = True if isinstance(meta_outputs, list): - regular_body = False + if nlp_flag: + if len(nlp_outputs) > 0: + ### This is a true nlp and we need to use nlp inputs ## + regular_body = False + else: + regular_body = True + else: + regular_body = False ############################################################################ ### check the defaults for the following! @@ -584,7 +602,7 @@ def train_custom_model(nlp_inputs, meta_inputs, meta_outputs, nlp_outputs, full_ try: y_test = np.concatenate(list(heldout_ds.map(lambda x,y: y).as_numpy_iterator())) print(' Single-Label: Heldout data shape: %s' %(y_test.shape,)) - max_batch_size = y_test.shape[0] + max_batch_size = int(min(y_test.shape[0], 4096)) except: max_batch_size = 48 pass @@ -644,7 +662,7 @@ def train_custom_model(nlp_inputs, meta_inputs, meta_outputs, nlp_outputs, full_ tune_mode = val_mode if tuner.lower() == "storm": ######## S T O R M T U N E R D E F I N E D H E R E ########### - randomization_factor = 0.25 + randomization_factor = 0.5 tuner = MyTuner(project_dir=trials_saved_path, build_fn=build_model_storm, objective_direction=tune_mode, @@ -657,14 +675,14 @@ def train_custom_model(nlp_inputs, meta_inputs, meta_outputs, nlp_outputs, full_ #### This is where you find best model parameters for keras using SToRM ##### ############################################################################# start_time1 = time.time() - print(' STORM Tuner max_trials = %d, randomization factor = %0.1f' %( + print(' STORM Tuner max_trials = %d, randomization factor = %0.2f' %( max_trials, randomization_factor)) tuner_epochs = 100 ### keep this low so you can run fast tuner_steps = STEPS_PER_EPOCH ## keep this also very low - batch_limit = min(max_batch_size, int(2 * find_batch_size(data_size))) - batch_nums = int(min(5, 0.1 * batch_limit)) + batch_limit = min(max_batch_size, int(5 * find_batch_size(data_size))) + batch_nums = int(min(8, math.log(batch_limit, 3))) print('Max. batch size = %d, number of batch sizes to try: %d' %(batch_limit, batch_nums)) - + #### You have to make sure that inputs are unique, otherwise error #### tuner.search(train_ds, valid_ds, tuner_epochs, tuner_steps, inputs, meta_outputs, cols_len, output_activation, @@ -825,7 +843,7 @@ def objective(trial): print('Model training with best hyperparameters for %d epochs' %NUMBER_OF_EPOCHS) for each_callback in callbacks_list: print(' Callback added: %s' %str(each_callback).split(".")[-1]) - + ############################ M O D E L T R A I N I N G ################## np.random.seed(42) tf.random.set_seed(42) diff --git a/deep_autoviml/modeling/train_model.py b/deep_autoviml/modeling/train_model.py index ad23afa..64218d4 100644 --- a/deep_autoviml/modeling/train_model.py +++ b/deep_autoviml/modeling/train_model.py @@ -120,6 +120,10 @@ def train_model(deep_model, full_ds, target, keras_model_type, keras_options, patience = check_keras_options(keras_options, "patience", 10) optimizer = keras_options['optimizer'] class_weights = check_keras_options(keras_options, "class_weight", {}) + if not isinstance(model_options["label_encode_flag"], str): + if not model_options["label_encode_flag"]: + print(' removing class weights since label_encode_flag is set to False which means classes can be anything.') + class_weights = {} print(' class_weights: %s' %class_weights) cols_len = len([item for sublist in list(var_df.values()) for item in sublist]) print(' original datasize = %s, initial batchsize = %s' %(data_size, batch_size)) diff --git a/deep_autoviml/models/tf_hub_lookup.py b/deep_autoviml/models/tf_hub_lookup.py index f2f0c57..e9f6fad 100644 --- a/deep_autoviml/models/tf_hub_lookup.py +++ b/deep_autoviml/models/tf_hub_lookup.py @@ -67,7 +67,7 @@ 'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/2', } -map_hub_to_name = dict([(v,k) for (k,v) in map_name_to_handle.items()]) +map_hub_to_name = {v: k for (k,v) in map_name_to_handle.items()} map_name_to_preprocess = { 'bert_en_uncased_L-12_H-768_A-12': diff --git a/deep_autoviml/preprocessing/__pycache__/preprocessing.cpython-38.pyc b/deep_autoviml/preprocessing/__pycache__/preprocessing.cpython-38.pyc index fb3fc8f..55f9c3d 100644 Binary files a/deep_autoviml/preprocessing/__pycache__/preprocessing.cpython-38.pyc and b/deep_autoviml/preprocessing/__pycache__/preprocessing.cpython-38.pyc differ diff --git a/deep_autoviml/preprocessing/__pycache__/preprocessing_images.cpython-38.pyc b/deep_autoviml/preprocessing/__pycache__/preprocessing_images.cpython-38.pyc index d93de86..730f3d4 100644 Binary files a/deep_autoviml/preprocessing/__pycache__/preprocessing_images.cpython-38.pyc and b/deep_autoviml/preprocessing/__pycache__/preprocessing_images.cpython-38.pyc differ diff --git a/deep_autoviml/preprocessing/__pycache__/preprocessing_nlp.cpython-38.pyc b/deep_autoviml/preprocessing/__pycache__/preprocessing_nlp.cpython-38.pyc index 6004ebf..02c3187 100644 Binary files a/deep_autoviml/preprocessing/__pycache__/preprocessing_nlp.cpython-38.pyc and b/deep_autoviml/preprocessing/__pycache__/preprocessing_nlp.cpython-38.pyc differ diff --git a/deep_autoviml/preprocessing/__pycache__/preprocessing_tabular.cpython-38.pyc b/deep_autoviml/preprocessing/__pycache__/preprocessing_tabular.cpython-38.pyc index cecbfcc..d5812c9 100644 Binary files a/deep_autoviml/preprocessing/__pycache__/preprocessing_tabular.cpython-38.pyc and b/deep_autoviml/preprocessing/__pycache__/preprocessing_tabular.cpython-38.pyc differ diff --git a/deep_autoviml/preprocessing/__pycache__/preprocessing_text.cpython-38.pyc b/deep_autoviml/preprocessing/__pycache__/preprocessing_text.cpython-38.pyc index 7b82048..07413b2 100644 Binary files a/deep_autoviml/preprocessing/__pycache__/preprocessing_text.cpython-38.pyc and b/deep_autoviml/preprocessing/__pycache__/preprocessing_text.cpython-38.pyc differ diff --git a/deep_autoviml/preprocessing/preprocessing.py b/deep_autoviml/preprocessing/preprocessing.py index 23e944f..6222438 100644 --- a/deep_autoviml/preprocessing/preprocessing.py +++ b/deep_autoviml/preprocessing/preprocessing.py @@ -24,7 +24,7 @@ # Make numpy values easier to read. np.set_printoptions(precision=3, suppress=True) from collections import defaultdict - +import os ############################################################################################ # data pipelines and feature engg here from deep_autoviml.preprocessing.preprocessing_tabular import preprocessing_tabular @@ -65,6 +65,7 @@ from tensorflow.keras import regularizers from tensorflow.keras.layers import Dense, LSTM, GRU, Input, concatenate, Embedding from tensorflow.keras.layers import Reshape, Activation, Flatten +import tensorflow_hub as hub from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error from IPython.core.display import Image, display @@ -183,74 +184,101 @@ def perform_preprocessing(train_ds, var_df, cat_vocab_dict, keras_model_type, nlp_names = [] embedding = [] ################## All other Features are Proprocessed Here ################ - fast_models = ['fast','deep_and_wide','deep_wide','wide_deep', 'mixed_nlp', + ### make sure you include mixed_nlp and combined_nlp in this list since you want it separated + fast_models = ['fast','deep_and_wide','deep_wide','wide_deep', "mixed_nlp","combined_nlp", 'wide_and_deep','deep wide', 'wide deep', 'fast1', 'deep_and_cross', 'deep_cross', 'deep cross', 'fast2',"text"] ############################################################################## meta_outputs = [] print('Preprocessing non-NLP layers for %s Keras model...' %keras_model_type) - + if not keras_model_type.lower() in fast_models: - ################################################################################ - ############ T H I S I S F O R "A U T O" M O D E L S O N L Y ######### - ################################################################################ + ############################################################################################ + ############ I N "A U T O" M O D E L S we use Lat and Lon with NLP right here ######### + ############################################################################################ if len(lats+lons) > 0: - print(' starting categorical, float and integer layer preprocessing...') + print(' Now combine all numeric and non-numeric vars into a Deep only model...') meta_outputs, meta_inputs, meta_names = preprocessing_tabular(train_ds, var_df, cat_feat_cross_flag, model_options, cat_vocab_dict, keras_model_type, verbose) - print(' All Non-NLP feature preprocessing for %s completed.' %keras_model_type) + print(' All Non-NLP feature preprocessing completed.') ### this is the order in which columns have been trained ### + if len(nlps) > 0: + print('Starting NLP string column layer preprocessing...') + nlp_inputs = create_nlp_inputs(nlps) + max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small = aggregate_nlp_dictionaries(nlps, cat_vocab_dict, model_options) + nlp_encoded = encode_nlp_inputs(nlp_inputs, cat_vocab_dict) + ### we call nlp_outputs as embedding in this section of the program #### + print('NLP Preprocessing completed.') + #merged = [meta_outputs, nlp_encoded] + merged = layers.concatenate([nlp_encoded, meta_outputs]) + print(' combined categorical+numeric with nlp outputs successfully for %s model...' %keras_model_type) + nlp_inputs = list(nlp_inputs.values()) + else: + merged = meta_outputs final_training_order = nlp_names + meta_names ### find their dtypes - remember to use element_spec[0] for train data sets! - ds_types = dict([(col_name, train_ds.element_spec[0][col_name].dtype) for col_name in final_training_order ]) + ds_types = {col_name: train_ds.element_spec[0][col_name].dtype for col_name in final_training_order } col_type_tuples = [(name,ds_types[name]) for name in final_training_order] if verbose >= 2: print('Inferred column names, layers and types (double-check for duplicates and correctness!): \n%s' %col_type_tuples) print(' %s model loaded and compiled successfully...' %keras_model_type) else: - ####### Now combine all vars into a complete auto deep and wide model ############## + ############################################################################################ + #### In "auto" vs. "mixed_nlp", the NLP processings are different. Numeric process is same. + #### Here both NLP and NON-NLP varas are combined with embedding to form a deep wide model # + ############################################################################################ + print(' Now combine all numeric+cat+NLP vars into a Deep and Wide model') ## Since we are processing NLPs separately we need to remove them from inputs ### if len(NON_NLP_VARS) == 0: - print(' Non-NLP vars is zero in this dataset. No tabular preprocesing needed...') + print(' There are zero non-NLP variables in this dataset. No non-NLP preprocesing needed...') meta_inputs = [] else: - #### Here both NLP and NON-NLP varas are combined with embedding to form a deep wide model # FEATURE_NAMES = left_subtract(FEATURE_NAMES, nlps) dropout_rate = 0.1 hidden_units = [dense_layer2, dense_layer3] inputs = create_fast_inputs(FEATURE_NAMES, NUMERIC_FEATURE_NAMES, FLOATS) #all_inputs = dict(zip(meta_names,meta_inputs)) + #### In auto models we want "wide" to be short. Hence use_embedding to be True. wide = encode_auto_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, vocab_dict, - hidden_units, use_embedding=False) + hidden_units, use_embedding=True) wide = layers.BatchNormalization()(wide) deep = encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, vocab_dict, use_embedding=True) + deep = layers.BatchNormalization()(deep) meta_inputs = list(inputs.values()) ### convert input layers to a list #### If there are NLP vars in dataset, you must combine the nlp_outputs ## + print(' All Non-NLP feature preprocessing completed.') if len(nlps) > 0: print('Starting NLP string column layer preprocessing...') nlp_inputs = create_nlp_inputs(nlps) max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small = aggregate_nlp_dictionaries(nlps, cat_vocab_dict, model_options) nlp_encoded = encode_nlp_inputs(nlp_inputs, cat_vocab_dict) ### we call nlp_outputs as embedding in this section of the program #### - print(' NLP Preprocessing completed.') + print('NLP preprocessing completed.') merged = [wide, deep, nlp_encoded] - print(' %s combined wide, deep and nlp outputs successfully...' %keras_model_type) + print(' Combined wide, deep and nlp outputs successfully') nlp_inputs = list(nlp_inputs.values()) else: merged = [wide, deep] print(' %s combined wide and deep successfully...' %keras_model_type) - return nlp_inputs, meta_inputs, merged, embedding - elif keras_model_type.lower() == 'mixed_nlp': + ### if NLP_outputs is NOT a list, it means there is some NLP variable in the data set + if not isinstance(merged, list): + print('Shape of output from all preprocessing layers before model training = %s' %(merged.shape,)) + return nlp_inputs, meta_inputs, merged, embedding + elif keras_model_type.lower() in ['mixed_nlp', 'combined_nlp']: ### this is similar to auto models but uses TFHub models for NLP preprocessing ##### if len(NON_NLP_VARS) == 0: print(' Non-NLP vars is zero in this dataset. No tabular preprocesing needed...') meta_inputs = [] else: + ############################################################################################ + #### In "auto" vs. "mixed_nlp", the NLP processings are different. Numeric process is same. + ############################################################################################ + print(' Now combine all numeric and non-numeric vars into a Deep and Wide model...') #### Here both NLP and NON-NLP varas are combined with embedding to form a deep wide model # FEATURE_NAMES = left_subtract(FEATURE_NAMES, nlps) - dropout_rate = 0.1 + dropout_rate = 0.5 hidden_units = [dense_layer2, dense_layer3] inputs = create_fast_inputs(FEATURE_NAMES, NUMERIC_FEATURE_NAMES, FLOATS) #all_inputs = dict(zip(meta_names,meta_inputs)) @@ -259,20 +287,27 @@ def perform_preprocessing(train_ds, var_df, cat_vocab_dict, keras_model_type, wide = layers.BatchNormalization()(wide) deep = encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, vocab_dict, use_embedding=True) + deep = layers.BatchNormalization()(deep) meta_inputs = list(inputs.values()) ### convert input layers to a list + print(' All Non-NLP feature preprocessing completed.') #### If there are NLP vars in dataset, you use TFHub models in this case ## if len(nlps) > 0: print('Starting NLP string column layer preprocessing...') - nlp_inputs, embedding, nlp_names = preprocessing_nlp(train_ds, model_options, + nlp_inputs, embedding, nlp_names = mixed_preprocessing_nlp(train_ds, model_options, var_df, cat_vocab_dict, keras_model_type, verbose) ### we call nlp_outputs as embedding in this section of the program #### print(' NLP Preprocessing completed.') - print('There are no NLP variables in this dataset for preprocessing...') else: + print('There are no NLP variables in this dataset for preprocessing...') embedding = [] - meta_outputs = layers.concatenate([wide, deep]) - print(' %s model: combined wide, deep and NLP (with TFHub) successfully...' %keras_model_type) + if isinstance(embedding, list): + ### This means embedding is an empty list with nothing in it ### + meta_outputs = layers.concatenate([wide, deep]) + print(' Combined wide, deep layers successfully.') + else: + meta_outputs = layers.concatenate([wide, deep, embedding]) + print(' Combined wide, deep and NLP (with TFHub) successfully.') else: meta_inputs = [] ##### You need to send in the ouput from embedding layer to this sequence of layers #### @@ -348,13 +383,13 @@ def perform_preprocessing(train_ds, var_df, cat_vocab_dict, keras_model_type, print('There is no numeric or cat or int variables in this data set.') if isinstance(nlp_outputs, list): ### if NLP_outputs is a list, it means there is no NLP variable in the data set - print(' There is no NLP variable in this data set. Returning') + print('There is no NLP variable in this data set. Returning') consolidated_outputs = meta_outputs else: - print(' %s vector dimensions from NLP variable' %(nlp_outputs.shape,)) + print('Shape of encoded NLP variables just before training: %s' %(nlp_outputs.shape,)) consolidated_outputs = nlp_outputs else: - print(' Shape of output from numeric+integer+cat variables before model training = %s' %(meta_outputs.shape,)) + print('Shape of non-NLP encoded variables just before model training = %s' %(meta_outputs.shape,)) if isinstance(nlp_outputs, list): ### if NLP_outputs is a list, it means there is no NLP variable in the data set print(' There is no NLP variable in this data set. Continuing...') @@ -362,8 +397,72 @@ def perform_preprocessing(train_ds, var_df, cat_vocab_dict, keras_model_type, consolidated_outputs = meta_outputs else: ### if NLP_outputs is NOT a list, it means there is some NLP variable in the data set - print(' %s vector dimensions from NLP variable' %(nlp_outputs.shape,)) + print(' Shape of encoded NLP variables just before training: %s' %(nlp_outputs.shape,)) consolidated_outputs = layers.concatenate([nlp_outputs, meta_outputs]) print('Shape of output from all preprocessing layers before model training = %s' %(consolidated_outputs.shape,)) return nlp_inputs, meta_inputs, consolidated_outputs, nlp_outputs ########################################################################################## +def mixed_preprocessing_nlp(train_ds, model_options, + var_df, cat_vocab_dict, + keras_model_type, verbose=0): + """ + This is only for mixed NLP preprocessing of tabular and nlp datasets + """ + nlp_inputs = [] + all_nlp_encoded = [] + all_nlp_embeddings = [] + nlp_col_names = [] + nlp_columns = var_df['nlp_vars'] + nlp_columns = list(set(nlp_columns)) + + if len(nlp_columns) == 1: + nlp_column = nlp_columns[0] + elif keras_model_type.lower() == 'combined_nlp': + nlp_column = 'combined_nlp_text' ### this is when there are multiple nlp columns ## + else: + ### This is to keep nlp columns separate ### + nlp_column = '' + + #### Now perform NLP preproprocessing for each nlp_column ###### + ######### This is where we load Swivel model and process each nlp column ### + try: + bert_model_name = "Swivel-20" + if os.name == 'nt': + tfhub_path = os.path.join(keras_model_type, 'tf_cache') + os.environ['TFHUB_CACHE_DIR'] = tfhub_path + tfhub_handle_encoder = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1' + else: + tfhub_handle_encoder = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1' + hub_layer = hub.KerasLayer(tfhub_handle_encoder, + input_shape=[], + dtype=tf.string, + trainable=False, name="Swivel20_encoder") + print(f' {bert_model_name} selected from: {tfhub_handle_encoder}') + ### this is for mixed nlp models. You use Swivel to embed NLP columns fast #### + if len(nlp_columns) > 1: + copy_nlp_columns = copy.deepcopy(nlp_columns) + for each_nlp in copy_nlp_columns: + nlp_input = tf.keras.Input(shape=(), dtype=tf.string, name=each_nlp) + nlp_inputs.append(nlp_input) + x = hub_layer(nlp_input) + all_nlp_encoded.append(x) + nlp_col_names.append(each_nlp) + else: + nlp_input = tf.keras.Input(shape=(), dtype=tf.string, name=nlp_column) + x = hub_layer(nlp_input) + ### Now we combine all inputs and outputs in one place here ########### + nlp_inputs.append(nlp_input) + all_nlp_encoded.append(x) + nlp_col_names.append(nlp_column) + except: + print(' Error: Skipping %s for keras layer preprocessing...' %nlp_column) + ### we gather all outputs above into a single list here called all_features! + if len(all_nlp_encoded) == 0: + print('There are no NLP string variables in this dataset to preprocess!') + elif len(all_nlp_encoded) == 1: + all_nlp_embeddings = all_nlp_encoded[0] + else: + all_nlp_embeddings = layers.concatenate(all_nlp_encoded) + + return nlp_inputs, all_nlp_embeddings, nlp_col_names +################################################################################# diff --git a/deep_autoviml/preprocessing/preprocessing_images.py b/deep_autoviml/preprocessing/preprocessing_images.py index f70b445..a5c4269 100644 --- a/deep_autoviml/preprocessing/preprocessing_images.py +++ b/deep_autoviml/preprocessing/preprocessing_images.py @@ -57,7 +57,7 @@ from tensorflow.keras.optimizers import SGD from tensorflow.keras import regularizers import tensorflow_hub as hub -import tensorflow_text as text + from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error from IPython.core.display import Image, display diff --git a/deep_autoviml/preprocessing/preprocessing_nlp.py b/deep_autoviml/preprocessing/preprocessing_nlp.py index 8990457..fc18fc2 100644 --- a/deep_autoviml/preprocessing/preprocessing_nlp.py +++ b/deep_autoviml/preprocessing/preprocessing_nlp.py @@ -60,7 +60,7 @@ from tensorflow.keras.optimizers import SGD from tensorflow.keras import regularizers import tensorflow_hub as hub -import tensorflow_text as text + from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error from IPython.core.display import Image, display @@ -123,7 +123,8 @@ def preprocessing_nlp(train_ds, model_options, var_df, cat_vocab_dict, keras_mod 'wide_and_deep','deep wide', 'wide deep', 'fast1', 'deep_and_cross', 'deep_cross', 'deep cross', 'fast2'] - max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small = aggregate_nlp_dictionaries(nlp_columns, cat_vocab_dict, model_options) + max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small = aggregate_nlp_dictionaries( + nlp_columns, cat_vocab_dict, model_options, verbose) if len(nlp_columns) == 1: nlp_column = nlp_columns[0] @@ -360,7 +361,7 @@ def encode_NLP_column(train_ds, nlp_column, nlp_input, vocab_size, sequence_leng #print(f" {nlp_column} vocab size = {vocab_size}, sequence_length={sequence_length}") return nlp_vectorized ################################################################################################ -def aggregate_nlp_dictionaries(nlp_columns, cat_vocab_dict, model_options): +def aggregate_nlp_dictionaries(nlp_columns, cat_vocab_dict, model_options, verbose=0): """ This function aggregates all the dictionaries you need for nlp processing. Just send in a list of nlp variables and a small data sample and it will compute all @@ -380,20 +381,24 @@ def aggregate_nlp_dictionaries(nlp_columns, cat_vocab_dict, model_options): if len(nlps_copy) > 0: vocab_train_small = [] for each_name in nlps_copy: - print('Creating aggregate_nlp_dictionaries for nlp column = %s' %each_name) + if verbose >= 2: + print('Creating aggregate_nlp_dictionaries for nlp column = %s' %each_name) max_tokens_zip[each_name] = cat_vocab_dict[each_name]['size_of_vocab'] print(' size of vocabulary = %s' %max_tokens_zip[each_name]) seq_tokens_zip[each_name] = cat_vocab_dict[each_name]['seq_length'] seq_lengths.append(seq_tokens_zip[each_name]) - print(' sequence length = %s' %seq_tokens_zip[each_name]) + if verbose >= 2: + print(' sequence length = %s' %seq_tokens_zip[each_name]) vocab_size = cat_vocab_dict[each_name]['size_of_vocab'] vocab_train_small += cat_vocab_dict[each_name]['vocab'] vocab_train_small = np.unique(vocab_train_small).tolist() - best_embedding_size = closest(lst, vocab_size//4000) - print(' recommended embedding_size = %s' %best_embedding_size) + best_embedding_size = closest(lst, vocab_size//50000) + if verbose >= 2: + print(' recommended embedding_size = %s' %best_embedding_size) input_embedding_size = check_model_options(model_options, "embedding_size", best_embedding_size) if input_embedding_size != best_embedding_size: - print(' input embedding size given as %d. Overriding recommended embedding_size...' %input_embedding_size) + if verbose >= 2: + print(' input embedding size given as %d. Overriding recommended embedding_size...' %input_embedding_size) best_embedding_size = input_embedding_size embed_tokens_zip[each_name] = best_embedding_size return max_tokens_zip, seq_tokens_zip, embed_tokens_zip, vocab_train_small diff --git a/deep_autoviml/preprocessing/preprocessing_tabular.py b/deep_autoviml/preprocessing/preprocessing_tabular.py index 46c82af..7c260b0 100644 --- a/deep_autoviml/preprocessing/preprocessing_tabular.py +++ b/deep_autoviml/preprocessing/preprocessing_tabular.py @@ -327,7 +327,7 @@ def preprocessing_tabular(train_ds, var_df, cat_feature_cross_flag, model_option except: print(' Error: Skipping %s since Keras Bolean preprocessing is erroring' %each_bool) - ###### This is where we handle Boolean Integer variables - we just combine them ################## + ###### This is where we handle Boolean + Integer variables - we just combine them ################## int_bools_copy = copy.deepcopy(int_bools) if len(int_bools_copy) > 0: for each_int in int_bools_copy: @@ -361,16 +361,24 @@ def preprocessing_tabular(train_ds, var_df, cat_feature_cross_flag, model_option else: nums_bin = max(20, int(max_tokens_zip[each_int]/40)) int_input = keras.Input(shape=(1,), name=each_int, dtype="int32") - encoded = encode_any_integer_to_hash_categorical(int_input, each_int, - train_ds, nums_bin) + if (max_tokens_zip[each_int] >= high_cats_alert): + encoded = encode_any_integer_to_hash_categorical(int_input, each_int, + train_ds, nums_bin) + if verbose: + print(' %s encoded: %d categories, %d bins. After integer HASH encoding shape = %s' %(each_int, + max_tokens_zip[each_int], nums_bin, encoded.shape[1])) + else: + encoded = encode_categorical_and_integer_features(int_input, each_int, + train_ds, is_string=False) + if verbose: + print(' %s encoded: %d categories. After integer encoding shape: %s' %(each_int, + max_tokens_zip[each_int], encoded.shape[1])) all_int_inputs.append(int_input) all_int_encoded.append(encoded) all_input_names.append(each_int) if verbose: - print(' %s number of categories = %d and bins = %d: after integer hash encoding shape: %s' %(each_int, - max_tokens_zip[each_int], nums_bin, encoded.shape[1])) - if (encoded.shape[1] >= high_cats_alert) or (max_tokens_zip[each_int] >= high_cats_alert): - print(' Alert! excessive feature trap. Should this not be a float variable?? %s' %each_int) + if (encoded.shape[1] >= high_cats_alert): + print(' High Dims Alert! Convert %s to float??' %each_int) except: print(' Error: Skipping %s since Keras Integer preprocessing erroring' %each_int) @@ -384,16 +392,19 @@ def preprocessing_tabular(train_ds, var_df, cat_feature_cross_flag, model_option int_input = keras.Input(shape=(1,), name=each_int, dtype="int32") cat_input_dict[each_int] = int_input vocab = max_tokens_zip[each_int] - encoded = encode_integer_to_categorical_feature(int_input, each_int, - train_ds, vocab) + #encoded = encode_integer_to_categorical_feature(int_input, each_int, + # train_ds, vocab) + encoded = encode_categorical_and_integer_features(int_input, each_int, + train_ds, is_string=False) all_int_cat_inputs.append(int_input) all_int_cat_encoded.append(encoded) all_input_names.append(each_int) if verbose: - print(' %s number of categories = %d: after integer categorical encoding shape: %s' %( - each_int, len(vocab), encoded.shape[1])) + print(' %s encoded: %d categories. After integer encoding shape: %s' %(each_int, + len(vocab), encoded.shape[1])) if encoded.shape[1] > high_cats_alert: - print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + if verbose: + print(' High Dims Alert! Convert %s to float??' %each_int) except: print(' Error: Skipping %s since Keras Integer Categorical preprocessing erroring' %each_int) @@ -408,8 +419,10 @@ def preprocessing_tabular(train_ds, var_df, cat_feature_cross_flag, model_option cat_input_dict[each_cat] = cat_input vocab = max_tokens_zip[each_cat] max_tokens = len(vocab) - cat_encoded = encode_string_categorical_feature_categorical(cat_input, each_cat, - train_ds, vocab) + cat_encoded = encode_categorical_and_integer_features(cat_input, each_cat, + train_ds, is_string=True) + #cat_encoded = encode_string_categorical_feature_categorical(cat_input, each_cat, + # train_ds, vocab) all_cat_inputs.append(cat_input) all_cat_encoded.append(cat_encoded) cat_encoded_dict[each_cat] = cat_encoded @@ -418,7 +431,8 @@ def preprocessing_tabular(train_ds, var_df, cat_feature_cross_flag, model_option print(' %s number of categories = %d: after string to categorical encoding shape: %s' %( each_cat, max_tokens, cat_encoded.shape[1])) if cat_encoded.shape[1] > high_cats_alert: - print(' Alert! excessive feature dimension created. Check if necessary to have this many.') + if verbose: + print(' High Dims Alert! Convert %s to float??' %each_int) except: print(' Error: Skipping %s since Keras Categorical preprocessing erroring' %each_cat) @@ -487,9 +501,9 @@ def preprocessing_tabular(train_ds, var_df, cat_feature_cross_flag, model_option all_num_encoded.append(encoded) num_only_encoded.append(encoded) all_input_names.append(each_num) - print(' %s numeric column left as is for feature preprocessing' %each_num) + print(' %s numeric column left as is since float' %each_num) except: - print(' Error: Skipping %s since Keras Float preprocessing erroring' %each_num) + print(' Error: Skipping %s due to Keras float preprocessing error' %each_num) # Latitude and Longitude Numerical features are Binned first and then Category Encoded ####### @@ -617,9 +631,16 @@ def preprocessing_tabular(train_ds, var_df, cat_feature_cross_flag, model_option meta_input_categ1 = all_low_cat_encoded[0] meta_categ1 = layers.Dense(concat_layer_neurons, kernel_initializer=concat_kernel_initializer)(meta_input_categ1) else: - meta_input_categ1 = layers.concatenate(all_low_cat_encoded) - #WIDE - This Dense layer connects to input layer - Categorical Data - meta_categ1 = layers.Dense(concat_layer_neurons, kernel_initializer=concat_kernel_initializer)(meta_input_categ1) + int_list = [x for x in all_low_cat_encoded if x.dtype in [np.int8, np.int16, np.int32, np.int64]] + float_list = [ x for x in all_low_cat_encoded if x.dtype in [np.float32, np.float64]] + if len(float_list) == len(all_low_cat_encoded): + ### All of them are floats ### + all_high_cat_encoded += float_list + else: + meta_input_categ1 = layers.concatenate(int_list) + all_high_cat_encoded += float_list + #WIDE - This Dense layer connects to input layer - Categorical Data + meta_categ1 = layers.Dense(concat_layer_neurons, kernel_initializer=concat_kernel_initializer)(meta_input_categ1) skip_meta_categ2 = False if len(all_high_cat_encoded) == 0: @@ -779,6 +800,22 @@ def encode_binning_numeric_feature_categorical(feature, name, dataset, bins_lat, return encoded_feature ########################################################################################### +def encode_categorical_and_integer_features(feature, name, dataset, is_string): + lookup_class = StringLookup if is_string else IntegerLookup + # Create a lookup layer which will turn strings into integer indices + lookup = lookup_class(output_mode="binary") + + # Prepare a Dataset that only yields our feature + feature_ds = dataset.map(lambda x, y: x[name]) + feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) + + # Learn the set of possible string values and assign them a fixed integer index + lookup.adapt(feature_ds) + + # Turn the string input into integer indices + encoded_feature = lookup(feature) + return encoded_feature +############################################################################## def encode_string_categorical_feature_categorical(feature_input, name, dataset, vocab): """ Inputs: @@ -796,7 +833,7 @@ def encode_string_categorical_feature_categorical(feature_input, name, dataset, Outputs: ----------- encoded_feature: a keras.Tensor. You can use this tensor in keras models for training. - The Tensor has a shape of (None, 1) - None indicates that it has not been + The Tensor has a shape of (None, 1) - None indicates that it is not batched. When the output_mode = "binary" or "count", the output is in float otherwise it is integer. """ extra_oov = 3 @@ -1076,7 +1113,8 @@ def encode_any_feature_to_embed_categorical(feature_input, name, dataset, vocabu # Learn the set of possible string values and assign them a fixed integer index #lookup.adapt(feature_ds) encoded_feature = lookup(feature_input) - embedding_dims = int(math.sqrt(len(vocabulary))) + #embedding_dims = int(math.sqrt(len(vocabulary))) + embedding_dims = int(max(2, math.log(len(vocabulary), 2))) # Create an embedding layer with the specified dimensions. embedding = tf.keras.layers.Embedding( input_dim=len(vocabulary)+extra_oov, output_dim=embedding_dims @@ -1281,18 +1319,32 @@ def encode_auto_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FE numeric_encoded = [] text_encoded = [] encoded_features = [] - + #### In "auto" model, "wide" part is short. Hence we use "count" with "embedding" flag. for feature_name in inputs: vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name] extra_oov = 3 if feature_name in CATEGORICAL_FEATURE_NAMES: cat_encoded.append('') cat_len = len(vocabulary) - encoded_feature = inputs[feature_name] - encoded_feature = tf.keras.layers.experimental.preprocessing.StringLookup( - vocabulary=vocabulary, mask_token=None, oov_token = '~UNK~')(encoded_feature) - cat_encoded[-1] = tf.keras.layers.experimental.preprocessing.CategoryEncoding( - num_tokens = cat_len + 1)(encoded_feature) + lookup = StringLookup(vocabulary=vocabulary, + mask_token=None, + oov_token = '~UNK~') + if len(vocabulary) > 32: + # Convert the string input values into integer indices. + encoded_feature = inputs[feature_name] + encoded_feature = lookup(encoded_feature) + embedding_dims = int(max(2, math.log(len(vocabulary), 2))) + # Create an embedding layer with the specified dimensions. + embedding = Embedding( + input_dim=len(vocabulary)+extra_oov, output_dim=embedding_dims + ) + # Convert the index values to embedding representations. + encoded_feature = embedding(encoded_feature) + cat_encoded[-1] = Flatten()(encoded_feature) + else: + encoded_feature = inputs[feature_name] + encoded_feature = lookup(encoded_feature) + cat_encoded[-1] = CategoryEncoding(num_tokens = cat_len + 1)(encoded_feature) elif feature_name in FLOATS: ### you just ignore the floats in cross models #### numeric_encoded.append('') @@ -1303,7 +1355,7 @@ def encode_auto_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FE else: cat_encoded.append('') if len(vocabulary) > 100: - print(' ALERT! Excessive feature dimension of %s. Should %s be a float variable?' %( + print(' ALERT! Excessive dimensions in %s. Should integer %s be a float variable?' %( len(vocabulary), feature_name)) use_embedding = True lookup = IntegerLookup( @@ -1333,7 +1385,7 @@ def encode_fast_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FE # Create a lookup to convert string values to an integer indices. # Since we are not using a mask token but expecting some out of vocabulary # (oov) tokens, we set mask_token to None and num_oov_indices to extra_oov. - if len(vocabulary) > 50: + if len(vocabulary) > 32: use_embedding = True lookup = StringLookup( vocabulary=vocabulary, @@ -1346,7 +1398,8 @@ def encode_fast_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FE # Convert the string input values into integer indices. encoded_feature = inputs[feature_name] encoded_feature = lookup(encoded_feature) - embedding_dims = int(math.sqrt(len(vocabulary))) + #embedding_dims = int(math.sqrt(len(vocabulary))) + embedding_dims = int(max(2, math.log(len(vocabulary), 2))) # Create an embedding layer with the specified dimensions. embedding = layers.Embedding( input_dim=len(vocabulary)+extra_oov, output_dim=embedding_dims @@ -1365,7 +1418,7 @@ def encode_fast_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FE encoded_feature = normalizer(inputs[feature_name]) else: if len(vocabulary) > 100: - print(' ALERT! Excessive feature dimension of %s. Should %s be a float variable?' %( + print(' ALERT! Excessive feature dimension in %s. Should %s be a float variable?' %( len(vocabulary), feature_name)) use_embedding = True lookup = IntegerLookup( @@ -1374,7 +1427,7 @@ def encode_fast_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FE num_oov_indices=extra_oov, max_tokens=None, oov_token=-9999, - output_mode="count" if not use_embedding else "binary", + output_mode="count" if use_embedding else "binary", ) # Use the numerical features as-is. encoded_feature = inputs[feature_name] @@ -1407,8 +1460,9 @@ def encode_nlp_inputs(inputs, CATEGORICAL_FEATURES_WITH_VOCABULARY): for feature_name in inputs: vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]['vocab'] extra_oov = 50 - vocab_size = int(math.sqrt(len(vocabulary))) - best_embedding_size = closest(list_embedding_sizes, vocab_size//4000) + #vocab_size = int(math.sqrt(len(vocabulary))) + #best_embedding_size = closest(list_embedding_sizes, vocab_size//4000) + best_embedding_size = int(max(2, math.log(len(vocabulary), 2))) lookup = StringLookup( vocabulary=vocabulary, @@ -1483,7 +1537,7 @@ def encode_num_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FEA #################################################################################################### def encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FEATURES_WITH_VOCABULARY, use_embedding=False): - + #### This is a new version intended to reduce dimensions ################# encoded_features = [] for feature_name in inputs: vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name] @@ -1492,7 +1546,7 @@ def encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FEA # Create a lookup to convert string values to an integer indices. # Since we are not using a mask token but expecting some out of vocabulary # (oov) tokens, we set mask_token to None and num_oov_indices to extra_oov. - if len(vocabulary) > 50: + if len(vocabulary) > 32: use_embedding = True lookup = StringLookup( vocabulary=vocabulary, @@ -1505,7 +1559,7 @@ def encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FEA # Convert the string input values into integer indices. encoded_feature = inputs[feature_name] encoded_feature = lookup(encoded_feature) - embedding_dims = int(math.sqrt(len(vocabulary))) + embedding_dims = int(max(2, math.log(len(vocabulary), 2))) # Create an embedding layer with the specified dimensions. embedding = layers.Embedding( input_dim=len(vocabulary)+extra_oov, output_dim=embedding_dims @@ -1525,8 +1579,24 @@ def encode_all_inputs(inputs, CATEGORICAL_FEATURE_NAMES, FLOATS, CATEGORICAL_FEA encoded_feature = normalizer(inputs[feature_name]) #encoded_feature = inputs[feature_name] encoded_features.append(encoded_feature) + ################### + int_list = [x for x in encoded_features if x.dtype in [np.int8, np.int16, np.int32, np.int64]] + float_list = [ x for x in encoded_features if x.dtype in [np.float32, np.float64]] + if len(int_list) > 0: + all_int_features = layers.concatenate(int_list) + meta_int1 = layers.Dense(32)(all_int_features) + if len(float_list) > 0: + all_float_features = layers.concatenate(float_list) + meta_float1 = layers.Dense(32)(all_float_features) + #### You can add a Dense layer if needed here ########### + if len(int_list) > 0: + if len(float_list) > 0: + all_features = layers.concatenate([meta_int1, meta_float1]) + else: + all_features = layers.concatenate([meta_int1]) + else: + all_features = layers.concatenate([meta_float1]) ##### This is where are float encoded features are combined ### - all_features = layers.concatenate(encoded_features) return all_features ################################################################################ from itertools import combinations diff --git a/deep_autoviml/preprocessing/preprocessing_text.py b/deep_autoviml/preprocessing/preprocessing_text.py index 4f5cbe1..443c7a5 100644 --- a/deep_autoviml/preprocessing/preprocessing_text.py +++ b/deep_autoviml/preprocessing/preprocessing_text.py @@ -57,7 +57,7 @@ from tensorflow.keras.optimizers import SGD from tensorflow.keras import regularizers import tensorflow_hub as hub -import tensorflow_text as text + from sklearn.metrics import roc_auc_score, mean_squared_error, mean_absolute_error from IPython.core.display import Image, display diff --git a/deep_autoviml/utilities/__pycache__/utilities.cpython-38.pyc b/deep_autoviml/utilities/__pycache__/utilities.cpython-38.pyc index 14a9697..076be18 100644 Binary files a/deep_autoviml/utilities/__pycache__/utilities.cpython-38.pyc and b/deep_autoviml/utilities/__pycache__/utilities.cpython-38.pyc differ diff --git a/deep_autoviml/utilities/utilities.py b/deep_autoviml/utilities/utilities.py index ec1a952..f12dc39 100644 --- a/deep_autoviml/utilities/utilities.py +++ b/deep_autoviml/utilities/utilities.py @@ -913,6 +913,7 @@ def get_callbacks(val_mode, val_monitor, patience, learning_rate, save_weights_o callbacks_dict['tensor_board'] = tb callbacks_dict['print'] = pr callbacks_dict['reducer'] = rlr + callbacks_dict['rlr'] = rlr callbacks_dict['decay'] = lr_decay_cb return callbacks_dict, tensorboard_logpath @@ -925,14 +926,14 @@ def get_chosen_callback(callbacks_dict, keras_options): lr_scheduler = callbacks_dict['onecycle2'] elif keras_options['lr_scheduler'] == 'onecycle': lr_scheduler = callbacks_dict['onecycle'] - elif keras_options['lr_scheduler'] == 'reducer': + elif keras_options['lr_scheduler'] in ['reducer', 'rlr']: lr_scheduler = callbacks_dict['reducer'] elif keras_options['lr_scheduler'] == 'decay': lr_scheduler = callbacks_dict['decay'] elif keras_options['lr_scheduler'] == "scheduler": lr_scheduler = callbacks_dict['scheduler'] else: - lr_scheduler = callbacks_dict['scheduler'] + lr_scheduler = callbacks_dict['rlr'] return lr_scheduler ################################################################################################ def get_chosen_callback2(callbacks_dict, keras_options): @@ -948,8 +949,8 @@ def get_chosen_callback2(callbacks_dict, keras_options): elif keras_options['lr_scheduler'] == 'decay': lr_scheduler = callbacks_dict['lr_decay_cb'] else: - lr_scheduler = callbacks_dict['lr_sched'] - keras_options['lr_scheduler'] = "lr_sched" + lr_scheduler = callbacks_dict['rlr'] + keras_options['lr_scheduler'] = "rlr" return lr_scheduler ################################################################################################ import math diff --git a/dist/deep_autoviml-0.0.82-py3-none-any.whl b/dist/deep_autoviml-0.0.82-py3-none-any.whl new file mode 100644 index 0000000..662cc54 Binary files /dev/null and b/dist/deep_autoviml-0.0.82-py3-none-any.whl differ diff --git a/dist/deep_autoviml-0.0.82.tar.gz b/dist/deep_autoviml-0.0.82.tar.gz new file mode 100644 index 0000000..4cd77ca Binary files /dev/null and b/dist/deep_autoviml-0.0.82.tar.gz differ diff --git a/requirements.txt b/requirements.txt index 5a4983c..11c4029 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,9 +3,9 @@ jupyter tensorflow==2.7.0 keras=2.7.0 pandas -numpy==1.19.2 +numpy~=1.19.2 matplotlib -scikit-learn>=0.23.1 +scikit-learn>=0.23.1, <=0.24.2 regex storm-tuner>=0.0.8 emoji diff --git a/setup.py b/setup.py index ac578a5..9d4fb48 100644 --- a/setup.py +++ b/setup.py @@ -20,7 +20,7 @@ setuptools.setup( name="deep_autoviml", - version="0.0.77", + version="0.0.83", author="Ram Seshadri", # author_email="author@example.com", description="Automatically Build Deep Learning Models and Pipelines fast!", @@ -40,17 +40,17 @@ install_requires=[ "ipython", "jupyter", - "tensorflow==2.5.2", + "tensorflow~=2.5", "pandas", "matplotlib", - "numpy==1.19.2", - "scikit-learn>=0.23.1", + "numpy~=1.19.2", + "scikit-learn>=0.23.1, <=0.24.2", "regex", "emoji", "storm-tuner>=0.0.8", "optuna", - "tensorflow_hub==0.12.0", - "tensorflow-text==2.5.0", + "tensorflow_hub~=0.12.0", + "tensorflow-text~=2.5", "xlrd" ], classifiers=[