Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data utils #3

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
__pycache__/
__pycache__/
venv*/
1 change: 1 addition & 0 deletions chatgpt-responsible-ai/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
venv_*/
14 changes: 14 additions & 0 deletions chatgpt-responsible-ai/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# Using ChatGPT to build a (responsible) classification model

## Summary of Learnings / Takeaways

This is a really great (and well documented!) package [AI Fairness 360](https://github.com/Trusted-AI/AIF360)

I think I'm going to start having the AI generate this boilerplate ML pipeline code whenever possible (and I 100% don't trust it with anything company proprietary) - it did a nice job constructing the scikit-learn pipeline-style code (and at time of writing this, I have yet to test that out and make sure it works)

As someone that's not a PhD in ML, it's SUPER nice to be able to abstract away some of this knowledge...I think I know what to ask, but it's so intimidating to approach these things, and they seem too complex, so I assume not to use them. Big benefit of creating tooling is that it can reduce "activation energy" of good practices, which leads to higher adoption (obvious statement, but calling it out)

Also, whoever is reading this, I'd love to have your feedback on aif360...are there better alternatives? I might not know what I don't know in this space. Drop an idea in repo issues, discussion, or contact me on my socials: [Twitter @BrentBrewington](https://twitter.com/BrentBrewington) / [LinkedIn](https://linkedin.com/in/BrentBrewington)

## Entire conversation

Conversation details saved in [chatgpt_conversation](chatgpt_conversation.md) - here's how I grabbed it:

1. Have conversation with ChatGPT
2. Export conversation to text via Chrome extension [Export ChatGPT Convesation](https://chrome.google.com/webstore/detail/export-chatgpt-conversati/clgidpflecgaaabfcmdmkcgebpbadgoc?hl=en-US) by [[email protected]](mailto:[email protected]) --> saved in [chatgpt_conversation.md](chatgpt_conversation.md)
3. Reformat so it generally mirrors how it looks in ChatGPT (wonder if there's a better extension for this? e.g. markdown tables)
8 changes: 5 additions & 3 deletions chatgpt-responsible-ai/chatgpt_conversation.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,11 @@ __________________________

Brent: write a scikit-learn pipeline using XGBoost that can ingest data from a BigQuery table, preprocess the data, and fit and hyperparameter tune a classification model

ChatGPT: Sure, I can help you with that. Here's an example Scikit-learn pipeline that uses XGBoost to train a classification model on data from a BigQuery table. This pipeline will preprocess the data, fit and hyperparameter tune the model, and output evaluation metrics:pythonCopy code# Import necessary libraries
ChatGPT: Sure, I can help you with that. Here's an example Scikit-learn pipeline that uses XGBoost to train a classification model on data from a BigQuery table. This pipeline will preprocess the data, fit and hyperparameter tune the model, and output evaluation metrics:


```python
# Import necessary libraries
import pandas as pd
import numpy as np
from google.cloud import bigquery
Expand Down Expand Up @@ -83,7 +85,7 @@ In this pipeline, we first use BigQuery to retrieve the data from a specified ta

Brent: now build in a way to visualize shap values

ChatGPT: Sure, I can help you with that. Here's an updated version of the pipeline that includes SHAP value visualization:pythonCopy code
ChatGPT: Sure, I can help you with that. Here's an updated version of the pipeline that includes SHAP value visualization:

```python
# Import necessary libraries
Expand Down Expand Up @@ -304,7 +306,7 @@ Note that in this example, we assume that the preprocess_data function is encaps

Brent: now, show me the preprocess_data function

ChatGPT: Sure, here's an example of a preprocess_data function that performs some basic preprocessing steps:pythonCopy code
ChatGPT: Sure, here's an example of a preprocess_data function that performs some basic preprocessing steps:

```python
import pandas as pd
Expand Down
30 changes: 30 additions & 0 deletions chatgpt-responsible-ai/classification_model.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.1"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "5c7b89af1651d0b8571dde13640ecdccf7d5a6204171d6ab33e7c296e100e08a"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
21 changes: 21 additions & 0 deletions chatgpt-responsible-ai/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Note: using ".0" at end b/c the "~=" means "use highest version of latest decimal"
# (so this allows for grabbing latest bugfix, while pinning to a major.minor version)

# Developer Tools
ipython~=8.11.0
jupyter~=1.0.0

# General Tools
joblib~=1.2.0

# Data Tools
pandas~=1.5.0
numpy~=1.24.0
google-cloud-bigquery~=3.6.0

# ML Tools
scikit-learn~=1.2.0
aif360~=0.5.0
xgboost~=1.7.0
shap~=0.41.0
yellowbrick~=1.5.0
18 changes: 18 additions & 0 deletions chatgpt-responsible-ai/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Change working directory into git_repo_root/chatgpt-responsible-ai
cd $(git rev-parse --show-toplevel)/chatgpt-responsible-ai

# Create & Activate Virtual Environment named "venv_cg_rai"
python -m venv venv_cg_rai
source venv_cg_rai/bin/activate

# Install stuff
pip install --upgrade pip
pip install ipykernel~=6.13.0
pip install -r requirements.txt

# On Brent's mac, getting a weird error, so leaving this commented out and just using .py files (instead of .ipynb)
# Error: (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))
# Tried these and they didn't work:
# https://stackoverflow.com/questions/72619143/unable-to-import-psutil-on-m1-mac-with-miniforge-mach-o-file-but-is-an-incomp
# https://stackoverflow.com/questions/71882029/mach-o-file-but-is-an-incompatible-architecture-have-arm64-need-x86-64-i
# ipython kernel install --name "venv_cg_rai" --user
28 changes: 28 additions & 0 deletions chatgpt-responsible-ai/src/ai_fairness.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from aif360.metrics import ClassificationMetric
from sklearn.metrics import confusion_matrix

def monitor_for_bias(model, test_data, privileged_group, unprivileged_group):
# Make predictions on the test data using the model
y_pred = model.predict(test_data.drop('churn', axis=1))

# Calculate the confusion matrix and classification report
tn, fp, fn, tp = confusion_matrix(test_data['churn'], y_pred).ravel()

# Calculate the equal opportunity difference using the AIF360 package
cm = ClassificationMetric(test_data['churn'], y_pred,
privileged_groups=[{privileged_group: 1}],
unprivileged_groups=[{unprivileged_group: 1}])
eod = cm.equal_opportunity_difference()

return {'confusion_matrix': {'TN': tn, 'FP': fp, 'FN': fn, 'TP': tp},
'classification_report': classification_report(test_data['churn'], y_pred),
'equal_opportunity_difference': eod}

def choose_features(df, excluded_features):
# Define a list of all features in the dataset
all_features = df.columns.tolist()

# Remove the excluded features from the list
features_to_include = [f for f in all_features if f not in excluded_features]

return features_to_include
14 changes: 14 additions & 0 deletions chatgpt-responsible-ai/src/get_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from google.cloud import bigquery
import pandas as pd

def read_from_bigquery(project_id, dataset_id, table_id):
# Initialize a BigQuery client
client = bigquery.Client(project=project_id)

# Construct a reference to the table
table_ref = client.dataset(dataset_id).table(table_id)

# Read the data from BigQuery into a Pandas DataFrame
df = client.query(f"SELECT * FROM `{table_ref}`").to_dataframe()

return df
76 changes: 76 additions & 0 deletions chatgpt-responsible-ai/src/pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import pandas as pd
import numpy as np
from google.cloud import bigquery
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
import shap

# Initialize BigQuery client
client = bigquery.Client()

# Define SQL query to retrieve data from BigQuery
query = """
SELECT *
FROM your_project.your_dataset.your_table
"""

# Retrieve data from BigQuery and convert to Pandas dataframe
df = client.query(query).to_dataframe()

# Define features to be used in the model
numeric_features = ['feature1', 'feature2', ...]
categorical_features = ['feature3', 'feature4', ...]
target = 'target_variable'

# Define preprocessing steps
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])

# Define XGBoost model
model = xgb.XGBClassifier()

# Define hyperparameter tuning grid
param_grid = {
'learning_rate': [0.1, 0.01],
'max_depth': [3, 5],
'n_estimators': [100, 500],
}

# Define Scikit-learn pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', model)
])

# Define GridSearchCV to tune hyperparameters
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5)

# Fit model and tune hyperparameters
X = df.drop(columns=[target])
y = df[target]
grid_search.fit(X, y)

# Output evaluation metrics
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

# Generate SHAP values for each instance
explainer = shap.Explainer(grid_search.best_estimator_)
shap_values = explainer(X)

# Plot the SHAP values for each feature
shap.summary_plot(shap_values, X, plot_type='bar')
25 changes: 25 additions & 0 deletions chatgpt-responsible-ai/src/preprocessing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

def preprocess_data(df):
# Define numeric and categorical features
numeric_features = ['feature1', 'feature2', ...]
categorical_features = ['feature3', 'feature4', ...]

# Impute missing values in numeric features using mean imputation
numeric_transformer = SimpleImputer(strategy='mean')
transformed_numeric = numeric_transformer.fit_transform(df[numeric_features])

# One-hot encode categorical features
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
transformed_categorical = categorical_transformer.fit_transform(df[categorical_features])

# Combine numeric and categorical features
transformed_features = pd.concat([pd.DataFrame(transformed_numeric), pd.DataFrame(transformed_categorical.toarray())], axis=1)

# Get target variable
target = df['target_variable']

return transformed_features, target, numeric_features, categorical_features
44 changes: 44 additions & 0 deletions chatgpt-responsible-ai/src/visualize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Import necessary libraries
import pandas as pd
import numpy as np
import xgboost as xgb
import joblib
from yellowbrick.classifier import ConfusionMatrix, ROCAUC, ClassificationReport, ClassPredictionError, DiscriminationThreshold
from yellowbrick.features import Rank2D
from preprocessing import preprocess_data

# Load trained model from .pkl file
model = joblib.load('path/to/trained/model.pkl')

# Load data from parquet file
df = pd.read_parquet('path/to/data.parquet')

# Preprocess data
X, y, numeric_features, categorical_features = preprocess_data(df)

# Instantiate visualizers
cm = ConfusionMatrix(model)
roc_auc = ROCAUC(model)
cr = ClassificationReport(model)
cpe = ClassPredictionError(model)
dt = DiscriminationThreshold(model)
r2d = Rank2D(features=numeric_features)

# Visualize data
cm.fit(X, y)
cm.show()

roc_auc.fit(X, y)
roc_auc.show()

cr.fit(X, y)
cr.show()

cpe.fit(X, y)
cpe.show()

dt.fit(X, y)
dt.show()

r2d.fit_transform(X)
r2d.show()
1 change: 1 addition & 0 deletions data_utils/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
venv*/
2 changes: 2 additions & 0 deletions data_utils/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# data_utils

Empty file.
61 changes: 61 additions & 0 deletions data_utils/data_profiler/dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
from gcp_client import GcpClient
from google.cloud import bigquery, storage
import pandas as pd
from pathlib import Path

SAMPLE_SPEC = {"max_rows": 10000,
"tablesample_pct": 0.1}

def bigquery_query_runner(query, gcp_client):
"""
Args:
query (str): BigQuery query string
gcp_client (GcpClient): GcpClient object
"""
query_job = gcp_client.bq_client.query(query)
return query_job.to_dataframe()

class Table:
"""
types:
- bigquery.TableReference: BigQuery table (support '*' globbing table name)
- pandas.DataFrame
- Path: Local CSV file (support '*' globbing file name)
- storage.Blob: GCS CSV file (support '*' globbing file name)
"""
def __init__(self, gcp_client, table_obj, sample_spec=SAMPLE_SPEC, col_subset: list[str] = None):
_table_types = {
bigquery.TableReference: "bigquery",
pd.DataFrame: ("local", "pandas_dataframe"),
Path: ("local", "csv"),
storage.Blob: ("remote", "gcs_csv")
}

self.gcp_client = gcp_client
self._table_obj = table_obj
self.storage_loc, self.table_type = _table_types[type(table_obj)]

self.sample_spec = sample_spec

def _df_col_subset(df, col_subset):
return df[col_subset] if col_subset is not None else df

def read_table(self):
if isinstance(self._table_obj, bigquery.TableReference):
query = f"SELECT * FROM `{self._table_obj.project}.{self._table_obj.dataset_id}.{self._table_obj.table_id}`"
if SAMPLE_SPEC['tablesample_pct'] is not None:
query += f"\nTABLESAMPLE ({SAMPLE_SPEC['tablesample_pct']} PERCENT)"
if SAMPLE_SPEC['max_rows'] is not None:
query += f"\nLIMIT {SAMPLE_SPEC['max_rows']}"

query_job = self.gcp_client.query(query)
df = query_job.result().to_dataframe()
elif isinstance(self._table_obj, pd.DataFrame):
df = self._table_obj
elif isinstance(self._table_obj, Path):
df = pd.read_csv(self._table_obj)
elif isinstance(self._table_obj, storage.Blob):
# TODO: pretty sure GCS won't work this way
df = pd.read_csv(self._table_obj)

return df_filtered
Loading