Training an explainable multivariate machine learning model

Business Objectives

The goal of the project is to use machine learning to also understand how distributor/customer behavior metrics drove revenue in a flexible way. Explanations can be taken upstream to c-suite to understand customer behavior for the whole business, and it can be taken downstream to account managers for tacticle interventions for their specific set of accounts. While this method was used in a real business context, here I used a dummy dataset of distributor information (distributors.csv) for a business.

Technical Overview

This project uses XGboost to train a multivariate model to make predictions about revenue, based on distibutors and customer behavior metrics. Cross-validation using CV grid search was used to tune the model, and explainability of the model was generated using SHAP values to understand feature importance, both at a local and global level (i.e individual distributor, and across all distributors).

Output of Project

Impact on whole business (Global explanations)

Analysis of distributor behavior impact on revenue for the whole business can be conducted using SHAP values after training the model. We see that new customers, followed by contract renewals, then number of orders, and order value were the highest impact metrics. Most of the metrics behave as expected, with the higher the metric value, the higher the impact on revenue. An exception is months_purchased, which seems to have less obvious behavior. Further work can be used to investigate this. There are also a number of distributors with low repeat customers but with postitively impacted revenue, which is interesting.

Impact on individual distributor (Local explanations)

One of the valuable things about using a tree-based model is that we can also see explanations at a localized level, i.e a single distributor, useful for tactical business decisions, such as account management. Here we see the waterfall showing the impact of metrics on a single distributor's revenue, compared to the expected average across all distributors.

Creation of interactive yearly explanations

A limitation of the SHAP package is that waterfall charts confine the visual to compare a a single prediction (i.e one distributor) to the average over all predictions (i.e avg. of all distributors). What if we want to compare a single prediction to itself from a prior time period? Here we devise a way to analyze shap impact between a distributor and their own performance last year. In the code, an interactive plotly plot is created to visualize the impact of a distributor's behavior on revenue between 2022 vs 2023.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
output		output
.DS_Store		.DS_Store
README.md		README.md
distributors.csv		distributors.csv
xgboost_shap.ipynb		xgboost_shap.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training an explainable multivariate machine learning model

Business Objectives

Technical Overview

Output of Project

Impact on whole business (Global explanations)

Impact on individual distributor (Local explanations)

Creation of interactive yearly explanations

Full documentation in notebook.

About

Releases

Packages

Languages

dkwik/xgboost-shap-business-analysis

Folders and files

Latest commit

History

Repository files navigation

Training an explainable multivariate machine learning model

Business Objectives

Technical Overview

Output of Project

Impact on whole business (Global explanations)

Impact on individual distributor (Local explanations)

Creation of interactive yearly explanations

Full documentation in notebook.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages