Skip to content

Trained multivariate model with XGboost + shapley values to make blackbox model explainable. Analysis used for account management and c-suite analysis. Shapley value was wrangled to become useful at explaining customer's performance to themselves last year. Output in an interactive plotly waterfall chart.

Notifications You must be signed in to change notification settings

dkwik/xgboost-shap-business-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training an explainable multivariate machine learning model

Business Objectives

The goal of the project is to use machine learning to also understand how distributor/customer behavior metrics drove revenue in a flexible way. Explanations can be taken upstream to c-suite to understand customer behavior for the whole business, and it can be taken downstream to account managers for tacticle interventions for their specific set of accounts. While this method was used in a real business context, here I used a dummy dataset of distributor information (distributors.csv) for a business.

Technical Overview

This project uses XGboost to train a multivariate model to make predictions about revenue, based on distibutors and customer behavior metrics. Cross-validation using CV grid search was used to tune the model, and explainability of the model was generated using SHAP values to understand feature importance, both at a local and global level (i.e individual distributor, and across all distributors).

Output of Project

Impact on whole business (Global explanations)

Analysis of distributor behavior impact on revenue for the whole business can be conducted using SHAP values after training the model. We see that new customers, followed by contract renewals, then number of orders, and order value were the highest impact metrics. Most of the metrics behave as expected, with the higher the metric value, the higher the impact on revenue. An exception is months_purchased, which seems to have less obvious behavior. Further work can be used to investigate this. There are also a number of distributors with low repeat customers but with postitively impacted revenue, which is interesting.

global_shap

Impact on individual distributor (Local explanations)

One of the valuable things about using a tree-based model is that we can also see explanations at a localized level, i.e a single distributor, useful for tactical business decisions, such as account management. Here we see the waterfall showing the impact of metrics on a single distributor's revenue, compared to the expected average across all distributors.

waterfall_shap

Creation of interactive yearly explanations

A limitation of the SHAP package is that waterfall charts confine the visual to compare a a single prediction (i.e one distributor) to the average over all predictions (i.e avg. of all distributors). What if we want to compare a single prediction to itself from a prior time period? Here we devise a way to analyze shap impact between a distributor and their own performance last year. In the code, an interactive plotly plot is created to visualize the impact of a distributor's behavior on revenue between 2022 vs 2023.

waterfall_interactive

Full documentation in notebook.

About

Trained multivariate model with XGboost + shapley values to make blackbox model explainable. Analysis used for account management and c-suite analysis. Shapley value was wrangled to become useful at explaining customer's performance to themselves last year. Output in an interactive plotly waterfall chart.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published