-
Notifications
You must be signed in to change notification settings - Fork 0
/
Pricing_Application.Rmd
374 lines (262 loc) · 15.3 KB
/
Pricing_Application.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
---
title: "Bayesian_Pricing_Exercise"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Introduction
In this last scenario, I want to emphasize that the goal of analytics, is **not** building models, **not** prediction, and **not** data and technology. These are all tools - some of which are useful. I began this course describing how the term Analytics was first defined by Aristotle as a system of logic and inference, a way of understanding.
### Pricing Scenario
RT Inc. sells engineered industrial equipment. They track opportunties to orders in the CRM system, which integrates with the ERP to create orders and invoices *(typical archiecture)*. Sales engineers are expected to track opportunities in the CRM and record detials on competitive position *(e.g., competitor prices)*. There's typically a lot of data that goes into CRMs *(contacts, calls, issues, feedback, meetings, actions, etc.)*, and much of that is summarized in forecasting indices. This exercise takes a small portion of that data *(most of these analyses involve 20+ dimensions)*.
Pricing is a frequent project in consulting *(interal and external)* - a favorite of CEO's. Why? Because every gain in pricing fall directly to the bottom line *(very little cost)*. These projects are also high visibility because poor pricing policies can lower margins and/or reduce sales volume. So it's an opporutnity to be a hero *(or a bozo)*. In this particular case, RT *(and many ETO companies)* assigns engineering resources to opportunities based on their preception *(intuition)* of Risk - probability of winning a deal.
We'll use the following data for our first analysis:
* SampleID
* RSF *(Relationship Factor - divided into 4 tiers: Green, Silver, Gold and Platinum - based on potential, purchase history, brand, etc. - with service levels allocated baed on tier)*
* QuoteDiff *(diffence between our quote and primary competitor quote)*
* RFPDiff *(difference between the dates the RFP response was requested, and when it was returned)*
* ATPDifference *(diffeence between the available to promise - ATP - date and the date required)*
* Result *(whether the opportunity was won or lost)*
Our goal in this case is to use Bayesian analysis for exploration and understanding. So our emphasis is not on prediction *(although we use prediction parameters for analysis)*, but on probabilistic inference.
## Development of Analysis Model
First let's try to understand what causes wins and losses. The data are not nested - each observation is exchangable, so a crossed-effects, multiple regression model, transformed for classification using a logit function is where we start:
A few final transformations were implemented:
* The ATPDifference as scaled down *(divided by 1000)* to bring it into scale with the other dimensions and help the sampler.
* Data was divided into using a holdout validation set of 100 observations, with the rest for training *(dividing data into training and test is not necessary for exploratory analysis - just a levelset)*.
* Result is transformed between logisitic and regression models to adapt to a *c(0,1)* outcome in logreg, and a *(1,2)* factor as an independent variable.
We use a typical logistic regression equation equation:
$P(\hat{Y})=exp(\hat{\beta}X) / exp(1+ exp(\hat{\beta}X)$
Results are "levelset" *(not a predictive model)* with a small holdout set. After pulling the parameters from the sampler, we used the above equation to compute probability and then generated a binomial result using the following:
$Class = ifelse(Prob < .5, 0, 1))$
These results were run through a confusion matrix to begin analysis:
```{r, message=F, warning=F, echo=T, results="hide"}
library(lme4)
library(tidyverse)
library(rstan)
library(shinystan)
library(gridExtra)
library(caret)
library(cowplot)
library(lubridate)
library(stringr)
library(ggridges)
set.seed(102)
setwd("C:/Users/ellen/Documents/UH/Spring 2020/DA2/tmpGitHub/EllenwTerry/Data_Files")
quoteData <- read.csv("QuoteData2.csv")
RSF = data.frame(RSF = c(1, 2, 3, 4))
Desc <- data.frame(RSFDesc = factor(c("Green", "Silver", "Gold", "Platinum"),
levels = c("Green", "Silver", "Gold", "Platinum")))
RSFDesc = cbind(RSF, Desc)
quoteData = quoteData %>% inner_join(RSFDesc, by = "RSF")
quoteData <- quoteData %>% rownames_to_column("SampleID")
quoteData$SampleID <- as.numeric(quoteData$SampleID)
quoteData$QuoteDiff <- quoteData$QuoteDiff/1000
train <- sample_n(quoteData, nrow(quoteData)-100)
test <- quoteData %>% anti_join(train, by = "SampleID")
xTrain <- select(train, RSF, QuoteDiff, RFPDiff, ATPDiff)
xTest <- select(test, RSF, QuoteDiff, RFPDiff, ATPDiff)
x_train <- as.numeric(train$QuoteDiff)
y_train <- as.integer(train$Result)
N_train <- length(x_train)
# Quick Priors:
glm.priors <- glm(Result ~ RSF + QuoteDiff + RFPDiff + ATPDiff, data = quoteData, family = binomial)
mQuotes = model.matrix(Result ~ RSF + QuoteDiff + RFPDiff + ATPDiff, data = quoteData)
mTrain = model.matrix(Result ~ RSF + QuoteDiff + RFPDiff + ATPDiff, data = train)
mTest = model.matrix(Result ~ RSF + QuoteDiff + RFPDiff + ATPDiff, data = test)
beta = as.numeric(glm.priors$coefficients)
stanMod <- '
data {
int N_train;
int K;
int y_train[N_train];
matrix[N_train, K] x_train;
real p_b[K];
real<lower = 0> p_sb[K];
}
parameters {
vector[K] beta;
}
transformed parameters {
vector[N_train] y_hat;
for(n in 1:N_train)
y_hat[n] = x_train[n]*beta;
}
model {
target += normal_lpdf(beta | p_b, p_sb);
target += bernoulli_lpmf(y_train | inv_logit(y_hat));
}
'
beta = as.numeric(glm.priors$coefficients)
betaSigma = rep(.2, 5)
fit <- stan(model_code=stanMod,
data = list(
N_train=nrow(mTrain),
K=ncol(mTrain),
y_train=train$Result,
x_train=mTrain,
p_b = beta,
p_sb = betaSigma
), refresh = 0)
Coef <- summary(fit, pars = c('beta'), probs = c(0.1, 0.9))$summary[,1]
Stanbeta = summary(fit, pars = c('beta'), probs = c(0.1, 0.9))$summary[,1]
test$Prob <- exp(t(Stanbeta%*%t(mTest)))/(1+exp(t(Stanbeta%*%t(mTest))))
test = test %>% mutate (Class = if_else(Prob < .5, 0, 1))
```
Confusion Matrix:
```{r, message=F, warning=F, fig.width=4, fig.height=3, fig.align="center"}
confusionMatrix(factor(test$Class), factor(test$Result), positive = "1")
```
70% is a good start for exploratory analysis *(80+% accuracy in operational sales models is good - trust me)*. But for analysis, we're really interested in the parameters - that's the story. So, let's take a look at those:
```{r, message=F, warning=F, fig.width=6, fig.height=6, fig.align="center", echo=T}
dfPlotRSF <- test %>% group_by(RSF) %>% summarise(meanP = mean(Prob), sdP = sd(Prob))
p1 <- ggplot(dfPlotRSF, aes(x=meanP)) + geom_point(aes(y = RSF), color = 'blue') +
geom_errorbarh(aes(xmin= (meanP-sdP) , xmax = (meanP + sdP), y = RSF), height = 1, color = "red") +
theme(panel.background = element_rect(fill = "white")) +
theme(axis.title.x=element_blank())
dfPlotQuote <- test %>% group_by(QuoteDiff) %>% summarise(meanP = mean(Prob), sdP = sd(Prob))
p2 <- ggplot(dfPlotQuote, aes(x=meanP)) + geom_point(aes(y = QuoteDiff), color = 'blue') +
geom_errorbarh(aes(xmin= (meanP-sdP) , xmax = (meanP + sdP), y = QuoteDiff), height = 1, color = "red") +
theme(panel.background = element_rect(fill = "white")) +
theme(axis.title.x=element_blank())
dfPlotATP <- test %>% group_by(ATPDiff) %>% summarise(meanP = mean(Prob), sdP = sd(Prob))
p3 <- ggplot(dfPlotATP, aes(x=meanP)) + geom_point(aes(y = ATPDiff), color = 'blue') +
geom_errorbarh(aes(xmin= (meanP - sdP) , xmax = (meanP + sdP), y = ATPDiff), height = 1, color = "red") +
theme(panel.background = element_rect(fill = "white")) +
theme(axis.title.x=element_blank())
dfPlotRFP <- test %>% group_by(RFPDiff) %>% summarise(meanP = mean(Prob), sdP = sd(Prob))
p4 <- ggplot(dfPlotRFP, aes(x=meanP)) + geom_point(aes(y = RFPDiff), color = 'blue') +
geom_errorbarh(aes(xmin= (meanP - sdP) , xmax = (meanP + sdP), y = RFPDiff), height = 1, color = "red") +
theme(panel.background = element_rect(fill = "white")) +
theme(axis.title.x=element_blank())
plot_grid(p1, p2, p3, p4, align = 'h')
```
Lots of stories - when you start to dig into data like this, keep an holistic, business analytics mindset. Here are some initial questions:
* **RSF** is the customer tier. The "green" tier has a wider standard deviation than the others. Could it be that reponse to new customers is less consistent? Do new customers ask engineering for unrealisitic function? Why is the variance in "silver" customers less than than the others - is that just the sample or some characteristic of transition from "green" to "gold"? And why is success rate with "green" customers higher than "silver"?
* **QuoteDiff** is follows an expected pattern *(remeber, a positive number means that competitors' prices are higher, so probabity increases as QuoteDiff increases)*. Why does the standard deviation increase around 0?
* **ATPDiff** Difference between Available to Promise and Date Required is all over the board. Why? is this reliable? Is it too complex? Is there a consistent policy?
* **RFPDiff** Difference between RFP Due Date and Date Submitted is all over the place too. Why? Policy? What if we look at ATP and RFP by Customer Tier - do you think that might provide some insight?
Maybe we should dig a little deeper...
### Extending Analysis of QuoteDiff with Regression
QuoteDiff appears to be the best predictor of Win %. Let's turn the analysis around and model QuoteDiff as a reponse variable - maybe there's a causal chain here *(maybe the customer tier affects price and engineering, which affects Win %)*. Let's create a regression model with QuoteDiff as the response.
Quick Priors:
```{r, message=F, warning=F, fig.width=6, fig.height=6, fig.align="center", echo=T}
# quoteDataBU = quoteData
# quoteData = quoteDataBU
quoteData$Result = quoteData$Result + 1
quoteData$Result = factor(quoteData$Result)
tstMod1 = lmer(QuoteDiff ~ 1 + ATPDiff +
(1 | RSF) +
(1 | Result),
data = quoteData)
```
Before modeling, let's look at WvL% vs. RSF and ATP:
```{r, message=F, warning=F, fig.width=5, fig.height=5, fig.align="center", echo=T}
quoteData$ATPInterval = as.factor(cut_interval(quoteData$ATPDiff, 4))
p <- ggplot(quoteData, aes(y = ATPInterval, x = QuoteDiff, fill = Result)) +
guides(fill=FALSE) +
geom_density_ridges(bandwidth = 10, alpha = .2) +
geom_vline(xintercept = 0) +
facet_grid(rows = vars(RSFDesc)) +
theme(panel.background = element_rect(fill = "white"))
p
```
What do you see? Any patterns? Is there a relationship between ATP, Tier and WvL? What could be causes? Why does the spread between W and L change so much for Green? Let's so this:
Build a Stan Regression model and determine the Win probability for green vs platinum tier for different QuoteDiffs:
```{r, message=F, warning=F, fig.width=6, fig.height=6, fig.align="center", echo=T}
stanModel1 <- '
data {
int<lower=0> N;
int<lower=0> K;
int<lower=0> L;
vector[N] y;
real ATP[N];
int Result[N];
int RSF[N];
real p_ATP;
real p_Result[K];
real p_RSF[L];
}
parameters {
real<lower = 0> sigma[L];
real aATP;
vector[K] aResult;
vector[L] aRSF;
}
transformed parameters {
vector[N] y_hat;
for (i in 1:N)
y_hat[i] = aResult[Result[i]] + aRSF[RSF[i]] + aATP *ATP[i];
}
model {
target += normal_lpdf(y | y_hat, sigma[RSF]);
target += normal_lpdf(aATP | p_ATP, .5);
target += normal_lpdf(aResult | p_Result, 2);
target += normal_lpdf(aRSF | p_RSF, .5);
}
'
stanData <- list(
N=nrow(quoteData),
K=length(unique(quoteData$Result)),
L=length(unique(quoteData$RSF)),
y=quoteData$QuoteDiff,
ATP=as.numeric(quoteData$ATPDiff),
Result = as.numeric(quoteData$Result),
RSF = as.numeric(quoteData$RSF),
p_ATP = fixef(tstMod1)[2],
p_Result = coef(tstMod1)$`Result`[,1],
p_RSF = coef(tstMod1)$`RSF`[,1]
)
fit1 <- stan(model_code = stanModel1, data = stanData, refresh = 0)
```
And look at the
```{r, message=F, warning=F, fig.width=6, fig.height=3, fig.align="center", echo=T}
sumFit1 <- data.frame(summary(fit1))
Coef <- summary(fit1, pars = c('aATP', 'aResult', 'aRSF', 'sigma'), probs = c(0.1, 0.9))$summary
sigma = summary(fit1, pars = c("sigma"))$summary[,1]
ATPEffect = summary(fit1, pars = c("aATP"))$summary[,1]
ResultEffect = summary(fit1, pars = c("aResult"))$summary[,1]
RSFEffect = summary(fit1, pars = c("aRSF"))$summary[,1]
fDf = data.frame(QuoteDiff = seq(from = -60, to = 60, by = 1))
dMn = mean(quoteData$QuoteDiff)
dSD = sd(quoteData$QuoteDiff)
dWMn = mean(filter(quoteData, Result == 2)$QuoteDiff)
dWSD = sd(filter(quoteData, Result == 2)$QuoteDiff)
dLMn = mean(filter(quoteData, Result == 1)$QuoteDiff)
dLSD = sd(filter(quoteData, Result == 1)$QuoteDiff)
p10 <- ggplot(quoteData) +
geom_line(aes(QuoteDiff,y= dnorm(QuoteDiff, mean = dMn, sd = dSD)), color = 'gray') +
geom_line(aes(QuoteDiff,y=
dnorm(QuoteDiff, mean = mean(filter(quoteData, Result == 2)$QuoteDiff),
sd = sd(filter(quoteData, Result == 2)$QuoteDiff))), color = 'skyblue1') +
geom_line(aes(QuoteDiff,y= dnorm(QuoteDiff, mean = mean(filter(quoteData, Result == 1)$QuoteDiff),
sd = sd(filter(quoteData, Result == 1)$QuoteDiff))), color = 'tomato1') +
xlim(-60, 60) + ylab("density") +
theme(panel.background = element_rect(fill = "white"))
# p10
```
```{r, message=F, warning=F, fig.width=6, fig.height=3, fig.align="center", echo=T}
1 - pnorm(0, dWMn, dWSD)
# RSF Effects - quote = compeitor, ATP = Required
p10 <- ggplot(quoteData) +
geom_line(aes(QuoteDiff,y= dnorm(QuoteDiff, mean = dMn+RSFEffect[1]+ResultEffect[1]+(ATPEffect*ATPDiff), sd = sigma[1])), color = 'darkgreen') +
geom_line(aes(QuoteDiff,y= dnorm(QuoteDiff, mean = dMn+RSFEffect[4]+ResultEffect[1]+(ATPEffect*ATPDiff), sd = sigma[4])), color = 'darkgray') +
xlim(-30, 30) +
ylab("Win Density") +
theme(panel.background = element_rect(fill = "white"))
p10
```
Just looking at these distributions, we can see that the platinum customers are more likely to place orders when RT's prices are higher than competitors. Perhaps that's because new customers are just price shopping?
But we already knew that didn't we? And we can get this from an dashboard. The mark of an analyst is whether we can provide **credible, quantitative metrics to describe the relationships - the drivers of business**. In this case, Baysian analysis again provides us with the tools - do you know how to calculate that? What if we want to know the probability of a win if we match competitor price, and ATP is 70 *(no problem delivering on time)*. Compare Green and Platinum:
Here's what I get:
```{r, message=F, warning=F, fig.width=6, fig.height=6, fig.align="center", echo=T}
# QuoteDiff = 0, RSFEffect[Green] + Result[Win] + ATPEffect[70]
pnorm(0, dMn+RSFEffect[1]+ResultEffect[1]+(ATPEffect*70), dWSD)
# QuoteDiff = 0, RSFEffect[Green] + Result[Win] + ATPEffect[70]
pnorm(0, dMn+RSFEffect[4]+ResultEffect[1]+(ATPEffect*70), dWSD)
```
So, about 90% for Platimum, but 70% for Green *(more price sensitivity)*?
### Final Thoughts
I've really enjoyed knowing this class, and I hope you'll keep these tools, and especially this mindset with you as you seek to understand the data and transactions in front of you. **Be Compasionate**, **Be Conscientious**, **Be Curious**, and **Be Cool**.
Best of luck and stay in touch.
Love,
Ellen