Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ben/what is probability #141

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion source/bayes_simulation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -1201,7 +1201,7 @@ Creating the correct simulation procedure is not trivial, because Bayesian
reasoning is subtle. But it certainly is not easier to create a
correct procedure using analytic tools (except in the cookbook sense of
plug-and-pray). If one is interested in insight, a combination of theory and simulation procedure
might well be the answer[^sequentially]
might well be the answer[^sequentially].

[^sequentially]: We can use a similar procedure to illustrate an aspect of the
Bayesian procedure that Box and Tiao emphasize, its
Expand Down
102 changes: 100 additions & 2 deletions source/what_is_probability.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ Not ready for review
## Introduction, what is probability?

Uncertainty is part of life. Although there is not much we can do to answer questions such as,
will I loose my job within the next year?, or what is my life expectency?, it is possible to
will I loose my job within the next year?, or what is my life expectancy?, it is possible to
answer questions such as, what is my chances to win the lottery? (exceedingly small).

Scientists from any discipline that we can think of, also need to deal with uncertainties,
to come up with specific answers when possible, or know the limits of what it is possible to know.

You will encounter numerous examples of this kind throughout this book. In fatc, this
You will encounter numerous examples of this kind throughout this book. In fact, this
is what this book is all about.

The uncertainty stems from different sources, some or all may be present in any given application.
Expand Down Expand Up @@ -115,6 +115,104 @@ on developing causality models and is still an exciting, active field of researc
The rest of this chapter is about the general considerations that are important for principled
reasoning in the presence of uncertainty.

## Understand your problem

In practice you will asked to provide answers based on data. For example, you may be given data about customer
behaviour in a large bank and asked to develop a model that will provide the probability of default of the customers in
the bank. This is an important problem for all financial institutions - if it does not have a good credit risk model,
it will either loose money by being too conservative in the way it lends money, or loose money bay taking on too much
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it will either loose money by being too conservative in the way it lends money, or loose money bay taking on too much
it will either loose money by being too conservative in the way it lends money, or loose money by taking on too much

of a risk. You can opt for applying a sophisticated model such as a deep neural network but you are almost guaranteed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't they be confused by "deep neural network" here? Is there a more general way of saying this - such as:

Suggested change
of a risk. You can opt for applying a sophisticated model such as a deep neural network but you are almost guaranteed
of a risk. You can opt for applying an extremely complex "machine-learning" model with many parameters but you are almost guaranteed

to come to grief. First study the problem and come to terms of all the many issues at stake. We speak of experience!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better as?

Suggested change
to come to grief. First study the problem and come to terms of all the many issues at stake. We speak of experience!
to come to grief. First study the problem and come to terms of all the many issues at stake. We speak from experience!


Let's illustrate the idea with a simple example. The teacher asks little Annie to solve the following problem: Ten
sheep are on this side of the road and one sheep crosses to the other side, how many sheep remain on this side? Annie
knows the answer of course, and replies, correctly, none. This quite agitates the teacher and asks, there were ten sheep
on this side of the road and one crosses over to the other side, how is it that you tell non remain? Annie replies,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
on this side of the road and one crosses over to the other side, how is it that you tell non remain? Annie replies,
on this side of the road and one crosses over to the other side, why are you saying that none remain? Annie replies,

You don't know sheep, if one crosses the road everyone else follows and none remain. We can confirm this, also from
experience!

It is easy to get the arithmetic right, but as easy to get the problem wrong if you don't understand it.

Please make sure you know what problem you have to solve. You may even run into situations where a company provides you
with lots of data and then ask you to extract meaningful information from it. Our advice it, work with the company to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with lots of data and then ask you to extract meaningful information from it. Our advice it, work with the company to
with lots of data and then asks you to extract meaningful information from it. Our advice is, work with the company to

first formulate a meaningful problem. Then you can direct your investigation to solving this problem.

Work with the domain experts! You may also find that you expertise is in more demand if you have specialised domain knowledge!

## Understand your data

You will often find that you spend more time trying to understand your data than solving the problem. During this
investigation you will probably look at things like,

1. Possible correlations between the variables.
2. Is the data complete? Real-world data often has empty fields that are filled with `NaN`'s. How do you deal with this?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Is the data complete? Real-world data often has empty fields that are filled with `NaN`'s. How do you deal with this?
2. Is the data complete? Real-world data often has empty fields with missing data. How do you deal with this?

Do you discard these fields or do you impute values for them?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Do you discard these fields or do you impute values for them?
Do you discard these fields or do you try and estimate values for them?

3. Do you have reason to believe that the information you need to solve the problem is in the data? This is a much-neglected
problem, perhaps because it is not easy to provide an answer. There are certainly situations where some of the
variables can act as a proxy for what is needed. This simply means that the information is hidden, but present in
the data. Precisely because it is not an easy problem to answer, it is necessary to give it some thought and
most definitely keep it in mind while you are modelling.

One example that is often quoted in the literature is about a group of researchers that wanted to build a model to
visually distinguish between criminals and non-criminals. For this they used a dataset of photographs of known criminals
and non-criminals. For their criminals they relied on police mug shots. You have to seriously ask yourself whether
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and non-criminals. For their criminals they relied on police mug shots. You have to seriously ask yourself whether
and non-criminals. For their criminals they relied on police mug shots - but of course they had to use other types of photograph for the non-criminals. But even without that difference, you have to seriously ask yourself whether

you believe that one tell whether a person is a criminal based on visual appearance.
4. Is your data balanced, if not how are you going to deal with it? If you are ever asked to develop a credit risk model,
you will have a vast quantity of non-defaulting examples and few, perhaps 5%, of defaulting examples. No financial
institution will survive a high percentage of defaulting customers. How will you deal with it?
5. Is you data biased? Because of historical reasons and social inequalities your data may be biased against minorities,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. Is you data biased? Because of historical reasons and social inequalities your data may be biased against minorities,
5. Is your data biased? For historical reasons and because of social inequalities your data may be biased against minorities,

or on race, gender, etc. If this is the case your model will seriously deficient. You may also find that the instituion
you are working for has strict policies in place against the use of potentially harmful variables.

These raise serious ethical questions that the practitioner should be aware of.

Returning to the criminal detection problem mentioned above, it failed. Let's think of what the model does. Since it
it given samples of photographs of criminals and non-criminals, i.e. each photograph comes with the label, `criminal`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it given samples of photographs of criminals and non-criminals, i.e. each photograph comes with the label, `criminal`
is given samples of photographs of criminals and non-criminals, i.e. each photograph comes with the label, `criminal`

or `non-criminal`, it is adjusting its parameters in order to find the maximum correlation between samples belonging
to the same class, and to maximize the difference between the two classes. Your model will latch on to any feature that
satisfies these requirements, including unacceptable bias.

## How is you model going to be used?

The responsibility of the technical developer does not end with providing the model, or the analytics needed for the
purpose. It is important to know how your model is going to be used. If you are to develop a system that need to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
purpose. It is important to know how your model is going to be used. If you are to develop a system that need to
purpose. It is important to know how your model is going to be used. If you are to develop a system that needs to

detect a terrorist before they board an aircraft, your thinking will be very different when the result of an error
by your system is relatively benign.

The institution you work for may also need to be able to audit the output of your system. This brings is to the next issue.

## Involve all stakeholders

You cannot, and don't want to take on the responsibility for all the choices outlined above, and this is by no means an
exhaustive list! Work with the domain knowledge experts within the institution, involve all managers that have a stake in
your system. And work within a team.

Make sure you are that team member that everyone wants to work with, because of your expertise, because you are trustworthy,
and easy to work with.

## Follow good software practices

This might appear as an unusual topic for a book on statistics. However, it does emphasize resampling methods. And that
is all about coding. Anyway, you will become so much more marketable if you learn the basics of solid software practices.
We don't have the space to do it here but we do want to stress its importance. Always keep in mind the following:

1. Use versioning control, we recommend using git. If you regularly push to the git repo this will protect you from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Use versioning control, we recommend using git. If you regularly push to the git repo this will protect you from
1. Use version control; we recommend using git. If you regularly push to the git repo this will protect you from

accidental software loss. It makes is also eay to share your code. You want other people to use your code, it make you
so much more useful!
2. Ask someone elso to critically review your code. Even better if you work in an environment where there is a formal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Ask someone elso to critically review your code. Even better if you work in an environment where there is a formal
2. Ask someone else to critically review your code. Even better if you work in an environment where there is a formal

system of code review.
3. Read other people's code. You will learn a lot.
4. Always tests for your code. This means that you run your code on small examples for which you know the answer. Every
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Always tests for your code. This means that you run your code on small examples for which you know the answer. Every
4. Always add tests for your code. This means that you run your code on small examples for which you know the answer. Every

time you make changes to your code you can check that no unwanted side effects occurred.
5. Even better if you start you coding exercise with a small example for which you know the answer. This is known as
test-driven and we do this all the time.
6. Python has several formal testing environments that help you write tests. We recommend `pytest`.
7. Always be critical of yourself. You will run into bugs and make errors. This is inevitable and you should learn how
to recognise non-obvious errors. Are the results what you expect, are they reasonable? This of everything you
possibly can to find fault with the output of your system.
8. Always be critical of yourself. Know that you can and will make mistakes. That is no shame; being sloppy and not following
good practices, is.


<!---

Expand Down