resampling-stats · ben-herbst · Sep 12, 2023 · Sep 13, 2023 · Sep 15, 2023 · Sep 25, 2023
diff --git a/source/bayes_simulation.Rmd b/source/bayes_simulation.Rmd
@@ -1201,7 +1201,7 @@ Creating the correct simulation procedure is not trivial, because Bayesian
 reasoning is subtle. But it certainly is not easier to create a
 correct procedure using analytic tools (except in the cookbook sense of
 plug-and-pray). If one is interested in insight, a combination of theory and  simulation procedure
-might well be the answer[^sequentially]
+might well be the answer[^sequentially].
 
 [^sequentially]: We can use a similar procedure to illustrate an aspect of the
     Bayesian procedure that Box and Tiao emphasize, its

diff --git a/source/what_is_probability.Rmd b/source/what_is_probability.Rmd
@@ -36,13 +36,13 @@ Not ready for review
 ## Introduction, what is probability?
 
 Uncertainty is part of life. Although there is not much we can do to answer questions such as,
-will I loose my job within the next year?, or what is my life expectency?, it is possible to
+will I loose my job within the next year?, or what is my life expectancy?, it is possible to
 answer questions such as, what is my chances to win the lottery? (exceedingly small).
 
 Scientists from any discipline that we can think of, also need to deal with uncertainties,
 to come up with specific answers when possible, or know the limits of what it is possible to know.
 
-You will encounter numerous examples of this kind throughout this book. In fatc, this
+You will encounter numerous examples of this kind throughout this book. In fact, this
 is what this book is all about.
 
 The uncertainty stems from different sources, some or all may be present in any given application.
@@ -115,6 +115,104 @@ on developing causality models and is still an exciting, active field of researc
 The rest of this chapter is about the general considerations that are important for principled
 reasoning in the presence of uncertainty.
 
+## Understand your problem
+
+In practice you will asked to provide answers based on data. For example, you may be given data about customer
+behaviour in a large bank and asked to develop a model that will provide the probability of default of the customers in
+the bank. This is an important problem for all financial institutions - if it does not have a good credit risk model,
+it will either loose money by being too conservative in the way it lends money, or loose money bay taking on too much
-it will either loose money by being too conservative in the way it lends money, or loose money bay taking on too much
+it will either loose money by being too conservative in the way it lends money, or loose money by taking on too much
-it will either loose money by being too conservative in the way it lends money, or loose money bay taking on too much
+it will either loose money by being too conservative in the way it lends money, or loose money by taking on too much
+of a risk. You can opt for applying a sophisticated model such as a deep neural network but you are almost guaranteed
-of a risk. You can opt for applying a sophisticated model such as a deep neural network but you are almost guaranteed
+of a risk. You can opt for applying an extremely complex "machine-learning" model with many parameters but you are almost guaranteed
-of a risk. You can opt for applying a sophisticated model such as a deep neural network but you are almost guaranteed
+of a risk. You can opt for applying an extremely complex "machine-learning" model with many parameters but you are almost guaranteed
+to come to grief. First study the problem and come to terms of all the many issues at stake. We speak of experience!
-to come to grief. First study the problem and come to terms of all the many issues at stake. We speak of experience!
+to come to grief. First study the problem and come to terms of all the many issues at stake. We speak from experience!
-to come to grief. First study the problem and come to terms of all the many issues at stake. We speak of experience!
+to come to grief. First study the problem and come to terms of all the many issues at stake. We speak from experience!
+
+Let's illustrate the idea with a simple example. The teacher asks little Annie to solve the following problem: Ten
+sheep are on this side of the road and one sheep crosses to the other side, how many sheep remain on this side? Annie
+knows the answer of course, and replies, correctly, none. This quite agitates the teacher and asks, there were ten sheep
+on this side of the road and one crosses over to the other side, how is it that you tell non remain? Annie replies,
-on this side of the road and one crosses over to the other side, how is it that you tell non remain? Annie replies,
+on this side of the road and one crosses over to the other side, why are you saying that none remain? Annie replies,
-on this side of the road and one crosses over to the other side, how is it that you tell non remain? Annie replies,
+on this side of the road and one crosses over to the other side, why are you saying that none remain? Annie replies,
+You don't know sheep, if one crosses the road everyone else follows and none remain. We can confirm this, also from
+experience!
+
+It is easy to get the arithmetic right, but as easy to get the problem wrong if you don't understand it.
+
+Please make sure you know what problem you have to solve. You may even run into situations where a company provides you
+with lots of data and then ask you to extract meaningful information from it. Our advice it, work with the company to
-with lots of data and then ask you to extract meaningful information from it. Our advice it, work with the company to
+with lots of data and then asks you to extract meaningful information from it. Our advice is, work with the company to
-with lots of data and then ask you to extract meaningful information from it. Our advice it, work with the company to
+with lots of data and then asks you to extract meaningful information from it. Our advice is, work with the company to
+first formulate a meaningful problem. Then you can direct your investigation to solving this problem.
+
+Work with the domain experts! You may also find that you expertise is in more demand if you have specialised domain knowledge!
+
+## Understand your data
+
+You will often find that you spend more time trying to understand your data than solving the problem. During this
+investigation you will probably look at things like,
+
+1. Possible correlations between the variables.
+2. Is the data complete? Real-world data often has empty fields that are filled with `NaN`'s. How do you deal with this?
-2. Is the data complete? Real-world data often has empty fields that are filled with `NaN`'s. How do you deal with this?
+2. Is the data complete? Real-world data often has empty fields with missing data. How do you deal with this?
-2. Is the data complete? Real-world data often has empty fields that are filled with `NaN`'s. How do you deal with this?
+2. Is the data complete? Real-world data often has empty fields with missing data. How do you deal with this?
+   Do you discard these fields or do you impute values for them?
-   Do you discard these fields or do you impute values for them?
+   Do you discard these fields or do you try and estimate values for them?
-   Do you discard these fields or do you impute values for them?
+   Do you discard these fields or do you try and estimate values for them?
+3. Do you have reason to believe that the information you need to solve the problem is in the data? This is a much-neglected
+   problem, perhaps because it is not easy to provide an answer. There are certainly situations where some of the
+   variables can act as a proxy for what is needed. This simply means that the information is hidden, but present in
+   the data. Precisely because it is not an easy problem to answer, it is necessary to give it some thought and
+   most definitely keep it in mind while you are modelling.
+
+   One example that is often quoted in the literature is about a group of researchers that wanted to build a model to
+   visually distinguish between criminals and non-criminals. For this they used a dataset of photographs of known criminals
+   and non-criminals. For their criminals they relied on police mug shots. You have to seriously ask yourself whether
-   and non-criminals. For their criminals they relied on police mug shots. You have to seriously ask yourself whether
+   and non-criminals. For their criminals they relied on police mug shots - but of course they had to use other types of photograph for the non-criminals.  But even without that difference, you have to seriously ask yourself whether
-   and non-criminals. For their criminals they relied on police mug shots. You have to seriously ask yourself whether
+   and non-criminals. For their criminals they relied on police mug shots - but of course they had to use other types of photograph for the non-criminals.  But even without that difference, you have to seriously ask yourself whether
+   you believe that one tell whether a person is a criminal based on visual appearance.
+4. Is your data balanced, if not how are you going to deal with it? If you are ever asked to develop a credit risk model,
+   you will have a vast quantity of non-defaulting examples and few, perhaps 5%, of defaulting examples. No financial
+   institution will survive a high percentage of defaulting customers. How will you deal with it?
+5. Is you data biased? Because of historical reasons and social inequalities your data may be biased against minorities,
-5. Is you data biased? Because of historical reasons and social inequalities your data may be biased against minorities,
+5. Is your data biased? For historical reasons and because of social inequalities your data may be biased against minorities,
-5. Is you data biased? Because of historical reasons and social inequalities your data may be biased against minorities,
+5. Is your data biased? For historical reasons and because of social inequalities your data may be biased against minorities,
+   or on race, gender, etc. If this is the case your model will seriously deficient. You may also find that the instituion
+   you are working for has strict policies in place against the use of potentially harmful variables.
+
+   These raise serious ethical questions that the practitioner should be aware of.
+
+   Returning to the criminal detection problem mentioned above, it failed. Let's think of what the model does. Since it
+   it given samples of photographs of criminals and non-criminals, i.e. each photograph comes with the label, `criminal`
-   it given samples of photographs of criminals and non-criminals, i.e. each photograph comes with the label, `criminal`
+   is given samples of photographs of criminals and non-criminals, i.e. each photograph comes with the label, `criminal`
-   it given samples of photographs of criminals and non-criminals, i.e. each photograph comes with the label, `criminal`
+   is given samples of photographs of criminals and non-criminals, i.e. each photograph comes with the label, `criminal`
+   or `non-criminal`, it is adjusting its parameters in order to find the maximum correlation between samples belonging
+   to the same class, and to maximize the difference between the two classes. Your model will latch on to any feature that
+   satisfies these requirements, including unacceptable bias.
+
+## How is you model going to be used?
+
+The responsibility of the technical developer does not end with providing the model, or the analytics needed for the
+purpose. It is important to know how your model is going to be used. If you are to develop a system that need to
-purpose. It is important to know how your model is going to be used. If you are to develop a system that need to
+purpose. It is important to know how your model is going to be used. If you are to develop a system that needs to
-purpose. It is important to know how your model is going to be used. If you are to develop a system that need to
+purpose. It is important to know how your model is going to be used. If you are to develop a system that needs to
+detect a terrorist before they board an aircraft, your thinking will be very different when the result of an error
+by your system is relatively benign.
+
+The institution you work for may also need to be able to audit the output of your system. This brings is to the next issue.
+
+## Involve all stakeholders
+
+You cannot, and don't want to take on the responsibility for all the choices outlined above, and this is by no means an
+exhaustive list! Work with the domain knowledge experts within the institution, involve all managers that have a stake in
+your system. And work within a team.
+
+Make sure you are that team member that everyone wants to work with, because of your expertise, because you are trustworthy,
+and easy to work with.
+
+## Follow good software practices
+
+This might appear as an unusual topic for a book on statistics. However, it does emphasize resampling methods. And that
+is all about coding. Anyway, you will become so much more marketable if you learn the basics of solid software practices.
+We don't have the space to do it here but we do want to stress its importance. Always keep in mind the following:
+
+1. Use versioning control, we recommend using git. If you regularly push to the git repo this will protect you from
-1. Use versioning control, we recommend using git. If you regularly push to the git repo this will protect you from
+1. Use version control; we recommend using git. If you regularly push to the git repo this will protect you from
-1. Use versioning control, we recommend using git. If you regularly push to the git repo this will protect you from
+1. Use version control; we recommend using git. If you regularly push to the git repo this will protect you from
+   accidental software loss. It makes is also eay to share your code. You want other people to use your code, it make you
+   so much more useful!
+2. Ask someone elso to critically review your code. Even better if you work in an environment where there is a formal
-2. Ask someone elso to critically review your code. Even better if you work in an environment where there is a formal
+2. Ask someone else to critically review your code. Even better if you work in an environment where there is a formal
-2. Ask someone elso to critically review your code. Even better if you work in an environment where there is a formal
+2. Ask someone else to critically review your code. Even better if you work in an environment where there is a formal
+   system of code review.
+3. Read other people's code. You will learn a lot.
+4. Always tests for your code. This means that you run your code on small examples for which you know the answer. Every
-4. Always tests for your code. This means that you run your code on small examples for which you know the answer. Every
+4. Always add tests for your code. This means that you run your code on small examples for which you know the answer. Every
-4. Always tests for your code. This means that you run your code on small examples for which you know the answer. Every
+4. Always add tests for your code. This means that you run your code on small examples for which you know the answer. Every
+   time you make changes to your code you can check that no unwanted side effects occurred.
+5. Even better if you start you coding exercise with a small example for which you know the answer. This is known as
+   test-driven and we do this all the time.
+6. Python has several formal testing environments that help you write tests. We recommend `pytest`.
+7. Always be critical of yourself. You will run into bugs and make errors. This is inevitable and you should learn how
+   to recognise non-obvious errors. Are the results what you expect, are they reasonable? This of everything you
+   possibly can to find fault with the output of your system.
+8. Always be critical of yourself. Know that you can and will make mistakes. That is no shame; being sloppy and not following
+   good practices, is.
+
 
 <!---