From 41fed3f8467a4cb8f0fad08cb7b7d1216f0e9ca5 Mon Sep 17 00:00:00 2001 From: Matthew Brett Date: Fri, 18 Oct 2024 15:13:27 +0100 Subject: [PATCH 01/10] Clean up trailing spaces after emphasis. --- source/confidence_1.Rmd | 4 ++-- source/confidence_2.Rmd | 8 +++----- source/correlation_causation.Rmd | 11 +++++------ source/framing_questions.Rmd | 10 +++++----- source/how_big_sample.Rmd | 4 ++-- source/inference_ideas.Rmd | 5 ++--- source/inference_intro.Rmd | 10 +++++----- source/intro.Rmd | 5 ++--- source/monte_carlo.Rmd | 4 ++-- source/point_estimation.Rmd | 6 ++---- source/preface_second.Rmd | 5 ++--- source/probability_theory_1a.Rmd | 15 +++++++-------- source/probability_theory_1b.Rmd | 17 +++++++---------- source/probability_theory_2_compound.Rmd | 3 +-- source/probability_theory_3.Rmd | 7 +++---- source/resampling_method.Rmd | 6 +++--- source/sampling_variability.Rmd | 4 ++-- source/significance.Rmd | 5 ++--- source/testing_counts_1.Rmd | 5 ++--- source/testing_counts_2.Rmd | 2 +- source/testing_measured.Rmd | 2 +- source/what_is_probability.Rmd | 18 +++++++++--------- 22 files changed, 70 insertions(+), 86 deletions(-) diff --git a/source/confidence_1.Rmd b/source/confidence_1.Rmd index 9a3bc2df..e1a71f7c 100644 --- a/source/confidence_1.Rmd +++ b/source/confidence_1.Rmd @@ -89,7 +89,7 @@ hand — from a universe whose characteristics one *knows* , one then can *deduce* the probability that the sample mean will fall within a given distance of the population mean. Intuitively, it *seems* as if one should also be able to reverse the process — to infer something about -the location of the population mean *from the sample mean* . But this +the location of the population mean *from the sample mean*. But this inverse inference turns out to be a slippery business indeed. Let's put it differently: It is all very well to say — as one logically @@ -266,7 +266,7 @@ If the pattern of the 10 apples is tight, you imagine the pattern of the likely locations of the population mean to be tight; if not, not. That is, *it is intuitively clear that there is some connection between how spread out are the sample observations and your confidence about the -location of the population mean* . For example, consider two patterns of +location of the population mean*. For example, consider two patterns of a thousand apples, one with twice the spread of another, where we measure spread by (say) the diameter of the circle that holds the inner half of the apples for each tree, or by the standard deviation. It makes diff --git a/source/confidence_2.Rmd b/source/confidence_2.Rmd index 4428f8c4..b4c473d4 100644 --- a/source/confidence_2.Rmd +++ b/source/confidence_2.Rmd @@ -100,7 +100,7 @@ Please notice that the distribution (universe) assumed at the beginning of this approach did not include the assumption that the distribution is centered on the sample mean or anywhere else. It is true that the sample mean is used *for purposes of reporting the location of the estimated -universe mean* . But despite how the subject is treated in the +universe mean*. But despite how the subject is treated in the conventional approach, the estimated population mean is not part of the work of constructing confidence intervals. Rather, the calculations apply in the same way to *all universes in the neighborhood of the @@ -1124,8 +1124,7 @@ dispersion as the sample. We can then say that *distributions centered at the two endpoints of the 95 percent confidence interval (each of them including a tail in the direction of the observed sample mean with 2.5 percent of the area), or even further away from the sample mean, will -produce the observed sample only 5 percent of the time or less* . - +produce the observed sample only 5 percent of the time or less*. The result of the second approach is more in the spirit of a hypothesis test than of the usual interpretation of confidence intervals. Another statement of the result of the second approach is: We postulate a given @@ -1173,8 +1172,7 @@ lies to the (say) right of it. As noted in the preview to this chapter, we do not learn about the reliability of sample estimates of the population mean (and other parameters) by logical inference from any one particular sample to any -one particular universe, because *in principle this cannot be done* . -Instead, in this second approach we investigate the behavior of various +one particular universe, because *in principle this cannot be done*. Instead, in this second approach we investigate the behavior of various universes at the borderline of the neighborhood of the sample, those universes being chosen on the basis of their resemblances to the sample. We seek, for example, to find the universes that would produce samples diff --git a/source/correlation_causation.Rmd b/source/correlation_causation.Rmd index f251ceb3..3af557a7 100644 --- a/source/correlation_causation.Rmd +++ b/source/correlation_causation.Rmd @@ -78,7 +78,7 @@ is, even in a controlled experiment there is often no way except subject-matter knowledge to avoid erroneous conclusions about causality. Nothing except substantive knowledge or scientific intuition would have led them to the recognition that it is the alcohol rather than the soda that made them drunk, -*as long as they always took soda with their drinks* . And no statistical +*as long as they always took soda with their drinks*. And no statistical procedure can suggest to them that they ought to experiment with the presence and absence of soda. If this is true for an experiment, it must also be true for an uncontrolled study. @@ -106,7 +106,7 @@ in a statement that has these important characteristics: observable so that the relationship will apply under a wide enough range of conditions to be considered useful or interesting. In other words, *the relationship must not require too many "if"s, "and"s, - and "but"s in order to hold* . For example, one might say that an + and "but"s in order to hold*. For example, one might say that an increase in income caused an increase in the birth rate if this relationship were observed everywhere. But, if the relationship were found to hold only in developed countries, among the educated @@ -130,7 +130,7 @@ in a statement that has these important characteristics: previous criterion for side conditions is that a plenitude of very restrictive side conditions may take the relationship out of the class of causal relationships, *even though the effects of the side - conditions are known* . This criterion of nonspuriousness concerns + conditions are known*. This criterion of nonspuriousness concerns variables that are as yet *unknown and unevaluated* but that have a *possible* ability to *upset* the observed association. @@ -799,8 +799,7 @@ occur if the I.Q. scores were ranked from best to worst (column 3) and worst to best (column 5). The extent of correlation (association) can thus be measured by whether the sum of the multiples of the observed *x* and *y* values is relatively much higher or much lower than are sums of randomly-chosen pairs of -*x* and *y* . - +*x* and *y*. ```{python echo=FALSE, eval=TRUE, results="asis", message=FALSE} import numpy as np import pandas as pd @@ -2336,7 +2335,7 @@ the fact that the player sometimes does not get a hit for an abnormally long period of time. One way of testing whether or not the coach is right is by comparing an average player's longest slump in a 100-at-bat season with the longest run of outs in the first card trial. Assume that -Slug is a player picked *at random* . Then compare Slug's longest +Slug is a player picked *at random*. Then compare Slug's longest slump — say, 10 outs in a row — with the longest cluster of a single simulated 100-at-bat trial with the cards, 9 outs. This result suggests that Slug's apparent slump might well have resulted by chance. diff --git a/source/framing_questions.Rmd b/source/framing_questions.Rmd index b5b33af5..2c0001a4 100644 --- a/source/framing_questions.Rmd +++ b/source/framing_questions.Rmd @@ -37,7 +37,7 @@ philosophy of statistical inference. Now we turn to inferential-statistical problems. Up until now, we have been estimating the complex probabilities of *known* universes — the -topic of *probability* . Now as we turn to problems in *statistics* , we +topic of *probability*. Now as we turn to problems in *statistics* , we seek to learn the characteristics of an unknown system — the basic probabilities of its simple events and parameters. (Here we note again, however, that in the process of dealing with them, all @@ -49,7 +49,7 @@ been drawn from it. For further discussion on the distinction between inferential statistics and probability theory, see @sec-resampling-method - @sec-what-is-probability. -This chapter begins the topic of *hypothesis testing* . The issue is: +This chapter begins the topic of *hypothesis testing*. The issue is: whether to adjudge that a particular sample (or samples) come(s) from a particular universe. A two-outcome yes-no universe is discussed first. Then we move on to "measured-data" universes, which are more complex @@ -372,7 +372,7 @@ sample. In the case of the medicine, the universe with which we compare the sample who took the medicine is the benchmark universe to which that sample would belong if the medicine had had no effect. This comparison leads to the benchmark (null) hypothesis that the sample comes from a population in which -the medicine (or other experimental treatment) seems to have *no effect* . It +the medicine (or other experimental treatment) seems to have *no effect*. It is to avoid confusion inherent in the term "null hypothesis" that I replace it with the term "benchmark hypothesis." @@ -464,7 +464,7 @@ Once you know exactly which probability-statistical question you want to ask — that is, exactly which probability you want to determine — the rest of the work is relatively easy. The stage at which you are most likely to make mistakes is in stating the question you want to answer in probabilistic terms. Though this -step is hard, *it involves no mathematics* . This step requires only *hard, -clear thinking* . You cannot beg off by saying "I have no brain for math!" To +step is hard, *it involves no mathematics*. This step requires only *hard, +clear thinking*. You cannot beg off by saying "I have no brain for math!" To flub this step is to admit that you have no brain for clear thinking, rather than no brain for mathematics. diff --git a/source/how_big_sample.Rmd b/source/how_big_sample.Rmd index 62a155ed..2db24216 100644 --- a/source/how_big_sample.Rmd +++ b/source/how_big_sample.Rmd @@ -141,7 +141,7 @@ perhaps 400; this number is selected from my general experience, but it is just a starting point. Then proceed through the first 400 numbers in a random-number table, marking down a *yes* for numbers 1-3 and *no* for numbers 4-10 (because 3/10 was your estimate of the proportion listening). Then add the number of -*yes* and *no* . Carry out perhaps ten sets of such trials, the results of +*yes* and *no*. Carry out perhaps ten sets of such trials, the results of which are in @tbl-phone-trials. +--------------+--------------+-------------+-------------------+ @@ -852,7 +852,7 @@ a small (>5%) chance that arose from the 50:50 world. Therefore, designate numbers 1-30 as *no* and 31-00 as *yes* in the random-number table (that is, 70 percent, as in your estimate based on your presample of ten), work through a trial sample size of fifty, and -count the number of *yeses* . Run through perhaps ten or fifteen trials, +count the number of *yeses*. Run through perhaps ten or fifteen trials, and reckon how often the observed number of *yeses* is >= 32 (the number you must exceed for a result you can rely on). In @tbl-cable-yes we see that a sample of fifty respondents, from a universe split 70-30, will show that many diff --git a/source/inference_ideas.Rmd b/source/inference_ideas.Rmd index 314bc4a8..4cbef50e 100644 --- a/source/inference_ideas.Rmd +++ b/source/inference_ideas.Rmd @@ -115,8 +115,7 @@ be male also; d) paths in the village stay much the same through a person's life; e) religious ritual changes little through the decades; f) your best guess about tomorrow's temperature or stock price is that will be the same as today's. This principle of constancy is related to David Hume's concept of -*constant conjunction* . - +*constant conjunction*. When my children were young, I would point to a tree on our lawn and ask: "Do you think that tree will be there tomorrow?" And when they would answer "Yes," I'd ask, "Why doesn't the tree fall?" That's a tough question to answer. @@ -402,7 +401,7 @@ mattresses, despite the wrong theory about the causation of plague; see publish. So far I have spoken only of *predictability* and not of other elements of -statistical knowledge such as *understanding* and *control* . This is simply +statistical knowledge such as *understanding* and *control*. This is simply because statistical *correlation* is the bed rock of most scientific understanding, and predictability. Later we will expand the discussion beyond predictability; it holds no sacred place here. diff --git a/source/inference_intro.Rmd b/source/inference_intro.Rmd index ca4c2fe9..3f884ed0 100644 --- a/source/inference_intro.Rmd +++ b/source/inference_intro.Rmd @@ -90,7 +90,7 @@ balls is not very small, because if there are only a few green balls and many other-colored balls, it would be unusual — that is, the event would have a low probability — to draw a green ball. Not impossible, but unlikely. And *we can compute the probability of drawing a green ball* — or any other combination of -colors — *for different assumed compositions within the box* . So the knowledge +colors — *for different assumed compositions within the box*. So the knowledge that the sampling process is random greatly increases our ability — or our confidence in our ability — to infer the contents of the box. @@ -103,7 +103,7 @@ this idea shortly. There are several kinds of questions one might ask about the contents of the box. One general category includes questions about our best guesses of the -box's contents — that is, questions of *estimation* . Another category includes +box's contents — that is, questions of *estimation*. Another category includes questions about our *surety* of that description, and our surety that the contents are similar or different from the contents of other boxes; the consideration of surety follows after estimates are made. The estimation @@ -190,7 +190,7 @@ canonical procedures for hypothesis testing and for the finding of confidence intervals in the chapters on those subjects. The discussion so far has been in the spirit of what is known as *hypothesis -testing* . The result of a hypothesis test is a decision about whether or not +testing*. The result of a hypothesis test is a decision about whether or not one believes that the sample is likely to have been drawn randomly from the "benchmark universe" X. The logic is that if the probability of such a sample coming from that universe is low, we will then choose to believe the @@ -235,7 +235,7 @@ that you are told that samples of balls are alternately drawn from one of two and the other with 80 percent green balls. Now you are shown a sample of nine green and one red balls drawn from one of those buckets. On the basis of your sample you can then say how probable it is that the sample came *from one or -the other universe* . You proceed by computing the probabilities (often called +the other universe*. You proceed by computing the probabilities (often called the *likelihoods* in this situation) that each of those two universes would individually produce the observed samples — probabilities that you could arrive at with resampling, with Pascal's Triangle, or with a table of binomial @@ -371,7 +371,7 @@ decisions about whether to remove a basketball player from the game or to produce a new product. Some key points: 1) In statistical inference as in all sound thinking, one's -*purpose is central* . All judgments should be made relative to that purpose, +*purpose is central*. All judgments should be made relative to that purpose, and in light of costs and benefits. (This is the spirit of the Neyman-Pearson approach). 2) One cannot avoid making judgments; the process of statistical inference cannot ever be perfectly routinized or objectified. Even in science, diff --git a/source/intro.Rmd b/source/intro.Rmd index 1e09b116..0b3ed302 100644 --- a/source/intro.Rmd +++ b/source/intro.Rmd @@ -256,7 +256,7 @@ event will occur. Second, the specific elements of the overall decision-making process taught in this book belong to the interrelated subjects of *probability theory* and -*statistics* . Though probabilistic and statistical theory ultimately is +*statistics*. Though probabilistic and statistical theory ultimately is intended to be part of the general decision-making process, often only the estimation of probabilities is done systematically, and the rest of the decision-making process — for example, the decision whether or not to proceed @@ -344,8 +344,7 @@ part of students is *not* the root of the problem. Scan this book and you will find almost no formal mathematics. Yet nearly every student finds the subject very difficult — as difficult as anything taught at universities. The root of the difficulty is that the -*subject matter* is extremely difficult. Let's find out *why* . - +*subject matter* is extremely difficult. Let's find out *why*. It is easy to find out with high precision which movie is playing tonight at the local cinema; you can look it up on the web or call the cinema and ask. But consider by contrast how difficult it is to determine with diff --git a/source/monte_carlo.Rmd b/source/monte_carlo.Rmd index 1ae7c0f8..2ba61353 100644 --- a/source/monte_carlo.Rmd +++ b/source/monte_carlo.Rmd @@ -40,13 +40,13 @@ This is what we shall mean by the term *Monte Carlo simulation* when discussing problems in probability: *Using the given data-generating mechanism (such as a coin or die) that is a model of the process you wish to understand, produce new samples of simulated data, and examine -the results of those samples* . That's it in a nutshell. In some cases, +the results of those samples*. That's it in a nutshell. In some cases, it may also be appropriate to amplify this procedure with additional assumptions. This definition fits both problems in pure probability as well as problems in statistics, but in the latter case the process is called -*resampling* . The reason that the same definition fits is that *at the +*resampling*. The reason that the same definition fits is that *at the core of every problem in inferential statistics lies a problem in probability* ; that is, the procedure for handling every statistics problem is the procedure for handling a problem in probability. (There is diff --git a/source/point_estimation.Rmd b/source/point_estimation.Rmd index b83dcc7b..f8af2136 100644 --- a/source/point_estimation.Rmd +++ b/source/point_estimation.Rmd @@ -207,8 +207,7 @@ the particular set of facts about the situation at hand. ## Criteria of estimates How should one judge the soundness of the process that produces an -estimate? General criteria include *representativeness* and *accuracy* . -But these are pretty vague; we'll have to get more specific. +estimate? General criteria include *representativeness* and *accuracy*. But these are pretty vague; we'll have to get more specific. ### Unbiasedness @@ -254,8 +253,7 @@ population. (Why not? Well, why should they split evenly? There is no general reason why they should.) But if the sample proportions do *not* equal the population proportion, we can say that the extent of the difference between the two sample proportions and the population -proportion will be *identical but in the opposite direction* . - +proportion will be *identical but in the opposite direction*. If the population proportion is 600/1000 = 60 percent, and one sample's proportion is 340/500 = 68 percent, then the other sample's proportion must be (600-340 = 260)/500 = 52 percent. So if in the very long run you diff --git a/source/preface_second.Rmd b/source/preface_second.Rmd index 0b287ef0..0d8ed5cd 100644 --- a/source/preface_second.Rmd +++ b/source/preface_second.Rmd @@ -188,7 +188,7 @@ taxpayers and their possessions. Up until the beginning of the 20th century, the term *statistic* meant the number of something — soldiers, births, taxes, or what-have-you. In many cases, the term *statistic* still means the number of something; the most important statistics for the United -States are in the *Statistical Abstract of the United States* . These numbers +States are in the *Statistical Abstract of the United States*. These numbers are now known as descriptive statistics. This book will not deal at all with the making or interpretation of descriptive statistics, because the topic is handled very well in most conventional statistics texts. @@ -207,8 +207,7 @@ descriptive statistics originating from sample surveys, but also the numbers arising from experiments. Statisticians began to apply the theory of probability to the accuracy of the data arising from sample surveys and experiments, and that became the theory of *inferential -statistics* . - +statistics*. Here we find a guidepost: probability theory and statistics are relevant whenever there is uncertainty about events occurring in the world, or in the numbers describing those events. diff --git a/source/probability_theory_1a.Rmd b/source/probability_theory_1a.Rmd index 78b46ece..3764076b 100644 --- a/source/probability_theory_1a.Rmd +++ b/source/probability_theory_1a.Rmd @@ -158,7 +158,7 @@ any probability problem: *theoretical-deductive* and *empirical* , each of which has two sub-types. These concepts have complicated links with the concept of "frequency series" discussed earlier. -- *Empirical Methods* . One empirical method is to look at *actual cases in +- *Empirical Methods*. One empirical method is to look at *actual cases in nature* — for example, examine all (or a sample of) the families in Brazil that have four children and count the proportion that have three girls among them. (This is the most fundamental process in science and in @@ -171,7 +171,7 @@ concept of "frequency series" discussed earlier. fashion as to produce hypothetical experience with how the simple elements behave. This is the heart of the resampling method, as well as of physical simulations such as wind tunnels. -- *Theoretical Methods* . The most fundamental theoretical approach is to +- *Theoretical Methods*. The most fundamental theoretical approach is to resort to first principles, working with the elements in their full deductive simplicity, and examining all possibilities. This is what we do when we use a tree diagram to calculate the probability of three girls in @@ -186,7 +186,7 @@ The formulaic approach is a theoretical method that aims to avoid the inconvenience of resorting to first principles, and instead uses calculation shortcuts that have been worked out in the past. -*What the Book Teaches* . This book teaches you the empirical method +*What the Book Teaches*. This book teaches you the empirical method using hypothetical cases. Formulas can be misleading for most people in most situations, and should be used as a shortcut only when a person understands exactly which first principles are embodied in the formulas. @@ -222,7 +222,7 @@ The idea of a census and a population and a sample above? For every sample there must also be a universe "behind" it. But "universe" is harder to define, partly because it is often an *imaginary* concept. A universe is the collection of things or people -*that you want to say that your sample was taken from* . A universe can +*that you want to say that your sample was taken from*. A universe can be finite and well defined — "all live holders of the Congressional Medal of Honor," "all presidents of major universities," "all billion-dollar corporations in the United States." Of course, these @@ -365,8 +365,7 @@ winning depends upon the weather; on a nice day we estimate a 0.65 chance of winning, on a nasty (rainy or snowy) day a chance of 0.55. It is obvious that we then want to know the chance of a nice day, and we estimate a probability of 0.7. Let's now ask the probability that both will happen — *it will be -a nice day and the Commanders will win* . - +a nice day and the Commanders will win*. Before getting on with the process of estimation itself, let's tarry a moment to discuss the probability estimates. Where do we get the notion that the probability of a nice day next Sunday is 0.7? We might have done so by @@ -870,7 +869,7 @@ upon" or "given that." That is, the vertical line indicates a "conditional probability," a concept we must consider in a minute. The multiplication rule is a formula that produces the probability of -the *combination (juncture) of two or more events* . More discussion of +the *combination (juncture) of two or more events*. More discussion of it will follow below. ## Conditional and unconditional probabilities {#sec-cond-uncond} @@ -1231,7 +1230,7 @@ is different than the original probability. And an illustrative joke: The best way to avoid there being a live bomb aboard your plane flight is to take an inoperative bomb aboard with you; the probability of one bomb is very low, and by the multiplication rule, -*the probability of two bombs is very very low* . Two hundred years ago +*the probability of two bombs is very very low*. Two hundred years ago the same joke was told about the midshipman who, during a battle, stuck his head through a hole in the ship's side that had just been made by an enemy cannon ball because he had heard that the probability of two diff --git a/source/probability_theory_1b.Rmd b/source/probability_theory_1b.Rmd index 2dc084cc..3cdd96ef 100644 --- a/source/probability_theory_1b.Rmd +++ b/source/probability_theory_1b.Rmd @@ -68,7 +68,7 @@ See section @sec-cond-uncond for an explanation of this notation. In this case we need only one set of two buckets to make all the estimates. Independence means that the elements are drawn from 2 or more *separate -sets of possibilities* . **That is, $P(A | B) = P(A | \ \hat{} B) = P(A)$ and +sets of possibilities*. **That is, $P(A | B) = P(A | \ \hat{} B) = P(A)$ and vice versa**. @@ -213,8 +212,7 @@ unavoidable when the situation gets more complex. But in this simple case, you are likely to see that you can compute the probability by *adding* the .7 probability of a nice day and the .2 probability of a rainy day to get the desired probability. This procedure of formulaic -deductive probability theory is called the *addition rule* . - +deductive probability theory is called the *addition rule*. ## The addition rule The addition rule applies to *mutually exclusive* outcomes — that is, @@ -222,7 +220,7 @@ the case where if one outcome occurs, the other(s) cannot occur; one event implies the absence of the other when events are mutually exclusive. Green and red coats are mutually exclusive if you never wear more than one coat at a time. If there are only two possible -mutually-exclusive outcomes, the outcomes are *complementary* . It may +mutually-exclusive outcomes, the outcomes are *complementary*. It may be helpful to note that mutual exclusivity equals total dependence; if one outcome occurs, the other cannot. Hence we write formally that @@ -384,7 +382,7 @@ in the list must have an equal chance, to distinguish the coin falling on its side from the other possibilities (so ignore it). Or, if it is impossible to make the probabilities equal, make special allowance for inequality. Working directly with the sample space is the *method of -first principles* . The idea of a list was refined to the idea of sample +first principles*. The idea of a list was refined to the idea of sample space, and "for" and "against" were refined to the "success" and "failure" elements among the total elements. @@ -406,8 +404,7 @@ analogy requires judgment. A *Venn diagram* is another device for displaying the elements that make up an event. But unlike a tree diagram, it does not show the sequence of those elements; rather, it shows the *extent of overlap among various -classes of elements* . - +classes of elements*. A Venn diagram expresses by *areas* (especially rectangular Venn diagrams) the numbers at the end of the branches in a tree. diff --git a/source/probability_theory_2_compound.Rmd b/source/probability_theory_2_compound.Rmd index 1152f5fe..c65a42c8 100644 --- a/source/probability_theory_2_compound.Rmd +++ b/source/probability_theory_2_compound.Rmd @@ -868,8 +868,7 @@ message(n_triples / 10000) To estimate the probability of getting a two-pair hand, we revert to the original program (counting pairs), except that we examine all the results in the score-keeping vector `z` for hands in which we had *two* pairs, -instead of *one* . - +instead of *one*. ::: {.notebook name="two_pairs" title="Two pairs"} ::: nb-only diff --git a/source/probability_theory_3.Rmd b/source/probability_theory_3.Rmd index 28689082..fbaa9671 100644 --- a/source/probability_theory_3.Rmd +++ b/source/probability_theory_3.Rmd @@ -189,7 +189,7 @@ a *good enough* approximation for your purposes.) The probability that a fair coin will turn up heads is .50 or 50-50, close to the probability of having a daughter. Therefore, flip a coin in groups of four -flips, and count how often three of the flips produce *heads* . (You must +flips, and count how often three of the flips produce *heads*. (You must decide in *advance* whether three heads means three girls or three boys.) It is as simple as that. @@ -317,7 +317,7 @@ the computer solution we would have used the statement It is important that, in this case, in contrast to what we did in the example from @sec-one-pair (the introductory poker example), the card is *replaced* each time so that each card is dealt from a full deck. This method -is known as *sampling with replacement* . One samples with replacement whenever +is known as *sampling with replacement*. One samples with replacement whenever the successive events are *independent* ; in this case we assume that the chance of having a daughter remains the same (1 girl in 2 births) no matter what sex the previous births were [^what-sex]. But, if the first card dealt is @@ -340,8 +340,7 @@ replacement, it is *impossible* to obtain 4 "daughters" with the 6-card deck because there are only 3 "daughters" in the deck. To repeat, then, whenever you want to estimate the probability of some series of events where each event is independent of the other, you must sample *with -replacement* . - +replacement*. ## Variations of the daughters problem In later chapters we will frequently refer to a problem which is diff --git a/source/resampling_method.Rmd b/source/resampling_method.Rmd index a8caa3d0..2510a2ca 100644 --- a/source/resampling_method.Rmd +++ b/source/resampling_method.Rmd @@ -757,18 +757,18 @@ pure probability, and in teaching *beginning rather than advanced* students to solve problems this way. (Here it is necessary to emphasize that the resampling method is used to *solve the problems themselves rather than as a demonstration device to teach the notions found in the -standard conventional approach* . Simulation has been used in elementary +standard conventional approach*. Simulation has been used in elementary courses in the past, but only to demonstrate the operation of the analytical mathematical ideas. That is very different than using the resampling approach to solve statistics problems themselves, as is done here.) Once we get rid of the formulas and tables, we can see that statistics -is a matter of *clear thinking, not fancy mathematics* . Then we can get +is a matter of *clear thinking, not fancy mathematics*. Then we can get down to the business of learning how to do that clear statistical thinking, and putting it to work for you. *The study of probability* is purely mathematics (though not necessarily formulas) and technique. But -*statistics has to do with meaning* . For example, what is the meaning +*statistics has to do with meaning*. For example, what is the meaning of data showing an association just discovered between a type of behavior and a disease? Of differences in the pay of men and women in your firm? Issues of causation, acceptability of control, and design of diff --git a/source/sampling_variability.Rmd b/source/sampling_variability.Rmd index d29c1d46..b11097a8 100644 --- a/source/sampling_variability.Rmd +++ b/source/sampling_variability.Rmd @@ -52,7 +52,7 @@ Perhaps the most important idea for sound statistical inference — the section of the book we are now beginning, in contrast to problems in probability, which we have studied in the previous chapters — is recognition of the *presence of variability in the results of small -samples* . The fatal error of relying on too-small samples is all too +samples*. The fatal error of relying on too-small samples is all too common among economic forecasters, journalists, and others who deal with trends and public opinion. Athletes, sports coaches, sportswriters, and fans too frequently disregard this principle both in their decisions and @@ -478,7 +478,7 @@ best. How did their rookie season compare to their "true" average? ::: -The explanation is the presence of *variability* . And lack of recognition of +The explanation is the presence of *variability*. And lack of recognition of the role of variability is at the heart of much fallacious reasoning. Being alert to the role of variability is crucial. diff --git a/source/significance.Rmd b/source/significance.Rmd index c84db58d..2e0731de 100644 --- a/source/significance.Rmd +++ b/source/significance.Rmd @@ -44,8 +44,7 @@ they have ever heard of a real case where a dog ate somebody's homework. You are a teacher, and a student comes in without homework and says that a dog ate the homework. It could have happened — your survey reports that it really has happened in three lifetimes out of a million. But the event happens *only very -infrequently* . - +infrequently*. Therefore, you probably conclude that because the event is so unlikely, something else must have happened — and the likeliest alternative is that the student did not do the homework. The logic is that if an event seems very @@ -177,7 +176,7 @@ series of tosses: At *some* point you examine it to see if it has two heads. But if your investigation is negative, in the absence of an indication *other than the behavior in question* , you continue to believe that there is no explanation and you assume that the event is "chance" and *should not be acted -upon* . In the same way, a coach might *ask* a player if there is an +upon*. In the same way, a coach might *ask* a player if there is an explanation for the many misses. But if the player answers "no," the coach should not bench him. (There are difficulties here with truth-telling, of course, but let that go for now.) diff --git a/source/testing_counts_1.Rmd b/source/testing_counts_1.Rmd index 57db5ffe..7add4061 100644 --- a/source/testing_counts_1.Rmd +++ b/source/testing_counts_1.Rmd @@ -60,7 +60,7 @@ is strong reason to believe *a priori* that the difference between the benchmark (null) universe and the sample will be in a given direction — for example if you hypothesize that the sample mean will be *smaller* than the mean of the benchmark universe — you should then -employ a *one-tailed test* . If you do *not* have strong basis for such +employ a *one-tailed test*. If you do *not* have strong basis for such a prediction, use the *two-tailed* test. As an example, when a scientist tests a new medication, his/her hypothesis would be that the number of patients who get well will be higher in the treated group than in the control @@ -1786,8 +1786,7 @@ to employ another sort of test for a slightly more precise evaluation.) But if the goal is a decision on which type of ration to buy for a small farm and they are the same price, just go ahead and buy ration A because, even if it is no better than ration B, you -have strong evidence that it is *no worse* . - +have strong evidence that it is *no worse*. ::: How about if we investigate further and find that 4 of 40 *elms* fell, but only -one of 60 *oaks* , and ours is an oak tree. Should we consider that oaks and +one of 60 *oaks*, and ours is an oak tree. Should we consider that oaks and elms have different chances of falling? Proceeding a bit further, we can think of the question as: Should we or should we not consider oaks and elms as different? This is the type of statistical inference called "hypothesis diff --git a/source/inference_intro.Rmd b/source/inference_intro.Rmd index 3f884ed0..72505df6 100644 --- a/source/inference_intro.Rmd +++ b/source/inference_intro.Rmd @@ -96,9 +96,9 @@ confidence in our ability — to infer the contents of the box. Let us note well the strategy of the previous paragraph: *Ask about the probability that one or more various possible contents of the box (the -"universe") will produce the observed sample* , on the assumption that the +"universe") will produce the observed sample*, on the assumption that the sample was drawn randomly. *This is the central strategy of all statistical -inference* , though I do not find it so stated elsewhere. We shall come back to +inference*, though I do not find it so stated elsewhere. We shall come back to this idea shortly. There are several kinds of questions one might ask about the contents of the @@ -150,7 +150,7 @@ other samples as well. It is of the highest importance to recognize that without additional knowledge (or assumption) one cannot make any statements about the probability of the -sample having come from *any particular universe* , on the basis of the sample +sample having come from *any particular universe*, on the basis of the sample evidence. (Better read that last sentence again.) We can only speak about the probability that a particular universe *will produce* the observed sample, a very different matter. This issue will arise again very sharply in the context diff --git a/source/point_estimation.Rmd b/source/point_estimation.Rmd index f8af2136..f69aa1c3 100644 --- a/source/point_estimation.Rmd +++ b/source/point_estimation.Rmd @@ -451,7 +451,7 @@ if you wish to compare two sets of data where the distributions of observations overlap each other, comparing the means of the two distributions can often help you better understand the matter. -Another complication is the confusion between *description* and *estimation* , +Another complication is the confusion between *description* and *estimation*, which makes it difficult to decide where to place the topic of descriptive statistics in a textbook. For example, compare the mean income of all men in the U. S., as measured by the decennial census. This mean of the universe can diff --git a/source/probability_theory_1a.Rmd b/source/probability_theory_1a.Rmd index 3764076b..78fcd6c3 100644 --- a/source/probability_theory_1a.Rmd +++ b/source/probability_theory_1a.Rmd @@ -154,7 +154,7 @@ sorts of probabilities. ## Theoretical and historical methods of estimation As introduced in @sec-probability-ways, there are two general ways to tackle -any probability problem: *theoretical-deductive* and *empirical* , each of +any probability problem: *theoretical-deductive* and *empirical*, each of which has two sub-types. These concepts have complicated links with the concept of "frequency series" discussed earlier. diff --git a/source/probability_theory_1b.Rmd b/source/probability_theory_1b.Rmd index 3cdd96ef..6a34d890 100644 --- a/source/probability_theory_1b.Rmd +++ b/source/probability_theory_1b.Rmd @@ -360,7 +360,7 @@ outcomes occur. ## The Concept of Sample Space -The formulaic approach begins with the idea of *sample space* , which is +The formulaic approach begins with the idea of *sample space*, which is the set of all possible outcomes of the "experiment" or other situation that interests us. Here is a formal definition from Goldberg [-@goldberg1986probability, page 46]: diff --git a/source/probability_theory_3.Rmd b/source/probability_theory_3.Rmd index fbaa9671..76bec8f4 100644 --- a/source/probability_theory_3.Rmd +++ b/source/probability_theory_3.Rmd @@ -630,7 +630,7 @@ can have only *two outcomes* (boy or girl, heads or tails), "binomial" meaning errors resulting from incorrect pigeonholing of problems. A fundamental property of binomial processes is that the individual trials are -*independent* , a concept discussed earlier. A binomial sampling process is a +*independent*, a concept discussed earlier. A binomial sampling process is a *series* of binomial (one-of-two-outcome) events about which one may ask many sorts of questions — the probability of exactly X heads ("successes") in N trials, or the probability of X or more "successes" in N trials, and so on. diff --git a/source/probability_theory_4_finite.Rmd b/source/probability_theory_4_finite.Rmd index 10bd71dc..c8f4a9f4 100644 --- a/source/probability_theory_4_finite.Rmd +++ b/source/probability_theory_4_finite.Rmd @@ -24,7 +24,7 @@ source("_common.R") ## Introduction -The examples in @sec-infinite-universes dealt with *infinite universes* , in +The examples in @sec-infinite-universes dealt with *infinite universes*, in which the probability of a given simple event is unaffected by the outcome of the previous simple event. But now we move on to finite universes, situations in which you begin with a *given set of objects* whose number is not enormous — @@ -532,7 +532,7 @@ kk outcomes, several of each item.** What is the probability of getting an ordered series of *four girls and then -one boy* , from a universe of 25 girls and 25 boys? This illustrates Case 3 +one boy*, from a universe of 25 girls and 25 boys? This illustrates Case 3 above. Clearly we can use the same sampling mechanism as in the example @sec-four-girls-one-boy, but now we record "yes" for a smaller number of composite events. @@ -1303,6 +1303,6 @@ unrelated. This completes the discussion of problems in probability — that is, problems where we assume that the structure is known. Whereas @sec-infinite-universes -dealt with samples drawn from universes considered *not finite* , this chapter +dealt with samples drawn from universes considered *not finite*, this chapter deals with problems drawn from *finite universes* and therefore you *sample without replacement*. diff --git a/source/reliability_average.Rmd b/source/reliability_average.Rmd index 022acd8c..95ebd4e8 100644 --- a/source/reliability_average.Rmd +++ b/source/reliability_average.Rmd @@ -101,7 +101,7 @@ include_svg('diagrams/pop_prop_disp.svg') Perhaps it will help to clarify the issue of estimating dispersion if we consider this: If we compare estimates for a second sample based on a) the -*population* , versus b) the *first sample* , the former will be more accurate +*population*, versus b) the *first sample*, the former will be more accurate than the latter, because of the sampling variation in the first sample that affects the latter estimate. But we cannot estimate that sampling variation without knowing more about the population. diff --git a/source/resampling_method.Rmd b/source/resampling_method.Rmd index 2510a2ca..ba64fa19 100644 --- a/source/resampling_method.Rmd +++ b/source/resampling_method.Rmd @@ -752,7 +752,7 @@ Better examples. These are out of date. But those five examples are all complex problems. This book and its earlier editions break new ground by using this method for *simple -rather than complex problems* , especially in statistics rather than +rather than complex problems*, especially in statistics rather than pure probability, and in teaching *beginning rather than advanced* students to solve problems this way. (Here it is necessary to emphasize that the resampling method is used to *solve the problems themselves diff --git a/source/sampling_variability.Rmd b/source/sampling_variability.Rmd index b11097a8..19dff7ae 100644 --- a/source/sampling_variability.Rmd +++ b/source/sampling_variability.Rmd @@ -35,7 +35,7 @@ Ford." > The Plymouth Horizon: "The disaster of all disasters. That should've been painted bright yellow. What a lemon." -(From *Washington Post Magazine* , May 17, 1992, p. 19) +(From *Washington Post Magazine*, May 17, 1992, p. 19) Do the quotes above convince you that Japanese cars are better than American? Has Debra got enough evidence to reach the conclusion she now holds? That sort diff --git a/source/significance.Rmd b/source/significance.Rmd index 2e0731de..44080f4a 100644 --- a/source/significance.Rmd +++ b/source/significance.Rmd @@ -65,15 +65,15 @@ Kentucky in 1987. Out of 219 questions, 211 of the answers were identical, including many that were wrong. Student A was a high school athlete in Kentucky who had failed two previous SAT exams, and Student B thought he saw Student A copying from him. Should one believe that Student A cheated? (*The Washington -Post* , April 19, 1992, p. D2.) +Post*, April 19, 1992, p. D2.) You say to yourself: It would be most unlikely that the two test-takers would answer that many questions identically by chance — and we can compute how unlikely that event would be. Because that event is so unlikely, we therefore conclude that one or both cheated. And indeed, the testing service invalidated the athlete's exam. On the other hand, if all the questions that were answered -identically were *correct* , the result might not be unreasonable. If we knew -in how many cases they made the *same mistakes* , the inquiry would have been +identically were *correct*, the result might not be unreasonable. If we knew +in how many cases they made the *same mistakes*, the inquiry would have been clearer, but the newspaper did not contain those details. The court is hearing a murder case. There is no eye-witness, and the evidence @@ -103,8 +103,8 @@ Note: You don't attach the same meaning to any *other* permutation (say 3, 6, 7, 7, and king of various suits), even though that permutation is just as rare — unless the person announced exactly that permutation in advance. -Indeed, even if the person says *nothing* , you will be surprised at a royal -flush, because this hand has *meaning* , whereas another given set of five +Indeed, even if the person says *nothing*, you will be surprised at a royal +flush, because this hand has *meaning*, whereas another given set of five cards do not have any special meaning. You see six Volvos in one home's driveway, and you conclude that it is a Volvo @@ -174,7 +174,7 @@ So, how should one proceed? Perhaps proceed the same way as with a coin that keeps coming down heads a very large proportion of the throws, over a long series of tosses: At *some* point you examine it to see if it has two heads. But if your investigation is negative, in the absence of an indication *other -than the behavior in question* , you continue to believe that there is no +than the behavior in question*, you continue to believe that there is no explanation and you assume that the event is "chance" and *should not be acted upon*. In the same way, a coach might *ask* a player if there is an explanation for the many misses. But if the player answers "no," the coach @@ -183,7 +183,7 @@ course, but let that go for now.) The key point for the basketball case and other repetitive situations is not to judge that there is an unusual explanation from the behavior of a *single -sample alone* , just as with a short sequence of stock-price changes. +sample alone*, just as with a short sequence of stock-price changes. We all need to learn that "irregular" (a good word here) sequences are less unusual than they seem to the naked intuition. A streak of 10 out of 12 misses diff --git a/source/testing_counts_1.Rmd b/source/testing_counts_1.Rmd index 7add4061..b188ccd4 100644 --- a/source/testing_counts_1.Rmd +++ b/source/testing_counts_1.Rmd @@ -472,7 +472,7 @@ Notice that the strength of the evidence for the effectiveness of the radiation treatment depends upon the original question: whether or not the treatment had *any* effect on the sex of the fruit fly, which is a two-tailed question. If there were reason to believe at the start that -the treatment could increase *only* the number of *males* , then we +the treatment could increase *only* the number of *males*, then we would focus our attention on the result that in only `r int2tw(n_25_gte_14)` of the twenty-five trials were fourteen or more males. There would then be only a `r n_25_gte_14`/25 = @@ -1127,7 +1127,7 @@ benchmark (null) hypothesis that the flies came from a universe with a one-to-one sex ratio, and the poll data problem also compared results to a 50-50 hypothesis. The calves problem also compared the results to a single benchmark universe — a proportion of 100/206 females. Now we want to compare -*two samples with each other* , rather than comparing one sample with +*two samples with each other*, rather than comparing one sample with a hypothesized universe. That is, in this example we are not comparing one sample to a benchmark universe, but rather asking whether *both* samples come from the *same* universe. The universe from which both samples come, *if* both diff --git a/source/testing_counts_2.Rmd b/source/testing_counts_2.Rmd index 22d74817..e0ed29ec 100644 --- a/source/testing_counts_2.Rmd +++ b/source/testing_counts_2.Rmd @@ -29,7 +29,7 @@ around us is difficult to understand, and spoon-fed mathematical simplifications which you manipulate mechanically simply mislead you into thinking you understand that about which you have not got a clue. -The good news is that you — and that means *you* , even if you say you are "no +The good news is that you — and that means *you*, even if you say you are "no good at mathematics" — can understand these problems with a layperson's hard thinking, even if you have no mathematical background beyond arithmetic and you think that you have no mathematical capability. That's because the difficulty diff --git a/source/testing_measured.Rmd b/source/testing_measured.Rmd index 0b0dad1a..ed553efc 100644 --- a/source/testing_measured.Rmd +++ b/source/testing_measured.Rmd @@ -78,7 +78,7 @@ same, then each of the observed weight gains came from the *same benchmark universe*. This is the basic tactic in our statistical strategy. That is, if the two foods came from the same universe, *our best guess about the composition of that universe is that it includes weight gains just like the -twenty-four we have observed* , and in the same proportions, because that is +twenty-four we have observed*, and in the same proportions, because that is all the information that we have about the universe; this is the bootstrap method. Since ours is (by definition) a sample from an infinite (or at least, a very large) universe of possible weight gains, we assume that there are diff --git a/source/testing_procedures.Rmd b/source/testing_procedures.Rmd index 87c92356..4c71c253 100644 --- a/source/testing_procedures.Rmd +++ b/source/testing_procedures.Rmd @@ -171,7 +171,7 @@ significance statements? Ten. *What is the benchmark universe* that embodies the null hypothesis? 50-50 female, or 100/206 female. -*If there is to be a Neyman-Pearson alternative universe* , what is it? +*If there is to be a Neyman-Pearson alternative universe*, what is it? None. *Which symbols for the observed entities?* Balls in bucket, or numbers. diff --git a/source/what_is_probability.Rmd b/source/what_is_probability.Rmd index 8c2c1136..61d49c44 100644 --- a/source/what_is_probability.Rmd +++ b/source/what_is_probability.Rmd @@ -149,7 +149,7 @@ proxy. For example, the director of the CIA, Robert Gates, said in 1993 "that in May 1989, the CIA reported that the problems in the Soviet Union were so serious and the situation so volatile that Gorbachev had only a 50-50 chance of surviving the next three to four years unless he retreated from his reform -policies" (*The Washington Post* , January 17, 1993, p. A42). Can such a +policies" (*The Washington Post*, January 17, 1993, p. A42). Can such a statement be based on solid enough data to be more than a crude guess? The conceptual probability in any specific situation is *an interpretation of @@ -433,7 +433,7 @@ odd). Here are several general methods of estimation, where we define each metho same probability is either 0 or 1, and since the chance of each in turn is .5, the probability of heads is ultimately .5 once again. Nothing is to be gained by saying that one .5 is sharply defined and that the other is - fuzzy. Of course, *if* , and this is a big "if," you could experiment with + fuzzy. Of course, *if*, and this is a big "if," you could experiment with the coin you will toss before you are obliged to declare, then the two options are manifestly asymmetrical. Barring this privilege, the two options are equivalent [@raiffa1968decision, page 108]. @@ -465,7 +465,7 @@ complication that enters into estimating probabilities. > all of NASA's claims — that they were much more careful with manned flights, > that the typical rocket isn't a valid comparison, etcetera. " > -> But then a new problem came up: the Jupiter probe, *Galileo* , was going to +> But then a new problem came up: the Jupiter probe, *Galileo*, was going to > use a power supply that runs on heat generated by radioactivity. If the > shuttle carrying *Galileo* failed, radioactivity could be spread over a > large area. So the argument continued: NASA kept saying 1 in 100,000 and Mr. From f63ecb551e6966d1bd70cb533aba6d73193607e7 Mon Sep 17 00:00:00 2001 From: Matthew Brett Date: Fri, 18 Oct 2024 15:18:35 +0100 Subject: [PATCH 03/10] Spaces before colon, after emphasis. --- source/probability_theory_1a.Rmd | 10 +++++----- source/resampling_with_code.Rmd | 4 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/source/probability_theory_1a.Rmd b/source/probability_theory_1a.Rmd index 78fcd6c3..f192d8de 100644 --- a/source/probability_theory_1a.Rmd +++ b/source/probability_theory_1a.Rmd @@ -64,7 +64,7 @@ known as resampling. A few definitions first: -- *Simple Event* : An event such as a single flip of a coin, or one draw of a +- *Simple Event*: An event such as a single flip of a coin, or one draw of a single card. A simple event cannot be broken down into simpler events of a similar sort. - *Simple Probability* (also called "primitive probability"): The probability @@ -92,12 +92,12 @@ costs, and other subjects of measurement and judgment. Some more definitions: -- *Composite Event* : A composite event is the combination of two or more +- *Composite Event*: A composite event is the combination of two or more simple events. Examples include all heads in three throws of a single coin; all heads in one throw of three coins at once; Sunday being a nice day *and* the Commanders winning; and the birth of nine females out of the next ten calves born if the chance of a female in a single birth is 0.48. -- *Compound Probability* : The probability that a composite event will occur. +- *Compound Probability*: The probability that a composite event will occur. The difficulty in estimating *simple* probabilities such as the chance of the Commanders winning on Sunday arises from our lack of understanding of @@ -138,7 +138,7 @@ Commanders winning (say) 3 of their next 4 games. We will return to this illustration again and we will see how it enables us to estimate many other sorts of probabilities. -- *Experiment or Experimental Trial, or Trial, or Resampling Experiment* : A +- *Experiment or Experimental Trial, or Trial, or Resampling Experiment*: A simulation experiment or trial is a randomly-generated composite event which has the same characteristics as the actual composite event in which we are interested (except that in inferential statistics the resampling experiment @@ -147,7 +147,7 @@ sorts of probabilities. -- *Parameter* : A numerical property of a universe. For example, the "true" +- *Parameter*: A numerical property of a universe. For example, the "true" mean (don't worry about the meaning of "true"), and the range between largest and smallest members, are two of its parameters. diff --git a/source/resampling_with_code.Rmd b/source/resampling_with_code.Rmd index 0fdb2e1a..31df5390 100644 --- a/source/resampling_with_code.Rmd +++ b/source/resampling_with_code.Rmd @@ -320,8 +320,8 @@ rounded to the nearest integer. The components we send to a function are called *arguments*. The finished result the function sends back is the *return value*. -* **Arguments** : the value or values we send to a function. -* **Return value** : the values the function sends back. +* **Arguments**: the value or values we send to a function. +* **Return value**: the values the function sends back. See @fig-round_function_pl for an illustration of [`round`]{.r}[`np.round`]{.python} as a production line. From a1ea9651bbbaec4949e109c7b09f830833838bd7 Mon Sep 17 00:00:00 2001 From: Matthew Brett Date: Fri, 18 Oct 2024 16:24:06 +0100 Subject: [PATCH 04/10] Working through proofread to end of introduction. --- source/diagrams/covid-tree.svg | 57 +++++------ source/index.Rmd | 10 +- source/intro.Rmd | 30 +++--- source/preface_second.Rmd | 19 ++-- source/preface_third.Rmd | 170 +++++++++++++++++---------------- source/simon_refs.bib | 6 +- 6 files changed, 150 insertions(+), 142 deletions(-) diff --git a/source/diagrams/covid-tree.svg b/source/diagrams/covid-tree.svg index f8b1929a..1d3aedfa 100644 --- a/source/diagrams/covid-tree.svg +++ b/source/diagrams/covid-tree.svg @@ -1,19 +1,19 @@ + inkscape:version="1.2.2 (b0a8486, 2022-12-01)" + sodipodi:docname="covid-tree.svg" + xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" + xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" + xmlns="http://www.w3.org/2000/svg" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" + xmlns:cc="http://creativecommons.org/ns#" + xmlns:dc="http://purl.org/dc/elements/1.1/"> + inkscape:window-y="25" + inkscape:window-maximized="0" + inkscape:showpageshadow="2" + inkscape:pagecheckerboard="0" + inkscape:deskcolor="#d1d1d1" /> @@ -72,7 +75,7 @@ image/svg+xml - + @@ -123,7 +126,7 @@ x="64.851036" y="58.438557" style="stroke-width:0.264583" - id="tspan879">COVID + id="tspan879">Covid have COVID + id="tspan906">have Covid 0.5% of testsyield falsepositivespositives forCOVID + id="tspan941">Covid without COVIDwithout Covidtest negatively + id="tspan986">test negative COVID tests failCovid tests failCOVID's presence + id="tspan1030">Covid's presence ... no statistical treatment can put validity into generalizations which are > based on data that were not reasonably accurate and complete to begin with. > It is unfortunate that academic departments so often offer courses on the -> statistical manipulation of human material to students who have little -> understanding of the problems involved in securing the original data. ... -> When training in these things replaces or at least precedes some of the +> statistical manipulation of [data from human behavior] to students who have +> little understanding of the problems involved in securing the original data. +> ... When training in these things replaces or at least precedes some of the > college courses on the mathematical treatment of data, we shall come nearer > to having a science of human behavior. [@kinsey1948sexual, p 35]. @@ -345,10 +345,10 @@ Scan this book and you will find almost no formal mathematics. Yet nearly every student finds the subject very difficult — as difficult as anything taught at universities. The root of the difficulty is that the *subject matter* is extremely difficult. Let's find out *why*. -It is easy to find out with high precision which movie is playing -tonight at the local cinema; you can look it up on the web or call the cinema -and ask. But consider by contrast how difficult it is to determine with -accuracy: + +It is easy to find out with high precision which movie is playing tonight at +the local cinema; you can look it up on the web or call the cinema and ask. But +consider by contrast how difficult it is to determine with accuracy: 1. Whether we will save lives by recommending vitamin D supplements for the whole population as protection against viral infections. Some evidence diff --git a/source/preface_second.Rmd b/source/preface_second.Rmd index 0d8ed5cd..e5bc9b2f 100644 --- a/source/preface_second.Rmd +++ b/source/preface_second.Rmd @@ -104,7 +104,7 @@ this success. The method was first presented at some length in the 1969 edition of my book *Basic Research Methods in Social Science* [@simon1969basic] (third edition -with Paul Burstein -@simon1985basic). +with Paul Burstein [-@simon1985basic]). For some years, the resampling method failed to ignite interest among statisticians. While many factors (including the accumulated @@ -184,7 +184,7 @@ separate publication where it might be overlooked. In ancient times, mathematics developed from the needs of governments and rich men to number armies, flocks, and especially to count the -taxpayers and their possessions. Up until the beginning of the 20th +taxpayers and their possessions. Up until the beginning of the 20^th^ century, the term *statistic* meant the number of something — soldiers, births, taxes, or what-have-you. In many cases, the term *statistic* still means the number of something; the most important statistics for the United @@ -194,12 +194,14 @@ the making or interpretation of descriptive statistics, because the topic is handled very well in most conventional statistics texts. Another stream of thought entered the field of probability and statistics in -the 17th century by way of gambling in France. Throughout history people had +the 17^th^ century by way of gambling in France. Throughout history people had learned about the odds in gambling games by repeated plays of the game. But in -the year 1654, the French nobleman Chevalier de Mere asked the great +the year 1654, the French nobleman Chevalier de Méré asked the great mathematician and philosopher Pascal to help him develop correct odds for some -gambling games. Pascal, the famous Fermat, and others went on to develop -modern probability theory. +gambling games[^problem-points]. Pascal, the famous Fermat, and others went on +to develop modern probability theory. + +[^problem-points]: Later these two streams of thought came together. Researchers wanted to know how accurate their descriptive statistics were — not only the @@ -208,9 +210,10 @@ numbers arising from experiments. Statisticians began to apply the theory of probability to the accuracy of the data arising from sample surveys and experiments, and that became the theory of *inferential statistics*. + Here we find a guidepost: probability theory and statistics are relevant -whenever there is uncertainty about events occurring in the world, or in -the numbers describing those events. +whenever there is uncertainty about events occurring in the world, or in the +numbers describing those events. Later, probability theory was also applied to another context in which there is uncertainty — decision-making situations. Descriptive diff --git a/source/preface_third.Rmd b/source/preface_third.Rmd index 7de8e953..09f78c7c 100644 --- a/source/preface_third.Rmd +++ b/source/preface_third.Rmd @@ -29,12 +29,12 @@ The book in your hands, or on your screen, is the third edition of a book originally called "Resampling: the new statistics", by Julian Lincoln Simon [-@simon1992resampling]. -One of the pleasures of writing an edition of someone else's book is that -we have some freedom to praise a previous version of our own book. We will -do that, in the next section. Next we talk about the resampling methods in -this book, and their place at the heart of "data science", Finally, we -discuss what we have changed, and why, and make some suggestions about where -this book could fit into your learning and teaching. +One of the pleasures of writing a new edition of a work by another author, is +that we can praise the previous version of our own book. We will do that, in +the next section. Next we talk about the resampling methods in this book, and +their place at the heart of "data science". We then discuss what we have +changed, what we haven't, and why. Finally, we make some suggestions about +where this book could fit into your learning and teaching. ## What Simon saw @@ -53,10 +53,10 @@ mathematics. Most students cannot follow along and quickly get lost, reducing the subject to — as Simon puts it — "mumbo-jumbo". On its own, this was not a new realization. Simon quotes a classic textbook by -Wallis and Roberts [-@wallis1956statistics], in which they compare teaching -statistics through mathematics to teaching in a foreign language. More -recently, other teachers of statistics have come to the same conclusion. Cobb -[-@cobb2007introductory] argues that it is practically impossible to teach +Wallis and Roberts [-@wallis1956statistics], to the effect that teaching +statistics through mathematics is like teaching philosophy in ancient Greek. +More recently, other teachers of statistics have come to the same conclusion. +Cobb [-@cobb2007introductory] argues that it is not practical to teach students the level of mathematics they would need to understand standard introductory courses. As you will see below, Cobb also agrees with Simon about the solution. @@ -67,37 +67,37 @@ appears in the original preface: "Beneath the logic of a statistical inference there necessarily lies a physical process". Drawing conclusions from noisy data means building a *model* of the noisy world, and seeing how that model behaves. That model can be physical, where we generate the noisiness of the -world using physical devices like dice and spinners and coin-tosses. In fact, -Simon used exactly these kinds of devices in his first experiments -in teaching [@simon1969basic]. He then saw that it was much more efficient to -build these models with simple computer code, and the result was the first and -second editions of this book, with their associated software, the *Resampling -Stats* language. - -Simon's second conclusion follows from the first. Now that Simon had stripped -away the unnecessary barrier of mathematics, he had got to the heart of what is -interesting and difficult in statistics. Drawing conclusions from noisy data -involves a lot of hard, clear thinking. We need to be honest with our students -about that; statistics is hard, not because it is obscure (it need not be), but -because it deals with difficult problems. It is exactly that hard logical -thinking that can make statistics so interesting to our best students; +world using physical devices like dice and spinners and coin-tosses. +Simon used exactly these kinds of devices in his first experiments in teaching +[@simon1969basic]. He then saw that it was much more efficient to build these +models with simple computer code, and the result was the first and second +editions of this book, with their associated software, the *Resampling Stats* +language. + +Simon's second conclusion follows from the first. Now he had found a path +round the unnecessary barrier of mathematics, he had got to the heart of what +is interesting and difficult in statistics. Drawing conclusions from noisy +data involves a lot of hard, clear thinking. We should be honest with our +students about that; statistics is hard, not because it is obscure (it need +not be), but because it deals with difficult problems. It is exactly that hard +logical thinking that can make statistics so interesting to our best students; "statistics" is just reasoning about the world when the world is noisy. Simon writes eloquently about this in a section in the introduction — "Why is statistics such a difficult subject" (@sec-stats-difficult). -We needed both of Simon's conclusions to get anywhere. We cannot hope to +We need both of Simon's conclusions to make progress. We cannot hope to teach two hard subjects at the same time; mathematics, and statistical -reasoning. That is what Simon has done: he replaced the mathematics with -something that is much easier to reason about. Then he can concentrate on the -real, interesting problem — the hard thinking about data, and the world it -comes from. To quote from a later section in this book -(@sec-resamp-differs): "Once we get rid of the formulas and tables, we can -see that statistics is a matter of *clear thinking, not fancy mathematics*." -Instead of asking "where would I look up the right recipe for this", you -find yourself asking "what kind of world do these data come from?" and "how -can I reason about that world?". Like Simon, we have found that this way of -thinking and teaching is almost magically liberating and satisfying. We hope -and believe that you will find the same. +reasoning. He replaced the mathematics with something that is much easier for +most of us to reason about. By doing that, he can concentrate on the real, +interesting problem — the hard thinking about data, and the world it comes +from. To quote from a later section in this book (@sec-resamp-differs): "Once +we get rid of the formulas and tables, we can see that statistics is a matter +of *clear thinking, not fancy mathematics*." Instead of asking "where would +I look up the right recipe for this?", you find yourself asking "what kind of +world do these data come from?" and "How can I reason about that world?". +Like Simon, we have found that this way of thinking and teaching brings rich +rewards — for insight and practice. We hope and believe that you will find +the same. ## Resampling and data science {#sec-resampling-data-science} @@ -135,11 +135,11 @@ what it should do with data; it is the native language of data analysis. This insight transforms the way with think of code. In the past, we have thought of code as a separate, specialized skill, that some of us learn. We -take coding courses — we "learn to code". If code is the fundamental -language for analyzing data, then we need code to express what data analysis -does, and explain how it works. Here we "code to learn". Code is not an aim -in itself, but a language we can use to express the simple ideas behind data -analysis and statistics. +take coding courses — we "learn to code". But if we us code as the +fundamental language for analyzing data, then we need code to express what +data analysis does, and explain how it works. Here we "code to learn". Code +is not an aim in itself, but a language we can use to express the simple ideas +behind data analysis and statistics. Thus the data science movement started from code as the foundation for data analysis, to using code to explain statistics. It ends at the same place as @@ -153,37 +153,41 @@ goes on to explain why there is so much mathematics, and why we should remove it. In the age before ubiquitous computing, we needed mathematics to simplify calculations that we could not practically do by hand. Now we have great computing power in our phones and laptops, we do not have this constraint, and -we can use simpler resampling methods to solve the same problems. As Simon -shows, these are much easier to describe and understand. Data science, and -teaching with resampling, are the obvious consequences of ubiquitous -computing. +we can use simpler ideas from resampling methods to solve the same problems. +As Simon shows, these are much easier to describe and understand. Data +science, and teaching with resampling, are the obvious consequences of +ubiquitous computing. ## What we changed This diversion, through data science, leads us to the changes that we have made for the new edition. The previous edition of this book is still excellent, and -you can read it free, online, at . +you can read it freely at . It continues to be ahead of its time, and ahead of our time. Its one major drawback is that Simon bases much of the book around code written in a special -language that he developed with Dan Weidenfeld, called *Resampling Stats*. The -Resampling Stats language is well designed for expressing the steps in -simulating worlds that include elements of randomness, and it was a useful -contribution at the time that it was written. Since then, and particularly in -the last decade, there have been many improvements in more powerful and general -languages, such as {{< var lang >}} and {{< var other_lang >}}. These -languages are particularly suitable for beginners in data analysis, and they -come with a huge range of tools and libraries for a many tasks in data -analysis, including the kinds of models and simulations you will see in this -book. We have updated the book to use {{< var lang >}}, instead of *Resampling -Stats*. If you already know {{< var lang >}} or a similar language, such as -{{< var other_lang >}}, you will have a big head start in reading this book, -but even if you do not, we have written the book so it will be possible to pick -up the {{< var lang >}} code that you need to understand and build the kind of -models that Simon uses. The advantage to us, your authors, is that we can use -the very powerful tools associated with {{< var lang >}} to make it easier to -run and explain the code. The advantage to you, our readers, is that you can -also learn these tools, and the {{< var lang >}} language. They will serve you -well for the rest of your career in data analysis. +language that he developed with Dan Weidenfeld, called *Resampling +Stats*^[stats101]. The Resampling Stats language is well designed for +expressing the steps in simulating worlds that include elements of randomness, +and it was a useful contribution at the time that it was written. Since then, +and particularly in the last decade, there have been many improvements in more +powerful and general languages, such as {{< var lang >}} and {{< var +other_lang >}}. These languages are particularly suitable for beginners in +data analysis, and they come with a huge range of tools and libraries for +a many tasks in data analysis, including the kinds of models and simulations +you will see in this book. We have updated the book to use {{< var lang >}}, +instead of *Resampling Stats*. If you already know {{< var lang >}} or +a similar language, such as {{< var other_lang >}}, you will have a big head +start in reading this book, but even if you do not, we have written the book +so it will be possible to pick up the {{< var lang >}} code that you need to +understand and build the kind of models that Simon uses. The advantage to us, +your authors, is that we can use the very powerful tools associated with {{< +var lang >}} to make it easier to run and explain the code. The advantage to +you, our readers, is that you can also learn these tools, and the +{{< var lang >}} language. They will serve you well for the rest of your +career in data analysis. + +[^stats101]: If you are interested, has + a free modern version of the original Resampling Stats language. -1. For a flight to Mars, calculating the correct route involves a great many - variables, too many to solve with formulas. Hence, the Monte Carlo - simulation method is used. +1. Imagine a large train station such as Grand Central Terminal in New York or + King's Cross in London. We are responsible for planning the new station + layout so that passengers can move as quickly as possible to and from their + trains in rush-hour. It will likely be far too complicated to make + formulas to represent the passenger flows, but we could use the computer to + simulate passengers, and their movements, and try different potential + layouts within the simulation. 2. The Navy might want to know how long the average ship will have to wait for dock facilities. The time of completion varies from ship to diff --git a/source/resampling_with_code.Rmd b/source/resampling_with_code.Rmd index 31df5390..4c23cef9 100644 --- a/source/resampling_with_code.Rmd +++ b/source/resampling_with_code.Rmd @@ -113,7 +113,7 @@ of cure. Given that 90% cure rate, what is the chance that 17 out of 17 of the Hypothetical group will be cured? You may notice that this question about the Hypothetical group is similar to -the problem of the 16 ambulances in Chapter @sec-resampling-method. In that +the problem of the 16 ambulances in @sec-resampling-method. In that problem, we were interested to know how likely it was that 3 or more of 16 ambulances would be out of action on any one day, given that each ambulance had a 10% chance of being out of action. Here we would like to know the chances @@ -858,7 +858,7 @@ np.sum(a) sum(a) ``` -## Counting results +## Counting results {#sec-counting-results} We now have the code to do the equivalent of throwing 17 ten-sided dice. This is the basis for one simulated trial in the world of Saint Hypothetical From f950512605b2be43338f5412042ade6afc13e54b Mon Sep 17 00:00:00 2001 From: Matthew Brett Date: Fri, 18 Oct 2024 21:05:58 +0100 Subject: [PATCH 07/10] Notes on process_notebooks --- scripts/process_notebooks.py | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/scripts/process_notebooks.py b/scripts/process_notebooks.py index 3c47a250..24beed91 100755 --- a/scripts/process_notebooks.py +++ b/scripts/process_notebooks.py @@ -1,4 +1,11 @@ #!/usr/bin/env python3 +""" Process notebooks + +* Copy all files in given directory. +* Write notebooks with given extension. +* Replace local kernel with Pyodide kernel in metadata. +* If url_root specified, replace local file with URL, add message. +""" from argparse import ArgumentParser, RawDescriptionHelpFormatter from copy import deepcopy From ede9b27862ca17e9fccaf4569629b73c5f4b8c63 Mon Sep 17 00:00:00 2001 From: Matthew Brett Date: Fri, 18 Oct 2024 21:10:05 +0100 Subject: [PATCH 08/10] Restore random number generator code block. --- source/resampling_method.Rmd | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/source/resampling_method.Rmd b/source/resampling_method.Rmd index 600784a4..c7106e52 100644 --- a/source/resampling_method.Rmd +++ b/source/resampling_method.Rmd @@ -275,6 +275,12 @@ import numpy as np We also need to ask Numpy for something (that we will call an "object") that can generate random numbers. Such an object is known as a "random number generator". + +```{python} +# Ask NumPy for a random number generator. +# Name it `rnd` — short for "random" +rnd = np.random.default_rng() +``` ::: Recall that we want 16 10-sided dice — one per ambulance. Our dice should From a3f8f4da60237c9f5a7033966cf2b7286caabc43 Mon Sep 17 00:00:00 2001 From: Matthew Brett Date: Tue, 22 Oct 2024 11:49:49 +0100 Subject: [PATCH 09/10] Clean up, format some ketables --- README.md | 8 ++++++-- source/intro.Rmd | 14 ++++++++++---- source/resampling_method.Rmd | 10 +++++++--- 3 files changed, 23 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index a0712d72..b174ca9f 100644 --- a/README.md +++ b/README.md @@ -441,9 +441,9 @@ lake = pd.read_csv(op.join('data', 'lough_erne.csv')) yearly_srp = lake.loc[:, ['Year', 'SRP']].copy() ``` -```{r, eval=TRUE, echo=FALSE} +```{r, label="tbl-yearly-srp", eval=TRUE, echo=FALSE} ketable(py$yearly_srp, - caption = "Soluble Reactive Phosphorus in Lough Erne {#tbl-yearly-srp}") + caption = "Soluble Reactive Phosphorus in Lough Erne") ``` ~~~ @@ -455,6 +455,10 @@ See [the Knitr chunk options documentation](https://bookdown.org/yihui/rmarkdown-cookbook/chunk-options.html) for more detail. +You can use the [`kableExtra::column_spec` +options](https://www.rdocumentation.org/packages/kableExtra/versions/1.4.0/topics/column_spec) +to tune table formatting — see `resampling_method.Rmd` for an example. + ## More setup for Jupyter For the Jupyter notebook, you might want to enable the R magics, to allow you diff --git a/source/intro.Rmd b/source/intro.Rmd index e45dea32..f1d68a08 100644 --- a/source/intro.Rmd +++ b/source/intro.Rmd @@ -147,9 +147,9 @@ lake = pd.read_csv(Path('data') / 'lough_erne.csv') yearly_srp = lake.loc[:, ['Year', 'SRP']].copy() ``` -```{r, eval=TRUE, echo=FALSE} +```{r, label="tbl-yearly-srp", eval=TRUE, echo=FALSE} ketable(py$yearly_srp, - caption = "Soluble Reactive Phosphorus in Lough Erne {#tbl-yearly-srp}") + caption = "Soluble Reactive Phosphorus in Lough Erne") ``` We may want to *summarize* this set of SRP measurements. For example, we could @@ -170,9 +170,9 @@ srp_stats = pd.DataFrame({'Descriptive statistics for SRP': pd.Series({ 'Maximum': np.max(srp)})}) ``` -```{r, eval=TRUE, echo=FALSE} +```{r, label="tbl-srp-stats", eval=TRUE, echo=FALSE} ketable(head(py$srp_stats), - caption = "Statistics for SRP levels {#tbl-srp-stats}") + caption = "Statistics for SRP levels") ``` Descriptive statistics are nothing new to you; you have been using many of them @@ -247,6 +247,12 @@ only one part of the overall web of consequences and evaluations that must be taken into account when making your decision whether or not to do further research on medicine AntiAnyVir. + + Why does this book limit itself to the specific probability questions when ultimately we are interested in decisions? A first reason is division of labor. The more general aspects of the decision-making process in the face of diff --git a/source/resampling_method.Rmd b/source/resampling_method.Rmd index c7106e52..997bd76a 100644 --- a/source/resampling_method.Rmd +++ b/source/resampling_method.Rmd @@ -178,7 +178,9 @@ df.insert(0, 'Day', range(1, n_trials + 1)) ``` ```{r, label="tbl-veh-numbers", eval=TRUE, echo=FALSE} -ketable(py$df, caption='25 simulations of 16 ambulances') +kableExtra::column_spec( + ketable(py$df, caption='25 simulations of 16 ambulances'), + 1, bold=TRUE, include_thead=TRUE) ``` To know how many ambulances were "out of order" on any given day, we count @@ -192,8 +194,10 @@ df_counts['#9'] = (df == 9).sum(axis=1) ``` ```{r, label="tbl-veh-numbers-counts", eval=TRUE, echo=FALSE} -ketable(py$df_counts, - caption='25 simulations of 16 ambulances, with counts') +kableExtra::column_spec( + ketable(py$df_counts, + caption='25 simulations of 16 ambulances, with counts'), + c(1, 18), bold=TRUE, include_thead=TRUE) ``` Each value in the last column of @tbl-veh-numbers-counts is the count of From f1b91d15fc8d9a5058c2daef199287d2e6275444 Mon Sep 17 00:00:00 2001 From: Matthew Brett Date: Tue, 22 Oct 2024 11:57:18 +0100 Subject: [PATCH 10/10] Maybe fix to R pdf build --- .github/workflows/build.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/build.yaml b/.github/workflows/build.yaml index 4be11505..5e83a133 100644 --- a/.github/workflows/build.yaml +++ b/.github/workflows/build.yaml @@ -121,7 +121,7 @@ jobs: touch r-book/.nojekyll - name: Build R PDF - run: cd source && ninja python-book-pdf + run: cd source && ninja r-book-pdf - name: Deploy R book uses: JamesIves/github-pages-deploy-action@v4.6.1