Merge branch 'main' into edition-3

* main: Maybe fix to R pdf build Clean up, format some ketables Restore random number generator code block. Notes on process_notebooks Edits from resampling with code Fix references to Tables (don't write "Table") Working through proofread to end of introduction. Spaces before colon, after emphasis. Remove spaces after emphasis, before commas Clean up trailing spaces after emphasis.
resampling-stats · Oct 22, 2024 · cb82b89 · cb82b89
2 parents 28c3687 + f1b91d1
commit cb82b89
Show file tree

Hide file tree

Showing 36 changed files with 374 additions and 402 deletions.
diff --git a/.github/workflows/build.yaml b/.github/workflows/build.yaml
@@ -121,7 +121,7 @@ jobs:
           touch r-book/.nojekyll
 
       - name: Build R PDF
-        run: cd source && ninja python-book-pdf
+        run: cd source && ninja r-book-pdf
 
       - name: Deploy R book
         uses: JamesIves/[email protected]

diff --git a/README.md b/README.md
@@ -441,9 +441,9 @@ lake = pd.read_csv(op.join('data', 'lough_erne.csv'))
 yearly_srp = lake.loc[:, ['Year', 'SRP']].copy()
 ```
 
-```{r, eval=TRUE, echo=FALSE}
+```{r, label="tbl-yearly-srp", eval=TRUE, echo=FALSE}
 ketable(py$yearly_srp,
-        caption = "Soluble Reactive Phosphorus in Lough Erne {#tbl-yearly-srp}")
+        caption = "Soluble Reactive Phosphorus in Lough Erne")
 ```
 ~~~
 
@@ -455,6 +455,10 @@ See [the Knitr chunk options
 documentation](https://bookdown.org/yihui/rmarkdown-cookbook/chunk-options.html)
 for more detail.
 
+You can use the [`kableExtra::column_spec`
+options](https://www.rdocumentation.org/packages/kableExtra/versions/1.4.0/topics/column_spec)
+to tune table formatting — see `resampling_method.Rmd` for an example.
+
 ## More setup for Jupyter
 
 For the Jupyter notebook, you might want to enable the R magics, to allow you

diff --git a/scripts/process_notebooks.py b/scripts/process_notebooks.py
@@ -1,4 +1,11 @@
 #!/usr/bin/env python3
+""" Process notebooks
+
+* Copy all files in given directory.
+* Write notebooks with given extension.
+* Replace local kernel with Pyodide kernel in metadata.
+* If url_root specified, replace local file with URL, add message.
+"""
 
 from argparse import ArgumentParser, RawDescriptionHelpFormatter
 from copy import deepcopy

diff --git a/source/bayes_simulation.Rmd b/source/bayes_simulation.Rmd
@@ -904,7 +904,7 @@ But instead of doing that, let's take the easy route out and simulate
 the situation instead.
 
 1.  We begin, as do Box and Tiao, by restricting our attention to the third
-    line in Table @tbl-mice-genetics. We draw a mouse with label 'BB', 'Bb', or
+    line in @tbl-mice-genetics. We draw a mouse with label 'BB', 'Bb', or
     'bb', using those probabilities. We were told that the "test mouse" is
     black, so if we draw 'bb', we try again. (Alternatively, we could draw 'BB'
     and 'Bb' with probabilities of 1/3 and 2/3 respectively.)
@@ -917,7 +917,7 @@ the situation instead.
     If our test mouse is "BB", we already know that all their offspring will
     be black ("Bb").  Thus, store "BB" in the parent list.
 3.  If our test mouse is "Bb", we have a bit more work to do.  Draw
-    seven offspring from the middle row of Table tbl-mice-genetics.
+    seven offspring from the middle row of @tbl-mice-genetics.
     If all the offspring are black, store "Bb" in the parent list.
 4.  Repeat steps 1-3 perhaps 10000 times.
 5.  Now, out of all parents count the numbers of "BB" vs "Bb".

diff --git a/source/confidence_1.Rmd b/source/confidence_1.Rmd
@@ -75,7 +75,7 @@ respected statisticians who argue that the logic of confidence intervals
 is better grounded and leads less often to error.
 
 Confidence intervals are considered by many to be part of the same topic
-as *estimation* , being an estimation of accuracy, in their view. And
+as *estimation*, being an estimation of accuracy, in their view. And
 confidence intervals and hypothesis testing are seen as sub-cases of
 each other by some people. Whatever the importance of these distinctions
 among these intellectual tasks in other contexts, they need not concern
@@ -85,11 +85,11 @@ us here.
 
 If one draws a sample that is very, very large — large enough so that
 one need not worry about sample size and dispersion in the case at
-hand — from a universe whose characteristics one *knows* , one then can
+hand — from a universe whose characteristics one *knows*, one then can
 *deduce* the probability that the sample mean will fall within a given
 distance of the population mean. Intuitively, it *seems* as if one
 should also be able to reverse the process — to infer something about
-the location of the population mean *from the sample mean* . But this
+the location of the population mean *from the sample mean*.  But this
 inverse inference turns out to be a slippery business indeed.
 
 Let's put it differently: It is all very well to say — as one logically
@@ -233,7 +233,7 @@ might be, you would offer higher odds that the center (the trunk) is in
 any unit of area close to the center of your two apples than in a unit
 of area far from the center. That is, if you are told that either one
 apple, or two apples, came from *one of two specified trees whose
-locations are given* , with no reason to believe it is one tree or the
+locations are given*, with no reason to believe it is one tree or the
 other (later, we can put other prior probabilities on the two trees),
 and you are also told the dispersions, you now can put *relative*
 probabilities on *one tree or the other* being the source. (Note to the
@@ -266,7 +266,7 @@ If the pattern of the 10 apples is tight, you imagine the pattern of the
 likely locations of the population mean to be tight; if not, not. That
 is, *it is intuitively clear that there is some connection between how
 spread out are the sample observations and your confidence about the
-location of the population mean* . For example, consider two patterns of
+location of the population mean*.  For example, consider two patterns of
 a thousand apples, one with twice the spread of another, where we
 measure spread by (say) the diameter of the circle that holds the inner
 half of the apples for each tree, or by the standard deviation. It makes

diff --git a/source/confidence_2.Rmd b/source/confidence_2.Rmd
@@ -100,7 +100,7 @@ Please notice that the distribution (universe) assumed at the beginning
 of this approach did not include the assumption that the distribution is
 centered on the sample mean or anywhere else. It is true that the sample
 mean is used *for purposes of reporting the location of the estimated
-universe mean* . But despite how the subject is treated in the
+universe mean*.  But despite how the subject is treated in the
 conventional approach, the estimated population mean is not part of the
 work of constructing confidence intervals. Rather, the calculations
 apply in the same way to *all universes in the neighborhood of the
@@ -1038,7 +1038,7 @@ resampling. We shall not discuss the latter method here.
 
 As with approach 1, we do not make any probability statements about
 where the population mean may be found. Rather, we discuss only what
-various hypothetical universes *might produce* , and make inferences
+various hypothetical universes *might produce*, and make inferences
 about the "actual" population's characteristics by comparison with those
 hypothesized universes.
 
@@ -1124,8 +1124,7 @@ dispersion as the sample. We can then say that *distributions centered
 at the two endpoints of the 95 percent confidence interval (each of them
 including a tail in the direction of the observed sample mean with 2.5
 percent of the area), or even further away from the sample mean, will
-produce the observed sample only 5 percent of the time or less* .
-
+produce the observed sample only 5 percent of the time or less*.
 The result of the second approach is more in the spirit of a hypothesis
 test than of the usual interpretation of confidence intervals. Another
 statement of the result of the second approach is: We postulate a given
@@ -1173,8 +1172,7 @@ lies to the (say) right of it.
 As noted in the preview to this chapter, we do not learn about the
 reliability of sample estimates of the population mean (and other
 parameters) by logical inference from any one particular sample to any
-one particular universe, because *in principle this cannot be done* .
-Instead, in this second approach we investigate the behavior of various
+one particular universe, because *in principle this cannot be done*.  Instead, in this second approach we investigate the behavior of various
 universes at the borderline of the neighborhood of the sample, those
 universes being chosen on the basis of their resemblances to the sample.
 We seek, for example, to find the universes that would produce samples

diff --git a/source/correlation_causation.Rmd b/source/correlation_causation.Rmd
@@ -78,7 +78,7 @@ is, even in a controlled experiment there is often no way except subject-matter
 knowledge to avoid erroneous conclusions about causality. Nothing except
 substantive knowledge or scientific intuition would have led them to the
 recognition that it is the alcohol rather than the soda that made them drunk,
-*as long as they always took soda with their drinks* . And no statistical
+*as long as they always took soda with their drinks*.  And no statistical
 procedure can suggest to them that they ought to experiment with the presence
 and absence of soda. If this is true for an experiment, it must also be true
 for an uncontrolled study.
@@ -106,7 +106,7 @@ in a statement that has these important characteristics:
     observable so that the relationship will apply under a wide enough
     range of conditions to be considered useful or interesting. In other
     words, *the relationship must not require too many "if"s, "and"s,
-    and "but"s in order to hold* . For example, one might say that an
+    and "but"s in order to hold*.  For example, one might say that an
     increase in income caused an increase in the birth rate if this
     relationship were observed everywhere. But, if the relationship were
     found to hold only in developed countries, among the educated
@@ -130,7 +130,7 @@ in a statement that has these important characteristics:
     previous criterion for side conditions is that a plenitude of very
     restrictive side conditions may take the relationship out of the
     class of causal relationships, *even though the effects of the side
-    conditions are known* . This criterion of nonspuriousness concerns
+    conditions are known*.  This criterion of nonspuriousness concerns
     variables that are as yet *unknown and unevaluated* but that have a
     *possible* ability to *upset* the observed association.
 
@@ -224,14 +224,14 @@ or whether they are not independent but rather are related.
 
 ## A Note on Association Compared to Testing a Hypothesis
 
-Problems in which we investigate a) whether there is an *association* , versus
+Problems in which we investigate a) whether there is an *association*, versus
 b) whether there is a *difference* between just two groups, often look very
 similar, especially when the data constitute a 2-by-2 table. There is this
 important difference between the two types of analysis, however: Questions
 about *association* refer to *variables* — say weight and age — and it never
 makes sense to ask whether there is a difference between variables (except when
 asking whether they measure the same quantity). Questions about *similarity or
-difference* refer to *groups of individuals* , and in such a situation it does
+difference* refer to *groups of individuals*, and in such a situation it does
 make sense to ask whether or not two groups are observably different from each
 other.
 
@@ -794,8 +794,7 @@ occur if the I.Q. scores were ranked from best to worst (column 3) and worst to
 best (column 5). The extent of correlation (association) can thus be measured
 by whether the sum of the multiples of the observed *x* and *y* values is
 relatively much higher or much lower than are sums of randomly-chosen pairs of
-*x* and *y* .
-
+*x* and *y*.
 ```{python echo=FALSE, eval=TRUE, results="asis", message=FALSE}
 import numpy as np
 import pandas as pd
@@ -1804,7 +1803,7 @@ thirteen index cards. Now take *another* set of seventy-eight index
 cards, preferably of a different color, and write "yes" on fifty-two of
 them and "no" on twenty-six of them, corresponding to the numbers of
 people who do and do not drink beer in the sample. Now lay them down in
-random *pairs* , one from each pile.
+random *pairs*, one from each pile.
 
 If there is a high association between the variables, then real life
 observations will bunch up in the two diagonal cells in the upper left and
@@ -1837,8 +1836,8 @@ We can carry out a resampling test with this procedure:
 * **Step 2.** Pair the two sets of cards randomly. Count the numbers of the
 four possible pairs: (1) "approve-drink," (2) "disapprove-don't drink," (3)
 "disapprove-drink," and (4) "approve-don't drink." Record the number of these
-combinations, as in Table 23-10, where columns 1-4 correspond to the four cells
-in Table 23-9.
+combinations, as in @tbl-beerpol-trial, where columns 1-4 correspond to the
+four data cells in @tbl-beerpol-data.
 * **Step 3.** Add (column 1 plus column 4), then add (column 2 plus column 3),
 and subtract the result in the second parenthesis from the result in the first
 parenthesis. If the difference is equal to or greater than 24, record "yes,"
@@ -2331,7 +2330,7 @@ the fact that the player sometimes does not get a hit for an abnormally
 long period of time. One way of testing whether or not the coach is
 right is by comparing an average player's longest slump in a 100-at-bat
 season with the longest run of outs in the first card trial. Assume that
-Slug is a player picked *at random* . Then compare Slug's longest
+Slug is a player picked *at random*.  Then compare Slug's longest
 slump — say, 10 outs in a row — with the longest cluster of a single
 simulated 100-at-bat trial with the cards, 9 outs. This result suggests
 that Slug's apparent slump might well have resulted by chance.

diff --git a/source/diagrams/covid-tree.svg b/source/diagrams/covid-tree.svg
diff --git a/source/dramatizing_resampling.Rmd b/source/dramatizing_resampling.Rmd
@@ -46,7 +46,7 @@ one coin, and if it is a head (boy), I then flip another coin and see
 how often that will be a boy, also, and then actually flip the coin once
 and record the outcomes. But that experiment does not resemble the
 situation of interest. A proper modeling throws *two* coins, examines to
-see if there is a head on *either* , and then examines the *other*.
+see if there is a head on *either*, and then examines the *other*.
 
 Or consider table @tbl-otherboy, where we asked the computer to do this
 work for us.  The computer has done 50 trials, where one trial is one family of
@@ -110,7 +110,7 @@ Someone might wonder whether formal mathematics can help us with this
 problem. Formal (even though not formulaic) analysis can certainly
 provide an answer. We can use what is known as the "sample space"
 approach which reasons from first principles; here it consists of making
-a *list of the possibilities* , and examining the proportion of
+a *list of the possibilities*, and examining the proportion of
 "successes" to "failures" in that list.
 
 First we write down the equally-likely ways that two coins can fall: