Skip to content

Commit

Permalink
Merge branch 'main' into edition-3
Browse files Browse the repository at this point in the history
* main:
  Maybe fix to R pdf build
  Clean up, format some ketables
  Restore random number generator code block.
  Notes on process_notebooks
  Edits from resampling with code
  Fix references to Tables (don't write "Table")
  Working through proofread to end of introduction.
  Spaces before colon, after emphasis.
  Remove spaces after emphasis, before commas
  Clean up trailing spaces after emphasis.
  • Loading branch information
matthew-brett committed Oct 22, 2024
2 parents 28c3687 + f1b91d1 commit cb82b89
Show file tree
Hide file tree
Showing 36 changed files with 374 additions and 402 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ jobs:
touch r-book/.nojekyll
- name: Build R PDF
run: cd source && ninja python-book-pdf
run: cd source && ninja r-book-pdf

- name: Deploy R book
uses: JamesIves/[email protected]
Expand Down
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -441,9 +441,9 @@ lake = pd.read_csv(op.join('data', 'lough_erne.csv'))
yearly_srp = lake.loc[:, ['Year', 'SRP']].copy()
```

```{r, eval=TRUE, echo=FALSE}
```{r, label="tbl-yearly-srp", eval=TRUE, echo=FALSE}
ketable(py$yearly_srp,
caption = "Soluble Reactive Phosphorus in Lough Erne {#tbl-yearly-srp}")
caption = "Soluble Reactive Phosphorus in Lough Erne")
```
~~~
Expand All @@ -455,6 +455,10 @@ See [the Knitr chunk options
documentation](https://bookdown.org/yihui/rmarkdown-cookbook/chunk-options.html)
for more detail.
You can use the [`kableExtra::column_spec`
options](https://www.rdocumentation.org/packages/kableExtra/versions/1.4.0/topics/column_spec)
to tune table formatting — see `resampling_method.Rmd` for an example.
## More setup for Jupyter
For the Jupyter notebook, you might want to enable the R magics, to allow you
Expand Down
7 changes: 7 additions & 0 deletions scripts/process_notebooks.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
#!/usr/bin/env python3
""" Process notebooks
* Copy all files in given directory.
* Write notebooks with given extension.
* Replace local kernel with Pyodide kernel in metadata.
* If url_root specified, replace local file with URL, add message.
"""

from argparse import ArgumentParser, RawDescriptionHelpFormatter
from copy import deepcopy
Expand Down
4 changes: 2 additions & 2 deletions source/bayes_simulation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -904,7 +904,7 @@ But instead of doing that, let's take the easy route out and simulate
the situation instead.

1. We begin, as do Box and Tiao, by restricting our attention to the third
line in Table @tbl-mice-genetics. We draw a mouse with label 'BB', 'Bb', or
line in @tbl-mice-genetics. We draw a mouse with label 'BB', 'Bb', or
'bb', using those probabilities. We were told that the "test mouse" is
black, so if we draw 'bb', we try again. (Alternatively, we could draw 'BB'
and 'Bb' with probabilities of 1/3 and 2/3 respectively.)
Expand All @@ -917,7 +917,7 @@ the situation instead.
If our test mouse is "BB", we already know that all their offspring will
be black ("Bb"). Thus, store "BB" in the parent list.
3. If our test mouse is "Bb", we have a bit more work to do. Draw
seven offspring from the middle row of Table tbl-mice-genetics.
seven offspring from the middle row of @tbl-mice-genetics.
If all the offspring are black, store "Bb" in the parent list.
4. Repeat steps 1-3 perhaps 10000 times.
5. Now, out of all parents count the numbers of "BB" vs "Bb".
Expand Down
10 changes: 5 additions & 5 deletions source/confidence_1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ respected statisticians who argue that the logic of confidence intervals
is better grounded and leads less often to error.

Confidence intervals are considered by many to be part of the same topic
as *estimation* , being an estimation of accuracy, in their view. And
as *estimation*, being an estimation of accuracy, in their view. And
confidence intervals and hypothesis testing are seen as sub-cases of
each other by some people. Whatever the importance of these distinctions
among these intellectual tasks in other contexts, they need not concern
Expand All @@ -85,11 +85,11 @@ us here.

If one draws a sample that is very, very large — large enough so that
one need not worry about sample size and dispersion in the case at
hand — from a universe whose characteristics one *knows* , one then can
hand — from a universe whose characteristics one *knows*, one then can
*deduce* the probability that the sample mean will fall within a given
distance of the population mean. Intuitively, it *seems* as if one
should also be able to reverse the process — to infer something about
the location of the population mean *from the sample mean* . But this
the location of the population mean *from the sample mean*. But this
inverse inference turns out to be a slippery business indeed.

Let's put it differently: It is all very well to say — as one logically
Expand Down Expand Up @@ -233,7 +233,7 @@ might be, you would offer higher odds that the center (the trunk) is in
any unit of area close to the center of your two apples than in a unit
of area far from the center. That is, if you are told that either one
apple, or two apples, came from *one of two specified trees whose
locations are given* , with no reason to believe it is one tree or the
locations are given*, with no reason to believe it is one tree or the
other (later, we can put other prior probabilities on the two trees),
and you are also told the dispersions, you now can put *relative*
probabilities on *one tree or the other* being the source. (Note to the
Expand Down Expand Up @@ -266,7 +266,7 @@ If the pattern of the 10 apples is tight, you imagine the pattern of the
likely locations of the population mean to be tight; if not, not. That
is, *it is intuitively clear that there is some connection between how
spread out are the sample observations and your confidence about the
location of the population mean* . For example, consider two patterns of
location of the population mean*. For example, consider two patterns of
a thousand apples, one with twice the spread of another, where we
measure spread by (say) the diameter of the circle that holds the inner
half of the apples for each tree, or by the standard deviation. It makes
Expand Down
10 changes: 4 additions & 6 deletions source/confidence_2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ Please notice that the distribution (universe) assumed at the beginning
of this approach did not include the assumption that the distribution is
centered on the sample mean or anywhere else. It is true that the sample
mean is used *for purposes of reporting the location of the estimated
universe mean* . But despite how the subject is treated in the
universe mean*. But despite how the subject is treated in the
conventional approach, the estimated population mean is not part of the
work of constructing confidence intervals. Rather, the calculations
apply in the same way to *all universes in the neighborhood of the
Expand Down Expand Up @@ -1038,7 +1038,7 @@ resampling. We shall not discuss the latter method here.

As with approach 1, we do not make any probability statements about
where the population mean may be found. Rather, we discuss only what
various hypothetical universes *might produce* , and make inferences
various hypothetical universes *might produce*, and make inferences
about the "actual" population's characteristics by comparison with those
hypothesized universes.

Expand Down Expand Up @@ -1124,8 +1124,7 @@ dispersion as the sample. We can then say that *distributions centered
at the two endpoints of the 95 percent confidence interval (each of them
including a tail in the direction of the observed sample mean with 2.5
percent of the area), or even further away from the sample mean, will
produce the observed sample only 5 percent of the time or less* .

produce the observed sample only 5 percent of the time or less*.
The result of the second approach is more in the spirit of a hypothesis
test than of the usual interpretation of confidence intervals. Another
statement of the result of the second approach is: We postulate a given
Expand Down Expand Up @@ -1173,8 +1172,7 @@ lies to the (say) right of it.
As noted in the preview to this chapter, we do not learn about the
reliability of sample estimates of the population mean (and other
parameters) by logical inference from any one particular sample to any
one particular universe, because *in principle this cannot be done* .
Instead, in this second approach we investigate the behavior of various
one particular universe, because *in principle this cannot be done*. Instead, in this second approach we investigate the behavior of various
universes at the borderline of the neighborhood of the sample, those
universes being chosen on the basis of their resemblances to the sample.
We seek, for example, to find the universes that would produce samples
Expand Down
21 changes: 10 additions & 11 deletions source/correlation_causation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ is, even in a controlled experiment there is often no way except subject-matter
knowledge to avoid erroneous conclusions about causality. Nothing except
substantive knowledge or scientific intuition would have led them to the
recognition that it is the alcohol rather than the soda that made them drunk,
*as long as they always took soda with their drinks* . And no statistical
*as long as they always took soda with their drinks*. And no statistical
procedure can suggest to them that they ought to experiment with the presence
and absence of soda. If this is true for an experiment, it must also be true
for an uncontrolled study.
Expand Down Expand Up @@ -106,7 +106,7 @@ in a statement that has these important characteristics:
observable so that the relationship will apply under a wide enough
range of conditions to be considered useful or interesting. In other
words, *the relationship must not require too many "if"s, "and"s,
and "but"s in order to hold* . For example, one might say that an
and "but"s in order to hold*. For example, one might say that an
increase in income caused an increase in the birth rate if this
relationship were observed everywhere. But, if the relationship were
found to hold only in developed countries, among the educated
Expand All @@ -130,7 +130,7 @@ in a statement that has these important characteristics:
previous criterion for side conditions is that a plenitude of very
restrictive side conditions may take the relationship out of the
class of causal relationships, *even though the effects of the side
conditions are known* . This criterion of nonspuriousness concerns
conditions are known*. This criterion of nonspuriousness concerns
variables that are as yet *unknown and unevaluated* but that have a
*possible* ability to *upset* the observed association.

Expand Down Expand Up @@ -224,14 +224,14 @@ or whether they are not independent but rather are related.

## A Note on Association Compared to Testing a Hypothesis

Problems in which we investigate a) whether there is an *association* , versus
Problems in which we investigate a) whether there is an *association*, versus
b) whether there is a *difference* between just two groups, often look very
similar, especially when the data constitute a 2-by-2 table. There is this
important difference between the two types of analysis, however: Questions
about *association* refer to *variables* — say weight and age — and it never
makes sense to ask whether there is a difference between variables (except when
asking whether they measure the same quantity). Questions about *similarity or
difference* refer to *groups of individuals* , and in such a situation it does
difference* refer to *groups of individuals*, and in such a situation it does
make sense to ask whether or not two groups are observably different from each
other.

Expand Down Expand Up @@ -794,8 +794,7 @@ occur if the I.Q. scores were ranked from best to worst (column 3) and worst to
best (column 5). The extent of correlation (association) can thus be measured
by whether the sum of the multiples of the observed *x* and *y* values is
relatively much higher or much lower than are sums of randomly-chosen pairs of
*x* and *y* .

*x* and *y*.
```{python echo=FALSE, eval=TRUE, results="asis", message=FALSE}
import numpy as np
import pandas as pd
Expand Down Expand Up @@ -1804,7 +1803,7 @@ thirteen index cards. Now take *another* set of seventy-eight index
cards, preferably of a different color, and write "yes" on fifty-two of
them and "no" on twenty-six of them, corresponding to the numbers of
people who do and do not drink beer in the sample. Now lay them down in
random *pairs* , one from each pile.
random *pairs*, one from each pile.

If there is a high association between the variables, then real life
observations will bunch up in the two diagonal cells in the upper left and
Expand Down Expand Up @@ -1837,8 +1836,8 @@ We can carry out a resampling test with this procedure:
* **Step 2.** Pair the two sets of cards randomly. Count the numbers of the
four possible pairs: (1) "approve-drink," (2) "disapprove-don't drink," (3)
"disapprove-drink," and (4) "approve-don't drink." Record the number of these
combinations, as in Table 23-10, where columns 1-4 correspond to the four cells
in Table 23-9.
combinations, as in @tbl-beerpol-trial, where columns 1-4 correspond to the
four data cells in @tbl-beerpol-data.
* **Step 3.** Add (column 1 plus column 4), then add (column 2 plus column 3),
and subtract the result in the second parenthesis from the result in the first
parenthesis. If the difference is equal to or greater than 24, record "yes,"
Expand Down Expand Up @@ -2331,7 +2330,7 @@ the fact that the player sometimes does not get a hit for an abnormally
long period of time. One way of testing whether or not the coach is
right is by comparing an average player's longest slump in a 100-at-bat
season with the longest run of outs in the first card trial. Assume that
Slug is a player picked *at random* . Then compare Slug's longest
Slug is a player picked *at random*. Then compare Slug's longest
slump — say, 10 outs in a row — with the longest cluster of a single
simulated 100-at-bat trial with the cards, 9 outs. This result suggests
that Slug's apparent slump might well have resulted by chance.
Expand Down
57 changes: 30 additions & 27 deletions source/diagrams/covid-tree.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions source/dramatizing_resampling.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ one coin, and if it is a head (boy), I then flip another coin and see
how often that will be a boy, also, and then actually flip the coin once
and record the outcomes. But that experiment does not resemble the
situation of interest. A proper modeling throws *two* coins, examines to
see if there is a head on *either* , and then examines the *other*.
see if there is a head on *either*, and then examines the *other*.

Or consider table @tbl-otherboy, where we asked the computer to do this
work for us. The computer has done 50 trials, where one trial is one family of
Expand Down Expand Up @@ -110,7 +110,7 @@ Someone might wonder whether formal mathematics can help us with this
problem. Formal (even though not formulaic) analysis can certainly
provide an answer. We can use what is known as the "sample space"
approach which reasons from first principles; here it consists of making
a *list of the possibilities* , and examining the proportion of
a *list of the possibilities*, and examining the proportion of
"successes" to "failures" in that list.

First we write down the equally-likely ways that two coins can fall:
Expand Down
Loading

0 comments on commit cb82b89

Please sign in to comment.