diff --git a/source/diagrams/covid-tree.svg b/source/diagrams/covid-tree.svg index f8b1929a..1d3aedfa 100644 --- a/source/diagrams/covid-tree.svg +++ b/source/diagrams/covid-tree.svg @@ -1,19 +1,19 @@ + inkscape:version="1.2.2 (b0a8486, 2022-12-01)" + sodipodi:docname="covid-tree.svg" + xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" + xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" + xmlns="http://www.w3.org/2000/svg" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" + xmlns:cc="http://creativecommons.org/ns#" + xmlns:dc="http://purl.org/dc/elements/1.1/"> + inkscape:window-y="25" + inkscape:window-maximized="0" + inkscape:showpageshadow="2" + inkscape:pagecheckerboard="0" + inkscape:deskcolor="#d1d1d1" /> @@ -72,7 +75,7 @@ image/svg+xml - + @@ -123,7 +126,7 @@ x="64.851036" y="58.438557" style="stroke-width:0.264583" - id="tspan879">COVID + id="tspan879">Covid have COVID + id="tspan906">have Covid 0.5% of testsyield falsepositivespositives forCOVID + id="tspan941">Covid without COVIDwithout Covidtest negatively + id="tspan986">test negative COVID tests failCovid tests failCOVID's presence + id="tspan1030">Covid's presence ... no statistical treatment can put validity into generalizations which are > based on data that were not reasonably accurate and complete to begin with. > It is unfortunate that academic departments so often offer courses on the -> statistical manipulation of human material to students who have little -> understanding of the problems involved in securing the original data. ... -> When training in these things replaces or at least precedes some of the +> statistical manipulation of [data from human behavior] to students who have +> little understanding of the problems involved in securing the original data. +> ... When training in these things replaces or at least precedes some of the > college courses on the mathematical treatment of data, we shall come nearer > to having a science of human behavior. [@kinsey1948sexual, p 35]. @@ -345,10 +345,10 @@ Scan this book and you will find almost no formal mathematics. Yet nearly every student finds the subject very difficult — as difficult as anything taught at universities. The root of the difficulty is that the *subject matter* is extremely difficult. Let's find out *why*. -It is easy to find out with high precision which movie is playing -tonight at the local cinema; you can look it up on the web or call the cinema -and ask. But consider by contrast how difficult it is to determine with -accuracy: + +It is easy to find out with high precision which movie is playing tonight at +the local cinema; you can look it up on the web or call the cinema and ask. But +consider by contrast how difficult it is to determine with accuracy: 1. Whether we will save lives by recommending vitamin D supplements for the whole population as protection against viral infections. Some evidence diff --git a/source/preface_second.Rmd b/source/preface_second.Rmd index 0d8ed5cd..e5bc9b2f 100644 --- a/source/preface_second.Rmd +++ b/source/preface_second.Rmd @@ -104,7 +104,7 @@ this success. The method was first presented at some length in the 1969 edition of my book *Basic Research Methods in Social Science* [@simon1969basic] (third edition -with Paul Burstein -@simon1985basic). +with Paul Burstein [-@simon1985basic]). For some years, the resampling method failed to ignite interest among statisticians. While many factors (including the accumulated @@ -184,7 +184,7 @@ separate publication where it might be overlooked. In ancient times, mathematics developed from the needs of governments and rich men to number armies, flocks, and especially to count the -taxpayers and their possessions. Up until the beginning of the 20th +taxpayers and their possessions. Up until the beginning of the 20^th^ century, the term *statistic* meant the number of something — soldiers, births, taxes, or what-have-you. In many cases, the term *statistic* still means the number of something; the most important statistics for the United @@ -194,12 +194,14 @@ the making or interpretation of descriptive statistics, because the topic is handled very well in most conventional statistics texts. Another stream of thought entered the field of probability and statistics in -the 17th century by way of gambling in France. Throughout history people had +the 17^th^ century by way of gambling in France. Throughout history people had learned about the odds in gambling games by repeated plays of the game. But in -the year 1654, the French nobleman Chevalier de Mere asked the great +the year 1654, the French nobleman Chevalier de Méré asked the great mathematician and philosopher Pascal to help him develop correct odds for some -gambling games. Pascal, the famous Fermat, and others went on to develop -modern probability theory. +gambling games[^problem-points]. Pascal, the famous Fermat, and others went on +to develop modern probability theory. + +[^problem-points]: Later these two streams of thought came together. Researchers wanted to know how accurate their descriptive statistics were — not only the @@ -208,9 +210,10 @@ numbers arising from experiments. Statisticians began to apply the theory of probability to the accuracy of the data arising from sample surveys and experiments, and that became the theory of *inferential statistics*. + Here we find a guidepost: probability theory and statistics are relevant -whenever there is uncertainty about events occurring in the world, or in -the numbers describing those events. +whenever there is uncertainty about events occurring in the world, or in the +numbers describing those events. Later, probability theory was also applied to another context in which there is uncertainty — decision-making situations. Descriptive diff --git a/source/preface_third.Rmd b/source/preface_third.Rmd index 7de8e953..09f78c7c 100644 --- a/source/preface_third.Rmd +++ b/source/preface_third.Rmd @@ -29,12 +29,12 @@ The book in your hands, or on your screen, is the third edition of a book originally called "Resampling: the new statistics", by Julian Lincoln Simon [-@simon1992resampling]. -One of the pleasures of writing an edition of someone else's book is that -we have some freedom to praise a previous version of our own book. We will -do that, in the next section. Next we talk about the resampling methods in -this book, and their place at the heart of "data science", Finally, we -discuss what we have changed, and why, and make some suggestions about where -this book could fit into your learning and teaching. +One of the pleasures of writing a new edition of a work by another author, is +that we can praise the previous version of our own book. We will do that, in +the next section. Next we talk about the resampling methods in this book, and +their place at the heart of "data science". We then discuss what we have +changed, what we haven't, and why. Finally, we make some suggestions about +where this book could fit into your learning and teaching. ## What Simon saw @@ -53,10 +53,10 @@ mathematics. Most students cannot follow along and quickly get lost, reducing the subject to — as Simon puts it — "mumbo-jumbo". On its own, this was not a new realization. Simon quotes a classic textbook by -Wallis and Roberts [-@wallis1956statistics], in which they compare teaching -statistics through mathematics to teaching in a foreign language. More -recently, other teachers of statistics have come to the same conclusion. Cobb -[-@cobb2007introductory] argues that it is practically impossible to teach +Wallis and Roberts [-@wallis1956statistics], to the effect that teaching +statistics through mathematics is like teaching philosophy in ancient Greek. +More recently, other teachers of statistics have come to the same conclusion. +Cobb [-@cobb2007introductory] argues that it is not practical to teach students the level of mathematics they would need to understand standard introductory courses. As you will see below, Cobb also agrees with Simon about the solution. @@ -67,37 +67,37 @@ appears in the original preface: "Beneath the logic of a statistical inference there necessarily lies a physical process". Drawing conclusions from noisy data means building a *model* of the noisy world, and seeing how that model behaves. That model can be physical, where we generate the noisiness of the -world using physical devices like dice and spinners and coin-tosses. In fact, -Simon used exactly these kinds of devices in his first experiments -in teaching [@simon1969basic]. He then saw that it was much more efficient to -build these models with simple computer code, and the result was the first and -second editions of this book, with their associated software, the *Resampling -Stats* language. - -Simon's second conclusion follows from the first. Now that Simon had stripped -away the unnecessary barrier of mathematics, he had got to the heart of what is -interesting and difficult in statistics. Drawing conclusions from noisy data -involves a lot of hard, clear thinking. We need to be honest with our students -about that; statistics is hard, not because it is obscure (it need not be), but -because it deals with difficult problems. It is exactly that hard logical -thinking that can make statistics so interesting to our best students; +world using physical devices like dice and spinners and coin-tosses. +Simon used exactly these kinds of devices in his first experiments in teaching +[@simon1969basic]. He then saw that it was much more efficient to build these +models with simple computer code, and the result was the first and second +editions of this book, with their associated software, the *Resampling Stats* +language. + +Simon's second conclusion follows from the first. Now he had found a path +round the unnecessary barrier of mathematics, he had got to the heart of what +is interesting and difficult in statistics. Drawing conclusions from noisy +data involves a lot of hard, clear thinking. We should be honest with our +students about that; statistics is hard, not because it is obscure (it need +not be), but because it deals with difficult problems. It is exactly that hard +logical thinking that can make statistics so interesting to our best students; "statistics" is just reasoning about the world when the world is noisy. Simon writes eloquently about this in a section in the introduction — "Why is statistics such a difficult subject" (@sec-stats-difficult). -We needed both of Simon's conclusions to get anywhere. We cannot hope to +We need both of Simon's conclusions to make progress. We cannot hope to teach two hard subjects at the same time; mathematics, and statistical -reasoning. That is what Simon has done: he replaced the mathematics with -something that is much easier to reason about. Then he can concentrate on the -real, interesting problem — the hard thinking about data, and the world it -comes from. To quote from a later section in this book -(@sec-resamp-differs): "Once we get rid of the formulas and tables, we can -see that statistics is a matter of *clear thinking, not fancy mathematics*." -Instead of asking "where would I look up the right recipe for this", you -find yourself asking "what kind of world do these data come from?" and "how -can I reason about that world?". Like Simon, we have found that this way of -thinking and teaching is almost magically liberating and satisfying. We hope -and believe that you will find the same. +reasoning. He replaced the mathematics with something that is much easier for +most of us to reason about. By doing that, he can concentrate on the real, +interesting problem — the hard thinking about data, and the world it comes +from. To quote from a later section in this book (@sec-resamp-differs): "Once +we get rid of the formulas and tables, we can see that statistics is a matter +of *clear thinking, not fancy mathematics*." Instead of asking "where would +I look up the right recipe for this?", you find yourself asking "what kind of +world do these data come from?" and "How can I reason about that world?". +Like Simon, we have found that this way of thinking and teaching brings rich +rewards — for insight and practice. We hope and believe that you will find +the same. ## Resampling and data science {#sec-resampling-data-science} @@ -135,11 +135,11 @@ what it should do with data; it is the native language of data analysis. This insight transforms the way with think of code. In the past, we have thought of code as a separate, specialized skill, that some of us learn. We -take coding courses — we "learn to code". If code is the fundamental -language for analyzing data, then we need code to express what data analysis -does, and explain how it works. Here we "code to learn". Code is not an aim -in itself, but a language we can use to express the simple ideas behind data -analysis and statistics. +take coding courses — we "learn to code". But if we us code as the +fundamental language for analyzing data, then we need code to express what +data analysis does, and explain how it works. Here we "code to learn". Code +is not an aim in itself, but a language we can use to express the simple ideas +behind data analysis and statistics. Thus the data science movement started from code as the foundation for data analysis, to using code to explain statistics. It ends at the same place as @@ -153,37 +153,41 @@ goes on to explain why there is so much mathematics, and why we should remove it. In the age before ubiquitous computing, we needed mathematics to simplify calculations that we could not practically do by hand. Now we have great computing power in our phones and laptops, we do not have this constraint, and -we can use simpler resampling methods to solve the same problems. As Simon -shows, these are much easier to describe and understand. Data science, and -teaching with resampling, are the obvious consequences of ubiquitous -computing. +we can use simpler ideas from resampling methods to solve the same problems. +As Simon shows, these are much easier to describe and understand. Data +science, and teaching with resampling, are the obvious consequences of +ubiquitous computing. ## What we changed This diversion, through data science, leads us to the changes that we have made for the new edition. The previous edition of this book is still excellent, and -you can read it free, online, at . +you can read it freely at . It continues to be ahead of its time, and ahead of our time. Its one major drawback is that Simon bases much of the book around code written in a special -language that he developed with Dan Weidenfeld, called *Resampling Stats*. The -Resampling Stats language is well designed for expressing the steps in -simulating worlds that include elements of randomness, and it was a useful -contribution at the time that it was written. Since then, and particularly in -the last decade, there have been many improvements in more powerful and general -languages, such as {{< var lang >}} and {{< var other_lang >}}. These -languages are particularly suitable for beginners in data analysis, and they -come with a huge range of tools and libraries for a many tasks in data -analysis, including the kinds of models and simulations you will see in this -book. We have updated the book to use {{< var lang >}}, instead of *Resampling -Stats*. If you already know {{< var lang >}} or a similar language, such as -{{< var other_lang >}}, you will have a big head start in reading this book, -but even if you do not, we have written the book so it will be possible to pick -up the {{< var lang >}} code that you need to understand and build the kind of -models that Simon uses. The advantage to us, your authors, is that we can use -the very powerful tools associated with {{< var lang >}} to make it easier to -run and explain the code. The advantage to you, our readers, is that you can -also learn these tools, and the {{< var lang >}} language. They will serve you -well for the rest of your career in data analysis. +language that he developed with Dan Weidenfeld, called *Resampling +Stats*^[stats101]. The Resampling Stats language is well designed for +expressing the steps in simulating worlds that include elements of randomness, +and it was a useful contribution at the time that it was written. Since then, +and particularly in the last decade, there have been many improvements in more +powerful and general languages, such as {{< var lang >}} and {{< var +other_lang >}}. These languages are particularly suitable for beginners in +data analysis, and they come with a huge range of tools and libraries for +a many tasks in data analysis, including the kinds of models and simulations +you will see in this book. We have updated the book to use {{< var lang >}}, +instead of *Resampling Stats*. If you already know {{< var lang >}} or +a similar language, such as {{< var other_lang >}}, you will have a big head +start in reading this book, but even if you do not, we have written the book +so it will be possible to pick up the {{< var lang >}} code that you need to +understand and build the kind of models that Simon uses. The advantage to us, +your authors, is that we can use the very powerful tools associated with {{< +var lang >}} to make it easier to run and explain the code. The advantage to +you, our readers, is that you can also learn these tools, and the +{{< var lang >}} language. They will serve you well for the rest of your +career in data analysis. + +[^stats101]: If you are interested, has + a free modern version of the original Resampling Stats language.