diff --git a/citytemps/citytemps.md b/citytemps/citytemps.md index 0eefeee..ae9e23e 100644 --- a/citytemps/citytemps.md +++ b/citytemps/citytemps.md @@ -1,7 +1,11 @@ + +### Learning Objectives + In this walk-through, you'll learn how to measure and visualize dispersion of a single quantitative variable. You will also learn how to change some of the default plot settings in R, like changing the axis labels or the number of breaks in a histogram. +---------- Data files: \* @@ -99,11 +103,19 @@ distribution: the standard deviation. ## [1] 5.698457 -Another measure of dispersion is the coverage interval: that is, an +Another measure of dispersion is the *coverage* or *prediction* interval: that is, an interval covering a specified fraction of the observations. For example, to get a central 50% coverage interval, we'd need the 25th and 75 percentiles of the distribution. By definition, 50% of the observations -are between these two numbers. You can get these from the `qdata` +are between these two numbers. So if we were to repeatedly sample single observations +from this dataset completely at random, about 50% of the time they would fall into this interval +by construction. This usually isn't so useful by itself, but if we think about trying to predict the +temperature on some random day in the future, we might expect the temperature on that future +day to lie in the same interval with probability 0.50. That's why these kinds of intervals are +most commonly called *prediction* rather than *coverage* intervals, since they're trying +to bracket the value of a future data point. + +You can get these from the `qdata` function. qdata(citytemps$Temp.SanDiego) @@ -184,7 +196,7 @@ San Diego is actually more extreme than a 10-degree day in Rapid City! As this example suggests, z-scores are useful for comparing numbers that come from different distributions, with different statistical properties. It tells you how extreme a number is, relative to other -numbers from that some distribution. +numbers from that same distribution. ### Fancier histograms diff --git a/gonefishing/gonefishing.md b/gonefishing/gonefishing.md index cd710c2..3b40879 100644 --- a/gonefishing/gonefishing.md +++ b/gonefishing/gonefishing.md @@ -8,7 +8,7 @@ Sampling distributions In this walk-through, you'll learn about sampling distributions. Data files: -\* [gonefishing.csv](gonefishing.csv): fictional data on fictional fish +\* [gonefishing.csv](https://raw.githubusercontent.com/jaredsmurray/learnR/master/gonefishing/gonefishing.csv): fictional data on fictional fish in a fictional lake. As usual, load the mosaic library. diff --git a/heights/files/import_options_new.png b/heights/files/import_options_new.png new file mode 100644 index 0000000..2a64527 Binary files /dev/null and b/heights/files/import_options_new.png differ diff --git a/heights/heights.md b/heights/heights.md index 4a34c7d..2b9ddbf 100644 --- a/heights/heights.md +++ b/heights/heights.md @@ -57,17 +57,17 @@ Read in the heights.csv data set by clicking the Import Dataset button in RStudi ![](files/import_dataset.png) -When you click Import Dataset, choose the "From Text File..." option, and in the window that pops up, surf to wherever you've downloaded the heights.csv file. +When you click Import Dataset, choose the "From CSV File..." option, and in the window that pops up, surf to wherever you've downloaded the heights.csv file. ![](files/import_file_window.png) Select the heights.csv file and open it from this window. Now you should see a new window pop up, like this: -![](files/import_options.png) +![](files/import_options_new.png) Three common things that you'll want to double-check in this window: - What do you want the data set to be called within the R environment? By default, RStudio will name the data set after the file, so in this case the imported data frame will be stored as ``heights'' unless you provide an alternative in the "Name" field. -- Does the data file have a header row (i.e. is the first row the names of the variables)? If so, make sure the "Yes" button next to "Heading" is selection. In this case, we do have a header row providing the variable names (SHGT, MHGT, and FHGT). +- Does the data file have a header row (i.e. is the first row the names of the variables)? If so, make sure the "First Row as Names" option is checked. In this case, we do have a header row providing the variable names (SHGT, MHGT, and FHGT). - What separates the data fields? Comma-separated files (like this one) are common; so are tab-separated files. Usually RStudio does a good job at auto-detecting these features of the file. But sometimes it can get tripped up, so it's good to verify what the program thinks it is seeing in this window. diff --git a/sat/sat.md b/sat/sat.md index 10a7eb9..475f644 100644 --- a/sat/sat.md +++ b/sat/sat.md @@ -5,6 +5,7 @@ layout: page Test scores and GPA for UT graduates ------------------------------------ +### Learning Objectives In this walk-through, you'll learn how to summarize and visualize the following kinds of relationships: - between a numerical variable and a categorical variable, via @@ -16,6 +17,8 @@ coefficients. You will also learn how to change more of the default plot settings in R plots. +------------------------------------ + You'll need this data file: \* [ut2000.csv](http://jgscott.github.io/teaching/data/ut2000.csv): data on SAT scores and graduating GPA for every student who entered the diff --git a/titanic/titanic_permtest.md b/titanic/titanic_permtest.md index 1a4706e..61f0219 100644 --- a/titanic/titanic_permtest.md +++ b/titanic/titanic_permtest.md @@ -5,7 +5,7 @@ permutation tests in the context of a 2x2 contingency table. Data files: \* -[TitanicSurvival.csv](http://jgscott.github.io/teaching/data/TitanicSurvival.csv) +[TitanicSurvival.csv](https://github.com/jgscott/ECO394D/raw/master/data/TitanicSurvival.csv) (right click the link and use "Save As") First download the TitanicSurvival.csv file and read it in. You can use RStudio's Import Dataset button, or the read.csv command: @@ -28,8 +28,8 @@ they survived, along with their age, sex, and cabin class. ### Relative risk in 2x2 tables -One of the very first contingency tables we made looked at survival -status stratified by sex: +We can use the `xtabs` (short for crosstabulations) and `prop.table` (to compute proportions/frequencies from a +table of counts) commands to compute the survival probability for men and women: t1 = xtabs(~sex + survived, data=TitanicSurvival) prop.table(t1, margin=1) @@ -38,8 +38,12 @@ status stratified by sex: ## sex no yes ## female 0.2725322 0.7274678 ## male 0.8090154 0.1909846 + +Without the `margin=1` command, `prop.table` would compute a table +of joint proababilities. Adding that command computes conditional probabilities +of survival for men and women. -This seems to suggest a strong association between survival status and +The data seem to suggest a strong association between survival status and sex. A natural *test statistic* to quantify this association between the rows and columns of this table is the [relative risk](http://en.wikipedia.org/wiki/Relative_risk) of dying: that is, the @@ -242,6 +246,6 @@ Again, it is zero, up to Monte Carlo accuracy. There are advantages and disadvantages to chi-square as a test statistic. The relative risk is certainly a lot easier to understand and interpret, especially for non-experts. On the other hand, relative risk -only makes sense 2x2 tables, while the chi-squared statistic generalizes +only makes sense in 2x2 tables, while the chi-squared statistic generalizes quite readily to tables with more than two rows or more than two columns.