Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new load data interface #1

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
18 changes: 15 additions & 3 deletions citytemps/citytemps.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@

### Learning Objectives

In this walk-through, you'll learn how to measure and visualize
dispersion of a single quantitative variable. You will also learn how to
change some of the default plot settings in R, like changing the axis
labels or the number of breaks in a histogram.
----------

Data files:
\*
Expand Down Expand Up @@ -99,11 +103,19 @@ distribution: the standard deviation.

## [1] 5.698457

Another measure of dispersion is the coverage interval: that is, an
Another measure of dispersion is the *coverage* or *prediction* interval: that is, an
interval covering a specified fraction of the observations. For example,
to get a central 50% coverage interval, we'd need the 25th and 75
percentiles of the distribution. By definition, 50% of the observations
are between these two numbers. You can get these from the `qdata`
are between these two numbers. So if we were to repeatedly sample single observations
from this dataset completely at random, about 50% of the time they would fall into this interval
by construction. This usually isn't so useful by itself, but if we think about trying to predict the
temperature on some random day in the future, we might expect the temperature on that future
day to lie in the same interval with probability 0.50. That's why these kinds of intervals are
most commonly called *prediction* rather than *coverage* intervals, since they're trying
to bracket the value of a future data point.

You can get these from the `qdata`
function.

qdata(citytemps$Temp.SanDiego)
Expand Down Expand Up @@ -184,7 +196,7 @@ San Diego is actually more extreme than a 10-degree day in Rapid City!
As this example suggests, z-scores are useful for comparing numbers that
come from different distributions, with different statistical
properties. It tells you how extreme a number is, relative to other
numbers from that some distribution.
numbers from that same distribution.

### Fancier histograms

Expand Down
2 changes: 1 addition & 1 deletion gonefishing/gonefishing.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Sampling distributions
In this walk-through, you'll learn about sampling distributions.

Data files:
\* [gonefishing.csv](gonefishing.csv): fictional data on fictional fish
\* [gonefishing.csv](https://raw.githubusercontent.com/jaredsmurray/learnR/master/gonefishing/gonefishing.csv): fictional data on fictional fish
in a fictional lake.

As usual, load the mosaic library.
Expand Down
Binary file added heights/files/import_options_new.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions heights/heights.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,17 +57,17 @@ Read in the heights.csv data set by clicking the Import Dataset button in RStudi

![](files/import_dataset.png)

When you click Import Dataset, choose the "From Text File..." option, and in the window that pops up, surf to wherever you've downloaded the heights.csv file.
When you click Import Dataset, choose the "From CSV File..." option, and in the window that pops up, surf to wherever you've downloaded the heights.csv file.

![](files/import_file_window.png)

Select the heights.csv file and open it from this window. Now you should see a new window pop up, like this:

![](files/import_options.png)
![](files/import_options_new.png)

Three common things that you'll want to double-check in this window:
- What do you want the data set to be called within the R environment? By default, RStudio will name the data set after the file, so in this case the imported data frame will be stored as ``heights'' unless you provide an alternative in the "Name" field.
- Does the data file have a header row (i.e. is the first row the names of the variables)? If so, make sure the "Yes" button next to "Heading" is selection. In this case, we do have a header row providing the variable names (SHGT, MHGT, and FHGT).
- Does the data file have a header row (i.e. is the first row the names of the variables)? If so, make sure the "First Row as Names" option is checked. In this case, we do have a header row providing the variable names (SHGT, MHGT, and FHGT).
- What separates the data fields? Comma-separated files (like this one) are common; so are tab-separated files.

Usually RStudio does a good job at auto-detecting these features of the file. But sometimes it can get tripped up, so it's good to verify what the program thinks it is seeing in this window.
Expand Down
3 changes: 3 additions & 0 deletions sat/sat.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ layout: page
Test scores and GPA for UT graduates
------------------------------------

### Learning Objectives
In this walk-through, you'll learn how to summarize and visualize the
following kinds of relationships:
- between a numerical variable and a categorical variable, via
Expand All @@ -16,6 +17,8 @@ coefficients.
You will also learn how to change more of the default plot settings in R
plots.

------------------------------------

You'll need this data file:
\* [ut2000.csv](http://jgscott.github.io/teaching/data/ut2000.csv): data
on SAT scores and graduating GPA for every student who entered the
Expand Down
14 changes: 9 additions & 5 deletions titanic/titanic_permtest.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ permutation tests in the context of a 2x2 contingency table.

Data files:
\*
[TitanicSurvival.csv](http://jgscott.github.io/teaching/data/TitanicSurvival.csv)
[TitanicSurvival.csv](https://github.com/jgscott/ECO394D/raw/master/data/TitanicSurvival.csv) (right click the link and use "Save As")

First download the TitanicSurvival.csv file and read it in. You can use
RStudio's Import Dataset button, or the read.csv command:
Expand All @@ -28,8 +28,8 @@ they survived, along with their age, sex, and cabin class.

### Relative risk in 2x2 tables

One of the very first contingency tables we made looked at survival
status stratified by sex:
We can use the `xtabs` (short for crosstabulations) and `prop.table` (to compute proportions/frequencies from a
table of counts) commands to compute the survival probability for men and women:

t1 = xtabs(~sex + survived, data=TitanicSurvival)
prop.table(t1, margin=1)
Expand All @@ -38,8 +38,12 @@ status stratified by sex:
## sex no yes
## female 0.2725322 0.7274678
## male 0.8090154 0.1909846

Without the `margin=1` command, `prop.table` would compute a table
of joint proababilities. Adding that command computes conditional probabilities
of survival for men and women.

This seems to suggest a strong association between survival status and
The data seem to suggest a strong association between survival status and
sex. A natural *test statistic* to quantify this association between the
rows and columns of this table is the [relative
risk](http://en.wikipedia.org/wiki/Relative_risk) of dying: that is, the
Expand Down Expand Up @@ -242,6 +246,6 @@ Again, it is zero, up to Monte Carlo accuracy.
There are advantages and disadvantages to chi-square as a test
statistic. The relative risk is certainly a lot easier to understand and
interpret, especially for non-experts. On the other hand, relative risk
only makes sense 2x2 tables, while the chi-squared statistic generalizes
only makes sense in 2x2 tables, while the chi-squared statistic generalizes
quite readily to tables with more than two rows or more than two
columns.