Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Style all code samples #125

Open
wants to merge 6 commits into
base: source
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Contributing/Contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,3 +184,4 @@ Please be sure to add alt text to images for sight-impaired users. Image filenam
- **How can I discuss what I'm doing with other contributors?** Head to the [Issues](https://github.com/LOST-STATS/LOST-STATS.github.io/issues) page and find (or post) a thread with the title of the page you're talking about.
- **How can I [add an image/link to another LOST page/add an external link/bold text] in the LOST wiki?** See the Markdown section above.
- **I want to contribute but I do not like all the rules and structure on this page. I don't even want my FAQ entry to be a question. Just let me write what I want.** If you have valuable knowledge about statistical techniques to share with people and are able to explain things clearly, I don't want to stop you. So go for it. Maybe post something in [Issues](https://github.com/LOST-STATS/LOST-STATS.github.io/issues) when you're done and perhaps someone else will help make your page more consistent with the rest of the Wiki. I mean, it would be nicer if you did that yourself, but hey, we all have different strengths, right?

3 changes: 2 additions & 1 deletion Data/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
Folder for adding user contributed data.
Folder for adding user contributed data.

Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,20 @@ There are three main ways to join datasets horizontally in python using the `mer
```python
import pandas as pd

gdp_2018 = pd.DataFrame({'country': ['UK', 'USA', 'France'],
'currency': ['GBP', 'USD', 'EUR'],
'gdp_trillions': [2.1, 20.58, 2.78]})

dollar_value_2018 = pd.DataFrame({'currency': ['EUR', 'GBP', 'YEN', 'USD'],
'in_dollars': [1.104, 1.256, .00926, 1]})
gdp_2018 = pd.DataFrame(
{
"country": ["UK", "USA", "France"],
"currency": ["GBP", "USD", "EUR"],
"gdp_trillions": [2.1, 20.58, 2.78],
}
)

dollar_value_2018 = pd.DataFrame(
{"currency": ["EUR", "GBP", "YEN", "USD"], "in_dollars": [1.104, 1.256, 0.00926, 1]}
)

# Perform a left merge, which discards 'YEN'
GDPandExchange = pd.merge(gdp_2018, dollar_value_2018, how='left', on='currency')
GDPandExchange = pd.merge(gdp_2018, dollar_value_2018, how="left", on="currency")
```

## R
Expand All @@ -50,12 +55,16 @@ There are several ways to combine data sets horizontally in R, including base-R
library(dplyr)

# This data set contains information on GDP in local currency
GDP2018 <- data.frame(Country = c("UK", "USA", "France"),
Currency = c("Pound", "Dollar", "Euro"),
GDPTrillions = c(2.1, 20.58, 2.78))
GDP2018 <- data.frame(
Country = c("UK", "USA", "France"),
Currency = c("Pound", "Dollar", "Euro"),
GDPTrillions = c(2.1, 20.58, 2.78)
)
# This data set contains dollar exchange rates
DollarValue2018 <- data.frame(Currency = c("Euro", "Pound", "Yen", "Dollar"),
InDollars = c(1.104, 1.256, .00926, 1))
DollarValue2018 <- data.frame(
Currency = c("Euro", "Pound", "Yen", "Dollar"),
InDollars = c(1.104, 1.256, .00926, 1)
)
```

Next we want to join together `GDP2018` and `DollarValue2018` so we can convert all the GDPs to dollars and compare them. There are three kinds of observations we could get - observations in `GDP2018` but not `DollarValue2018`, observations in `DollarValue2018` but not `GDP2018`, and observations in both. Use `help(join)` to pick the variant of `join` that keeps the observations we want. The "Yen" observation won't have a match, and we don't need to keep it. So let's do a `left_join` and list `GDP2018` first, so it keeps matched observations, plus any observations only in `GDP2018`.
Expand Down Expand Up @@ -117,3 +126,4 @@ A one-to-many merge is the opposite of a many to one merge, with multiple observ
#### Many-to-Many

A many-to-many merge is intended for use when there are multiple observations for each combination of the set of merging variables in both master and using data. However, `merge m:m` has strange behavior that is effectively never what you want, and it is not recommended.

Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,4 @@ Alternatively, the below example has two datasets that collect the same informat
| Donald Akliberti | B72197 | 34 |

These ways of combining data are referred to by different names across different programming languages, but will largely be referred to by one common set of terms (used by Stata and Python’s Pandas): merge for horizontal combination and append for for vertical combination.

Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ nav_order: 1

# Combining Datasets: Vertical Combination

When combining two datasets that collect the same information about different people, they get combined vertically because they have variables in common but different observations. The result of this combination will more rows than the original dataset because it contains all of the people that are present in each of the original datasets. Here we combine the files based on the name or position of the columns in the dataset. It is a "vertical" combination in the sense that one set of observations gets added to the bottom of the other set of observations.
When combining two datasets that collect the same information about different people, they get combined vertically because they have variables in common but different observations. The result of this combination will more rows than the original dataset because it contains all of the people that are present in each of the original datasets. Here we combine the files based on the name or position of the columns in the dataset. It is a "vertical" combination in the sense that one set of observations gets added to the bottom of the other set of observations.

# Keep in Mind
- Vertical combinations require datasets to have variables in common to be of much use. That said, it may not be necessary for the two datasets to have exactly the same variables. Be aware of how your statistical package handles observations for a variable that is in one dataset but not another (e.g. are such observations set to missing?).
- It may be the case that the datasets you are combining have the same variables but those variables are stored differently (numeric vs. string storage types). Be aware of how the variables are stored across datasets and how your statistical package handles attempts to combine the same variable with different storage types (e.g. Stata throws an error and will now allow the combination, unless the ", force" option is specified.)
# Keep in Mind
- Vertical combinations require datasets to have variables in common to be of much use. That said, it may not be necessary for the two datasets to have exactly the same variables. Be aware of how your statistical package handles observations for a variable that is in one dataset but not another (e.g. are such observations set to missing?).
- It may be the case that the datasets you are combining have the same variables but those variables are stored differently (numeric vs. string storage types). Be aware of how the variables are stored across datasets and how your statistical package handles attempts to combine the same variable with different storage types (e.g. Stata throws an error and will now allow the combination, unless the ", force" option is specified.)

# Implementations

Expand All @@ -24,12 +24,11 @@ When combining two datasets that collect the same information about different pe
import pandas as pd

# Load California Population data from the internet
df_ca = pd.read_stata('http://www.stata-press.com/data/r14/capop.dta')
df_il = pd.read_stata('http://www.stata-press.com/data/r14/ilpop.dta')
df_ca = pd.read_stata("http://www.stata-press.com/data/r14/capop.dta")
df_il = pd.read_stata("http://www.stata-press.com/data/r14/ilpop.dta")

# Concatenate a list of the dataframes (works on any number of dataframes)
df = pd.concat([df_ca, df_il])

```

## R
Expand All @@ -45,8 +44,8 @@ library(dplyr)
data(mtcars)

# Split it in two, so we can combine them back together
mtcars1 <- mtcars[1:10,]
mtcars2 <- mtcars[11:32,]
mtcars1 <- mtcars[1:10, ]
mtcars2 <- mtcars[11:32, ]

# Use bind_rows to vertically combine the data sets
mtcarswhole <- bind_rows(mtcars1, mtcars2)
Expand All @@ -56,26 +55,27 @@ mtcarswhole <- bind_rows(mtcars1, mtcars2)

```stata
* Load California Population data
webuse http://www.stata-press.com/data/r14/capop.dta // Import data from the web
webuse http://www.stata-press.com/data/r14/capop.dta // Import data from the web

append using http://www.stata-press.com/data/r14/ilpop.dta // Merge on Illinois population data from the web
append using http://www.stata-press.com/data/r14/ilpop.dta // Merge on Illinois population data from the web
```
You can also append multiple datasets at once, by simply listing both datasets separated by a space:
You can also append multiple datasets at once, by simply listing both datasets separated by a space:

```stata
* Load California Population data
* Import data from the web
webuse http://www.stata-press.com/data/r14/capop.dta
* Import data from the web
webuse http://www.stata-press.com/data/r14/capop.dta

* Merge on Illinois and Texas population data from the web
append using http://www.stata-press.com/data/r14/ilpop.dta http://www.stata-press.com/data/r14/txpop.dta
* Merge on Illinois and Texas population data from the web
append using http://www.stata-press.com/data/r14/ilpop.dta http://www.stata-press.com/data/r14/txpop.dta
```
Note that, if there are columns in one but not the other of the datasets, Stata will still append the two datasets, but observations from the dataset that did not contain those columns will have their values for that variable set to missing.
Note that, if there are columns in one but not the other of the datasets, Stata will still append the two datasets, but observations from the dataset that did not contain those columns will have their values for that variable set to missing.

```stata
* Load Even Number Data
webuse odd.dta, clear
* Load Even Number Data
webuse odd.dta, clear

append using http://www.stata-press.com/data/r14/even.dta
append using http://www.stata-press.com/data/r14/even.dta

```

Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,12 @@ Several python libraries have functions to turn categorical variables into dummi
import pandas as pd

# Create a dataframe
df = pd.DataFrame({'colors': ['red', 'green', 'blue', 'red', 'blue'],
'numbers': [5, 13, 1, 7, 5]})
df = pd.DataFrame(
{"colors": ["red", "green", "blue", "red", "blue"], "numbers": [5, 13, 1, 7, 5]}
)

# Replace the colors column with a dummy column for each color
df = pd.get_dummies(df, columns=['colors'])
df = pd.get_dummies(df, columns=["colors"])
```

## R
Expand All @@ -41,10 +42,10 @@ data(iris)

# To retain the column of dummies for the first
# categorical value we remove the intercept
model.matrix(~-1+Species, data=iris)
model.matrix(~ -1 + Species, data = iris)

# Then we can add the dummies to the original data
iris <- cbind(iris, model.matrix(~-1+Species, data=iris))
iris <- cbind(iris, model.matrix(~ -1 + Species, data = iris))

# Of course, in a regression we can skip this process
summary(lm(Sepal.Length ~ Petal.Length + Species, data = iris))
Expand All @@ -69,50 +70,49 @@ data(iris)
# mutated_data.
# Note: new variables do not have to be based on old
# variables
mutated_data = iris %>%
mutated_data <- iris %>%
mutate(Long.Petal = Petal.Length > Petal.Width)
```

This will create a new column of logical (`TRUE`/`FALSE`) variables. This works just fine for most uses of dummy variables. However if you need the variables to be 1s and 0s you can now take

```r?example=dplyr
mutated_data <- mutated_data %>%
mutate(Long.Petal = Long.Petal*1)
mutate(Long.Petal = Long.Petal * 1)
```

You could also nest that operation inside the original creation of new_dummy like so:

```r?example=dplyr
mutated_data = iris %>%
mutate(Long.Petal = (Petal.Length > Petal.Width)*1)
mutated_data <- iris %>%
mutate(Long.Petal = (Petal.Length > Petal.Width) * 1)
```

### Base R

```r?example=baser
#the following creates a 5 x 2 data frame
letters = c("a","b","c", "d", "e")
numbers = c(1,2,3,4,5)
df = data.frame(letters,numbers)
# the following creates a 5 x 2 data frame
letters <- c("a", "b", "c", "d", "e")
numbers <- c(1, 2, 3, 4, 5)
df <- data.frame(letters, numbers)
```

Now I'll show several different ways to create a dummy indicating if the numbers variable is odd.

```r?example=baser
df$dummy = df$numbers%%2
df$dummy <- df$numbers %% 2

df$dummy = ifelse(df$numbers%%2==1,1,0)
df$dummy <- ifelse(df$numbers %% 2 == 1, 1, 0)

df$dummy = df$numbers%%2==1
df$dummy <- df$numbers %% 2 == 1

# the last one created a logical outcome to convert to numerical we can either

df$dummy = df$dummy * 1
df$dummy <- df$dummy * 1

# or

df$dummy = (df$numbers%%2==1) *1

df$dummy <- (df$numbers %% 2 == 1) * 1
```

## MATLAB
Expand All @@ -121,7 +121,7 @@ df$dummy = (df$numbers%%2==1) *1

The equivalent of `model.matrix()` in MATLAB is `dummyvar` which creates columns of one-hot encoded dummies from categorical variables. The following example is taken from MathWorks documentation.

```MATLAB
```matlab
Colors = {'Red';'Blue';'Green';'Red';'Green';'Blue'};
Colors = categorical(Colors);

Expand All @@ -132,7 +132,7 @@ D = dummyvar(Colors)

In MATLAB you can store variables as columns in arrays. If you know you are going to add columns multiple times to the same array it is best practice to pre-allocate the final size of the array for computational efficiency. If you do this you can simply select the column you are designating for your dummy variable and story the dummys in that column.

```MATLAB
```matlab
arr = [1,2,3;5,2,6;1,8,3];
dum = sum(data(:,:),2) <10;
data = horzcat(arr,dum);
Expand Down Expand Up @@ -167,3 +167,4 @@ regress mpg weight b_*
* Create a logical variable
gen highmpg = mpg > 30
```

1 change: 1 addition & 0 deletions Data_Manipulation/Reshaping/reshape.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ nav_order: 1
---

# Reshaping Data

41 changes: 21 additions & 20 deletions Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,10 @@ import pandas as pd

# Load WHO data on population as an example, which has 'country', 'year',
# and 'population' columns.
df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/tidyr/population.csv',
index_col=0)
df = pd.read_csv(
"https://vincentarelbundock.github.io/Rdatasets/csv/tidyr/population.csv",
index_col=0,
)

# In this example, we would like to have one row per country but the data have
# multiple rows per country, each corresponding with
Expand All @@ -69,9 +71,7 @@ print(df.head())
# the pivot function and set 'country' as the index. As we'd like to
# split out years into different columns, we set columns to 'years', and the
# values within this new dataframe will be population:
df_wide = df.pivot(index='country',
columns='year',
values='population')
df_wide = df.pivot(index="country", columns="year", values="population")

# What if there are multiple year-country pairs? Pivot can't work
# because it needs unique combinations. In this case, we can use
Expand All @@ -81,20 +81,19 @@ df_wide = df.pivot(index='country',
# 5% higher values for all years.

# Copy the data for France
synth_fr_data = df.loc[df['country'] == 'France']
synth_fr_data = df.loc[df["country"] == "France"]

# Add 5% for all years
synth_fr_data['population'] = synth_fr_data['population']*1.05
synth_fr_data["population"] = synth_fr_data["population"] * 1.05

# Append it to the end of the original data
df = pd.concat([df, synth_fr_data], axis=0)

# Compute the wide data - averaging over the two estimates for France for each
# year.
df_wide = df.pivot_table(index='country',
columns='year',
values='population',
aggfunc='mean')
df_wide = df.pivot_table(
index="country", columns="year", values="population", aggfunc="mean"
)
```

## R
Expand All @@ -121,27 +120,28 @@ Now we think:

```r?example=pivot_wider
pop_wide <- pivot_wider(population,
names_from = year,
values_from = population,
names_prefix = "pop_")
names_from = year,
values_from = population,
names_prefix = "pop_"
)
```

Another way to do this is using `data.table`.

```r?example=pivot_wider
#install.packages('data.table')
# install.packages('data.table')
library(data.table)

# The second argument here is the formula describing the observation level of the data
# The full set of variables together is the current observation level (one row per country and year)
# The parts before the ~ are what we want the new observation level to be in the wide data (one row per country)
# The parts after the ~ are for the variables we want to no longer be part of the observation level (we no longer want a row per year)

population = as.data.table(population)
pop_wide = dcast(population,
country ~ year,
value.var = "population"
)
population <- as.data.table(population)
pop_wide <- dcast(population,
country ~ year,
value.var = "population"
)
```

## Stata
Expand Down Expand Up @@ -210,3 +210,4 @@ restore
```

Note: there is much more guidance to the usage of greshape on the [Gtools reshape page](https://gtools.readthedocs.io/en/latest/usage/greshape/index.html).

Loading