LOST-STATS · khwilson · May 6, 2021 · May 6, 2021 · May 6, 2021 · May 6, 2021
diff --git a/Contributing/Contributing.md b/Contributing/Contributing.md
@@ -184,3 +184,4 @@ Please be sure to add alt text to images for sight-impaired users. Image filenam
 - **How can I discuss what I'm doing with other contributors?** Head to the [Issues](https://github.com/LOST-STATS/LOST-STATS.github.io/issues) page and find (or post) a thread with the title of the page you're talking about.
 - **How can I [add an image/link to another LOST page/add an external link/bold text] in the LOST wiki?** See the Markdown section above.
 - **I want to contribute but I do not like all the rules and structure on this page. I don't even want my FAQ entry to be a question. Just let me write what I want.** If you have valuable knowledge about statistical techniques to share with people and are able to explain things clearly, I don't want to stop you. So go for it. Maybe post something in [Issues](https://github.com/LOST-STATS/LOST-STATS.github.io/issues) when you're done and perhaps someone else will help make your page more consistent with the rest of the Wiki. I mean, it would be nicer if you did that yourself, but hey, we all have different strengths, right?
+
diff --git a/Data/README.md b/Data/README.md
@@ -1 +1,2 @@
-Folder for adding user contributed data.
+Folder for adding user contributed data.
+
diff --git a/...ulation/Combining_Datasets/combining_datasets_horizontal_merge_deterministic.md b/...ulation/Combining_Datasets/combining_datasets_horizontal_merge_deterministic.md
@@ -29,15 +29,20 @@ There are three main ways to join datasets horizontally in python using the `mer
 ```python
 import pandas as pd
 
-gdp_2018 = pd.DataFrame({'country': ['UK', 'USA', 'France'],
-                         'currency': ['GBP', 'USD', 'EUR'],
-                         'gdp_trillions': [2.1, 20.58, 2.78]})
-
-dollar_value_2018 = pd.DataFrame({'currency': ['EUR', 'GBP', 'YEN', 'USD'],
-                                  'in_dollars': [1.104, 1.256, .00926, 1]})
+gdp_2018 = pd.DataFrame(
+    {
+        "country": ["UK", "USA", "France"],
+        "currency": ["GBP", "USD", "EUR"],
+        "gdp_trillions": [2.1, 20.58, 2.78],
+    }
+)
+
+dollar_value_2018 = pd.DataFrame(
+    {"currency": ["EUR", "GBP", "YEN", "USD"], "in_dollars": [1.104, 1.256, 0.00926, 1]}
+)
 
 # Perform a left merge, which discards 'YEN'
-GDPandExchange = pd.merge(gdp_2018, dollar_value_2018, how='left', on='currency')
+GDPandExchange = pd.merge(gdp_2018, dollar_value_2018, how="left", on="currency")
 ```
 
 ## R
@@ -50,12 +55,16 @@ There are several ways to combine data sets horizontally in R, including base-R
 library(dplyr)
 
 # This data set contains information on GDP in local currency
-GDP2018 <- data.frame(Country = c("UK", "USA", "France"),
-                  Currency = c("Pound", "Dollar", "Euro"),
-                  GDPTrillions = c(2.1, 20.58, 2.78))
+GDP2018 <- data.frame(
+  Country = c("UK", "USA", "France"),
+  Currency = c("Pound", "Dollar", "Euro"),
+  GDPTrillions = c(2.1, 20.58, 2.78)
+)
 # This data set contains dollar exchange rates
-DollarValue2018 <- data.frame(Currency = c("Euro", "Pound", "Yen", "Dollar"),
-                              InDollars = c(1.104, 1.256, .00926, 1))
+DollarValue2018 <- data.frame(
+  Currency = c("Euro", "Pound", "Yen", "Dollar"),
+  InDollars = c(1.104, 1.256, .00926, 1)
+)
 ```
 
 Next we want to join together `GDP2018` and `DollarValue2018` so we can convert all the GDPs to dollars and compare them. There are three kinds of observations we could get - observations in `GDP2018` but not `DollarValue2018`, observations in `DollarValue2018` but not `GDP2018`, and observations in both. Use `help(join)` to pick the variant of `join` that keeps the observations we want. The "Yen" observation won't have a match, and we don't need to keep it. So let's do a `left_join` and list `GDP2018` first, so it keeps matched observations, plus any observations only in `GDP2018`.
@@ -117,3 +126,4 @@ A one-to-many merge is the opposite of a many to one merge, with multiple observ
 #### Many-to-Many
 
 A many-to-many merge is intended for use when there are multiple observations for each combination of the set of merging variables in both master and using data. However, `merge m:m` has strange behavior that is effectively never what you want, and it is not recommended.
+
diff --git a/Data_Manipulation/Combining_Datasets/combining_datasets_overview.md b/Data_Manipulation/Combining_Datasets/combining_datasets_overview.md
@@ -39,3 +39,4 @@ Alternatively, the below example has two datasets that collect the same informat
 | Donald Akliberti | B72197 | 34 |
 
 These ways of combining data are referred to by different names across different programming languages, but will largely be referred to by one common set of terms (used by Stata and Python’s Pandas): merge for horizontal combination and append for for vertical combination.
+
diff --git a/Data_Manipulation/Combining_Datasets/combining_datasets_vertical_combination.md b/Data_Manipulation/Combining_Datasets/combining_datasets_vertical_combination.md
@@ -8,11 +8,11 @@ nav_order: 1
 
 # Combining Datasets: Vertical Combination
 
-When combining two datasets that collect the same information about different people, they get combined vertically because they have variables in common but different observations. The result of this combination will more rows than the original dataset because it contains all of the people that are present in each of the original datasets. Here we combine the files based on the name or position of the columns in the dataset. It is a "vertical" combination in the sense that one set of observations gets added to the bottom of the other set of observations. 
+When combining two datasets that collect the same information about different people, they get combined vertically because they have variables in common but different observations. The result of this combination will more rows than the original dataset because it contains all of the people that are present in each of the original datasets. Here we combine the files based on the name or position of the columns in the dataset. It is a "vertical" combination in the sense that one set of observations gets added to the bottom of the other set of observations.
 
-# Keep in Mind 
-- Vertical combinations require datasets to have variables in common to be of much use. That said, it may not be necessary for the two datasets to have exactly the same variables. Be aware of how your statistical package handles observations for a variable that is in one dataset but not another (e.g. are such observations set to missing?). 
-- It may be the case that the datasets you are combining have the same variables but those variables are stored differently (numeric vs. string storage types). Be aware of how the variables are stored across datasets and how your statistical package handles attempts to combine the same variable with different storage types (e.g. Stata throws an error and will now allow the combination, unless the ", force" option is specified.) 
+# Keep in Mind
+- Vertical combinations require datasets to have variables in common to be of much use. That said, it may not be necessary for the two datasets to have exactly the same variables. Be aware of how your statistical package handles observations for a variable that is in one dataset but not another (e.g. are such observations set to missing?).
+- It may be the case that the datasets you are combining have the same variables but those variables are stored differently (numeric vs. string storage types). Be aware of how the variables are stored across datasets and how your statistical package handles attempts to combine the same variable with different storage types (e.g. Stata throws an error and will now allow the combination, unless the ", force" option is specified.)
 
 # Implementations
 
@@ -24,12 +24,11 @@ When combining two datasets that collect the same information about different pe
 import pandas as pd
 
 # Load California Population data from the internet
-df_ca = pd.read_stata('http://www.stata-press.com/data/r14/capop.dta')
-df_il = pd.read_stata('http://www.stata-press.com/data/r14/ilpop.dta')
+df_ca = pd.read_stata("http://www.stata-press.com/data/r14/capop.dta")
+df_il = pd.read_stata("http://www.stata-press.com/data/r14/ilpop.dta")
 
 # Concatenate a list of the dataframes (works on any number of dataframes)
 df = pd.concat([df_ca, df_il])
-
 ```
 
 ## R
@@ -45,8 +44,8 @@ library(dplyr)
 data(mtcars)
 
 # Split it in two, so we can combine them back together
-mtcars1 <- mtcars[1:10,]
-mtcars2 <- mtcars[11:32,]
+mtcars1 <- mtcars[1:10, ]
+mtcars2 <- mtcars[11:32, ]
 
 # Use bind_rows to vertically combine the data sets
 mtcarswhole <- bind_rows(mtcars1, mtcars2)
@@ -56,26 +55,27 @@ mtcarswhole <- bind_rows(mtcars1, mtcars2)
 
 ```stata
 * Load California Population data
-webuse http://www.stata-press.com/data/r14/capop.dta // Import data from the web 
+webuse http://www.stata-press.com/data/r14/capop.dta // Import data from the web
 
-append using http://www.stata-press.com/data/r14/ilpop.dta // Merge on Illinois population data from the web 
+append using http://www.stata-press.com/data/r14/ilpop.dta // Merge on Illinois population data from the web
 ```
-You can also append multiple datasets at once, by simply listing both datasets separated by a space: 
+You can also append multiple datasets at once, by simply listing both datasets separated by a space:
 
 ```stata
 * Load California Population data
-* Import data from the web 
-webuse http://www.stata-press.com/data/r14/capop.dta 
+* Import data from the web
+webuse http://www.stata-press.com/data/r14/capop.dta
 
-* Merge on Illinois and Texas population data from the web 
-append using http://www.stata-press.com/data/r14/ilpop.dta http://www.stata-press.com/data/r14/txpop.dta 
+* Merge on Illinois and Texas population data from the web
+append using http://www.stata-press.com/data/r14/ilpop.dta http://www.stata-press.com/data/r14/txpop.dta
 ```
-Note that, if there are columns in one but not the other of the datasets, Stata will still append the two datasets, but observations from the dataset that did not contain those columns will have their values for that variable set to missing. 
+Note that, if there are columns in one but not the other of the datasets, Stata will still append the two datasets, but observations from the dataset that did not contain those columns will have their values for that variable set to missing.
 
 ```stata
-* Load Even Number Data 
-webuse odd.dta, clear 
+* Load Even Number Data
+webuse odd.dta, clear
 
-append using http://www.stata-press.com/data/r14/even.dta 
+append using http://www.stata-press.com/data/r14/even.dta
 
 ```
+
diff --git a/Data_Manipulation/Creating_Dummy_Variables/creating_dummy_variables.md b/Data_Manipulation/Creating_Dummy_Variables/creating_dummy_variables.md
@@ -25,11 +25,12 @@ Several python libraries have functions to turn categorical variables into dummi
 import pandas as pd
 
 # Create a dataframe
-df = pd.DataFrame({'colors': ['red', 'green', 'blue', 'red', 'blue'],
-                   'numbers': [5, 13, 1, 7, 5]})
+df = pd.DataFrame(
+    {"colors": ["red", "green", "blue", "red", "blue"], "numbers": [5, 13, 1, 7, 5]}
+)
 
 # Replace the colors column with a dummy column for each color
-df = pd.get_dummies(df, columns=['colors'])
+df = pd.get_dummies(df, columns=["colors"])
 ```
 
 ## R
@@ -41,10 +42,10 @@ data(iris)
 
 # To retain the column of dummies for the first
 # categorical value we remove the intercept
-model.matrix(~-1+Species, data=iris)
+model.matrix(~ -1 + Species, data = iris)
 
 # Then we can add the dummies to the original data
-iris <- cbind(iris, model.matrix(~-1+Species, data=iris))
+iris <- cbind(iris, model.matrix(~ -1 + Species, data = iris))
 
 # Of course, in a regression we can skip this process
 summary(lm(Sepal.Length ~ Petal.Length + Species, data = iris))
@@ -69,50 +70,49 @@ data(iris)
 # mutated_data.
 # Note: new variables do not have to be based on old
 # variables
-mutated_data = iris %>%
+mutated_data <- iris %>%
   mutate(Long.Petal = Petal.Length > Petal.Width)
 ```
 
 This will create a new column of logical (`TRUE`/`FALSE`) variables. This works just fine for most uses of dummy variables. However if you need the variables to be 1s and 0s you can now take
 
 ```r?example=dplyr
 mutated_data <- mutated_data %>%
-    mutate(Long.Petal = Long.Petal*1)
+  mutate(Long.Petal = Long.Petal * 1)
 ```
 
 You could also nest that operation inside the original creation of new_dummy like so:
 
 ```r?example=dplyr
-mutated_data = iris %>%
-  mutate(Long.Petal = (Petal.Length > Petal.Width)*1)
+mutated_data <- iris %>%
+  mutate(Long.Petal = (Petal.Length > Petal.Width) * 1)
 ```
 
 ### Base R
 
 ```r?example=baser
-#the following creates a 5 x 2 data frame
-letters = c("a","b","c", "d", "e")
-numbers = c(1,2,3,4,5)
-df = data.frame(letters,numbers)
+# the following creates a 5 x 2 data frame
+letters <- c("a", "b", "c", "d", "e")
+numbers <- c(1, 2, 3, 4, 5)
+df <- data.frame(letters, numbers)
 ```
 
 Now I'll show several different ways to create a dummy indicating if the numbers variable is odd.
 
 ```r?example=baser
-df$dummy = df$numbers%%2
+df$dummy <- df$numbers %% 2
 
-df$dummy = ifelse(df$numbers%%2==1,1,0)
+df$dummy <- ifelse(df$numbers %% 2 == 1, 1, 0)
 
-df$dummy = df$numbers%%2==1
+df$dummy <- df$numbers %% 2 == 1
 
 # the last one created a logical outcome to convert to numerical we can either
 
-df$dummy = df$dummy * 1
+df$dummy <- df$dummy * 1
 
 # or
 
-df$dummy = (df$numbers%%2==1) *1
-
+df$dummy <- (df$numbers %% 2 == 1) * 1
 ```
 
 ## MATLAB
@@ -121,7 +121,7 @@ df$dummy = (df$numbers%%2==1) *1
 
 The equivalent of `model.matrix()` in MATLAB is `dummyvar` which creates columns of one-hot encoded dummies from categorical variables. The following example is taken from MathWorks documentation.
 
-```MATLAB
+```matlab
 Colors = {'Red';'Blue';'Green';'Red';'Green';'Blue'};
 Colors = categorical(Colors);
 
@@ -132,7 +132,7 @@ D = dummyvar(Colors)
 
 In MATLAB you can store variables as columns in arrays. If you know you are going to add columns multiple times to the same array it is best practice to pre-allocate the final size of the array for computational efficiency. If you do this you can simply select the column you are designating for your dummy variable and story the dummys in that column.
 
-```MATLAB
+```matlab
 arr = [1,2,3;5,2,6;1,8,3];
 dum = sum(data(:,:),2) <10;
 data = horzcat(arr,dum);
@@ -167,3 +167,4 @@ regress mpg weight b_*
 * Create a logical variable
 gen highmpg = mpg > 30
 ```
+
diff --git a/Data_Manipulation/Reshaping/reshape.md b/Data_Manipulation/Reshaping/reshape.md
@@ -7,3 +7,4 @@ nav_order: 1
 ---
 
 # Reshaping Data
+
diff --git a/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.md b/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.md
@@ -56,8 +56,10 @@ import pandas as pd
 
 # Load WHO data on population as an example, which has 'country', 'year',
 # and 'population' columns.
-df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/tidyr/population.csv',
-                 index_col=0)
+df = pd.read_csv(
+    "https://vincentarelbundock.github.io/Rdatasets/csv/tidyr/population.csv",
+    index_col=0,
+)
 
 # In this example, we would like to have one row per country but the data have
 # multiple rows per country, each corresponding with
@@ -69,9 +71,7 @@ print(df.head())
 # the pivot function and set 'country' as the index. As we'd like to
 # split out years into different columns, we set columns to 'years', and the
 # values within this new dataframe will be population:
-df_wide = df.pivot(index='country',
-                   columns='year',
-                   values='population')
+df_wide = df.pivot(index="country", columns="year", values="population")
 
 # What if there are multiple year-country pairs? Pivot can't work
 # because it needs unique combinations. In this case, we can use
@@ -81,20 +81,19 @@ df_wide = df.pivot(index='country',
 # 5% higher values for all years.
 
 # Copy the data for France
-synth_fr_data = df.loc[df['country'] == 'France']
+synth_fr_data = df.loc[df["country"] == "France"]
 
 # Add 5% for all years
-synth_fr_data['population'] = synth_fr_data['population']*1.05
+synth_fr_data["population"] = synth_fr_data["population"] * 1.05
 
 # Append it to the end of the original data
 df = pd.concat([df, synth_fr_data], axis=0)
 
 # Compute the wide data - averaging over the two estimates for France for each
 # year.
-df_wide = df.pivot_table(index='country',
-                         columns='year',
-                         values='population',
-                         aggfunc='mean')
+df_wide = df.pivot_table(
+    index="country", columns="year", values="population", aggfunc="mean"
+)
 ```
 
 ## R
@@ -121,27 +120,28 @@ Now we think:
 
 ```r?example=pivot_wider
 pop_wide <- pivot_wider(population,
-                               names_from = year,
-                               values_from = population,
-                               names_prefix = "pop_")
+  names_from = year,
+  values_from = population,
+  names_prefix = "pop_"
+)
 ```
 
 Another way to do this is using `data.table`.
 
 ```r?example=pivot_wider
-#install.packages('data.table')
+# install.packages('data.table')
 library(data.table)
 
 # The second argument here is the formula describing the observation level of the data
 # The full set of variables together is the current observation level (one row per country and year)
 # The parts before the ~ are what we want the new observation level to be in the wide data (one row per country)
 # The parts after the ~ are for the variables we want to no longer be part of the observation level (we no longer want a row per year)
 
-population = as.data.table(population)
-pop_wide = dcast(population,
-                 country ~ year,
-                 value.var = "population"
-                 )
+population <- as.data.table(population)
+pop_wide <- dcast(population,
+  country ~ year,
+  value.var = "population"
+)
 ```
 
 ## Stata
@@ -210,3 +210,4 @@ restore
 ```
 
 Note: there is much more guidance to the usage of greshape on the [Gtools reshape page](https://gtools.readthedocs.io/en/latest/usage/greshape/index.html).
+
Original file line number	Diff line number	Diff line change
Expand Up		@@ -184,3 +184,4 @@ Please be sure to add alt text to images for sight-impaired users. Image filenam
		- How can I discuss what I'm doing with other contributors? Head to the [Issues](https://github.com/LOST-STATS/LOST-STATS.github.io/issues) page and find (or post) a thread with the title of the page you're talking about.
		- How can I [add an image/link to another LOST page/add an external link/bold text] in the LOST wiki? See the Markdown section above.
		- I want to contribute but I do not like all the rules and structure on this page. I don't even want my FAQ entry to be a question. Just let me write what I want. If you have valuable knowledge about statistical techniques to share with people and are able to explain things clearly, I don't want to stop you. So go for it. Maybe post something in [Issues](https://github.com/LOST-STATS/LOST-STATS.github.io/issues) when you're done and perhaps someone else will help make your page more consistent with the rest of the Wiki. I mean, it would be nicer if you did that yourself, but hey, we all have different strengths, right?
Original file line number	Diff line number	Diff line change
		@@ -1 +1,2 @@
		Folder for adding user contributed data.
		Folder for adding user contributed data.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -39,3 +39,4 @@ Alternatively, the below example has two datasets that collect the same informat
		\| Donald Akliberti \| B72197 \| 34 \|

		These ways of combining data are referred to by different names across different programming languages, but will largely be referred to by one common set of terms (used by Stata and Python’s Pandas): merge for horizontal combination and append for for vertical combination.