Skip to content

Commit

Permalink
Update chapter07.qmd
Browse files Browse the repository at this point in the history
Minor typos corrected
  • Loading branch information
carlosarcila authored Dec 1, 2023
1 parent c91de4e commit f21f0dd
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions content/chapter07.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ Now that you are familiar with data structures (Chapter [-@sec-chap-filetodata])

As we outlined in Chapter [-@sec-chap-introduction], the computational analysis
of communication can be bottom-up or top-down, inductive or
deductive. Just as in traditional research methods @Bryman2012, sometimes, an inductive
deductive. Just as in traditional research methods (see @Bryman2012), sometimes, an inductive
bottom-up approach is a goal in itself: after all, explorative
analyses are invaluable for generating hypotheses that can be tested
in follow-up research. But even when you are conducting a deductive,
Expand All @@ -96,7 +96,7 @@ Furthermore, before making any multivariate or inferential analysis we might wan
To illustrate how to do this in R and Python, we will use existing representative survey data to analyze how support for migrants or refugees in Europe changes over time and differs per country.
The Eurobarometer (freely available at the Leibniz Institute for the Social Sciences -- GESIS) has contained these specific questions since 2015. We might pose questions about the variation of a single variable or also describe the covariation of different variables to find patterns in our data. In this section, we will compute basic statistics to answer these questions and in the next section we will visualize them by plotting *within* and *between* variable behaviors of a selected group of features of the Eurobarometer conducted in November 2017 to 33193 Europeans.

For most of the EDA we will use *tidyverse* in R and *pandas* as well as *numpy* and *scipy* in Python (Example 7.1). After loading a clean version of the survey data[^1] stored in a csv file (using the *tidyverse* function `read_csv` in R and the *pandas* function `read_csv` in R), checking the dimensions of our data frame (33193 x 17), we probably want to get a global picture of each of our variables by getting a frequency table. This table shows the frequency of different outcomes for every case in a distribution. This means that we can know how many cases we have for each number or category in the distribution of every variable, which is useful in order to have an initial understanding of our data.
For most of the EDA we will use *tidyverse* in R and *pandas* as well as *numpy* and *scipy* in Python (Example [-@exm-load]). After loading a clean version of the survey data[^1] stored in a csv file (using the *tidyverse* function `read_csv` in R and the *pandas* function `read_csv` in Python), checking the dimensions of our data frame (33193 x 17), we probably want to get a global picture of each of our variables by getting a frequency table. This table shows the frequency of different outcomes for every case in a distribution. This means that we can know how many cases we have for each number or category in the distribution of every variable, which is useful in order to have an initial understanding of our data.

::: {.callout-note icon=false collapse=true}
## pandas versus pure numpy/scipy
Expand Down Expand Up @@ -458,7 +458,7 @@ In *ggplot* (R), you can use the `facet_grid` function to automatically create s
::: {.callout-note appearance="simple" icon=false}

::: {#exm-combine}
Creating subfigures)
Creating subfigures

::: {.panel-tabset}
## Python code
Expand All @@ -483,12 +483,12 @@ ggplot(support_long, aes(x=date_n, y=support)) +
:::
:::

Now if you want to explore the possible correlation between the average support for refugees (`mean_support_refugees_by_day`) and the average support to migrants by year (`mean_support_migrants_by_day`), you might need a scatterplot, which is a better way to visualize the type and strength of this relationship *scatter*.
Now if you want to explore the possible correlation between the average support for refugees (`mean_support_refugees_by_day`) and the average support to migrants (`mean_support_migrants_by_day`), you might need a scatterplot, which is a better way to visualize the type and strength of this relationship *scatter*.

::: {.callout-note appearance="simple" icon=false}

::: {#exm-scatter}
Scatterplot of average support for refugees and migrants by year
Scatterplot of average support for refugees and migrants

::: {.panel-tabset}
## Python code
Expand Down Expand Up @@ -1225,7 +1225,7 @@ pca$rotation
:::
:::

The generated object with the PCA contains different elements (in R `sdev`, `rotation`, `center`, `scale` and `x`) or attributes in Python (`components_`, `explained_variance_`, `explained_variance_ratio`, `singular_values_`, `mean_`, `n_components_`, `n_features_`, `n_samples_`, and `noise_variance_`). In the resulting object we can see the values of four principal components of each country, and the values of the loadings, technically called *eigenvalues*, for the variables in each principal component. In our example we can see that support for refugees and migrants are more represented on PC1, while age and educational level are more represented on PC2. If we plot the first two principal components using base function `biplot` in R and the library *bioinfokit* in Python (Example [-@exm-plot_pca]), we can clearly see how the variables are associated with either PC1 or with PC2 (we might also want to plot any pair of the four components!). But we can also get a picture of how countries are grouped based only in these two new variables.
The generated object with the PCA contains different elements (in R `sdev`, `rotation`, `center`, `scale` and `x`) or attributes (in Python `components_`, `explained_variance_`, `explained_variance_ratio`, `singular_values_`, `mean_`, `n_components_`, `n_features_`, `n_samples_`, and `noise_variance_`). In the resulting object we can see the values of four principal components of each country, and the values of the loadings, technically called *eigenvalues*, for the variables in each principal component. In our example we can see that support for refugees and migrants are more represented on PC1, while age and educational level are more represented on PC2. If we plot the first two principal components using base function `biplot` in R and the library *bioinfokit* in Python (Example [-@exm-plot_pca]), we can clearly see how the variables are associated with either PC1 or with PC2 (we might also want to plot any pair of the four components!). But we can also get a picture of how countries are grouped based only in these two new variables.

::: {.callout-note appearance="simple" icon=false}

Expand Down

0 comments on commit f21f0dd

Please sign in to comment.