Skip to content

Commit

Permalink
Update corpus-analysis-with-spacy.md
Browse files Browse the repository at this point in the history
- Correct typing errors, lines 72 and 194
- Adjust formatting of Research Questions to remove headers
  • Loading branch information
anisa-hawes authored Oct 20, 2023
1 parent bf918fd commit 9039656
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions en/lessons/corpus-analysis-with-spacy.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,10 @@ This lesson will describe how spaCy's utilities in **stopword removal,** **token

The following research questions will be investigated:

#### Research Question 1: Do students use certain parts-of-speech more frequently in Biology texts versus English texts, and does this linguistic discrepancy signify differences in disciplinary conventions?
**1: Do students use certain parts-of-speech more frequently in Biology texts versus English texts, and does this linguistic discrepancy signify differences in disciplinary conventions?**
Prior research has shown that even when writing in the same genres, writers in the sciences follow different conventions than those in the humanities. Notably, academic writing in the sciences has been characterized as informational, descriptive, and procedural, while scholarly writing in the humanities is narrativized, evaluative, and situation-dependent (that is, focused on discussing a particular text or prompt)[^5]. By deploying spaCy on the MICUSP texts, researchers can determine whether there are any significant differences between the part-of-speech tag frequencies in English and Biology texts. For example, we might expect students writing Biology texts to use more adjectives than those in the humanities, given their focus on description. Conversely, we might suspect English texts to contain more verbs and verb auxiliaries, indicating a more narrative structure. To test these hypotheses, you'll learn to analyze part-of-speech counts generated by spaCy, as well as to explore other part-of-speech count differences that could prompt further investigation.

#### Research Question 2: Do students use certain named entities more frequently in different academic genres, and do these varing word frequencies signify broader differences in genre conventions?
**2: Do students use certain named entities more frequently in different academic genres, and do these varying word frequencies signify broader differences in genre conventions?**
As with disciplinary differences, research has shown that different genres of writing have their own conventions and expectations. For example, explanatory genres such as research papers, proposals and reports tend to focus on description and explanation, whereas argumentative and critique-driven texts are driven by evaluations and arguments[^6]. By deploying spaCy on the MICUSP texts, researchers can determine whether there are any significant differences between the named entity frequencies in texts within the seven different genres represented (Argumentative Essay, Creative Writing, Critique/Evaluation, Proposal, Report, Research Paper, and Response Paper). We may suspect that argumentative genres engage more with people or works of art, since these could be entities serving to support their arguments or as the subject of their critiques. Conversely, perhaps dates and numbers are more prevalent in evidence-heavy genres, such as research papers and proposals. To test these hypotheses, you'll learn to analyze the nouns and noun phrases spaCy has tagged as 'named entities.'

In addition to exploring the research questions above, this lesson will address how a dataset enriched by spaCy can be exported in a usable format for further machine learning tasks including [sentiment analysis](/en/lessons/sentiment-analysis#calculate-sentiment-for-a-paragraph) or [topic modeling](/en/lessons/topic-modeling-and-mallet).
Expand Down Expand Up @@ -191,7 +191,7 @@ metadata_df = pd.read_csv('metadata.csv')
metadata_df = metadata_df.dropna(axis=1, how='all')
```

Display the first five rows to check that the data is as expected. Four rows should be present: the paper IDs, their titles, their discipline, and their type (genre).
Display the first five rows to check that the data is as expected. Four columns should be present: the paper IDs, their titles, their discipline, and their type (genre).

{% include figure.html filename="or-en-corpus-analysis-with-spacy-04.png" alt="First five rows of student paper metadata DataFrame, including columns for paper ID, title, discipline, and paper type." caption="Figure 4: Head of DataFrame with paper metadata-ID, title, discpline and type in Google Colab" %}

Expand Down

0 comments on commit 9039656

Please sign in to comment.