Skip to content

Commit

Permalink
Update corpus-analysis-with-spacy.md
Browse files Browse the repository at this point in the history
Small adjustments to tables
  • Loading branch information
anisa-hawes authored Oct 27, 2023
1 parent 2784caf commit 68ea6cf
Showing 1 changed file with 10 additions and 6 deletions.
16 changes: 10 additions & 6 deletions en/lessons/corpus-analysis-with-spacy.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,10 @@ This lesson will describe how spaCy's utilities in **stopword removal,** **token

The following research questions will be investigated:

**1: Do students use certain parts-of-speech more frequently in Biology texts versus English texts, and does this linguistic discrepancy signify differences in disciplinary conventions?**
**1: Do students use certain parts-of-speech more frequently in Biology texts versus English texts, and does this linguistic discrepancy signify differences in disciplinary conventions?**
Prior research has shown that even when writing in the same genres, writers in the sciences follow different conventions than those in the humanities. Notably, academic writing in the sciences has been characterized as informational, descriptive, and procedural, while scholarly writing in the humanities is narrativized, evaluative, and situation-dependent (that is, focused on discussing a particular text or prompt)[^5]. By deploying spaCy on the MICUSP texts, researchers can determine whether there are any significant differences between the part-of-speech tag frequencies in English and Biology texts. For example, we might expect students writing Biology texts to use more adjectives than those in the humanities, given their focus on description. Conversely, we might suspect English texts to contain more verbs and verb auxiliaries, indicating a more narrative structure. To test these hypotheses, you'll learn to analyze part-of-speech counts generated by spaCy, as well as to explore other part-of-speech count differences that could prompt further investigation.

**2: Do students use certain named entities more frequently in different academic genres, and do these varying word frequencies signify broader differences in genre conventions?**
**2: Do students use certain named entities more frequently in different academic genres, and do these varying word frequencies signify broader differences in genre conventions?**
As with disciplinary differences, research has shown that different genres of writing have their own conventions and expectations. For example, explanatory genres such as research papers, proposals and reports tend to focus on description and explanation, whereas argumentative and critique-driven texts are driven by evaluations and arguments[^6]. By deploying spaCy on the MICUSP texts, researchers can determine whether there are any significant differences between the named entity frequencies in texts within the seven different genres represented (Argumentative Essay, Creative Writing, Critique/Evaluation, Proposal, Report, Research Paper, and Response Paper). We may suspect that argumentative genres engage more with people or works of art, since these could be entities serving to support their arguments or as the subject of their critiques. Conversely, perhaps dates and numbers are more prevalent in evidence-heavy genres, such as research papers and proposals. To test these hypotheses, you'll learn to analyze the nouns and noun phrases spaCy has tagged as 'named entities.'

In addition to exploring the research questions above, this lesson will address how a dataset enriched by spaCy can be exported in a usable format for further machine learning tasks including [sentiment analysis](/en/lessons/sentiment-analysis#calculate-sentiment-for-a-paragraph) or [topic modeling](/en/lessons/topic-modeling-and-mallet).
Expand Down Expand Up @@ -239,6 +239,7 @@ final_paper_df = metadata_df.merge(paper_df,on='Filename')

Check the first five rows to make sure each has a filename, title, discipline, paper type and text (the full paper). At this point, you'll also see that any extra spaces have been removed from the beginning of the texts.


  | Filename | TITLE | DISCIPLINE | PAPER TYPE | Text
-- | -- | -- | -- | -- | --
0 | BIO.G0.15.1 | Invading the Territory of Invasives: The Dange... | Biology | Argumentative Essay | New York City, 1908: different colors of skin ...
Expand Down Expand Up @@ -669,6 +670,7 @@ This table shows the DataFrame including appearance counts of each part-of-speec
162 | English | 487 | 715 | 175 | 240 | 324 | 500 | 2 | 1474 | 55 | 157 | 334 | 226 | 820 | 147 | 691 | 7.0 | 5.0
163 | English | 68 | 94 | 23 | 34 | 26 | 79 | 3 | 144 | 2 | 25 | 36 | 54 | 80 | 22 | 69 | 1.0 | 2.0
164 | English | 53 | 86 | 27 | 28 | 19 | 90 | 1 | 148 | 6 | 15 | 37 | 43 | 80 | 15 | 67 | NaN | NaN
</div>
Now you can calculate the amount of times, on average, that each part-of-speech appears in Biology versus English papers. To do so, you use the `.groupby()` and `.mean()` functions to group all part-of-speech counts from the Biology texts together and calculate the mean usage of each part-of-speech, before doing the same for the English texts. The following code also rounds the counts to the nearest whole number:
Expand All @@ -691,6 +693,7 @@ Our DataFrame now contains average counts of each part-of-speech tag within each
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
0| Biology | 237.0 | 299.0 | 93.0 | 141.0 | 89.0 | 234.0 | 1.0 | 614.0 | 81.0 | 44.0 | 74.0 | 194.0 | 343.0 | 50.0 | 237.0 | 8.0 | 6.0
1| English | 211.0 | 364.0 | 127.0 | 141.0 | 108.0 | 283.0 | 2.0 | 578.0 | 34.0 | 99.0 | 223.0 | 189.0 | 367.0 | 70.0 | 306.0 | 7.0 | 5.0
</div>
Here we can examine the differences between average part-of-speech usage per genre. As suspected, Biology student papers use slightly more adjectives (235 per paper on average) than English student papers (209 per paper on average), while an even greater number of verbs (306) are used on average in English papers than in Biology papers (237). Another interesting contrast is in the `NUM` tag: almost 50 more numbers are used in Biology papers, on average, than in English papers. Given the conventions of scientific research, this does makes sense; studies are much more frequently quantitative, incorporating lab measurements and statistical calculations.
Expand Down Expand Up @@ -744,6 +747,7 @@ Now, our DataFrame contains average counts of each fine-grained part-of-speech:
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
0 | Biology | 5.0 | 94.0 | 10.0 | 198.0 | 339.0 | 35.0 | 6.0 | 4.0 | 38.0 | ... | 2.0 | 3.0 | 1.0 | 16.0 | 3.0 | 6.0 | 2.0 | 5.0 | 3.0 | 2.0
1 | English | 35.0 | 138.0 | 7.0 | 141.0 | 414.0 | 50.0 | 6.0 | 3.0 | 25.0 | ... | 2.0 | 2.0 | 2.0 | 3.0 | NaN | 1.0 | 3.0 | 5.0 | 3.0 | 5.0
</div>
As evidenced by the above DataFrame, spaCy identifies around 50 fine-grained part-of-speech tags. Researchers can investigate trends in the average usage of any or all of them. For example, is there a difference in the average usage of past tense versus present tense verbs in English and Biology papers? Three fine-grained tags that could help with this analysis are `VBD` (past tense verbs), `VBP` (non third-person singular present tense verbs), and `VBZ` (third-person singular present tense verbs).
Expand Down Expand Up @@ -784,12 +788,12 @@ ner_counts_df['CARDINAL_Counts'] = cardinal_counts
Reviewing the DataFrame now, our column headings define each paper's genre and four named entities (PERSON, ORG, DATE, and CARDINAL) of which spaCy will count usage:
  | Genre | PERSON_Counts | LOC_Counts | DATE_Counts | WORK_OF_ART_Counts
-- | -- | -- | -- | -- | --
-- | -- | :--: | :--: | :--: | :--:
0 | Argumentative Essay | 9 | 3 | 20 | 3
1 | Argumentative Essay | 90 | 13 | 151 | 6
2| Argumentative Essay | 0 | 0 | 2 | 2
3| Proposal | 11 | 6 | 21 | 4
4| Proposal | 44 | 7 | 65 | 3
2 | Argumentative Essay | 0 | 0 | 2 | 2
3 | Proposal | 11 | 6 | 21 | 4
4 | Proposal | 44 | 7 | 65 | 3
From here, we can compare the average usage of each named entity and plot across paper type.
Expand Down

0 comments on commit 68ea6cf

Please sign in to comment.