From 68ea6cf3c31c43c4c9faad5819475a81f1fb5b0c Mon Sep 17 00:00:00 2001 From: Anisa Hawes <87070441+anisa-hawes@users.noreply.github.com> Date: Sat, 28 Oct 2023 00:49:37 +0100 Subject: [PATCH] Update corpus-analysis-with-spacy.md Small adjustments to tables --- en/lessons/corpus-analysis-with-spacy.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/en/lessons/corpus-analysis-with-spacy.md b/en/lessons/corpus-analysis-with-spacy.md index 08aaab45d..ffabff6f2 100644 --- a/en/lessons/corpus-analysis-with-spacy.md +++ b/en/lessons/corpus-analysis-with-spacy.md @@ -66,10 +66,10 @@ This lesson will describe how spaCy's utilities in **stopword removal,** **token The following research questions will be investigated: -**1: Do students use certain parts-of-speech more frequently in Biology texts versus English texts, and does this linguistic discrepancy signify differences in disciplinary conventions?** +**1: Do students use certain parts-of-speech more frequently in Biology texts versus English texts, and does this linguistic discrepancy signify differences in disciplinary conventions?** Prior research has shown that even when writing in the same genres, writers in the sciences follow different conventions than those in the humanities. Notably, academic writing in the sciences has been characterized as informational, descriptive, and procedural, while scholarly writing in the humanities is narrativized, evaluative, and situation-dependent (that is, focused on discussing a particular text or prompt)[^5]. By deploying spaCy on the MICUSP texts, researchers can determine whether there are any significant differences between the part-of-speech tag frequencies in English and Biology texts. For example, we might expect students writing Biology texts to use more adjectives than those in the humanities, given their focus on description. Conversely, we might suspect English texts to contain more verbs and verb auxiliaries, indicating a more narrative structure. To test these hypotheses, you'll learn to analyze part-of-speech counts generated by spaCy, as well as to explore other part-of-speech count differences that could prompt further investigation. -**2: Do students use certain named entities more frequently in different academic genres, and do these varying word frequencies signify broader differences in genre conventions?** +**2: Do students use certain named entities more frequently in different academic genres, and do these varying word frequencies signify broader differences in genre conventions?** As with disciplinary differences, research has shown that different genres of writing have their own conventions and expectations. For example, explanatory genres such as research papers, proposals and reports tend to focus on description and explanation, whereas argumentative and critique-driven texts are driven by evaluations and arguments[^6]. By deploying spaCy on the MICUSP texts, researchers can determine whether there are any significant differences between the named entity frequencies in texts within the seven different genres represented (Argumentative Essay, Creative Writing, Critique/Evaluation, Proposal, Report, Research Paper, and Response Paper). We may suspect that argumentative genres engage more with people or works of art, since these could be entities serving to support their arguments or as the subject of their critiques. Conversely, perhaps dates and numbers are more prevalent in evidence-heavy genres, such as research papers and proposals. To test these hypotheses, you'll learn to analyze the nouns and noun phrases spaCy has tagged as 'named entities.' In addition to exploring the research questions above, this lesson will address how a dataset enriched by spaCy can be exported in a usable format for further machine learning tasks including [sentiment analysis](/en/lessons/sentiment-analysis#calculate-sentiment-for-a-paragraph) or [topic modeling](/en/lessons/topic-modeling-and-mallet). @@ -239,6 +239,7 @@ final_paper_df = metadata_df.merge(paper_df,on='Filename') Check the first five rows to make sure each has a filename, title, discipline, paper type and text (the full paper). At this point, you'll also see that any extra spaces have been removed from the beginning of the texts. +   | Filename | TITLE | DISCIPLINE | PAPER TYPE | Text -- | -- | -- | -- | -- | -- 0 | BIO.G0.15.1 | Invading the Territory of Invasives: The Dange... | Biology | Argumentative Essay | New York City, 1908: different colors of skin ... @@ -669,6 +670,7 @@ This table shows the DataFrame including appearance counts of each part-of-speec 162 | English | 487 | 715 | 175 | 240 | 324 | 500 | 2 | 1474 | 55 | 157 | 334 | 226 | 820 | 147 | 691 | 7.0 | 5.0 163 | English | 68 | 94 | 23 | 34 | 26 | 79 | 3 | 144 | 2 | 25 | 36 | 54 | 80 | 22 | 69 | 1.0 | 2.0 164 | English | 53 | 86 | 27 | 28 | 19 | 90 | 1 | 148 | 6 | 15 | 37 | 43 | 80 | 15 | 67 | NaN | NaN + Now you can calculate the amount of times, on average, that each part-of-speech appears in Biology versus English papers. To do so, you use the `.groupby()` and `.mean()` functions to group all part-of-speech counts from the Biology texts together and calculate the mean usage of each part-of-speech, before doing the same for the English texts. The following code also rounds the counts to the nearest whole number: @@ -691,6 +693,7 @@ Our DataFrame now contains average counts of each part-of-speech tag within each -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 0| Biology | 237.0 | 299.0 | 93.0 | 141.0 | 89.0 | 234.0 | 1.0 | 614.0 | 81.0 | 44.0 | 74.0 | 194.0 | 343.0 | 50.0 | 237.0 | 8.0 | 6.0 1| English | 211.0 | 364.0 | 127.0 | 141.0 | 108.0 | 283.0 | 2.0 | 578.0 | 34.0 | 99.0 | 223.0 | 189.0 | 367.0 | 70.0 | 306.0 | 7.0 | 5.0 + Here we can examine the differences between average part-of-speech usage per genre. As suspected, Biology student papers use slightly more adjectives (235 per paper on average) than English student papers (209 per paper on average), while an even greater number of verbs (306) are used on average in English papers than in Biology papers (237). Another interesting contrast is in the `NUM` tag: almost 50 more numbers are used in Biology papers, on average, than in English papers. Given the conventions of scientific research, this does makes sense; studies are much more frequently quantitative, incorporating lab measurements and statistical calculations. @@ -744,6 +747,7 @@ Now, our DataFrame contains average counts of each fine-grained part-of-speech: -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 0 | Biology | 5.0 | 94.0 | 10.0 | 198.0 | 339.0 | 35.0 | 6.0 | 4.0 | 38.0 | ... | 2.0 | 3.0 | 1.0 | 16.0 | 3.0 | 6.0 | 2.0 | 5.0 | 3.0 | 2.0 1 | English | 35.0 | 138.0 | 7.0 | 141.0 | 414.0 | 50.0 | 6.0 | 3.0 | 25.0 | ... | 2.0 | 2.0 | 2.0 | 3.0 | NaN | 1.0 | 3.0 | 5.0 | 3.0 | 5.0 + As evidenced by the above DataFrame, spaCy identifies around 50 fine-grained part-of-speech tags. Researchers can investigate trends in the average usage of any or all of them. For example, is there a difference in the average usage of past tense versus present tense verbs in English and Biology papers? Three fine-grained tags that could help with this analysis are `VBD` (past tense verbs), `VBP` (non third-person singular present tense verbs), and `VBZ` (third-person singular present tense verbs). @@ -784,12 +788,12 @@ ner_counts_df['CARDINAL_Counts'] = cardinal_counts Reviewing the DataFrame now, our column headings define each paper's genre and four named entities (PERSON, ORG, DATE, and CARDINAL) of which spaCy will count usage:   | Genre | PERSON_Counts | LOC_Counts | DATE_Counts | WORK_OF_ART_Counts --- | -- | -- | -- | -- | -- +-- | -- | :--: | :--: | :--: | :--: 0 | Argumentative Essay | 9 | 3 | 20 | 3 1 | Argumentative Essay | 90 | 13 | 151 | 6 -2| Argumentative Essay | 0 | 0 | 2 | 2 -3| Proposal | 11 | 6 | 21 | 4 -4| Proposal | 44 | 7 | 65 | 3 +2 | Argumentative Essay | 0 | 0 | 2 | 2 +3 | Proposal | 11 | 6 | 21 | 4 +4 | Proposal | 44 | 7 | 65 | 3 From here, we can compare the average usage of each named entity and plot across paper type.