Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 3052 #3053

Merged
merged 33 commits into from
Nov 2, 2023
Merged

Issue 3052 #3053

merged 33 commits into from
Nov 2, 2023

Conversation

anisa-hawes
Copy link
Contributor

@anisa-hawes anisa-hawes commented Oct 19, 2023

Preparing files for publication on behalf of AWC.

Checklist

  • Assign yourself in the "Assignees" menu
  • Add the appropriate "Label"
  • If this PR closes an Issue, add the phrase Closes #ISSUENUMBER to your summary above
  • Ensure the status checks pass: if you have difficulty fixing build errors, please contact our Publishing Assistant @anisa-hawes
  • Check the Netlify Preview: navigate to netlify/ph-preview/deploy-preview and click 'details' (at right)
  • Assign at least one individual or team to "Reviewers"
    • [ ] if the text needs to be translated, please follow the translation request guidelines, then assign the relevant language team(s) as "Reviewers" and tag both the team as well as the managing editor in your PR.

Add bio for Megan S. Kane
Create corpus-analysis-with-spacy.md
- Update links to the lesson's `.ipynb` (to be rendered with nbviewer)
- Slightly adjust wording at lines 80, 82 and 84.
@anisa-hawes anisa-hawes self-assigned this Oct 19, 2023
anisa-hawes and others added 9 commits October 19, 2023 18:43
Delete image to remove transparent background.
(without transparent background)
- Correct typing errors, lines 72 and 194
- Adjust formatting of Research Questions to remove headers
Delete image to remove transparent background.
(without transparent background)
Delete to replace with cropped image.
@hawc2
Copy link
Contributor

hawc2 commented Oct 24, 2023

@anisa-hawes is this ready for/waiting on my review?

@anisa-hawes
Copy link
Contributor Author

anisa-hawes commented Oct 25, 2023

Hello @hawc2,

Yes. Please do read through, and let me know if you spot anything that needs adjustment.

As explained via Slack, I've gone through the process of re-cropping some of the images which appeared to be surrounded by 'transparent' background space.

I remain puzzled by the fact that several of the Figures are aligned to the left margin while all the others are centred (for example, Figures 9., 10., 25., 26., most glaringly – although these do not feature that extra 'transparent' space').

I also have some concerns about the accessibility of the figures in general.

This is a broader question to tackle across all journals, but I think we should be aiming to avoid any screenshots of tabular data. Rather, we should be replacing them with data tables formatted in Markdown (the ones I'd suggest replacing are, for example, Figures 3., 4., 5., 7.). I also think we could significantly improve the accessibility of this lesson by providing the excerpts from spaCy's outputs written as code (for example, Figure 9., 10., 13., 14., 16., 17.). I think something like this could work:

A bounded box which contains a descriptor followed by the excerpt:

output as code 
across several lines
as necessary

This doesn't display as it would on our website, where I think it would display as a grey-shaded box (our 'notes' boxes) with a narrow-line frame around it.

Depending on what you think of these suggestions, we can agree a solution for making these adjustments.

Do you/Megan have the raw output excerpts at hand? If so, you could share them and Charlotte and I would be happy to implement the changes and make the necessary adjustments to the figure number sequence.

Transforming the tabular data might be a bit more cumbersome, but I think the accessibility benefits would be significant. Charlotte and I can help with this, or take the task on if you/Megan don't have the capacity.

@hawc2
Copy link
Contributor

hawc2 commented Oct 25, 2023

@mkane968 can you provide the original spreadsheet data for screenshot spreadsheets?

@mkane968
Copy link

Hi @hawc2 and @anisa-hawes,

Here is the tabluar data for the specified images:

Figure 2:

  0
BIO.G0.01.1.txt b"Introduction\xe2\x80\xa6\xe2\x80\xa6\xe2\x80...
BIO.G0.02.1.txt b' Ernst Mayr once wrote, sympatric speci...
BIO.G0.02.2.txt b" Do ecological constraints favour certa...
BIO.G0.02.3.txt b" Perhaps one of the most intriguing va...
BIO.G0.02.4.txt b" The causal link between chromosomal re...

Figure 3:

  Filename Text
0 BIO.G0.01.1.txt Introduction……………………………………………………..1 Brief Hist...
1 BIO.G0.02.1.txt Ernst Mayr once wrote, sympatric speciation is...
2 BIO.G0.02.2.txt Do ecological constraints favour certain perce...
3 BIO.G0.02.3.txt Perhaps one of the most intriguing varieties o...
4 BIO.G0.02.4.txt The causal link between chromosomal rearrangem...

Figure 4:

  PAPER ID TITLE DISCIPLINE PAPER TYPE
0 BIO.G0.15.1 Invading the Territory of Invasives: The Dange... Biology Argumentative Essay
1 BIO.G1.04.1 The Evolution of Terrestriality: A Look at the... Biology Argumentative Essay
2 BIO.G3.03.1 Intracellular Electric Field Sensing using Nan... Biology Argumentative Essay
3 BIO.G0.11.1 Exploring the Molecular Responses of Arabidops... Biology Proposal
4 BIO.G1.01.1 V. Cholerae: First Steps towards a Spatially E... Biology Proposal

Figure 5:

  Filename TITLE DISCIPLINE PAPER TYPE Text
0 BIO.G0.15.1 Invading the Territory of Invasives: The Dange... Biology Argumentative Essay New York City, 1908: different colors of skin ...
1 BIO.G1.04.1 The Evolution of Terrestriality: A Look at the... Biology Argumentative Essay The fish-tetrapod transition has been called t...
2 BIO.G3.03.1 Intracellular Electric Field Sensing using Nan... Biology Argumentative Essay Intracellular electric fields are of great int...
3 BIO.G0.11.1 Exploring the Molecular Responses of Arabidops... Biology Proposal Environmental stresses to plants have been stu...
4 BIO.G1.01.1 V. Cholerae: First Steps towards a Spatially E... Biology Proposal The recurrent cholera pandemics have been rela...

Figure 7:

  Text Tokens
0 New York City, 1908: different colors of skin ... [New, York, City, ,, 1908, :, different, color...
1 The fish-tetrapod transition has been called t... [The, fish, -, tetrapod, transition, has, been...
2 Intracellular electric fields are of great int... [Intracellular, electric, fields, are, of, gre...
3 Environmental stresses to plants have been stu... [Environmental, stresses, to, plants, have, be...
4 The recurrent cholera pandemics have been rela... [The, recurrent, cholera, pandemics, have, bee...

Figure 18:

  DISCIPLINE ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM PART PRON PROPN PUNCT SCONJ VERB SYM X
0 Biology 180 174 62 106 42 137 1 342 29 29 41 101 196 16 139 NaN NaN
1 Biology 421 458 174 253 187 389 1 868 193 78 121 379 786 99 389 1.0 2.0
2 Biology 163 171 63 91 51 148 1 362 6 31 23 44 134 15 114 4.0 1.0
3 Biology 318 402 120 267 121 317 1 908 101 93 128 151 487 92 387 4.0 NaN
4 Biology 294 388 97 142 97 299 1 734 89 41 36 246 465 36 233 1.0 7.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
160 English 943 1164 365 512 395 954 3 2287 98 315 530 406 1275 221 1122 15.0 8.0
161 English 672 833 219 175 202 650 1 1242 30 168 291 504 595 75 570 NaN 3.0
162 English 487 715 175 240 324 500 2 1474 55 157 334 226 820 147 691 7.0 5.0
163 English 68 94 23 34 26 79 3 144 2 25 36 54 80 22 69 1.0 2.0
164 English 53 86 27 28 19 90 1 148 6 15 37 43 80 15 67 NaN NaN

Figure 19:

  DISCIPLINE ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM PART PRON PROPN PUNCT SCONJ VERB SYM X
0 Biology 237.0 299.0 93.0 141.0 89.0 234.0 1.0 614.0 81.0 44.0 74.0 194.0 343.0 50.0 237.0 8.0 6.0
1 English 211.0 364.0 127.0 141.0 108.0 283.0 2.0 578.0 34.0 99.0 223.0 189.0 367.0 70.0 306.0 7.0 5.0

Figure 21:

  DISCIPLINE POS RB JJR NNS IN VBG RBR RBS -RRB- ... FW LS WP$ NFP AFX $ `` XX ADD ''
0 Biology 5.0 94.0 10.0 198.0 339.0 35.0 6.0 4.0 38.0 ... 2.0 3.0 1.0 16.0 3.0 6.0 2.0 5.0 3.0 2.0
1 English 35.0 138.0 7.0 141.0 414.0 50.0 6.0 3.0 25.0 ... 2.0 2.0 2.0 3.0 NaN 1.0 3.0 5.0 3.0 5.0

Figure 23:

  Genre PERSON_Counts LOC_Counts DATE_Counts WORK_OF_ART_Counts
0 Argumentative Essay 9 3 20 3
1 Argumentative Essay 90 13 151 6
2 Argumentative Essay 0 0 2 2
3 Proposal 11 6 21 4
4 Proposal 44 7 65 3

And here is the raw output for the following figures:

Figure 6:

This PRON
is AUX
' PUNCT
an DET
' PUNCT
example NOUN
? PUNCT
sentence NOUN

Figure 8:

"Write" appears in the text tokens column 40 times.
"Write" appears in the lemmas column 310 times.

Figure 9:

[[('PROPN', 'NNP'),
  ('PROPN', 'NNP'),
  ('PROPN', 'NNP'),
  ('PUNCT', ','),
  ('NUM', 'CD'),
  ('PUNCT', ':'),
  ('ADJ', 'JJ'),
  ('NOUN', 'NNS'),
  ('ADP', 'IN'),
  ('NOUN', 'NN'),
  ('NOUN', 'NN'),
  ('ADP', 'IN'),
  ('DET', 'DT'),
  ...]]

Figure 10:

[['New',
  'York',
  'City',
  'Earth',
  'Mooney',
  'Cleland',
  'Mack',
  'Dreissena',
  'Facon',
  'Mack',
  'Vredenburg',
  'Polynesia',
  'Euglandina',
  'Achatina',
  'fulica',
  'Hawaii',
  "O'Foighil",
  'Coote',
  'Loeve',
  'Hawaii',
  ...]]

Figure 13:

['New York City',
 'different colors',
 'skin swirl',
 'the great melting pot',
 'a cultural medley',
 'such a metropolis',
 'every last crevice',
 'Earth',
 'time',
 'people',
 'an unprecedented uniformity',
 'discrete identities',
 'Our heritages',
 'the history texts',
   ...]]

Figure 14:

CARDINAL : Numerals that do not fall under another type
DATE : Absolute or relative dates or periods
EVENT : Named hurricanes, battles, wars, sports events, etc.
FAC : Buildings, airports, highways, bridges, etc.
GPE : Countries, cities, states
LANGUAGE : Any named language
LAW : Named documents made into laws.
LOC : Non-GPE locations, mountain ranges, bodies of water
MONEY : Monetary values, including unit
NORP : Nationalities or religious or political groups
ORDINAL : "first", "second", etc.
ORG : Companies, agencies, institutions, etc.
PERCENT : Percentage, including "%"
PERSON : People, including fictional
PRODUCT : Objects, vehicles, foods, etc. (not services)
QUANTITY : Measurements, as of weight or distance
TIME : Times smaller than a day
WORK_OF_ART : Titles of books, songs, etc.

Figure 16:

{95: 1, 87: 1, 97: 3, 90: 1, 92: 2}

Figure 17:

{'AUX': 1, 'DET': 1, 'NOUN': 2, 'PRON': 1, 'PUNCT': 3}

Figure 25:

2004,      24
2003,      18
the        17
2002,      12
2005,      11
1998,      11
2000,       9
year,       9
1977,       8
season,     8

Figure 26:

the         10
winter,      8
years,       6
2009         5
1950,        5
1960,        5
century,     4
decade,      3
of           3
decades,     3

Is this all you need? If there a different/better format for you to revise the figures, happy to provide that instead.

Additionally, when re-running the code, I realized that a cell needs to be added at the top of the Part of Speech Analysis section to create a new dataframe to use for the section:

# Create new DataFrame for analysis purposes
pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']]

A couple of the code blocks after this have to be tweaked to reflect the use of this new dataframe rather than the final_paper_df; if it's not changed, the code will break at the start of the named entity analysis section.

Here is the Colab notebook with the revised change:
https://colab.research.google.com/drive/1vzAdK3o3JfPUUZT7Jpp_a6aB3U0UPlV9?usp=sharing

Can I still edit the markdown file to make this change? Sorry, I'm not sure how this slipped through earlier!

Thanks,

Megan

@hawc2
Copy link
Contributor

hawc2 commented Oct 27, 2023

@mkane968 thanks so much, this is just what we needed I think.

@anisa-hawes let me know if I can help with anything else preparing the lesson for publication, your plan sounds good to me. Slack me if I can help debugging the images cropping oddly

anisa-hawes and others added 7 commits October 27, 2023 11:37
- Replace figures 2, 3, 4, 5, 7, 18, 19, 21, 23 with tabular data
- Adjust/add text to introduce each table and explain what it provides

- Replace figures 6, 8, 9, 10, 13, 14, 16, 17, 25, 26 with raw output
- Adjust/add text to introduce each output and explain what it provides

- + some small typographical corrections
Add `table-wrapper` to make wide tables scrollable width-ways.
Add hard returns to follow `<div class="table-wrapper" markdown="block">`
Deleting directory (to replace with updated figure set)
Upload updated figure set.
- Update image filenames
- Update figure numbers
Small adjustments to tables
To be replaced with updated notebook.
@anisa-hawes
Copy link
Contributor Author

anisa-hawes commented Oct 28, 2023

Hello @mkane968,

Many thanks for providing the data tables formatted in Markdown + the excerpts from spaCy's outputs written as code. I think these adjustments will significantly improve the accessibility and readability of this lesson, and I really appreciate your collaboration.

  • I've replaced figures 2, 3, 4, 5, 7, 18, 19, 21, 23 with tabular data (adjusting or adding text to introduce each table as necessary)
  • I've replaced figures 6, 8, 9, 10, 13, 14, 16, 17, 25, 26 with raw output (again, adjusting or adding text to introduce each excerpt as necessary)

You can review the Netlify Preview to see the changes as staged. I've renumbered the remaining figures (and their filenames), and adjusted the captions accordingly.

As noted above, I've also made some small adjustments to the text, so that the tables and output are introduced and make sense within the lesson. You can review the rich-diff of the changes I made d4bbbe6, and let me know if anything is incorrect or not as you want it.


  • I've replaced the .ipynb with your updated version. Thank you for re-running the code as a final check, and for noting this error.
  • I'm unclear about where the Markdown file needs to be adjusted, as I do not find the lines you quote within it. Can you tell me exactly which lines of the Markdown file need updating?
# Create new DataFrame for analysis purposes
pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']]

Additionally, I'd like to raise a couple of queries which I noted in the course of making these adjustments:

  • Line 332 We'll discuss how and when to transform the lists to strings to conduct frequency counts below. is unclear to me. Can you let me know which specific section or sub-section you are referring to so that we can provide a direct link to it?
  • Line 454 The third text shown here, for example, involves astronomy concepts; this is likely to have been written for a biology course. In contrast, texts 163 and 164 appear to be analyses of Shakespeare plays and movie adaptations. is confusing. I agree that the text output is likely to be from a Biology paper, but there are no words related to astronomy? I'm also unclear what the texts 163 and 164 you refer to are? These do not appear to be mentioned anywhere else in the lesson.
  • In your comment above, you provided the raw spaCy output to replace Figure 10, beginning with the words:
[['New',
'York',
'City',
'Earth',

however I noticed that Figure 10 (as was) included a different extract (I've included a screenshot below) so I typed this output myself. This means your hypothesis (line 454) that the text output is likely to be from a Biology paper remains true (although, as I explain above, I think that sentence needs adjustment). Let me know how you want to handle this/if you want us to replace it with the New York list?

  • Line 741 As evidenced by the above DataFrame, spaCy identifies around 50 fine-grained part-of-speech tags. is unclear to me. Reviewing the tabular data provided (to replace Figure 21) above that sentence, I cannot ascertain how you reach the total ~50?

Screenshot 2023-10-27 at 11 39 14

@mkane968
Copy link

mkane968 commented Oct 30, 2023

Hi @anisa-hawes,

The changes look good! Just one minor edit:

  • Line 860: replace "words" with "standard 4-digit dates" so the sentence reads, "Here, only three of the most-frequently tagged DATE entities are standard 4-digit dates, and the rest are noun references to relative dates or periods."

Revisions related to the part-of-speech dataframe:

  • Revise line 627 to include the sentence, "First, we'll create a new DataFrame for the purposes of part-of speech analysis, containing the text filenames, disciplines, and Doc objects." Replace "the" with "a" in the first sentence in the paragraph and include the word "new" in the following sentence, after Dataframe. The paragraph should read in full (additions in bold): To get the same type of dictionary for each text in a DataFrame, a function can be created to nest the above for loop. First, we'll create a new DataFrame for the purposes of part-of speech analysis, containing the text filenames, disciplines, and Doc objects. We can then apply the function to each Doc object in the new DataFrame. In this case (and above), we are interested in the simpler, coarse-grained parts of speech.

  • Add the new line of code to the start of the code block in line 632 and make revisions in block to work with new dataframe:

pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']]

def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)
    
pos_analysis_df['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)
  • Make adjustments to code block starting at 644 to work with new dataframe:
pos_counts = pd.DataFrame(num_list)
columns = list(pos_counts.columns)
idx = 0
new_col = pos_analysis_df['DISCIPLINE']
pos_counts.insert(loc=idx, column='DISCIPLINE', value=new_col)
pos_counts.head()
  • Make adjustments to code block beginning in line 703 to work with new dataframe:
tag_num_list = []
def get_fine_pos_tags(doc):
    dictionary = {}
    num_tag = doc.count_by(spacy.attrs.TAG)
    for k,v in sorted(num_tag.items()):
        dictionary[doc.vocab[k].text] = v
    tag_num_list.append(dictionary)
    
pos_analysis_df['F_POS'] = pos_analysis_df['Doc'].apply(get_fine_pos_tags)
average_tag_df
  • Line 752 should be revised to read, "To start, we'll create a new DataFrame with the text filenames, types (genres), and named entity words and tags:" instead of "...filenames, disciplines, and part of speech tags."

In response to your other notes:

  • I think line 332 can be deleted, it was referring to a part of the tutorial that has been condensed/revised. Part-of-speech tags and named entities are counted in the analysis section, but list-to-string conversion is not emphasized, so I don't think it's necessary to mention.

  • Regarding Line 454, the original output in the Colab notebook was scrollable, and my discussion was referring to the proper nouns generated in several different texts. This section could be revised to list and compare the proper nouns in two different texts, with the following text and code to replace lines 417 to 454:

Listing the nouns in each text can help us ascertain the texts' subjects. Let's list the nouns in two different texts, the text located in row 3 of the DataFrame and the text located in row 163.

list(final_paper_df.loc[[3, 163], 'Proper_Nouns'])

The first text in the list includes botany and astronomy concepts; this is likely to have been written for a biology course.

[['Mars',
  'Arabidopsis',
  'Arabidopsis',
  'LEA',
  'COR',
  'LEA',
  'NASA',
    ...]]

In contrast, the second text appears to be an analysis of Shakespeare plays and movie adaptations, likely written for an English course.

['Shakespeare',
  'Bard',
  'Julie',
  'Taymor',
  'Titus',
  'Shakespeare',
  'Titus',
    ...]]

Along with assisting content analyses, extracting nouns have been shown to help build more efficient topic models[^9].

  • The above revision involves replacing the Figure 10 output with the above referenced outputs, so that difference would also be resolved. In the Colab notebook, the line of code that reads list(final_paper_df['Proper_Nouns']) should be replaced with list(final_paper_df.loc[[3, 163], 'Proper_Nouns']). Here is an updated version of the notebook: https://colab.research.google.com/drive/18z5x1X3nzFQ7j0PzzGdcpITPcPWTdj6r?usp=sharing

  • The fine-grained part of speech table in line 741 only shows about 20 POS tags, but has an ellipsis in the middle column to indicate more data not shown. This was how the output was generated in Colab. A note could be added here about the format of the table (i.e., since the table has 50+ rows, they are not all displayed; see a full list of the fine-grained part of speech tags spaCy generates here: https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/

Thanks!

Megan

Integrate Megan's edits.
Delete notebook asset to replace with updated version.
@anisa-hawes
Copy link
Contributor Author

anisa-hawes commented Nov 1, 2023

Thank you, @mkane968.

  • I've worked through the edits you list above. Please review these amendments to the file here: 0a145ec and let me know that you are happy.

  • One query that came up for me, was if the initial line num_list = [] (previously 631) is correct to remove when replaced with this block?:

pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']]

def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)
    
pos_analysis_df['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)
  • I've amended the paragraph including the line spaCy identifies around 50 fine-grained part-of-speech tags (now line 740) to add the note and link as suggested. I've tried to perma.cc the URL https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/#Fine_Grained_POS_Tag_list but multiple Google adverts and YouTube pop-ups on that page make it near-impossible to read, and cause the archived link to fail after a few moments....

  • Is there an alternative fine-grained part-of-speech tag list available online which you could suggest?

  • I've removed the existing .ipynb and replaced it with your updated version 64775c1.

@mkane968
Copy link

mkane968 commented Nov 1, 2023

Thanks @anisa-hawes! The changes look good, just a few notes from a final read-through:

  • Line 210 essentially restates Line 212, so it should be removed.
  • Same with Line 280, can be removed because it reiterates Line 273.
  • In the second sentence in the paragraph starting in line 744, replace the phrase "third-person singular tense part-of-speech verbs" with "third-person singular present tense verbs". The full sentence should read, "However, in English papers, an average of 130 third-person singular present tense verbs are used per paper, in compared to around 40 of the other two categories.
  • In the second sentence in the paragraph starting in line 757, replace the phrase, "CARDINAL (numbers)" with WORKS_OF_ART. In the following code block, line 765 should read woa_counts = ner_analysis_df['Named_Entities'].str.count('WORK_OF_ART') instead of cardinal_counts = ner_analysis_df['Named_Entities'].str.count('CARDINAL')
  • and the very last line (line 772) should read ner_counts_df['WORK_OF_ART_Counts'] = woa_counts instead of ner_counts_df['CARDINAL_Counts'] = cardinal_counts.
  • Finally, replace CARDINAL with WORKS_OF_ART in line 775 so it reads "Reviewing the DataFrame now, our column headings define each paper's genre and four named entities (PERSON, ORG, DATE, and WORKS_OF_ART) of which spaCy will count usage." The table, bar chart, and following analysis feature works of art as the 4th named entity instead of cardinal numbers, as does the Colab script.

In response to your other questions:

  • the line num_list = [] should NOT be removed in code block; this is used to create lists in which to store the part of speech tags, which are then added to the DataFrame with the following function. So it's needed regardless of whether we use the original DataFrame or the new part of speech DataFrame. Sorry if that was unclear in my earlier comments! The correct block should read in full:
# Create new DataFrame for analysis purposes
pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']]

# Create list to store each dictionary
num_list = []

# Define a function to get part of speech tags and counts and append them to a new dictionary
def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)

# Apply function to each doc object in DataFrame
pos_analysis_df['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)

@anisa-hawes
Copy link
Contributor Author

anisa-hawes commented Nov 1, 2023

Thank you for these clarifications, @mkane968.

  • Removed line 210: Display the first five rows to check that the data is as expected. Four columns should be present: the paper IDs, their titles, their discipline, and their type (genre).
  • Removed line 280: Upon this command, spaCy prints a list of each word in the sentence along with their corresponding part-of-speech tags, for example:
  • Reinstated num_list = [] (now) line 622
  • Updated link, line 742 (now 738) (strangely, perma.cc can't handle that one either so I have put in the original link)
  • Adjusted line 744: However, in English papers, an average of 130 third-person singular present tense verbs are used per paper, compared to around 40 of the other two categories.
  • Replaced the phrase, "CARDINAL (numbers)" with WORKS_OF_ART at line 757 (now 755)
  • Updated code block that following on lines 765-772 (now 758-770)
  • Replaced CARDINAL with WORKS_OF_ART at line 775 (now 773)

Integrate Megan's edits.
Replace perma.cc link with live link. (Perma.cc cannot archive that URL).
@hawc2
Copy link
Contributor

hawc2 commented Nov 1, 2023

This looks great to me. Thank you @mkane968 and @anisa-hawes for your careful attention to details and meticulous corrections/improvements to this lesson. It's ready for publication!

hawc2
hawc2 previously approved these changes Nov 1, 2023
- Adjust capitalisation of 'spaCy' in the lesson title
- Update `date:`
@anisa-hawes
Copy link
Contributor Author

Hello @hawc2.

Sorry to trouble you for a re-review. I made one tiny change, which was to adjust the capitalisation of 'spaCy' in the title so that it's consistent with the lesson.

This is aligned with how we've titled Installing Python Modules with pip, for example.

@anisa-hawes anisa-hawes requested a review from hawc2 November 2, 2023 12:51
@anisa-hawes anisa-hawes merged commit 98aec01 into gh-pages Nov 2, 2023
5 checks passed
@anisa-hawes anisa-hawes deleted the Issue-3052 branch November 2, 2023 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants