Issue 3052 #3053

anisa-hawes · 2023-10-19T17:28:43Z

Preparing files for publication on behalf of AWC.

Checklist

Assign yourself in the "Assignees" menu
Add the appropriate "Label"
If this PR closes an Issue, add the phrase Closes #ISSUENUMBER to your summary above
Ensure the status checks pass: if you have difficulty fixing build errors, please contact our Publishing Assistant @anisa-hawes
Check the Netlify Preview: navigate to netlify/ph-preview/deploy-preview and click 'details' (at right)
Assign at least one individual or team to "Reviewers"
- [ ] if the text needs to be translated, please follow the translation request guidelines, then assign the relevant language team(s) as "Reviewers" and tag both the team as well as the managing editor in your PR.

Add bio for Megan S. Kane

Create corpus-analysis-with-spacy.md

Upload additional assets

Upload images directory

Upload original avatar

Upload gallery avatar

- Update links to the lesson's `.ipynb` (to be rendered with nbviewer) - Slightly adjust wording at lines 80, 82 and 84.

Correct link, line 47

Delete image to remove transparent background.

(without transparent background)

- Correct typing errors, lines 72 and 194 - Adjust formatting of Research Questions to remove headers

Delete image to remove transparent background.

(without transparent background)

Delete to replace with cropped image.

(cropped image)

hawc2 · 2023-10-24T01:08:35Z

@anisa-hawes is this ready for/waiting on my review?

anisa-hawes · 2023-10-25T10:58:38Z

Hello @hawc2,

Yes. Please do read through, and let me know if you spot anything that needs adjustment.

As explained via Slack, I've gone through the process of re-cropping some of the images which appeared to be surrounded by 'transparent' background space.

I remain puzzled by the fact that several of the Figures are aligned to the left margin while all the others are centred (for example, Figures 9., 10., 25., 26., most glaringly – although these do not feature that extra 'transparent' space').

I also have some concerns about the accessibility of the figures in general.

This is a broader question to tackle across all journals, but I think we should be aiming to avoid any screenshots of tabular data. Rather, we should be replacing them with data tables formatted in Markdown (the ones I'd suggest replacing are, for example, Figures 3., 4., 5., 7.). I also think we could significantly improve the accessibility of this lesson by providing the excerpts from spaCy's outputs written as code (for example, Figure 9., 10., 13., 14., 16., 17.). I think something like this could work:

A bounded box which contains a descriptor followed by the excerpt:
output as code 
across several lines
as necessary

This doesn't display as it would on our website, where I think it would display as a grey-shaded box (our 'notes' boxes) with a narrow-line frame around it.

Depending on what you think of these suggestions, we can agree a solution for making these adjustments.

Do you/Megan have the raw output excerpts at hand? If so, you could share them and Charlotte and I would be happy to implement the changes and make the necessary adjustments to the figure number sequence.

Transforming the tabular data might be a bit more cumbersome, but I think the accessibility benefits would be significant. Charlotte and I can help with this, or take the task on if you/Megan don't have the capacity.

hawc2 · 2023-10-25T12:22:47Z

@mkane968 can you provide the original spreadsheet data for screenshot spreadsheets?

mkane968 · 2023-10-26T22:59:24Z

Hi @hawc2 and @anisa-hawes,

Here is the tabluar data for the specified images:

Figure 2:

	0
BIO.G0.01.1.txt	b"Introduction\xe2\x80\xa6\xe2\x80\xa6\xe2\x80...
BIO.G0.02.1.txt	b' Ernst Mayr once wrote, sympatric speci...
BIO.G0.02.2.txt	b" Do ecological constraints favour certa...
BIO.G0.02.3.txt	b" Perhaps one of the most intriguing va...
BIO.G0.02.4.txt	b" The causal link between chromosomal re...

Figure 3:

	Filename	Text
0	BIO.G0.01.1.txt	Introduction……………………………………………………..1 Brief Hist...
1	BIO.G0.02.1.txt	Ernst Mayr once wrote, sympatric speciation is...
2	BIO.G0.02.2.txt	Do ecological constraints favour certain perce...
3	BIO.G0.02.3.txt	Perhaps one of the most intriguing varieties o...
4	BIO.G0.02.4.txt	The causal link between chromosomal rearrangem...

Figure 4:

	PAPER ID	TITLE	DISCIPLINE	PAPER TYPE
0	BIO.G0.15.1	Invading the Territory of Invasives: The Dange...	Biology	Argumentative Essay
1	BIO.G1.04.1	The Evolution of Terrestriality: A Look at the...	Biology	Argumentative Essay
2	BIO.G3.03.1	Intracellular Electric Field Sensing using Nan...	Biology	Argumentative Essay
3	BIO.G0.11.1	Exploring the Molecular Responses of Arabidops...	Biology	Proposal
4	BIO.G1.01.1	V. Cholerae: First Steps towards a Spatially E...	Biology	Proposal

Figure 5:

	Filename	TITLE	DISCIPLINE	PAPER TYPE	Text
0	BIO.G0.15.1	Invading the Territory of Invasives: The Dange...	Biology	Argumentative Essay	New York City, 1908: different colors of skin ...
1	BIO.G1.04.1	The Evolution of Terrestriality: A Look at the...	Biology	Argumentative Essay	The fish-tetrapod transition has been called t...
2	BIO.G3.03.1	Intracellular Electric Field Sensing using Nan...	Biology	Argumentative Essay	Intracellular electric fields are of great int...
3	BIO.G0.11.1	Exploring the Molecular Responses of Arabidops...	Biology	Proposal	Environmental stresses to plants have been stu...
4	BIO.G1.01.1	V. Cholerae: First Steps towards a Spatially E...	Biology	Proposal	The recurrent cholera pandemics have been rela...

Figure 7:

	Text	Tokens
0	New York City, 1908: different colors of skin ...	[New, York, City, ,, 1908, :, different, color...
1	The fish-tetrapod transition has been called t...	[The, fish, -, tetrapod, transition, has, been...
2	Intracellular electric fields are of great int...	[Intracellular, electric, fields, are, of, gre...
3	Environmental stresses to plants have been stu...	[Environmental, stresses, to, plants, have, be...
4	The recurrent cholera pandemics have been rela...	[The, recurrent, cholera, pandemics, have, bee...

Figure 18:

	DISCIPLINE	ADJ	ADP	ADV	AUX	CCONJ	DET	INTJ	NOUN	NUM	PART	PRON	PROPN	PUNCT	SCONJ	VERB	SYM	X
0	Biology	180	174	62	106	42	137	1	342	29	29	41	101	196	16	139	NaN	NaN
1	Biology	421	458	174	253	187	389	1	868	193	78	121	379	786	99	389	1.0	2.0
2	Biology	163	171	63	91	51	148	1	362	6	31	23	44	134	15	114	4.0	1.0
3	Biology	318	402	120	267	121	317	1	908	101	93	128	151	487	92	387	4.0	NaN
4	Biology	294	388	97	142	97	299	1	734	89	41	36	246	465	36	233	1.0	7.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
160	English	943	1164	365	512	395	954	3	2287	98	315	530	406	1275	221	1122	15.0	8.0
161	English	672	833	219	175	202	650	1	1242	30	168	291	504	595	75	570	NaN	3.0
162	English	487	715	175	240	324	500	2	1474	55	157	334	226	820	147	691	7.0	5.0
163	English	68	94	23	34	26	79	3	144	2	25	36	54	80	22	69	1.0	2.0
164	English	53	86	27	28	19	90	1	148	6	15	37	43	80	15	67	NaN	NaN

Figure 19:

	DISCIPLINE	ADJ	ADP	ADV	AUX	CCONJ	DET	INTJ	NOUN	NUM	PART	PRON	PROPN	PUNCT	SCONJ	VERB	SYM	X
0	Biology	237.0	299.0	93.0	141.0	89.0	234.0	1.0	614.0	81.0	44.0	74.0	194.0	343.0	50.0	237.0	8.0	6.0
1	English	211.0	364.0	127.0	141.0	108.0	283.0	2.0	578.0	34.0	99.0	223.0	189.0	367.0	70.0	306.0	7.0	5.0

Figure 21:

	DISCIPLINE	POS	RB	JJR	NNS	IN	VBG	RBR	RBS	-RRB-	...	FW	LS	WP$	NFP	AFX	$	``	XX	ADD	''
0	Biology	5.0	94.0	10.0	198.0	339.0	35.0	6.0	4.0	38.0	...	2.0	3.0	1.0	16.0	3.0	6.0	2.0	5.0	3.0	2.0
1	English	35.0	138.0	7.0	141.0	414.0	50.0	6.0	3.0	25.0	...	2.0	2.0	2.0	3.0	NaN	1.0	3.0	5.0	3.0	5.0

Figure 23:

	Genre	PERSON_Counts	LOC_Counts	DATE_Counts	WORK_OF_ART_Counts
0	Argumentative Essay	9	3	20	3
1	Argumentative Essay	90	13	151	6
2	Argumentative Essay	0	0	2	2
3	Proposal	11	6	21	4
4	Proposal	44	7	65	3

And here is the raw output for the following figures:

Figure 6:

This PRON
is AUX
' PUNCT
an DET
' PUNCT
example NOUN
? PUNCT
sentence NOUN

Figure 8:

"Write" appears in the text tokens column 40 times.
"Write" appears in the lemmas column 310 times.

Figure 9:

[[('PROPN', 'NNP'),
  ('PROPN', 'NNP'),
  ('PROPN', 'NNP'),
  ('PUNCT', ','),
  ('NUM', 'CD'),
  ('PUNCT', ':'),
  ('ADJ', 'JJ'),
  ('NOUN', 'NNS'),
  ('ADP', 'IN'),
  ('NOUN', 'NN'),
  ('NOUN', 'NN'),
  ('ADP', 'IN'),
  ('DET', 'DT'),
  ...]]

Figure 10:

[['New',
  'York',
  'City',
  'Earth',
  'Mooney',
  'Cleland',
  'Mack',
  'Dreissena',
  'Facon',
  'Mack',
  'Vredenburg',
  'Polynesia',
  'Euglandina',
  'Achatina',
  'fulica',
  'Hawaii',
  "O'Foighil",
  'Coote',
  'Loeve',
  'Hawaii',
  ...]]

Figure 13:

['New York City',
 'different colors',
 'skin swirl',
 'the great melting pot',
 'a cultural medley',
 'such a metropolis',
 'every last crevice',
 'Earth',
 'time',
 'people',
 'an unprecedented uniformity',
 'discrete identities',
 'Our heritages',
 'the history texts',
   ...]]

Figure 14:

CARDINAL : Numerals that do not fall under another type
DATE : Absolute or relative dates or periods
EVENT : Named hurricanes, battles, wars, sports events, etc.
FAC : Buildings, airports, highways, bridges, etc.
GPE : Countries, cities, states
LANGUAGE : Any named language
LAW : Named documents made into laws.
LOC : Non-GPE locations, mountain ranges, bodies of water
MONEY : Monetary values, including unit
NORP : Nationalities or religious or political groups
ORDINAL : "first", "second", etc.
ORG : Companies, agencies, institutions, etc.
PERCENT : Percentage, including "%"
PERSON : People, including fictional
PRODUCT : Objects, vehicles, foods, etc. (not services)
QUANTITY : Measurements, as of weight or distance
TIME : Times smaller than a day
WORK_OF_ART : Titles of books, songs, etc.

Figure 16:

{95: 1, 87: 1, 97: 3, 90: 1, 92: 2}

Figure 17:

{'AUX': 1, 'DET': 1, 'NOUN': 2, 'PRON': 1, 'PUNCT': 3}

Figure 25:

2004,      24
2003,      18
the        17
2002,      12
2005,      11
1998,      11
2000,       9
year,       9
1977,       8
season,     8

Figure 26:

the         10
winter,      8
years,       6
2009         5
1950,        5
1960,        5
century,     4
decade,      3
of           3
decades,     3

Is this all you need? If there a different/better format for you to revise the figures, happy to provide that instead.

Additionally, when re-running the code, I realized that a cell needs to be added at the top of the Part of Speech Analysis section to create a new dataframe to use for the section:

# Create new DataFrame for analysis purposes
pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']]

A couple of the code blocks after this have to be tweaked to reflect the use of this new dataframe rather than the final_paper_df; if it's not changed, the code will break at the start of the named entity analysis section.

Here is the Colab notebook with the revised change:
https://colab.research.google.com/drive/1vzAdK3o3JfPUUZT7Jpp_a6aB3U0UPlV9?usp=sharing

Can I still edit the markdown file to make this change? Sorry, I'm not sure how this slipped through earlier!

Thanks,

Megan

hawc2 · 2023-10-27T01:17:25Z

@mkane968 thanks so much, this is just what we needed I think.

@anisa-hawes let me know if I can help with anything else preparing the lesson for publication, your plan sounds good to me. Slack me if I can help debugging the images cropping oddly

- Replace figures 2, 3, 4, 5, 7, 18, 19, 21, 23 with tabular data - Adjust/add text to introduce each table and explain what it provides - Replace figures 6, 8, 9, 10, 13, 14, 16, 17, 25, 26 with raw output - Adjust/add text to introduce each output and explain what it provides - + some small typographical corrections

Add `table-wrapper` to make wide tables scrollable width-ways.

Add hard returns to follow `<div class="table-wrapper" markdown="block">`

…nto Issue-3052

Deleting directory (to replace with updated figure set)

Upload updated figure set.

- Update image filenames - Update figure numbers

Small adjustments to tables

To be replaced with updated notebook.

anisa-hawes · 2023-10-28T07:24:55Z

Hello @mkane968,

Many thanks for providing the data tables formatted in Markdown + the excerpts from spaCy's outputs written as code. I think these adjustments will significantly improve the accessibility and readability of this lesson, and I really appreciate your collaboration.

I've replaced figures 2, 3, 4, 5, 7, 18, 19, 21, 23 with tabular data (adjusting or adding text to introduce each table as necessary)
I've replaced figures 6, 8, 9, 10, 13, 14, 16, 17, 25, 26 with raw output (again, adjusting or adding text to introduce each excerpt as necessary)

You can review the Netlify Preview to see the changes as staged. I've renumbered the remaining figures (and their filenames), and adjusted the captions accordingly.

As noted above, I've also made some small adjustments to the text, so that the tables and output are introduced and make sense within the lesson. You can review the rich-diff of the changes I made d4bbbe6, and let me know if anything is incorrect or not as you want it.

I've replaced the .ipynb with your updated version. Thank you for re-running the code as a final check, and for noting this error.
I'm unclear about where the Markdown file needs to be adjusted, as I do not find the lines you quote within it. Can you tell me exactly which lines of the Markdown file need updating?

# Create new DataFrame for analysis purposes
pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']]

Additionally, I'd like to raise a couple of queries which I noted in the course of making these adjustments:

Line 332 We'll discuss how and when to transform the lists to strings to conduct frequency counts below. is unclear to me. Can you let me know which specific section or sub-section you are referring to so that we can provide a direct link to it?
Line 454 The third text shown here, for example, involves astronomy concepts; this is likely to have been written for a biology course. In contrast, texts 163 and 164 appear to be analyses of Shakespeare plays and movie adaptations. is confusing. I agree that the text output is likely to be from a Biology paper, but there are no words related to astronomy? I'm also unclear what the texts 163 and 164 you refer to are? These do not appear to be mentioned anywhere else in the lesson.
In your comment above, you provided the raw spaCy output to replace Figure 10, beginning with the words:

[['New',
'York',
'City',
'Earth',

however I noticed that Figure 10 (as was) included a different extract (I've included a screenshot below) so I typed this output myself. This means your hypothesis (line 454) that the text output is likely to be from a Biology paper remains true (although, as I explain above, I think that sentence needs adjustment). Let me know how you want to handle this/if you want us to replace it with the New York list?

Line 741 As evidenced by the above DataFrame, spaCy identifies around 50 fine-grained part-of-speech tags. is unclear to me. Reviewing the tabular data provided (to replace Figure 21) above that sentence, I cannot ascertain how you reach the total ~50?

mkane968 · 2023-10-30T14:16:10Z

Hi @anisa-hawes,

The changes look good! Just one minor edit:

Line 860: replace "words" with "standard 4-digit dates" so the sentence reads, "Here, only three of the most-frequently tagged DATE entities are standard 4-digit dates, and the rest are noun references to relative dates or periods."

Revisions related to the part-of-speech dataframe:

Revise line 627 to include the sentence, "First, we'll create a new DataFrame for the purposes of part-of speech analysis, containing the text filenames, disciplines, and Doc objects." Replace "the" with "a" in the first sentence in the paragraph and include the word "new" in the following sentence, after Dataframe. The paragraph should read in full (additions in bold): To get the same type of dictionary for each text in a DataFrame, a function can be created to nest the above for loop. First, we'll create a new DataFrame for the purposes of part-of speech analysis, containing the text filenames, disciplines, and Doc objects. We can then apply the function to each Doc object in the new DataFrame. In this case (and above), we are interested in the simpler, coarse-grained parts of speech.
Add the new line of code to the start of the code block in line 632 and make revisions in block to work with new dataframe:

pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']]

def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)
    
pos_analysis_df['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)

Make adjustments to code block starting at 644 to work with new dataframe:

pos_counts = pd.DataFrame(num_list)
columns = list(pos_counts.columns)
idx = 0
new_col = pos_analysis_df['DISCIPLINE']
pos_counts.insert(loc=idx, column='DISCIPLINE', value=new_col)
pos_counts.head()

Make adjustments to code block beginning in line 703 to work with new dataframe:

tag_num_list = []
def get_fine_pos_tags(doc):
    dictionary = {}
    num_tag = doc.count_by(spacy.attrs.TAG)
    for k,v in sorted(num_tag.items()):
        dictionary[doc.vocab[k].text] = v
    tag_num_list.append(dictionary)
    
pos_analysis_df['F_POS'] = pos_analysis_df['Doc'].apply(get_fine_pos_tags)
average_tag_df

Line 752 should be revised to read, "To start, we'll create a new DataFrame with the text filenames, types (genres), and named entity words and tags:" instead of "...filenames, disciplines, and part of speech tags."

In response to your other notes:

I think line 332 can be deleted, it was referring to a part of the tutorial that has been condensed/revised. Part-of-speech tags and named entities are counted in the analysis section, but list-to-string conversion is not emphasized, so I don't think it's necessary to mention.
Regarding Line 454, the original output in the Colab notebook was scrollable, and my discussion was referring to the proper nouns generated in several different texts. This section could be revised to list and compare the proper nouns in two different texts, with the following text and code to replace lines 417 to 454:

Listing the nouns in each text can help us ascertain the texts' subjects. Let's list the nouns in two different texts, the text located in row 3 of the DataFrame and the text located in row 163.

list(final_paper_df.loc[[3, 163], 'Proper_Nouns'])

The first text in the list includes botany and astronomy concepts; this is likely to have been written for a biology course.

[['Mars',
  'Arabidopsis',
  'Arabidopsis',
  'LEA',
  'COR',
  'LEA',
  'NASA',
    ...]]

In contrast, the second text appears to be an analysis of Shakespeare plays and movie adaptations, likely written for an English course.

['Shakespeare',
  'Bard',
  'Julie',
  'Taymor',
  'Titus',
  'Shakespeare',
  'Titus',
    ...]]

Along with assisting content analyses, extracting nouns have been shown to help build more efficient topic models[^9].

The above revision involves replacing the Figure 10 output with the above referenced outputs, so that difference would also be resolved. In the Colab notebook, the line of code that reads list(final_paper_df['Proper_Nouns']) should be replaced with list(final_paper_df.loc[[3, 163], 'Proper_Nouns']). Here is an updated version of the notebook: https://colab.research.google.com/drive/18z5x1X3nzFQ7j0PzzGdcpITPcPWTdj6r?usp=sharing
The fine-grained part of speech table in line 741 only shows about 20 POS tags, but has an ellipsis in the middle column to indicate more data not shown. This was how the output was generated in Colab. A note could be added here about the format of the table (i.e., since the table has 50+ rows, they are not all displayed; see a full list of the fine-grained part of speech tags spaCy generates here: https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/

Thanks!

Megan

Integrate Megan's edits.

Delete notebook asset to replace with updated version.

anisa-hawes · 2023-11-01T13:47:55Z

Thank you, @mkane968.

I've worked through the edits you list above. Please review these amendments to the file here: 0a145ec and let me know that you are happy.
One query that came up for me, was if the initial line num_list = [] (previously 631) is correct to remove when replaced with this block?:

pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']]

def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)
    
pos_analysis_df['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)

I've amended the paragraph including the line spaCy identifies around 50 fine-grained part-of-speech tags (now line 740) to add the note and link as suggested. I've tried to perma.cc the URL https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/#Fine_Grained_POS_Tag_list but multiple Google adverts and YouTube pop-ups on that page make it near-impossible to read, and cause the archived link to fail after a few moments....
Is there an alternative fine-grained part-of-speech tag list available online which you could suggest?
I've removed the existing .ipynb and replaced it with your updated version 64775c1.

mkane968 · 2023-11-01T15:01:03Z

Thanks @anisa-hawes! The changes look good, just a few notes from a final read-through:

Line 210 essentially restates Line 212, so it should be removed.
Same with Line 280, can be removed because it reiterates Line 273.
In the second sentence in the paragraph starting in line 744, replace the phrase "third-person singular tense part-of-speech verbs" with "third-person singular present tense verbs". The full sentence should read, "However, in English papers, an average of 130 third-person singular present tense verbs are used per paper, in compared to around 40 of the other two categories.
In the second sentence in the paragraph starting in line 757, replace the phrase, "CARDINAL (numbers)" with WORKS_OF_ART. In the following code block, line 765 should read woa_counts = ner_analysis_df['Named_Entities'].str.count('WORK_OF_ART') instead of cardinal_counts = ner_analysis_df['Named_Entities'].str.count('CARDINAL')
and the very last line (line 772) should read ner_counts_df['WORK_OF_ART_Counts'] = woa_counts instead of ner_counts_df['CARDINAL_Counts'] = cardinal_counts.
Finally, replace CARDINAL with WORKS_OF_ART in line 775 so it reads "Reviewing the DataFrame now, our column headings define each paper's genre and four named entities (PERSON, ORG, DATE, and WORKS_OF_ART) of which spaCy will count usage." The table, bar chart, and following analysis feature works of art as the 4th named entity instead of cardinal numbers, as does the Colab script.

In response to your other questions:

the line num_list = [] should NOT be removed in code block; this is used to create lists in which to store the part of speech tags, which are then added to the DataFrame with the following function. So it's needed regardless of whether we use the original DataFrame or the new part of speech DataFrame. Sorry if that was unclear in my earlier comments! The correct block should read in full:

# Create new DataFrame for analysis purposes
pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']]

# Create list to store each dictionary
num_list = []

# Define a function to get part of speech tags and counts and append them to a new dictionary
def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)

# Apply function to each doc object in DataFrame
pos_analysis_df['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)

Here's another list of the fine-grained POS tags, from spaCy's Github glossary: https://github.com/explosion/spaCy/blob/master/spacy/glossary.py

anisa-hawes · 2023-11-01T16:11:25Z

Thank you for these clarifications, @mkane968.

Removed line 210: Display the first five rows to check that the data is as expected. Four columns should be present: the paper IDs, their titles, their discipline, and their type (genre).
Removed line 280: Upon this command, spaCy prints a list of each word in the sentence along with their corresponding part-of-speech tags, for example:
Reinstated num_list = [] (now) line 622
Updated link, line 742 (now 738) (strangely, perma.cc can't handle that one either so I have put in the original link)
Adjusted line 744: However, in English papers, an average of 130 third-person singular present tense verbs are used per paper, compared to around 40 of the other two categories.
Replaced the phrase, "CARDINAL (numbers)" with WORKS_OF_ART at line 757 (now 755)
Updated code block that following on lines 765-772 (now 758-770)
Replaced CARDINAL with WORKS_OF_ART at line 775 (now 773)

Integrate Megan's edits.

Replace perma.cc link with live link. (Perma.cc cannot archive that URL).

hawc2 · 2023-11-01T23:03:40Z

This looks great to me. Thank you @mkane968 and @anisa-hawes for your careful attention to details and meticulous corrections/improvements to this lesson. It's ready for publication!

- Adjust capitalisation of 'spaCy' in the lesson title - Update `date:`

anisa-hawes · 2023-11-02T12:51:17Z

Hello @hawc2.

Sorry to trouble you for a re-review. I made one tiny change, which was to adjust the capitalisation of 'spaCy' in the title so that it's consistent with the lesson.

This is aligned with how we've titled Installing Python Modules with pip, for example.

anisa-hawes added 7 commits October 19, 2023 18:00

Update ph_authors.yml

2a869a1

Add bio for Megan S. Kane

Create corpus-analysis-with-spacy.md

e96a40e

Create corpus-analysis-with-spacy.md

Upload additional assets to /corpus-analysis-with-spacy

e2f2a07

Upload additional assets

Upload /images/corpus-analysis-with-spacy

6b61521

Upload images directory

Upload corpus-analysis-with-spacy-original.png

870a558

Upload original avatar

Upload corpus-analysis-with-spacy.png

95a1740

Upload gallery avatar

Update corpus-analysis-with-spacy.md

9499f4b

- Update links to the lesson's `.ipynb` (to be rendered with nbviewer) - Slightly adjust wording at lines 80, 82 and 84.

anisa-hawes added the English label Oct 19, 2023

anisa-hawes self-assigned this Oct 19, 2023

anisa-hawes and others added 9 commits October 19, 2023 18:43

Update corpus-analysis-with-spacy.md

59fe695

Correct link, line 47

Delete or-en-corpus-analysis-with-spacy-04.png

d63c37b

Delete image to remove transparent background.

Upload or-en-corpus-analysis-with-spacy-04.png

bf918fd

(without transparent background)

Update corpus-analysis-with-spacy.md

9039656

- Correct typing errors, lines 72 and 194 - Adjust formatting of Research Questions to remove headers

Delete or-en-corpus-analysis-with-spacy-01.png

eea2edf

Delete image to remove transparent background.

Upload or-en-corpus-analysis-with-spacy-01.png

2e0ad3a

(without transparent background)

Delete or-en-corpus-analysis-with-spacy-10.png

12f95db

Delete to replace with cropped image.

Upload or-en-corpus-analysis-with-spacy-10.png

93e91bf

(cropped image)

Merge branch 'gh-pages' into Issue-3052

7bdb730

anisa-hawes and others added 7 commits October 27, 2023 11:37

Update corpus-analysis-with-spacy.md

473bfdf

Add `table-wrapper` to make wide tables scrollable width-ways.

Update corpus-analysis-with-spacy.md

e17056f

Add hard returns to follow `<div class="table-wrapper" markdown="block">`

Merge branch 'gh-pages' into Issue-3052

31b563f

Merge branch 'Issue-3052' of github.com:programminghistorian/jekyll i…

0f05644

…nto Issue-3052

Delete images/corpus-analysis-with-spacy directory

ec83411

Deleting directory (to replace with updated figure set)

Upload /images/corpus-analysis-with-spacy

7b58a90

Upload updated figure set.

anisa-hawes added 4 commits October 27, 2023 23:22

Update corpus-analysis-with-spacy.md

2784caf

- Update image filenames - Update figure numbers

Update corpus-analysis-with-spacy.md

68ea6cf

Small adjustments to tables

Delete corpus-analysis-with-spacy.ipynb

f04a516

To be replaced with updated notebook.

Upload corpus-analysis-with-spacy.ipynb (updated)

a93d157

anisa-hawes added 3 commits November 1, 2023 12:10

Update corpus-analysis-with-spacy.md

0a145ec

Integrate Megan's edits.

Delete corpus-analysis-with-spacy.ipynb

d05ed20

Delete notebook asset to replace with updated version.

Upload updated Python notebook.

64775c1

anisa-hawes added 2 commits November 1, 2023 16:11

Update corpus-analysis-with-spacy.md

c8781c3

Integrate Megan's edits.

Update corpus-analysis-with-spacy.md

9febb91

Replace perma.cc link with live link. (Perma.cc cannot archive that URL).

hawc2 previously approved these changes Nov 1, 2023

View reviewed changes

Update corpus-analysis-with-spacy.md

2ba71b9

- Adjust capitalisation of 'spaCy' in the lesson title - Update `date:`

anisa-hawes dismissed hawc2’s stale review via 2ba71b9 November 2, 2023 09:33

anisa-hawes requested a review from hawc2 November 2, 2023 12:51

hawc2 approved these changes Nov 2, 2023

View reviewed changes

anisa-hawes merged commit 98aec01 into gh-pages Nov 2, 2023
5 checks passed

anisa-hawes deleted the Issue-3052 branch November 2, 2023 14:02

anisa-hawes mentioned this pull request Nov 3, 2023

Preparing publication of new EN lesson #3052

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 3052 #3053

Issue 3052 #3053

anisa-hawes commented Oct 19, 2023 •

edited

Loading

hawc2 commented Oct 24, 2023

anisa-hawes commented Oct 25, 2023 •

edited

Loading

hawc2 commented Oct 25, 2023

mkane968 commented Oct 26, 2023

hawc2 commented Oct 27, 2023

anisa-hawes commented Oct 28, 2023 •

edited

Loading

mkane968 commented Oct 30, 2023 •

edited by anisa-hawes

Loading

anisa-hawes commented Nov 1, 2023 •

edited

Loading

mkane968 commented Nov 1, 2023 •

edited by anisa-hawes

Loading

anisa-hawes commented Nov 1, 2023 •

edited

Loading

hawc2 commented Nov 1, 2023

anisa-hawes commented Nov 2, 2023

Issue 3052 #3053

Issue 3052 #3053

Conversation

anisa-hawes commented Oct 19, 2023 • edited Loading

Checklist

hawc2 commented Oct 24, 2023

anisa-hawes commented Oct 25, 2023 • edited Loading

hawc2 commented Oct 25, 2023

mkane968 commented Oct 26, 2023

hawc2 commented Oct 27, 2023

anisa-hawes commented Oct 28, 2023 • edited Loading

mkane968 commented Oct 30, 2023 • edited by anisa-hawes Loading

anisa-hawes commented Nov 1, 2023 • edited Loading

mkane968 commented Nov 1, 2023 • edited by anisa-hawes Loading

anisa-hawes commented Nov 1, 2023 • edited Loading

hawc2 commented Nov 1, 2023

anisa-hawes commented Nov 2, 2023

anisa-hawes commented Oct 19, 2023 •

edited

Loading

anisa-hawes commented Oct 25, 2023 •

edited

Loading

anisa-hawes commented Oct 28, 2023 •

edited

Loading

mkane968 commented Oct 30, 2023 •

edited by anisa-hawes

Loading

anisa-hawes commented Nov 1, 2023 •

edited

Loading

mkane968 commented Nov 1, 2023 •

edited by anisa-hawes

Loading

anisa-hawes commented Nov 1, 2023 •

edited

Loading