diff --git a/_data/ph_authors.yml b/_data/ph_authors.yml index b1504cd17..77a61d2f1 100644 --- a/_data/ph_authors.yml +++ b/_data/ph_authors.yml @@ -2995,3 +2995,10 @@ team_roles: - publishing-assistant status: institutionally-supported + +- name: Megan S. Kane + orcid: 0000-0003-1817-2751 + team: false + bio: + en: | + Megan Kane is a PhD candidate in the English Department at Temple University. diff --git a/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-16.html b/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-16.html new file mode 100644 index 000000000..e24bdd7bc --- /dev/null +++ b/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-16.html @@ -0,0 +1,97 @@ + + + There + PRON + + + + are + VERB + + + + two + NUM + + + + interesting + ADJ + + + + phenomena + NOUN + + + + in + ADP + + + + this + DET + + + + research. + NOUN + + + + + + expl + + + + + + + + nummod + + + + + + + + amod + + + + + + + + attr + + + + + + + + prep + + + + + + + + det + + + + + + + + pobj + + + + \ No newline at end of file diff --git a/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-17.html b/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-17.html new file mode 100644 index 000000000..ef96bba98 --- /dev/null +++ b/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-17.html @@ -0,0 +1,45 @@ + + + There + PRON + + + + interesting + ADJ + + + + phenomena + NOUN + + + + research . + NOUN + + + + + + advmod + + + + + + + + amod + + + + + + + + compound + + + + \ No newline at end of file diff --git a/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-20.html b/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-20.html new file mode 100644 index 000000000..0bd0ad9af --- /dev/null +++ b/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-20.html @@ -0,0 +1,2281 @@ +
The fish-tetrapod transition has been called the greatest step in vertebrate history (Long and + + Gordon + GPE + +, + + 2004 + DATE + +) and even one of the most significant events in the history of life ( + + Carroll + ORG + +, + + 2001 + DATE + +). Indeed, the morphological, physiological, and behavioral changes necessary for such a transformation in lifestyle to occur are astounding. The sum of these modifications occurring during + + the Devonian and Carboniferous + EVENT + + led to the eventual filling of the terrestrial realm with vertebrate life, forever altering the structure and ecology of terrestrial communities. Long and + + Gordon + PERSON + + ( + + 2004 + DATE + +) cited + + six + CARDINAL + + critical questions relating to the evolution of tetrapods. These questions aimed to ascertain which sarcopterygian fish were basal to tetrapods, how morphological changes occurred sequentially, and when, where, how, and why these changes took place. Many researchers have described the morphological changes that occurred (Clack, + + 2002b + DATE + +; + + Eaton + GPE + +, + + 1951 + DATE + +; + + Jarvik + GPE + +, + + 1955 + DATE + +; + + Long + GPE + + and + + Gordon + GPE + +, + + 2004 + DATE + +; + + Thomson + GPE + +, + + 1993 + DATE + +), and others have focused specifically on the development of limbs and digits (Clack, + + 2002b + DATE + +; + + Coates + ORG + + and + + Clack + ORG + +, + + 1990 + DATE + +; + + Coates + ORG + + et al, + + 2002 + DATE + +; + + Daeschler + ORG + + and + + Shubin + ORG + +, + + 1995 + DATE + +; + + Shubin et al, 1997 + PERSON + +; + + Shubin et al + PERSON + +, + + 2004 + DATE + +). As + + Long + PERSON + + and + + Gordon + PERSON + + ( + + 2004 + DATE + +) pointed out, the question that is the least well answered is the question of why these modifications occurred. Exactly what factors drove these changes to take place? Many researchers have posited theories over + + the years + DATE + + attempting to answer this question, and the aim of this paper is to assess these arguments and suggest some possible common causes that could tie many of the proposed causal factors together. However, a brief description of known data pertaining to the time and place of tetrapod origins is + + first + ORDINAL + + necessary in order to make valid statements regarding possible influential factors. The + + first + ORDINAL + + tetrapods (defined as vertebrates with paired limbs and digits) appeared during + + the Late Devonian + WORK_OF_ART + +, and it is now well-accepted that the panderichtyid fish are the sister group to tetrapods ( + + Carroll + ORG + +, + + 1995 + DATE + +; + + Long + GPE + + and + + Gordon + ORG + +, + + 2004 + DATE + +). Prior to the past couple of decades, very few + + Devonian + NORP + + tetrapod taxa were known: mainly + + Ichthyostega + GPE + + and + + Acanthostega + PERSON + +, both from the uppermost + + Famennian + NORP + +. The discovery of possible tetrapod trackways in + + Australia + GPE + + and + + Brazil + GPE + + ( + + Bray + ORG + +, + + 1985 + DATE + +; + + Warren + GPE + + and + + Wakefield + GPE + +, + + 1972 + DATE + +) stretched the potential range of tetrapods back into the + + Frasnian + NORP + +, and these speculations were supported by the discovery of + + Elginerpeton + GPE + +, the oldest known stem tetrapod, from + + the Late Frasnian + ORG + + + + about 368 million + CARDINAL + + years ago (mya) (Figure + + 1 + CARDINAL + +) ( + + Ahlberg + GPE + +, + + 1995 + DATE + +; + + Carroll + ORG + +, + + 1995 + DATE + +). Figure + + 1 + CARDINAL + +. Stratigraphic appearances and interrelationships of early tetrapods. Adapted from Long and + + Gordon + PERSON + + ( + + 2004 + DATE + +). However, it is important not to necessarily equate limbs and digits with terrestriality. Classically, the idea was that limbs developed to enable tetrapods to locomote on land. + + Jarvik + PERSON + + ( + + 1955 + DATE + +) originally reconstructed + + Ichthyostega + GPE + + in such a way that implied terrestriality, but a recent analysis indicates that + + Ichthyostega + GPE + + was not well-adapted for terrestrial locomotion ( + + Ahlberg et al, 2005 + PERSON + +). It is now assumed that limbs with digits evolved completely for aquatic adaptation ( + + Ahlberg + GPE + + and + + Milner + GPE + +, + + 1994 + DATE + +; + + Ahlberg et al, 2005 + GPE + +; + + Carroll + ORG + +, + + 1995 + DATE + +; + + Clack + ORG + +, + + 2002b + DATE + +; + + Coates + ORG + + and + + Clack + ORG + +, + + 1990 + DATE + +; + + Daeschler + ORG + + and + + Shubin + ORG + +, + + 1995 + DATE + +; + + Lebedev, 1997 + ORG + +). + + Romer + PERSON + + ( + + 1958 + DATE + +) even pointed this out, differentiating between the development of limbs giving the potentiality of terrestrial existence, + + and… + PERSON + +the utilization of these limbs for life on land. The earliest fully terrestrial tetrapod appears to be Pederpes from the + + Tournaisian + NORP + + about 354 to + + 344 + CARDINAL + + mya (Figure + + 1 + CARDINAL + +) (Clack, + + 2002a + DATE + +; + + Long + GPE + + and + + Gordon + ORG + +, + + 2004 + DATE + +). Given the constraints imposed by the fossil record, it appears that the evolution of terrestriality took place in tetrapods between + + the Frasnian of the Late Devonian + ORG + + and the + + Tournaisian + NORP + + of the Early Carboniferous some time + + between 368 and 344 + CARDINAL + + mya. An analysis of the environmental and ecological conditions imposed on creatures during this timeframe can help elucidate the major factors that drove terrestriality in tetrapods. Before making assertions about these environmental and ecological pressures, it is + + first + ORDINAL + + necessary to locate where tetrapods were likely evolving. This question of place involves + + at least two + CARDINAL + + aspects: ( + + 1 + CARDINAL + +) where geographically they were evolving, and ( + + 2 + CARDINAL + +) where ecologically they were evolving, i.e. whether they were evolving in marine or freshwater conditions. Geographically, the + + first + ORDINAL + + early tetrapod specimens collected were from + + the Old Red Sandstone of North America + ORG + + and western + + Europe + LOC + + (Clack, + + 2002b + DATE + +; + + Jarvik + GPE + +, + + 1955 + DATE + +), and the majority of Late + + Devonian + NORP + + tetrapods have been concentrated in localities on the southern coastal belt of the + + Euramerican + NORP + + plate, in what is modern-day + + Scotland + GPE + +, + + Greenland + GPE + +, eastern + + North America + LOC + +, and the + + Baltic + NORP + + states (Clack, + + 2002b + DATE + +; + + Daeschler + ORG + + and + + Shubin + ORG + +, + + 1995 + DATE + +; + + Milner + GPE + +, + + 1990 + DATE + +). Some authors hypothesized an + + East Gondwanan + GPE + + origin of tetrapods based on the + + Australian + NORP + + trackways ( + + Milner + PERSON + +, + + 1993 + DATE + +), but the discovery of + + Frasnian + ORG + +-age panderichthyids and tetrapods in + + Latvia + GPE + + and + + Russia + GPE + + offer strong support for a + + Euramerican + NORP + + origin of tetrapods ( + + Ahlberg + GPE + +, + + 1995 + DATE + +; + + Clack 2002b + ORG + +; + + Daeschler + ORG + + and + + Shubin + ORG + +, + + 1995 + DATE + +). However, it is clear that by the end of the + + Famennian + NORP + +, tetrapods had achieved a broad geographic distribution in equatorial regions from + + Euramerica + ORG + + all the way to + + Australia + GPE + + and even + + China + GPE + + ( + + Daeschler + ORG + +, + + 2000 + DATE + +; + + Daeschler + PERSON + +, et + + al, 1994 + DATE + +; + + Long + GPE + + and + + Gordon + GPE + +, + + 2004 + DATE + +; + + Milner + GPE + +, + + 1993 + DATE + +; + + Zhu + PERSON + +, et al, + + 2002 + DATE + +). As for marine versus freshwater considerations, it has traditionally been hypothesized that tetrapods evolved in freshwater conditions and that seasonal drying of these water bodies had driven terrestriality (Clack, + + 2002b + DATE + +; + + Gordon + PERSON + + and + + Olson + ORG + +, + + 1995 + DATE + +; + + Long + GPE + + and + + Gordon + GPE + +, + + 2004 + DATE + +; + + Milner + GPE + +, + + 1990 + DATE + +; + + Thomson + ORG + +, + + 1993 + DATE + +). Some authors have argued that certain factors in freshwater conditions that could have driven terrestriality would have exerted an even stronger influence in marine conditions. For instance, + + Packard + ORG + + ( + + 1974, 1976 + DATE + +) argued that anoxia would be even more of a problem in marine habitats than it is in freshwater habitats. However, other authors have argued that the intertidal habitats for early vertebrates proposed by + + Schultze + ORG + + ( + + 1999 + DATE + +, quoted in + + Graham + PERSON + + and + + Lee + PERSON + +, + + 2004 + DATE + +; Long and + + Gordon + GPE + +, + + 2004 + DATE + +) would not have exhibited a strong enough selective force to initiate air breathing or the invasion of land ( + + Graham + PERSON + + and + + Lee + PERSON + +, + + 2004 + DATE + +). Most modern amphibians are unable to live in salt water (Clack 2002b), and most amphibian fossils have been discovered in what appear to be freshwater environments (Bendix-Almgreen, et al, + + 1990 + DATE + +; + + Clack + ORG + +, + + 2002b + DATE + +; + + Daeschler + GPE + +, + + 2000 + DATE + +; Long and + + Gordon + ORG + +, + + 2004 + DATE + +). However, + + Bray + ORG + + ( + + 1985 + DATE + +) and + + Clack + PERSON + + ( + + 2002b + DATE + +) noted that it is not always easy to distinguish between fluvially-influenced and tidally-influenced sediments. Bray ( + + 1985 + DATE + +) hypothesized that tetrapods evolved in marginal marine rather than freshwater conditions, arguing that there was less of a salinity gradient between fresh and salt water in the + + Devonian + NORP + + than there is + + today + DATE + +. He noted that the density of terrestrial plants at that time was likely less than what we have + + today + DATE + +, which would allow weathering to occur at a higher rate, thus increasing the dissolved ion concentration in freshwater. Recent fossil finds have included some early tetrapods from possible tidal, lagoonal, marginal marine, and/or brackish water sediments ( + + Carroll + ORG + +, + + 2001 + DATE + +; + + Clack + GPE + +, + + 2002b + DATE + +; + + Daeschler + ORG + + and + + Shubin + ORG + +, + + 1995 + DATE + +; + + Janvier + ORG + +, + + 1996 + DATE + +; + + Long + GPE + + and + + Gordon + GPE + +, + + 2004 + DATE + +), as well as evidence that many early sarcopterygians dwelt in marine habitats (Clack, + + 2002b + DATE + +; + + Thomson + GPE + +, + + 1993 + DATE + +). Some authors have even suggested that the apparent widespread geographic range of early tetrapods (from + + modern-day + DATE + + + + North America + LOC + + to + + Australia + GPE + + and + + China + GPE + +) could only be a result of dispersal through epicontinental seas ( + + Carroll + ORG + +, + + 2001 + DATE + +; + + Daeschler + ORG + +, + + 2000 + DATE + +; + + Thomson + GPE + +, + + 1993 + DATE + +). While the evidence is not necessarily conclusive of either freshwater or marine origins, recent evidence seems to indicate that tetrapods likely arose in marginal marine and possibly lowland freshwater environments, and it is possible that they could have been tolerant of both marine and freshwater conditions, as are many modern vertebrate types (Clack, + + 2002b + DATE + +; + + Daeschler + ORG + + and + + Shubin + ORG + +, + + 1995 + DATE + +). So it appears that tetrapods evolved in some sort of coastal wetland environment around the margins of the + + Euramerican + NORP + + plate during + + the Late Devonian + FAC + +. An analysis of terrestrial flora, fauna, climate, and geography at this time could help elucidate some of the factors that would have favored terrestriality in these earliest tetrapods. The advent of land plants had important evolutionary consequences for terrestrial life. During the + + Silurian + NORP + +, the + + first + ORDINAL + + terrestrial plants (mainly lichens, liverworts, and moss-like plants) evolved and were able to grow in habitats near the shore ( + + Kenrick + PERSON + + and + + Crane + ORG + +, + + 1997 + DATE + +). True vascular plants with stomata evolved by the end of the + + Silurian + NORP + + (Clack, + + 2002b + DATE + +), and by the end of the + + Devonian + NORP + +, many other advanced characteristics had already evolved as well, including leaves, roots, + + sporangia + GPE + +, seeds, and secondary growth allowing plants to have a tree-like habit ( + + Algeo + PERSON + + and + + Scheckler + PERSON + +, + + 1998 + DATE + +; + + Edwards + GPE + +, + + 1998 + DATE + +; + + Kenrick + PERSON + + and + + Crane + ORG + +, + + 1997 + DATE + +). + + Frasnian + NORP + + floras were dominated by progymnosperms, including + + Archaeopteris + PERSON + + trees with trunks in excess of + + a meter + QUANTITY + + in diameter ( + + DiMichele and Hook + ORG + +, + + 1992 + DATE + +; + + Edwards + GPE + +, + + 1998 + DATE + +). At the + + Frasnian + NORP + +-Famennian boundary, an extinction event occurred that resulted in significant changes to the constituent flora (Clack, + + 2002b + DATE + +). As the plants recovered during the + + Famennian + NORP + +, species diversity and structural complexity of floral communities increased; multi-storied forests developed, and different plant groups evolved down distinct ecological lines ( + + Algeo + PERSON + + and + + Scheckler + PERSON + +, + + 1998 + DATE + +; + + DiMichele and Hook + ORG + +, + + 1992 + DATE + +; + + Kenrick + PERSON + + and + + Crane + ORG + +, + + 1997 + DATE + +). These developing forests generated oxygen as a photosynthetic waste product, thus increasing its abundance in the atmosphere and making the land a much more suitable place for animal life ( + + Bray + ORG + +, + + 1985 + DATE + +; + + Clack + ORG + +, + + 2002b + DATE + +). By the Late + + Silurian + NORP + + and + + Early Devonian + GPE + +, there was already a complex terrestrial ecosystem in place, which included arthropod populations. + + Centipedes + ORG + +, millipedes, arachnids, mites, scorpions, and other terrestrial arthropods were all present by this time ( + + DiMichele and Hook + ORG + +, + + 1992 + DATE + +; + + Gordon + PERSON + + and + + Olson + ORG + +, + + 1995 + DATE + +; + + Jeram + ORG + +, et al, + + 1990 + DATE + +; + + Kenrick + ORG + + and + + Crane + ORG + +, + + 1997 + DATE + +). They appear to have been mainly predators and detritivores ( + + Kenrick + PERSON + + and + + Crane + ORG + +, + + 1997 + DATE + +), thus establishing themselves as a major link between animals and plants (DiMichele and Hook, + + 1992 + DATE + +). The radiation of these terrestrial invertebrates likely had a strong influence on the later radiation of terrestrial vertebrates. The classical idea regarding climate in the Late + + Devonian + NORP + + was that it was a time of warm, arid conditions with only seasonal rainfall ( + + Barrell + GPE + +, + + 1916 + DATE + +; + + Bendix-Almgreen et al, 1990 + GPE + +; + + Clack 2002b + ORG + +; + + DiMichele and Hook + ORG + +, + + 1992 + DATE + +; + + Ewer + ORG + +, + + 1955 + DATE + +; + + Long + GPE + + and + + Gordon + GPE + +, + + 2004 + DATE + +; + + Orton + ORG + +, + + 1954 + DATE + +; + + Romer + ORG + +, + + 1945 + DATE + +, + + 1958 + DATE + +, + + 1966 + DATE + +; + + Warburton + PERSON + + and + + Denman + PERSON + +, + + 1961 + DATE + +). The red beds in which the early tetrapods were found were thought to be indicative of arid conditions. However, Inger ( + + 1957 + DATE + +) cited + + Krynine + PERSON + + ( + + 1949 + DATE + +) as demonstrating that red beds often form in non-drought conditions; thus, red beds in and of themselves are not necessarily indicative of aridity. + + Romer + PERSON + + ( + + 1958 + DATE + +) responded to + + Inger + PERSON + +'s arguments by citing evidence of aridity in these strata other than the red color, including associated evaporites and evidence of subaerial deposition. The consensus + + today + DATE + + is that at least some areas appear to have been semi-arid with seasonal rainfall, especially those areas that were around the equator ( + + Gordon + PERSON + + and + + Olson + ORG + +, + + 1995 + DATE + +), but it is clear that not all + + Devonian + NORP + + rocks indicate arid conditions (Clack, + + 2002b + DATE + +). Figure + + 2 + CARDINAL + +. Late + + Devonian + NORP + + ( + + Famennian + NORP + +) paleogeographic reconstruction from + + Scotese + NORP + + and + + McKerrow + PERSON + + ( + + 1990 + DATE + +) in + + Daeschler + PERSON + + and + + Shubin + PERSON + + ( + + 1995 + DATE + +). Filled circles indicate tetrapod body fossils, and open circles represent trackways. During + + the Devonian, Euramerica + ORG + + (also known as + + Laurussia + GPE + +), consisting primarily of + + Laurentia + GPE + + and + + Baltica + ORG + +, is hypothesized as being in an equatorial position ( + + Clack + ORG + +, + + 2002b + DATE + +; + + Daeschler + ORG + + and + + Shubin + ORG + +, + + 1995 + DATE + +; Daeschler et al, + + 1994 + DATE + +; + + DiMichele and Hook + ORG + +, + + 1992 + DATE + +; + + Gordon + PERSON + + and + + Olson + ORG + +, + + 1995 + DATE + +; + + Scotese + NORP + + and + + McKerrow + LOC + +, + + 1990 + DATE + +; + + Thomson + ORG + +, + + 1993 + DATE + +). + + Gondwana + PERSON + + lay southward, with + + the Iapetus Sea + LOC + + separating the + + two + CARDINAL + + (Figure + + 2 + CARDINAL + +). While there is not necessarily a consensus as to how much ocean separated the + + two + CARDINAL + + major continents during + + the Late Devonian (Daeschler 2000 + EVENT + +; Dalziel, et + + al, 1994 + DATE + +; + + Milner + GPE + +, + + 1993 + DATE + +; + + Thomson + ORG + +, + + 1993 + DATE + +; + + Van Der Voo + PERSON + +, + + 1988 + DATE + +), it is agreed that + + the Iapetus Sea + LOC + + was in the process of closing up as + + Gondwana + GPE + + and + + Laurussia + GPE + + were moving closer, ultimately coming together in the Carboniferous (Clack, + + 2002b + DATE + +; + + Gordon + PERSON + + and + + Olson + ORG + +, + + 1995 + DATE + +; + + Van Der Voo + PERSON + +, + + 1988 + DATE + +). This tectonically active region would have had a great effect on the lives of early tetrapods as their habitats were being resized, reshaped, and eventually eliminated. Over + + the years + DATE + +, many authors have considered these ecological and environmental factors and posited theories as to why tetrapods evolved into fully terrestrial creatures. + + Barrell + PERSON + + ( + + 1916 + DATE + +) was + + one + CARDINAL + + of the earliest to propose that adverse climatic conditions were the driving factor in the origin of terrestriality. He cited the red beds in which early tetrapods had been found as evidence of aridity and hypothesized that shrinking pools of water during the dry season would have pushed amphibious tetrapods out onto land in order to survive. + + Romer + PERSON + + ( + + 1945, 1966 + DATE + +) advanced this theory, postulating that tetrapods evolved limbs in order to remain in the water. When these small pools dried up, those creatures with the stoutest limbs and most efficient terrestrial locomotion would be more likely to make it to another body of water and survive. He noted that amphibians and + + crossopterygians + NORP + + lived in the same habitat and argued that amphibians would have an obvious advantage if their habitat evaporated. He proposed that these short treks on land would eventually increase in duration as some amphibians would possibly linger on land to eat. For + + many years + DATE + +, this was the popular theory, and many authors proposed nuanced versions of this basic idea. During + + the 1950s + DATE + +, a series of papers was published on this topic. + + Orton + ORG + + ( + + 1954 + DATE + +) espoused the possibility that limbs may have been a digging adaptation. She cited extant amphibians digging aestivation burrows to stay moist when surrounding conditions got dry rather than dispersing to find another body of water. However, even she noted that there are many burrowing animals that are able to do so without any limbs at all. Ewer ( + + 1955 + DATE + +) suggested that early tetrapods did not leave the shrinking ponds simply because their habitat was shrinking; rather, he noted that the receding habitat would have greatly increased population pressure, which would have triggered migration if environmental conditions were adequate. Gunter ( + + 1956 + DATE + +) held to the basic + + Romer/Ewer + ORG + + theory, but he argued that the tetrapod limb had to be formed prior to these excursions. He emphasized that this was a gradual process, with the limbs + + first + ORDINAL + + acting as props under water and the tetrapods making very short excursions onto land to escape predators or seek nearby food. The longer terrestrial excursions to escape the drying conditions were the final step of the process towards terrestriality. Goin and Goin ( + + 1956 + DATE + +) theorized that competition for food was the major driving factor, citing the presence of arthropods in the shallows and on the shore that could have served as an untapped food source for early tetrapods, even though + + Romer + PERSON + + ( + + 1958 + DATE + +) argued that these food sources were not nearly adequate. In Inger's ( + + 1957 + DATE + +) paper offering a different climatic interpretation, he noted that the hypothesized aridity would have caused a great desiccation problem for migrating amphibians; a humid climate would have offered more favorable migration conditions for early tetrapods. + + Warburton + PERSON + + and + + Denman + PERSON + + ( + + 1961 + DATE + +) pointed out that in order to be a successful frog, one first must be a successful tadpole. They postulated that + + protoamphibians + NORP + + laid their eggs in shallow pools away from competition with larger lungfish and predators. Terrestrial locomotion would have been necessary for these larvae to get back to the water, and they pointed out that in this case, selection would be operating on a large number of individuals. This view was echoed by + + Gordon + PERSON + + and + + Olson + PERSON + + ( + + 1995 + DATE + +) as well. + + Thomson + PERSON + + ( + + 1993 + DATE + +) considered the whole pool-drying scenario to be logically inadequate, instead claiming that ecological conditions had to be the driving factor. It was the emergence of wetlands that fostered the origination of terrestrial tetrapods, offering a moist environment with abundant new food sources and protection from predators. Sayer and Davenport ( + + 1991 + DATE + +) conducted a study of modern-day amphibious fishes and found that they leave water under a variety of factors. Environmental degradation, including decreased oxygen content, increased temperature, and drastically fluctuating salinity, often causes fishes to evacuate the water. Biotic factors, such as competition for food and space, predation, feeding, and reproduction, can have a large influence as well. It is clear that a large number of potential factors could have played a role in the evolution of terrestriality, and because of this, the discussion of tetrapod origins has been unusually-theory laden ( + + Thomson + ORG + +, + + 1993 + DATE + +). Yet it is important to try and understand what conditions may have driven such an important evolutionary event. By analyzing the available evidence and assessing the validity of the many theories put forth, it may be possible to elucidate a small number of major factors that could have driven this critical occurrence in the history of life. Long and + + Gordon + PERSON + + ( + + 2004 + DATE + +) cited that both evolutionary pushes and pulls likely influenced the evolution of terrestriality. The pushes-the factors that encouraged tetrapods to leave the water-included poor environmental conditions, predators, competitors, diseases, and parasites. The pulls-the factors that encouraged tetrapods to come onto land-included favorable conditions, empty niches, abundant food resources, and a lack of predators, competitors, diseases, and parasites. The influences of all of these factors seem logical, so how can one disentangle them and determine the most important factors? Maybe it is not possible to sort out these factors and give some of them priority over others; they might all have been of equal importance. But if + + one + CARDINAL + + or + + two + CARDINAL + + primary causes could be determined that would have amplified the effects of these ecological factors, one could assign primary importance to these causes. The evolution of plants could have been one of these primary causes. In order for animals to move on to land, it + + first + ORDINAL + + had to be habitable for them, and, as described earlier, the evolution of land plants drastically altered the composition of the atmosphere and formed the basis of new terrestrial ecosystems. The emergence of coastal wetlands offered an array of habitats previously unseen in + + earth + LOC + +'s history ( + + Thomson + ORG + +, + + 1993 + DATE + +) and encouraged the evolution of terrestriality in arthropods. + + Despite Romer's + PERSON + + ( + + 1958 + DATE + +) objections that these food sources would not have been adequate, it is probable that even piscivorous fishes would have fed on these new prey items (Clack, + + 2002b + DATE + +; + + Goin and Goin, 1956 + ORG + +; + + Thomson + GPE + +, + + 1993 + DATE + +). There were no vertebrate predators on land, so this would basically have been an unexploited niche. Rather than competing with fish in the sea, they could have an untapped source of food on land as long as they could get to it. Yet, in addition to these evolutionary pulls, plants exerted some pushes on tetrapods as well. The evolution of deciduousness in plants could have played a crucial role. Not only would mass senescing of leaves have enhanced the terrestrial ecosystem by enriching soil development (an evolutionary pull), it also likely caused anoxia in near-shore waters ( + + Algeo + PERSON + + and + + Scheckler + PERSON + +, + + 1998 + DATE + +; + + Clack + GPE + +, + + 2002b + DATE + +). As the plant matter decayed in the water, oxygen levels in the water would have decreased. This situation could have encouraged air breathing in some fish (Sayer and Davenport, + + 1991 + DATE + +), which was a requisite step in the transition to life on land. Certainly, of course, the fish could have merely come up to the surface to breathe air and survived that way, but as Sayer and Davenport ( + + 1991 + DATE + +) pointed out, many modern-day fish do leave anoxic waters. The evolution of land plants clearly played a critical role in the evolution of terrestriality. They enhanced the terrestrial ecosystem and offered wide open niches, abundant invertebrate food resources, protection from predators, and an oxygen-rich atmosphere as opposed to anoxic waters. However, aside from the anoxia in the water, all of these would be considered evolutionary pulls rather than pushes. There had to have been some factors in their aquatic environment that made a move onto land-and all of the requisite changes-beneficial. The tectonic activity occurring in and around + + the Iapetus Sea + LOC + + at this time could have enhanced the effects of many various evolutionary pushes. As discussed previously, early tetrapod evolution appears to have been concentrated along the southern coast of + + Euramerica + GPE + + during the Late Devonian. During this time, + + Euramerica + ORG + + and + + Gondwana + PERSON + + were converging, eventually forming + + Pangaea + ORG + + during + + the Permo-Carboniferous + ORG + +. This closing of + + the Iapetus Sea + LOC + + would have affected aquatic tetrapods living in this region in several ways. As the continents came together, the major direct effect would have been habitat loss, and this could happen in several ways. The convergence of continents would decrease the amount of coastlines and lower global sea level (Clack 2002b). The arrangement of the continents also led to a short period of global cooling and glaciation during the + + Famennian + NORP + + in + + Gondwana + GPE + + ( + + Algeo + PERSON + + and + + Scheckler + PERSON + +, + + 1998 + DATE + +; + + Johnson + PERSON + +, et al, + + 1985 + DATE + +; + + Van Der Voo + PERSON + +, + + 1988 + DATE + +). The uptake of water by glaciers would have lowered sea level as well. For tetrapods living in coastal habitats, these compounding factors would have led to a great decline in available habitat. As the amount of habitat decreased, previously separated populations of animals would be brought together into more of a confined space. In such a situation, the competition would be very intense. This recalls + + Ewer + ORG + +'s ( + + 1955 + DATE + +) emphasis on the importance of population pressure, as well as + + Goin and Goin's + ORG + + ( + + 1956 + DATE + +) focus on competition, in tetrapod evolution. Clack ( + + 2002b + DATE + +) also noted that when previously separated populations are forced to share a common environment, the biodiversity would actually decrease, while distribution of the remaining species would increase. This intense competition would have been a strong evolutionary push for tetrapods to find another suitable habitat. When Inger ( + + 1957 + DATE + +) was contesting the interpretation of red beds as indicative of an arid climate, he argued that discerning the stimulus that pushed terrestriality is dependent on one's climatic interpretation. And it is clear that there is not consensus about the climate in which early tetrapods evolved. But at the heart of + + Romer + PERSON + +'s classic scenario of tetrapods escaping drying pools is the loss of habitat. It has been suggested here that tectonic activity and its effects could have caused the habitat of early tetrapods to be lost; thus, an arid climate need not necessarily be a critical component in theories of the evolution of terrestriality. The evolution of terrestrial tetrapods has certainly sparked much discussion over + + the years + DATE + +, and deservedly so, for a rich terrestrial vertebrate fauna of + + about 360 million years + DATE + + is contingent on this event. During + + the Late Devonian + WORK_OF_ART + +, the + + first + ORDINAL + + tetrapods made their way onto land. As their habitat was shrinking and causing fierce intra- and interspecific competition for resources in the shallows, a new habitat with abundant resources had been brought about by plants in the terrestrial realm. The filling of these new niches available on land forever changed the course of life on earth. To understand the full breadth of evolution, it is crucial to try and understand these landmark events in the history of life on this planet. The origin of new species depends on a complex combination of environmental conditions, ecological factors, and chance. The environmental conditions of the Late + + Devonian + NORP + + certainly made life difficult for aquatic organisms, as evidenced by the mass extinction event in the marine community (DiMichele and Hook, + + 1992 + DATE + +; + + Johnson + PERSON + +, et al, + + 1985 + DATE + +; + + Long + GPE + + and + + Gordon + ORG + +, + + 2004 + DATE + +); but had the opportunity for the colonization of land never been presented by the changes brought about by plants, the early aquatic tetrapods may have never survived long excursions in the terrestrial realm. The chance coincidence of increasingly poor quality and decreasingly abundant aquatic habitats, an emerging high quality terrestrial ecosystem, and the acquiring of morphological adaptations by the + + first + ORDINAL + + tetrapods set the stage for + + one + CARDINAL + + of the most important steps in the history of animal life. + + The Michigan Corpus of Upper + ORG + +-level Student Papers ( + + MICUSP + ORG + +) is owned by the Regents of the University of Michigan (UM), who hold the copyright. The corpus has been developed by researchers at + + the UM English Language Institute + FAC + +. The corpus files are freely available for study, research and teaching. However, if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required. For further information about copyright permissions, please contact + + micusp-help@umich.edu + PRODUCT + +. The recommended citation for + + MICUSP + ORG + + is: + + Michigan Corpus + PERSON + + of Upper-level + + Student Papers + WORK_OF_ART + +. ( + + 2009 + DATE + +). + + Ann Arbor + PERSON + +, MI: The Regents of + + the University of Michigan + ORG + +.
\ No newline at end of file diff --git a/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy.ipynb b/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy.ipynb index 5b8574860..82959a9e2 100644 --- a/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy.ipynb +++ b/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy.ipynb @@ -7,7 +7,7 @@ "colab_type": "text" }, "source": [ - "\"Open" + "\"Open" ] }, { @@ -541,6 +541,18 @@ "final_paper_df.head()" ] }, + { + "cell_type": "code", + "source": [ + "tokens = final_paper_df[['Text', 'Tokens']].copy()\n", + "tokens.head()" + ], + "metadata": { + "id": "xSU8Rn57FbSK" + }, + "execution_count": null, + "outputs": [] + }, { "cell_type": "markdown", "metadata": { @@ -656,14 +668,14 @@ }, { "cell_type": "code", - "execution_count": null, + "source": [ + "list(final_paper_df.loc[[3, 163], 'Proper_Nouns'])" + ], "metadata": { - "id": "P2r_x9neA_HG" + "id": "m98pVVJX1ZlK" }, - "outputs": [], - "source": [ - "list(final_paper_df['Proper_Nouns'])" - ] + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", @@ -895,7 +907,7 @@ }, "outputs": [], "source": [ - "# Store dictinoary with indexes and POS counts in a variable\n", + "# Store dictionary with indexes and POS counts in a variable\n", "num_pos = doc.count_by(spacy.attrs.POS)\n", "\n", "dictionary = {}\n", @@ -915,6 +927,9 @@ }, "outputs": [], "source": [ + "# Create new DataFrame for analysis purposes\n", + "pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']]\n", + "\n", "# Create list to store each dictionary\n", "num_list = []\n", "\n", @@ -927,7 +942,7 @@ " num_list.append(dictionary)\n", "\n", "# Apply function to each doc object in DataFrame\n", - "final_paper_df['C_POS'] = final_paper_df['Doc'].apply(get_pos_tags)" + "pos_analysis_df['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)" ] }, { @@ -944,7 +959,7 @@ "\n", "# Add discipline of each paper as new column to dataframe\n", "idx = 0\n", - "new_col = final_paper_df['DISCIPLINE']\n", + "new_col = pos_analysis_df['DISCIPLINE']\n", "pos_counts.insert(loc=idx, column='DISCIPLINE', value=new_col)\n", "\n", "pos_counts" @@ -1014,7 +1029,7 @@ " tag_num_list.append(dictionary)\n", "\n", "# Apply function to each doc object in DataFrame\n", - "final_paper_df['F_POS'] = final_paper_df['Doc'].apply(get_fine_pos_tags)\n", + "pos_analysis_df['F_POS'] = pos_analysis_df['Doc'].apply(get_fine_pos_tags)\n", "\n", "# Create new dataframe with part of speech counts\n", "tag_counts = pd.DataFrame(tag_num_list)\n", @@ -1022,7 +1037,7 @@ "\n", "# Add discipline of each paper as new column to dataframe\n", "idx = 0\n", - "new_col = final_paper_df['DISCIPLINE']\n", + "new_col = pos_analysis_df['DISCIPLINE']\n", "tag_counts.insert(loc=idx, column='DISCIPLINE', value=new_col)" ] }, diff --git a/assets/corpus-analysis-with-spacy/metadata.csv b/assets/corpus-analysis-with-spacy/metadata.csv new file mode 100644 index 000000000..262bd28bd --- /dev/null +++ b/assets/corpus-analysis-with-spacy/metadata.csv @@ -0,0 +1,166 @@ +PAPER ID,TITLE,DISCIPLINE,PAPER TYPE,,,,,,,,, +BIO.G0.15.1,Invading the Territory of Invasives: The Dangers of Biotic Disturbance,Biology,Argumentative Essay,,,,,,,,, +BIO.G1.04.1,The Evolution of Terrestriality: A Look at the Factors that Drove Tetrapods to Move Onto Land,Biology,Argumentative Essay,,,,,,,,, +BIO.G3.03.1,Intracellular Electric Field Sensing using Nano-sized Voltmeters,Biology,Argumentative Essay,,,,,,,,, +BIO.G0.11.1,Exploring the Molecular Responses of Arabidopsis in Hypobaric Environments: Identifying Possible Targets for Genetic Engineering,Biology,Proposal,,,,,,,,, +BIO.G1.01.1,V. Cholerae: First Steps towards a Spatially Explicit Model ,Biology,Proposal,,,,,,,,, +BIO.G1.07.1,Zebrafish and PGC mis-migration,Biology,Proposal,,,,,,,,, +BIO.G2.06.1,A Conserved Role of Cas-Spg System in Endoderm Specification during Early Vertebrate Development,Biology,Proposal,,,,,,,,, +BIO.G3.02.1,Linking scales to understand diversity,Biology,Proposal,,,,,,,,, +BIO.G0.01.1,The Ecology and Epidemiology of Plague,Biology,Report,,,,,,,,, +BIO.G0.02.1,Host-Parasite Interactions: On the Presumed Sympatric Speciation of Vidua,Biology,Report,,,,,,,,, +BIO.G0.02.2,Sensory Drive and Speciation,Biology,Report,,,,,,,,, +BIO.G0.02.3,Plant Pollination Systems: Evolutionary Trends in Generalization and Specialization,Biology,Report,,,,,,,,, +BIO.G0.02.4,"Chromosomal Rearrangements, Recombination Suppression, and Speciation: A Review of Rieseberg 2001",Biology,Report,,,,,,,,, +BIO.G0.02.5,On the Origins of Man: Understanding the Last Two Million Years,Biology,Report,,,,,,,,, +BIO.G0.04.1,Fetal Endocrine System,Biology,Report,,,,,,,,, +BIO.G0.05.1,Mn (III) TPPS4: A Metallophorphryin Used for Tumor Identification in MRI,Biology,Report,,,,,,,,, +BIO.G0.06.1,Global Reproductive Strategies of Tursiops and Stenella (Family Delphinidae),Biology,Report,,,,,,,,, +BIO.G0.07.1,Complementation Between Histidine-Requiring Mutants of Saccharomyces Cerevisiae,Biology,Report,,,,,,,,, +BIO.G0.09.1,Nest Selection In Weaver Birds,Biology,Report,,,,,,,,, +BIO.G0.11.3,Fungal Eye Infections Due to ReNu MoistureLoc,Biology,Report,,,,,,,,, +BIO.G0.12.2,Lab Report 2: Plant Biodiversity ,Biology,Report,,,,,,,,, +BIO.G0.13.1,Malaria Disease and Transmission,Biology,Report,,,,,,,,, +BIO.G0.16.1,Role of Leptin in Cardiovascular Disease,Biology,Report,,,,,,,,, +BIO.G0.18.1,Mammal Diversification,Biology,Report,,,,,,,,, +BIO.G0.19.1,Malaria Disease and Transmission,Biology,Report,,,,,,,,, +BIO.G0.20.1,A Case for a US International Anti-Malaria Program,Biology,Report,,,,,,,,, +BIO.G0.24.1,Varying Infectivity of Diplostomum Flexicaudum in Fish Species of Douglas Lake,Biology,Report,,,,,,,,, +BIO.G0.25.1,Malaria in the Twenty-first Century,Biology,Report,,,,,,,,, +BIO.G0.26.1,"Case Analysis for Jean ""Redhorse"" Osceola ",Biology,Report,,,,,,,,, +BIO.G0.29.1,Sexual Selection and Male Sacrifice: From Darwin until Now,Biology,Report,,,,,,,,, +BIO.G0.30.1,Assessing selection hypotheses for the CCR5-delta32 mutation in Europeans,Biology,Report,,,,,,,,, +BIO.G0.32.1,Mangrove Deforestation,Biology,Report,,,,,,,,, +BIO.G1.03.1,Dispersal: a Review and Synthesis,Biology,Report,,,,,,,,, +BIO.G1.06.1,Temp effects on nitrogen mineralization,Biology,Report,,,,,,,,, +BIO.G1.08.1,"Boc, sonic hedgehog and commissural axons",Biology,Report,,,,,,,,, +BIO.G2.02.1,Modularity and the Evolution of Complex Systems,Biology,Report,,,,,,,,, +BIO.G2.05.1,Application of Microarray Analysis in Drosophila,Biology,Report,,,,,,,,, +BIO.G2.05.2,Assignment: Subcellular Protein Localization,Biology,Report,,,,,,,,, +BIO.G2.07.1,Biofuels and Biodiversity,Biology,Report,,,,,,,,, +BIO.G0.02.6,Genetic Analysis of Drosophila Melanogaster Mutants: To Determine Inheritance and Linkage Patterns,Biology,Research Paper,,,,,,,,, +BIO.G0.03.1,Lab 3: Plant Competition,Biology,Research Paper,,,,,,,,, +BIO.G0.03.2,The Effects of Motor Oil on Aquatic Insect Predation,Biology,Research Paper,,,,,,,,, +BIO.G0.03.3,Comparison of Hypothesis Testing in a High vs. Low-impact Journal,Biology,Research Paper,,,,,,,,, +BIO.G0.04.2,Drosophila Lab Report,Biology,Research Paper,,,,,,,,, +BIO.G0.04.3,Conjugation Lab Report,Biology,Research Paper,,,,,,,,, +BIO.G0.07.2,"Plasmid Transfer, Genetic Stability and Nisin Resistance in Lactococci",Biology,Research Paper,,,,,,,,, +BIO.G0.07.3,Genetic Analysis of a Mutant Strain of Drosophila Melanogaster,Biology,Research Paper,,,,,,,,, +BIO.G0.08.1,Analysis of a Mutant Strain of Drosophila,Biology,Research Paper,,,,,,,,, +BIO.G0.10.1,Bacterial Conjugation and Gene Mapping in E. Coli,Biology,Research Paper,,,,,,,,, +BIO.G0.10.2,Drosophila Melanogaster Genetic Analysis Experiment,Biology,Research Paper,,,,,,,,, +BIO.G0.11.2,Mapping of Genes in a Mutant Strain of Drosophila Melanogaster,Biology,Research Paper,,,,,,,,, +BIO.G0.12.1,The Effect of Initial Carbon Dioxide Concentration on the Rate of Photosynthesis in the Aquatic Plant Elodea,Biology,Research Paper,,,,,,,,, +BIO.G0.14.1,Mapping of Unknown Mutations in Drosophila Melanogaster,Biology,Research Paper,,,,,,,,, +BIO.G0.17.1,Effects of Protein Deficiency on Organ Size in Mus Musculus,Biology,Research Paper,,,,,,,,, +BIO.G0.21.1,Prevalence of Haemosporidians in Birds of Northern Michigan,Biology,Research Paper,,,,,,,,, +BIO.G0.22.1,Trunk Forking in Acer Saccharum: a Phototropic Response to Forest Canopy Gaps,Biology,Research Paper,,,,,,,,, +BIO.G0.23.1,Drosophila Melanogaster lab report,Biology,Research Paper,,,,,,,,, +BIO.G0.27.1,Small Mammal Response to Post-fire Forest Succession in Northern Lower Michigan,Biology,Research Paper,,,,,,,,, +BIO.G0.28.1,Niche Partitioning in Bats,Biology,Research Paper,,,,,,,,, +BIO.G1.02.1,Biological Significance of Modular Structures in Protein Networks,Biology,Research Paper,,,,,,,,, +BIO.G1.05.1,Inferring Swimming Mode from Skeletal Proportions in Fossil Pinnipedimorphs,Biology,Research Paper,,,,,,,,, +BIO.G2.01.1,Relationships and Biogeography of Antillean Cichlids,Biology,Research Paper,,,,,,,,, +BIO.G2.03.1,"Cholera Seasonality, Rainfall, and Fadeouts: a Geostatistical Approach",Biology,Research Paper,,,,,,,,, +BIO.G2.04.1,Polyphyly of the Old World Vultures and Phylogenetic Placement of Gypohierax Angolensis (Aves: Accipitridae) Inferred from Mitochondrial DNA,Biology,Research Paper,,,,,,,,, +BIO.G3.01.1,Do Programming-Oriented Environments Provide Favorable Growing Conditions for Hypotheses?,Biology,Research Paper,,,,,,,,, +BIO.G0.31.1, Neurobiology Disease Explanation to a Parent,Biology,Response Paper,,,,,,,,, +BIO.G1.08.2,Drosophila neuroblasts,Biology,Response Paper,,,,,,,,, +ENG.G0.02.1,The Vicar of Wakefield as a Failed Morality Story,English,Argumentative Essay,,,,,,,,, +ENG.G0.03.1,HIV-AIDS Funding,English,Argumentative Essay,,,,,,,,, +ENG.G0.03.2,William Faulkner's As I Lay Dying,English,Argumentative Essay,,,,,,,,, +ENG.G0.04.1,The Absolute Necessity of College Level Writing Courses,English,Argumentative Essay,,,,,,,,, +ENG.G0.05.1,People or Property?,English,Argumentative Essay,,,,,,,,, +ENG.G0.06.2,A (Solitary) Place For Fantasy in Reality,English,Argumentative Essay,,,,,,,,, +ENG.G0.06.3,Human-Animal Nature in H.G. Wells and Edgar Allen Poe ,English,Argumentative Essay,,,,,,,,, +ENG.G0.07.1,Effects of digital age on children's literature,English,Argumentative Essay,,,,,,,,, +ENG.G0.09.2,"Historical Places, Violent Spaces: ",English,Argumentative Essay,,,,,,,,, +ENG.G0.11.1,Bloom & Martha: Are You Not Happy in Your Home?,English,Argumentative Essay,,,,,,,,, +ENG.G0.12.1,Women in Beowulf,English,Argumentative Essay,,,,,,,,, +ENG.G0.13.1,The Love Covenant,English,Argumentative Essay,,,,,,,,, +ENG.G0.14.1,James Joyce: A Portrait of the Artist as a Young Man,English,Argumentative Essay,,,,,,,,, +ENG.G0.15.1,Anna Karenina,English,Argumentative Essay,,,,,,,,, +ENG.G0.17.1,The Grey Zone of Shame,English,Argumentative Essay,,,,,,,,, +ENG.G0.18.1,Individuality and Isolation in Moll Flanders,English,Argumentative Essay,,,,,,,,, +ENG.G0.18.2,Frames and Resistance in Pride and Prejudice,English,Argumentative Essay,,,,,,,,, +ENG.G0.18.3,Satire and Morality in the Vicar of Wakefield,English,Argumentative Essay,,,,,,,,, +ENG.G0.18.4,The Space of Dreams in The Age of Innocence,English,Argumentative Essay,,,,,,,,, +ENG.G0.19.2,Paper on Invisible Man for an American Lit course,English,Argumentative Essay,,,,,,,,, +ENG.G0.20.1,Autonomy in Robinson Crusoe,English,Argumentative Essay,,,,,,,,, +ENG.G0.21.1,Elevated Language in Beowulf,English,Argumentative Essay,,,,,,,,, +ENG.G0.22.1,Autumnal Imagery in Austen's Persuasion,English,Argumentative Essay,,,,,,,,, +ENG.G0.23.1,Milton's Relativism ,English,Argumentative Essay,,,,,,,,, +ENG.G0.24.1,Contradiction and Religious Critique: The Pardoner in The Canterbury Tales,English,Argumentative Essay,,,,,,,,, +ENG.G0.25.1,Sexualized Violence and Identity in Achy Obejas' Memory Mambo,English,Argumentative Essay,,,,,,,,, +ENG.G0.26.2,Comparison of Hermia in A Midsummer Night's Dream and Jessica in TheMerchant of Venice,English,Argumentative Essay,,,,,,,,, +ENG.G0.27.1,The Representation of Jesus and Women in the Gospel of Mark,English,Argumentative Essay,,,,,,,,, +ENG.G0.28.1,Bloom the Critic in Joyce's Ulysses,English,Argumentative Essay,,,,,,,,, +ENG.G0.29.1,Good People Breaking Rules,English,Argumentative Essay,,,,,,,,, +ENG.G0.32.1,"Slavery in ""Robinson Crusoe""",English,Argumentative Essay,,,,,,,,, +ENG.G0.34.1,The Law of Love,English,Argumentative Essay,,,,,,,,, +ENG.G0.35.1,"""Less is More: Courtship in Twelfth Night""",English,Argumentative Essay,,,,,,,,, +ENG.G0.35.2,"""Andrew Marvell's Definition of Love""",English,Argumentative Essay,,,,,,,,, +ENG.G0.37.1,The Last Paper I Ever Wrote in College,English,Argumentative Essay,,,,,,,,, +ENG.G0.39.2,The Image of Mary,English,Argumentative Essay,,,,,,,,, +ENG.G0.40.1,Deep Sleep And The Inability To Exist In Works of Fantasy,English,Argumentative Essay,,,,,,,,, +ENG.G0.41.1,Anti-Aristotelian Kane,English,Argumentative Essay,,,,,,,,, +ENG.G0.41.3,Classical and Modern Representations in O'Neill,English,Argumentative Essay,,,,,,,,, +ENG.G0.42.2,Sexuality in Ancient Greece,English,Argumentative Essay,,,,,,,,, +ENG.G0.43.1,Margery Kempe's Self-Fashioning: Visioning Herself in God,English,Argumentative Essay,,,,,,,,, +ENG.G0.44.1,Eve's Understanding of Natural Patriarchy in Paradise Lost,English,Argumentative Essay,,,,,,,,, +ENG.G0.45.1,Charity in Sir Thornhill,English,Argumentative Essay,,,,,,,,, +ENG.G0.46.1,"With Magic Comes Power: Exploring Marlowe's ""Doctor Faustus"" and Shakespeare's ""The Tempest""",English,Argumentative Essay,,,,,,,,, +ENG.G0.47.1,"Female Bonding in the Novel ""Roxana""",English,Argumentative Essay,,,,,,,,, +ENG.G0.49.1,The Purgatory of the Postmodern,English,Argumentative Essay,,,,,,,,, +ENG.G0.49.2,"The Officer, Solness, and the Dionysian Man",English,Argumentative Essay,,,,,,,,, +ENG.G0.49.3,"This Man Loved Earth, Not Heaven, Enough to Die",English,Argumentative Essay,,,,,,,,, +ENG.G0.51.1,Rejecting Colonial Memory,English,Argumentative Essay,,,,,,,,, +ENG.G0.52.1,Deborah and the Degradation of Israel,English,Argumentative Essay,,,,,,,,, +ENG.G0.53.1,Carwin and the Imp of the Perverse,English,Argumentative Essay,,,,,,,,, +ENG.G0.55.1,Self Destruction in Kindred,English,Argumentative Essay,,,,,,,,, +ENG.G0.55.2,Illustrations in Persepolis,English,Argumentative Essay,,,,,,,,, +ENG.G0.58.1,My Reading of Chaucer,English,Argumentative Essay,,,,,,,,, +ENG.G1.02.1,Seeing Selves in Troilus and Cressida,English,Argumentative Essay,,,,,,,,, +ENG.G1.03.1,Theorizing the Analysand Desire,English,Argumentative Essay,,,,,,,,, +ENG.G1.04.1,Sports Literacy and Rhetoric as Power,English,Argumentative Essay,,,,,,,,, +ENG.G1.05.1,Yeats and Spenser,English,Argumentative Essay,,,,,,,,, +ENG.G1.06.1,Intergenerational Trauma in Nora Okja Keller's Comfort Woman,English,Argumentative Essay,,,,,,,,, +ENG.G2.01.1,Messianic Masochism in H. Rider Haggard,English,Argumentative Essay,,,,,,,,, +ENG.G2.02.1,Dramatic Adaptations: Jewish Identity and Narrative Form in The Island Within,English,Argumentative Essay,,,,,,,,, +ENG.G2.02.2,'City Troubles': Miss Lonelyhearts and the Publicized Privacy of Urban Space,English,Argumentative Essay,,,,,,,,, +ENG.G2.03.1,Domesticity in Cold War Black Fiction on the Left,English,Argumentative Essay,,,,,,,,, +ENG.G2.04.1,Augusta Webster Paper,English,Argumentative Essay,,,,,,,,, +ENG.G3.04.1,Creative Multivalence: Social Engagement in Gwendolyn Brooks's 'Maud Martha',English,Argumentative Essay,,,,,,,,, +ENG.G0.06.1,My Life Is Not a Movie Starring Michelle Pfeiffer or Hilary Swank,English,Creative Writing,,,,,,,,, +ENG.G0.26.1,Autoethnography,English,Creative Writing,,,,,,,,, +ENG.G0.38.3,Return to Suomi,English,Creative Writing,,,,,,,,, +ENG.G0.57.1,"Hire Me, You Know You Want To",English,Creative Writing,,,,,,,,, +ENG.G0.30.1,SIgnificance of Menstruation in Joyce's Ulysses,English,Critique/Evaluation,,,,,,,,, +ENG.G0.41.2,"Seasons, Ages, Cycles: All As You Like It",English,Critique/Evaluation,,,,,,,,, +ENG.G0.48.1,Creative Exercise in Style and Content,English,Critique/Evaluation,,,,,,,,, +ENG.G2.05.1,St. Alban and English Exemplarity,English,Critique/Evaluation,,,,,,,,, +ENG.G3.03.1,Into the Light: Avedon's Images of Inmates,English,Critique/Evaluation,,,,,,,,, +ENG.G0.01.1,Woolf's Women,English,Report,,,,,,,,, +ENG.G0.01.2,Douglas's Declaration,English,Report,,,,,,,,, +ENG.G0.10.1,Close Reading Paper: The Tempest (4.1.146-163),English,Report,,,,,,,,, +ENG.G0.13.2,Abraham Drafted to Team Galatians,English,Report,,,,,,,,, +ENG.G0.16.1,Humans and Animals in the Book of Genesis,English,Report,,,,,,,,, +ENG.G0.17.2,"Survival in Auschwitz: Irony, Reality, and Power",English,Report,,,,,,,,, +ENG.G0.19.1,A compare and contrast paper using two texts from a Science Fiction course,English,Report,,,,,,,,, +ENG.G0.22.2,"Chaucer's ""The Franklin's Tale"" and Boccaccio's Il Filocolo",English,Report,,,,,,,,, +ENG.G0.31.1,"Cursed Inheritances in Go Down, Moses",English,Report,,,,,,,,, +ENG.G0.31.2,Jack Zipes and Fairy Tales,English,Report,,,,,,,,, +ENG.G0.33.1,Limited Recovery: Trauma in Obejas' Cuban America,English,Report,,,,,,,,, +ENG.G0.34.2,The Words of Love Between Us: The Covenant in Hosea,English,Report,,,,,,,,, +ENG.G0.34.3,Names of God and Man,English,Report,,,,,,,,, +ENG.G0.36.1,Stephen's Deconstruction of Himself in Ulysses,English,Report,,,,,,,,, +ENG.G0.42.1,Perspectives on the English Revolution,English,Report,,,,,,,,, +ENG.G0.43.2,Defining Wild Nature,English,Report,,,,,,,,, +ENG.G0.50.1,Rosamond Vincy and the Real Sphere,English,Report,,,,,,,,, +ENG.G0.54.1,Analysis of T.C. Boyle's A Friend of the Earth,English,Report,,,,,,,,, +ENG.G1.01.1,Clarissa Consumed,English,Report,,,,,,,,, +ENG.G1.07.1,Categorical Entanglement ,English,Report,,,,,,,,, +ENG.G2.06.1,Jews in the New York Harbor,English,Report,,,,,,,,, +ENG.G2.07.1,Pedagogical genres,English,Report,,,,,,,,, +ENG.G0.38.1,Roman Polanski's Macbeth,English,Response Paper,,,,,,,,, +ENG.G0.38.2,Julie Taymor's Titus,English,Response Paper,,,,,,,,, diff --git a/assets/corpus-analysis-with-spacy/txt_files.zip b/assets/corpus-analysis-with-spacy/txt_files.zip new file mode 100644 index 000000000..3b02ea9be Binary files /dev/null and b/assets/corpus-analysis-with-spacy/txt_files.zip differ diff --git a/en/lessons/corpus-analysis-with-spacy.md b/en/lessons/corpus-analysis-with-spacy.md new file mode 100644 index 000000000..86fe04589 --- /dev/null +++ b/en/lessons/corpus-analysis-with-spacy.md @@ -0,0 +1,887 @@ +--- +title: "Corpus Analysis with spaCy" +slug: corpus-analysis-with-spacy +layout: lesson +collection: lessons +date: 2023-11-02 +authors: +- Megan S. Kane +reviewers: +- Maria Antoniak +- William Mattingly +editors: +- John R. Ladd +review-ticket: https://github.com/programminghistorian/ph-submissions/issues/546 +difficulty: 2 +activity: analyzing +topics: [data-manipulation, distant-reading, python] +abstract: This lesson demonstrates how to use the Python library spaCy for analysis of large collections of texts. This lesson details the process of using spaCy to enrich a corpus via lemmatization, part-of-speech tagging, dependency parsing, and named entity recognition. Readers will learn how the linguistic annotations produced by spaCy can be analyzed to help researchers explore meaningful trends in language patterns across a set of texts. +avatar_alt: Drawing of the planet Saturn +doi: 10.46430/phen0113 +--- + + +{% include toc.html %} + + +## Introduction +Say you have a big collection of texts. Maybe you've gathered speeches from the French Revolution, compiled a bunch of Amazon product reviews, or unearthed a collection of diary entries written during the first world war. In any of these cases, computational analysis can be a good way to compliment close reading of your corpus... but where should you start? + +One possible way to begin is with [spaCy](https://spacy.io/), an industrial-strength library for Natural Language Processing (NLP) in [Python](https://perma.cc/4GK2-5EEA). spaCy is capable of processing large corpora, generating linguistic annotations including part-of-speech tags and named entities, as well as preparing texts for further machine classification. This lesson is a 'spaCy 101' of sorts, a primer for researchers who are new to spaCy and want to learn how it can be used for corpus analysis. It may also be useful for those who are curious about natural language processing tools in general, and how they can help us to answer humanities research questions. + +### Lesson Goals +By the end of this lesson, you will be able to: +* Upload a corpus of texts to a platform for Python analysis (using Google Colaboratory) +* Use spaCy to enrich the corpus through tokenization, lemmatization, part-of-speech tagging, dependency parsing and chunking, and named entity recognition +* Conduct frequency analyses using part-of-speech tags and named entities +* Download an enriched dataset for use in future NLP analyses + +### Why Use spaCy for Corpus Analysis? +As the name implies, corpus analysis involves studying corpora, or large collections of documents. Typically, the documents in a corpus are representative of the group(s) a researcher is interested in studying, such as the writings of a specific author or genre. By analyzing these texts at scale, researchers can identify meaningful trends in the way language is used within the target group(s). + +Though computational tools like spaCy can't read and comprehend the meaning of texts like humans do, they excel at 'parsing' (analyzing sentence structure) and 'tagging' (labeling) them. When researchers give spaCy a corpus, it will 'parse' every document in the collection, identifying the grammatical categories to which each word and phrase in each text most likely belongs. NLP Algorithms like spaCy use this information to generate lexico-grammatical tags that are of interest to researchers, such as lemmas (base words), part-of-speech tags and named entities (more on these in the [Part-of-Speech Analysis](#part-of-speech-analysis) and [Named Entity Recognition](#named-entity-recognition) sections below). Furthermore, computational tools like spaCy can perform these parsing and tagging processes much more quickly (in a matter of seconds or minutes) and on much larger corpora (hundreds, thousands, or even millions of texts) than human readers would be able to. + +Though spaCy was designed for industrial use in software development, researchers also find it valuable for several reasons: +* It's [easy to set up and use spaCy's Trained Models and Pipelines](https://perma.cc/Q8QL-N3CX); there is no need to call a wide range of packages and functions for each individual task +* It uses [fast and accurate algorithms](https://perma.cc/W8AD-4QSN) for text-processing tasks, which are kept up-to-date by the developers so it's efficient to run +* It [performs better on text-splitting tasks than Natural Language Toolkit (NLTK)](https://perma.cc/8989-S2Q6), because it constructs [syntactic trees](https://perma.cc/E6UJ-DZ9W) for each sentence + +You may still be wondering: What is the value of extracting language data such as lemmas, part-of-speech tags, and named entities from a corpus? How can this data help researchers answer meaningful humanities research questions? To illustrate, let's look at the example corpus and questions developed for this lesson. + +### Dataset: Michigan Corpus of Upper-Level Student Papers +The [Michigan Corpus of Upper-Level Student Papers (MICUSP)](https://perma.cc/WK67-MQ8A) is a corpus of 829 high-scoring academic writing samples from students at the University of Michigan. The texts come from 16 disciplines and seven genres, all were written by senior undergraduate or graduate students and received an A-range score in a university course.[^1] The texts and their metadata are publicly available on [MICUSP Simple](https://perma.cc/WK67-MQ8A), an online interface which allows users to search for texts by a range of fields (for example genre, discipline, student level, textual features) and conduct simple keyword analyses across disciplines and genres. + +{% include figure.html filename="or-en-corpus-analysis-with-spacy-01.png" alt="MICUSP Simple Interface web page, displaying list of texts included in MICUSP, distribution of texts across disciplines and paper types, and options to sort texts by student level, textual features, paper types, and disciplines" caption="Figure 1: MICUSP Simple Interface" %} + +Metadata from the corpus is available to download in `.csv` format. The text files can be retrieved through webscraping, a process explained further in Jeri Wieringa's [Intro to BeautifulSoup lesson](/en/lessons/retired/intro-to-beautiful-soup), a Programming Historian lesson which remains methodologically useful even if it has been retired due to changes to the scraped website. + +Given its size and robust metadata, MICUSP has become a valuable tool for researchers seeking to study student writing computationally. Notably, Jack Hardy and Ute Römer[^2] use MICUSP to study language features that indicate how student writing differs across disciplines. Laura Aull compares usages of stance markers across student genres[^3], and Sugene Kim highlights discrepancies between prescriptive grammar rules and actual language use in student work[^4]. Like much corpus analysis research, these studies are predicated on the fact that computational analysis of language patterns — the discrete lexico-grammatical practices students employ in their writing — can yield insights into larger questions about academic writing. Given its value in discovering linguistic annotations, spaCy is well-poised to conduct this type of analysis on MICUSP data. + +This lesson will explore a subset of documents from MICUSP: 67 Biology papers and 98 English papers. Writing samples in this select corpus belong to all seven MICUSP genres: Argumentative Essay, Creative Writing, Critique/Evaluation, Proposal, Report, Research Paper, and Response Paper. This select corpus [`.txt_files.zip`](/assets/corpus-analysis-with-spacy/txt_files.zip) and the associated [`metadata.csv`](/assets/corpus-analysis-with-spacy/metadata.csv) are available to download as sample materials for this lesson. The dataset has been culled from the larger corpus in order to investigate the differences between two distinct disciplines of academic writing (Biology and English). It is also a manageable size for the purposes of this lesson. + +**Quick note on corpus size and processing speed:** spaCy is able to process jobs of up to 1 million characters, so it can be used to process the full MICUSP corpus, or other corpora containing hundreds or thousands of texts. You are more than welcome to retrieve the entire MICUSP corpus with [this webscraping code](https://perma.cc/75EV-XDBN) and using that dataset for the analysis. + +### Research Questions: Linguistic Differences Within Student Paper Genres and Disciplines +This lesson will describe how spaCy's utilities in **stopword removal,** **tokenization,** and **lemmatization** can assist in (and hinder) the preparation of student texts for analysis. You will learn how spaCy's ability to extract linguistic annotations such as **part-of-speech tags** and **named entities** can be used to compare conventions within subsets of a discursive community of interest. The lesson focuses on lexico-grammatical features that may indicate genre and disciplinary differences in academic writing. + +The following research questions will be investigated: + +**1: Do students use certain parts-of-speech more frequently in Biology texts versus English texts, and does this linguistic discrepancy signify differences in disciplinary conventions?** +Prior research has shown that even when writing in the same genres, writers in the sciences follow different conventions than those in the humanities. Notably, academic writing in the sciences has been characterized as informational, descriptive, and procedural, while scholarly writing in the humanities is narrativized, evaluative, and situation-dependent (that is, focused on discussing a particular text or prompt)[^5]. By deploying spaCy on the MICUSP texts, researchers can determine whether there are any significant differences between the part-of-speech tag frequencies in English and Biology texts. For example, we might expect students writing Biology texts to use more adjectives than those in the humanities, given their focus on description. Conversely, we might suspect English texts to contain more verbs and verb auxiliaries, indicating a more narrative structure. To test these hypotheses, you'll learn to analyze part-of-speech counts generated by spaCy, as well as to explore other part-of-speech count differences that could prompt further investigation. + +**2: Do students use certain named entities more frequently in different academic genres, and do these varying word frequencies signify broader differences in genre conventions?** +As with disciplinary differences, research has shown that different genres of writing have their own conventions and expectations. For example, explanatory genres such as research papers, proposals and reports tend to focus on description and explanation, whereas argumentative and critique-driven texts are driven by evaluations and arguments[^6]. By deploying spaCy on the MICUSP texts, researchers can determine whether there are any significant differences between the named entity frequencies in texts within the seven different genres represented (Argumentative Essay, Creative Writing, Critique/Evaluation, Proposal, Report, Research Paper, and Response Paper). We may suspect that argumentative genres engage more with people or works of art, since these could be entities serving to support their arguments or as the subject of their critiques. Conversely, perhaps dates and numbers are more prevalent in evidence-heavy genres, such as research papers and proposals. To test these hypotheses, you'll learn to analyze the nouns and noun phrases spaCy has tagged as 'named entities.' + +In addition to exploring the research questions above, this lesson will address how a dataset enriched by spaCy can be exported in a usable format for further machine learning tasks including [sentiment analysis](/en/lessons/sentiment-analysis#calculate-sentiment-for-a-paragraph) or [topic modeling](/en/lessons/topic-modeling-and-mallet). + +### Prerequisites +You should have some familiarity with Python or a similar coding language. For a brief introduction or refresher, work through some of the _Programming Historian_'s [introductory Python tutorials](/en/lessons/introduction-and-installation). You should also have basic knowledge of spreadsheet (`.csv`) files, as this lesson will primarily use data in a similar format called a [pandas](https://pandas.pydata.org/) DataFrame. Halle Burns's lesson [Crowdsourced-Data Normalization with Python and Pandas](/en/lessons/crowdsourced-data-normalization-with-pandas) provides an overview of creating and manipulating datasets using pandas. + +[The code for this lesson](https://nbviewer.org/github/programminghistorian/jekyll/blob/gh-pages/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy.ipynb) has been prepared as a Jupyter Notebook that is customized and ready to run in Google Colaboratory. + +[Jupyter Notebooks](https://perma.cc/S9GS-83JN) are browser-based, interactive computing environments for Python. Colaboratory is a Google platform which allows you to run cloud-hosted Jupyter Notebooks, with additional built-in features. If you're new to coding and aren't working with sensitive data, Google Colab may be the best option for you. [There is a brief Colab tutorial from Google available for beginners.](https://colab.research.google.com/) + +You can also download [the lesson code](https://nbviewer.org/github/programminghistorian/jekyll/blob/gh-pages/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy.ipynb) and run it on your local machine. The practical steps for running the code locally are the same except when it comes to installing packages and retrieving and downloading files. These divergences are marked in the notebook. Quinn Dombrowski, Tassie Gniady, and David Kloster's lesson [Introduction to Jupyter Notebooks](/en/lessons/jupyter-notebooks) covers the necessary background for setting up and using a Jupyter Notebook with Anaconda. + +It is also recommended, though not required, that before starting this lesson you learn about common text mining methods. Heather Froehlich's lesson [Corpus Analysis with AntConc](/en/lessons/corpus-analysis-with-antconc) shares tips for working with plain text files and outlines possibilities for exploring keywords and collocations in a corpora. William J. Turkel and Adam Crymble's lesson [Counting Word Frequencies with Python](/en/lessons/counting-frequencies) describes the process of counting word frequencies, a practice this lesson will adapt to count part-of-speech and named entity tags. + +No prior knowledge of spaCy is required. For a quick overview, go to the [spaCy 101 page](https://perma.cc/Z23P-R252) from the library's developers. + +## Imports, Uploads, and Preprocessing + +### Import Packages +Import spaCy and related packages into your Colab environment. + +``` +# Install and import spaCy +import spacy + +# Load spaCy visualizer +from spacy import displacy + +# Import os to upload documents and metadata +import os + +# Import pandas DataFrame packages +import pandas as pd + +# Import graphing package +import plotly.graph_objects as go +import plotly.express as px + +# Import drive and files to facilitate file uploads +from google.colab import files +``` + +### Upload Text Files +After all necessary packages have been imported, it is time to upload the data for analysis with spaCy. Prior to running the code below, make sure the MICUSP text files you are going to analyze are saved to your local machine. + +Run the code below to select multiple files to upload from a local folder: + +``` +uploaded_files = files.upload() +``` + +When the cell has run, navigate to where you stored the MICUSP text files. Select all the files of interest and click Open. The text files should now be uploaded to your Google Colab session. + +Now we have files upon which we can perform analysis. To check what form of data we are working with, you can use the `type()` function. + +``` +type(uploaded_files) +``` + +It should return that your files are contained in a dictionary, where keys are the filenames and values are the content of each file. + +Next, we’ll make the data easier to manage by inserting it into a pandas DataFrame. As the files are currently stored in a dictionary, use the `DataFrame.from_dict()` function to append them to a new DataFrame: + +``` +paper_df = pd.DataFrame.from_dict(uploaded_files, orient='index') +paper_df.head() +``` + +Use the `.head()` function to call the first five rows of the DataFrame and check that the filenames and text are present. You will also notice some strange characters at the start of each row of text; these are byte string characters (`b'` or `b"`) related to the encoding, and they will be removed below. + +This table shows the initial DataFrame with filenames and texts. These are the first five rows of the student text DataFrame, including columns for the title of each text and the body of each text, without column header names and with byte string characters at start of each line. + +  | 0 +-- | -- +BIO.G0.01.1.txt | b"Introduction\xe2\x80\xa6\xe2\x80\xa6\xe2\x80... +BIO.G0.02.1.txt | b' Ernst Mayr once wrote, sympatric speci... +BIO.G0.02.2.txt | b" Do ecological constraints favour certa... +BIO.G0.02.3.txt | b" Perhaps one of the most intriguing va... +BIO.G0.02.4.txt | b" The causal link between chromosomal re... + +From here, you can reset the index (the very first column of the DataFrame) so it is a true index, rather than the list of filenames. The filenames will become the first column and the texts become the second, making data wrangling easier later. + +``` +# Reset index and add column names to make wrangling easier +paper_df = paper_df.reset_index() +paper_df.columns = ["Filename", "Text"] +``` + +Check the head of the DataFrame again to confirm this process has worked. + +### Pre-process Text Files +If you've done any computational analysis before, you're likely familiar with the term 'cleaning', which covers a range of procedures such as lowercasing, punctuation removal, and stopword removal. Such procedures are used to standardize data and make it easier for computational tools to interpret it. In the next step, you will convert the uploaded files from byte strings into Unicode strings so that spaCy can process them and replace extra spaces with single spaces. + +First, you will notice that each text in your DataFrame starts with `b'` or `b"`. This indicates that the data has been read as 'byte strings', or strings which represent as sequence of bytes. `'b"Hello"`, for example, corresponds to the sequence of bytes `104, 101, 108, 108, 111`. To analyze the texts with spaCy, we need them to be Unicode strings, where the characters are individual letters. + +Converting from bytes to strings is a quick task using `str.decode()`. Within the parentheses, we specify the encoding parameter, UTF-8 (Unicode Transformation Format - 8 bits) which guides the transformation from bytes to Unicode strings. For a more thorough breakdown of encoding in Python, [check out this lesson](https://perma.cc/Z5M2-4EHC). + +``` +paper_df['Text'] = paper_df['Text'].str.decode('utf-8') +paper_df.head() +``` + +Here, we generate a decoded DataFrame with filenames and texts. This table shows the first five rows of student texts DataFrame, including columns for the Filename and the Text of each paper, with byte string characters removed. + +  | Filename | Text +-- | -- | -- +0 | BIO.G0.01.1.txt | Introduction……………………………………………………..1 Brief Hist... +1 | BIO.G0.02.1.txt | Ernst Mayr once wrote, sympatric speciation is... +2 | BIO.G0.02.2.txt | Do ecological constraints favour certain perce... +3 | BIO.G0.02.3.txt | Perhaps one of the most intriguing varieties o... +4 | BIO.G0.02.4.txt | The causal link between chromosomal rearrangem... + +Additionally, the beginnings of some of the texts may also contain extra spaces (indicated by `\t` or `\n`). These characters can be replaced by a single space using the `str.replace()` method. + +``` +paper_df['Text'] = paper_df['Text'].str.replace('\s+', ' ', regex=True).str.strip() +``` + +Further cleaning is not necessary before running running spaCy, and some common cleaning processes will, in fact, skew your results. For example, punctuation markers help spaCy parse grammatical structures and generate part-of-speech tags and dependency trees. Recent scholarship suggests that removing stopwords only superficially improves tasks like topic modeling, that retaining stopwords can support clustering and classification[^8]. At a later stage of this lesson, you will learn to remove stopwords so you can compare its impact on spaCy results. + +### Upload and Merge Metadata Files +Next you will retrieve the metadata about the MICUSP corpus: the discipline and genre information connected to the student texts. Later in this lesson, you will use spaCy to trace differences across genre and disciplinary categories. + +In your Colab, run the following code to upload the `.csv` file from your local machine. + +``` +metadata = files.upload() +``` + +Then convert the uploaded `.csv` file to a second DataFrame, dropping any empty columns. + +``` +metadata_df = pd.read_csv('metadata.csv') +metadata_df = metadata_df.dropna(axis=1, how='all') +``` + +The metadata DataFrame will include columns headed paper metadata-ID, title, discpline and type. This table displays the first five rows: + +  | PAPER ID | TITLE | DISCIPLINE | PAPER TYPE +-- | -- | -- | -- | -- +0 | BIO.G0.15.1 | Invading the Territory of Invasives: The Dange... | Biology | Argumentative Essay +1 | BIO.G1.04.1 | The Evolution of Terrestriality: A Look at the... | Biology | Argumentative Essay +2 | BIO.G3.03.1 | Intracellular Electric Field Sensing using Nan... | Biology | Argumentative Essay +3 | BIO.G0.11.1 | Exploring the Molecular Responses of Arabidops... | Biology | Proposal +4 | BIO.G1.01.1 | V. Cholerae: First Steps towards a Spatially E... | Biology | Proposal + +Notice that the paper IDs in this DataFrame are *almost* the same as the paper filenames in the corpus DataFrame. We're going to make them match exactly so we can merge the two DataFrames together on this column; in effect, linking each text with their title, discipline and genre. + +To match the columns, we'll remove the `.txt` extension from the end of each filename in the corpus DataFrame using a simple `str.replace` function. This function searches for every instance of the phrase `.txt` in the **Filename** column and replaces it with nothing (in effect, removing it). In the metadata DataFrame, we'll rename the paper ID column **Filename**. + +``` +# Remove .txt from title of each paper +paper_df['Filename'] = paper_df['Filename'].str.replace('.txt', '') + +# Rename column from paper ID to Title +metadata_df.rename(columns={"PAPER ID": "Filename"}, inplace=True) +``` + +Now it is possible to merge the papers and metadata into a single DataFrame: + +``` +final_paper_df = metadata_df.merge(paper_df,on='Filename') +``` + +Check the first five rows to make sure each has a filename, title, discipline, paper type and text (the full paper). At this point, you'll also see that any extra spaces have been removed from the beginning of the texts. + + +  | Filename | TITLE | DISCIPLINE | PAPER TYPE | Text +-- | -- | -- | -- | -- | -- +0 | BIO.G0.15.1 | Invading the Territory of Invasives: The Dange... | Biology | Argumentative Essay | New York City, 1908: different colors of skin ... +1 | BIO.G1.04.1 | The Evolution of Terrestriality: A Look at the... | Biology | Argumentative Essay | The fish-tetrapod transition has been called t... +2 | BIO.G3.03.1 | Intracellular Electric Field Sensing using Nan... | Biology | Argumentative Essay | Intracellular electric fields are of great int... +3 | BIO.G0.11.1 | Exploring the Molecular Responses of Arabidops... | Biology | Proposal | Environmental stresses to plants have been stu... +4 | BIO.G1.01.1 | V. Cholerae: First Steps towards a Spatially E... | Biology | Proposal | The recurrent cholera pandemics have been rela... + +The resulting DataFrame is now ready for analysis. + +## Text Enrichment with spaCy +### Creating Doc Objects +To use spaCy, the first step is to load one of spaCy's Trained Models and Pipelines which will be used to perform tokenization, part-of-speech tagging, and other text enrichment tasks. A wide range of options are available ([see the full list here](https://perma.cc/UK2P-ZNM4)), and they vary based on size and language. + +We'll use `en_core_web_sm`, which has been trained on written web texts. It may not perform as accurately as the those trained on medium and large English language models, but it will deliver results most efficiently. Once we've loaded `en_core_web_sm`, we can check what actions it performs; `parser`, `tagger`, `lemmatizer`, and `ner`, should be among those listed. + +``` +nlp = spacy.load('en_core_web_sm') + +print(nlp.pipe_names) +``` + +Now that the `nlp` function is loaded, let's test out its capacities on a single sentence. Calling the `nlp` function on a single sentence yields a Doc object. This object stores not only the original text, but also all of the linguistic annotations obtained when spaCy processed the text. + +``` +sentence = "This is 'an' example? sentence" + +doc = nlp(sentence) +``` + +Next we can call on the Doc object to get the information we're interested in. The command below loops through each token in a Doc object and prints each word in the text along with its corresponding part-of-speech: + +``` +for token in doc: + print(token.text, token.pos_) +``` + +>``` +>This PRON +>is AUX +>' PUNCT +>an DET +>' PUNCT +>example NOUN +>? PUNCT +>sentence NOUN +>``` + +Let's try the same process on the student texts. As we'll be calling the NLP function on every text in the DataFrame, we should first define a function that runs `nlp` on whatever input text is given. Functions are a useful way to store operations that will be run multiple times, reducing duplications and improving code readability. + +``` +def process_text(text): + return nlp(text) +``` + +After the function is defined, use `.apply()` to apply it to every cell in a given DataFrame column. In this case, `nlp` will run on each cell in the **Text** column of the `final_paper_df` DataFrame, creating a Doc object from every student text. These Doc objects will be stored in a new column of the DataFrame called **Doc**. + +Running this function takes several minutes because spaCy is performing all the parsing and tagging tasks on each text. However, when it is complete, we can simply call on the resulting Doc objects to get parts-of-speech, named entities, and other information of interest, just as in the example of the sentence above. + +``` +final_paper_df['Doc'] = final_paper_df['Text'].apply(process_text) +``` + +### Text Reduction +#### Tokenization +A critical first step spaCy performs is tokenization, or the segmentation of strings into individual words and punctuation markers. Tokenization enables spaCy to parse the grammatical structures of a text and identify characteristics of each word-like part-of-speech. + +To retrieve a tokenized version of each text in the DataFrame, we'll write a function that iterates through any given Doc object and returns all functions found within it. This can be accomplished by simply putting a `define` wrapper around a `for` loop, similar to the one written above to retrieve the tokens and parts-of-speech from a single sentence. + +``` +def get_token(doc): + for token in doc: + return token.text +``` + +However, there's a way to write the same function that makes the code more readable and efficient. This is called List Comprehension, and it involves condensing the `for` loop into a single line of code and returning a list of tokens within each text it processes: + +``` +def get_token(doc): + return [(token.text) for token in doc] +``` + +As with the function used to create Doc objects, the `token` function can be applied to the DataFrame. In this case, we will call the function on the **Doc** column, since this is the column which stores the results from the processing done by spaCy. + +``` +final_paper_df['Tokens'] = final_paper_df['Doc'].apply(get_token) +``` + +If we compare the **Text** and **Tokens** column, we find a couple of differences. In the table below, you'll notice that most importantly, the words, spaces, and punctuation markers in the **Tokens** column are separated by commas, indicating that each have been parsed as individual tokens. The text in the **Tokens** column is also bracketed; this indicates that tokens have been generated as a list. + +  | Text | Tokens +-- | -- | -- +0 | New York City, 1908: different colors of skin ... | [New, York, City, ,, 1908, :, different, color... +1 | The fish-tetrapod transition has been called t... | [The, fish, -, tetrapod, transition, has, been... +2 | Intracellular electric fields are of great int... | [Intracellular, electric, fields, are, of, gre... +3 | Environmental stresses to plants have been stu... | [Environmental, stresses, to, plants, have, be... +4 | The recurrent cholera pandemics have been rela... | [The, recurrent, cholera, pandemics, have, bee... + +#### Lemmatization +Another process performed by spaCy is lemmatization, or the retrieval of the dictionary root word of each word (for example “brighten” for “brightening”). We'll perform a similar set of steps to those above to create a function to call the lemmas from the Doc object, then apply it to the DataFrame. + +``` +def get_lemma(doc): + return [(token.lemma_) for token in doc] + +final_paper_df['Lemmas'] = final_paper_df['Doc'].apply(get_lemma) +``` + +Lemmatization can help reduce noise and refine results for researchers who are conducting keyword searches. For example, let’s compare counts of the word “write” in the original **Tokens** column and in the lemmatized **Lemmas** column. + +``` +print(f'"Write" appears in the text tokens column ' + str(final_paper_df['Tokens'].apply(lambda x: x.count('write')).sum()) + ' times.') +print(f'"Write" appears in the lemmas column ' + str(final_paper_df['Lemmas'].apply(lambda x: x.count('write')).sum()) + ' times.') +``` + +In reponse to this command, spaCy prints the following counts: + +>``` +>"Write" appears in the text tokens column 40 times. +>"Write" appears in the lemmas column 310 times. +>``` + +As expected, there are more instances of "write" in the **Lemmas** column, as the lemmatization process has grouped inflected word forms (writing, writer) into the base word "write." + +### Text Annotation +#### Part-of-Speech Tagging +spaCy facilitates two levels of part-of-speech tagging: coarse-grained tagging, which predicts the simple [universal part-of-speech](https://perma.cc/49ER-GXVW) of each token in a text (such as noun, verb, adjective, adverb), and detailed tagging, which uses a larger, more fine-grained set of part-of-speech tags (for example 3rd person singular present verb). The part-of-speech tags used are determined by the English language model we use. In this case, we're using the small English model, and you can explore the differences between the models on [spaCy's website](https://perma.cc/PC9E-HKHM). + +We can call the part-of-speech tags in the same way as the lemmas. Create a function to extract them from any given Doc object and apply the function to each Doc object in the DataFrame. The function we'll create will extract both the coarse- and fine-grained part-of-speech for each token (`token.pos_` and `token.tag_`, respectively). + +``` +def get_pos(doc): + return [(token.pos_, token.tag_) for token in doc] + +final_paper_df['POS'] = final_paper_df['Doc'].apply(get_pos) +``` + +We can create a list of the part-of-speech columns to review them further. The first (coarse-grained) tag corresponds to a generally recognizable part-of-speech such as a noun, adjective, or punctuation mark, while the second (fine-grained) category are a bit more difficult to decipher. + +``` +list(final_paper_df['POS']) +``` + +Here's an excerpt from spaCy's list of coarse- and fine-grained part-of-speech tags that appear in the student texts, including `PROPN, NNP` and `NUM, CD` among other pairs: + +>``` +>[[('PROPN', 'NNP'), +> ('PROPN', 'NNP'), +> ('PROPN', 'NNP'), +> ('PUNCT', ','), +> ('NUM', 'CD'), +> ('PUNCT', ':'), +> ('ADJ', 'JJ'), +> ('NOUN', 'NNS'), +> ('ADP', 'IN'), +> ('NOUN', 'NN'), +> ('NOUN', 'NN'), +> ('ADP', 'IN'), +> ('DET', 'DT'), +> ...]] +``` + +Fortunately, spaCy has a built-in function called `explain` that can provide a short description of any tag of interest. If we try it on the tag `IN` using `spacy.explain("IN")`, the output reads `conjunction`, `subordinating` or `preposition`. + +In some cases, you may want to get only a set of part-of-speech tags for further analysis, like all of the proper nouns. A function can be written to perform this task, extracting only words which have been fitted with the proper noun tag. + +``` +def extract_proper_nouns(doc): + return [token.text for token in doc if token.pos_ == 'PROPN'] + +final_paper_df['Proper_Nouns'] = final_paper_df['Doc'].apply(extract_proper_nouns) +``` + +Listing the nouns in each text can help us ascertain the texts' subjects. Let's list the nouns in two different texts, the text located in row 3 of the DataFrame and the text located in row 163. + +``` +list(final_paper_df.loc[[3, 163], 'Proper_Nouns']) +``` + +The first text in the list includes botany and astronomy concepts; this is likely to have been written for a biology course. + +>``` +>[['Mars', +> 'Arabidopsis', +> 'Arabidopsis', +> 'LEA', +> 'COR', +> 'LEA', +> 'NASA', +> ...]] +>``` + +In contrast, the second text appears to be an analysis of Shakespeare plays and movie adaptations, likely written for an English course. + +>``` +>[['Shakespeare', +> 'Bard', +> 'Julie', +> 'Taymor', +> 'Titus', +> 'Shakespeare', +> 'Titus', +> ...]] +>``` + +Along with assisting content analyses, extracting nouns have been shown to help build more efficient topic models[^9]. + +#### Dependency Parsing +Closely related to part-of-speech tagging is 'dependency parsing', wherein spaCy identifies how different segments of a text are related to each other. Once the grammatical structure of each sentence is identified, visualizations can be created to show the connections between different words. Since we are working with large texts, our code will break down each text into sentences (spans) and then create dependency visualizers for each span. We can then visualize the span of once sentence at a time. + +``` +doc = final_paper_df['Doc'][5] +sentences = list(doc.sents) +sentence = sentences[1] +displacy.render(sentence, style="dep", jupyter=True) +``` + +{% include figure.html filename="or-en-corpus-analysis-with-spacy-02.png" alt="Dependency parse visualization of the sentence, 'There are two interesting phenomena in this research', with part-of-speech labels and arrows indicating dependencies between words." caption="Figure 2: Dependency parsing example from one sentence of one text in corpus" %} + +If you'd like to review the output of this code as raw `.html`, you can download it [here](/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-16.html) and open it with your browser. Here, spaCy has identified relationships between pronouns, verbs, nouns and other parts of speech in one sentence. For example, both "two" and "interesting" modify the noun "phenomena," and the pronoun "There" is an expletive filling the noun position before "are" without adding meaning to the sentence. + +Dependency parsing makes it easy to see how removing stopwords can impact spaCy's depiction of the grammatical structure of texts. Let's compare to a dependency parsing where stopwords are removed. To do so, we'll create a function to remove stopwords from the Doc object, create a new Doc object without stopwords, and extract the part-of-speech tokens from the same sentence in the same text. Then we'll create a visualization for the dependency parsing for the same sentence as above, this time without stopwords. + +``` +def extract_stopwords(doc): + return [token.text for token in doc if token.text not in nlp.Defaults.stop_words] + +final_paper_df['Tokens_NoStops'] = final_paper_df['Doc'].apply(extract_stopwords) + +final_paper_df['Text_NoStops'] = [' '.join(map(str, l)) for l in final_paper_df['Tokens_NoStops']] + +final_paper_df['Doc_NoStops'] = final_paper_df['Text_NoStops'].apply(process_text) + +doc = final_paper_df['Doc_NoStops'][5] +sentences = list(doc.sents) +sentence = sentences[0] + +displacy.render(sentence, style='dep', jupyter=True) +``` + +{% include figure.html filename="or-en-corpus-analysis-with-spacy-03.png" alt="Dependency parse visualization of the sentence without stopwords, 'There interesting phenomena research', with part-of-speech labels and arrows indicating dependencies between words." caption="Figure 3: Dependency parsing example from one sentence of one text in corpus without stopwords" %} + +If you'd like to review the output of this code as raw `.html`, you can download it [here](/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-17.html). In this example, the verb of the sentence "are" has been removed, along with the adjective "two" and the words "in this" that made up the prepositional phrases. Not only do these removals prevent the sentence from being legible, but they also render some of the dependencies inaccurate; "phenomena research" is here identified as a compound noun, and "interesting" as modifying research instead of phenomena. + +This example demonstrates what can be lost in analysis when stopwords are removed, especially when investigating the relationships between words in a text or corpus. Since part-of-speech tagging and named entity recognition are predicated on understanding relationships between words, it's best to keep stopwords so spaCy can use all available linguistic units during the tagging process. + +Dependency parsing also enables the extraction of larger chunks of text, like noun phrases. Let's try it out: + +``` +def extract_noun_phrases(doc): + return [chunk.text for chunk in doc.noun_chunks] + +final_paper_df['Noun_Phrases'] = final_paper_df['Doc'].apply(extract_noun_phrases) +``` + +Calling the first row in the **Noun_Phrases** column will reveal the words spaCy has classified as noun phrases. In this example, spaCy has identified a wide range of nouns and nouns with modifiers, from locations ("New York City") to phrases with adjectival descriptors ("the great melting pot"): + +>``` +>['New York City', +> 'different colors', +> 'skin swirl', +> 'the great melting pot', +> 'a cultural medley', +> 'such a metropolis', +> 'every last crevice', +> 'Earth', +> 'time', +> 'people', +> 'an unprecedented uniformity', +> 'discrete identities', +> 'Our heritages', +> 'the history texts', +> ...]] +>``` + +#### Named Entity Recognition +Finally, spaCy can tag named entities in the text, such as names, dates, organizations, and locations. Call the full list of named entities and their descriptions using this code: + +``` +labels = nlp.get_pipe("ner").labels + +for label in labels: + print(label + ' : ' + spacy.explain(label)) +``` + +spaCy lists the named entity tags that it recognizes, alongside their descriptions: + +>``` +>CARDINAL : Numerals that do not fall under another type +>DATE : Absolute or relative dates or periods +>EVENT : Named hurricanes, battles, wars, sports events, etc. +>FAC : Buildings, airports, highways, bridges, etc. +>GPE : Countries, cities, states +>LANGUAGE : Any named language +>LAW : Named documents made into laws. +>LOC : Non-GPE locations, mountain ranges, bodies of water +>MONEY : Monetary values, including unit +>NORP : Nationalities or religious or political groups +>ORDINAL : "first", "second", etc. +>ORG : Companies, agencies, institutions, etc. +>PERCENT : Percentage, including "%" +>PERSON : People, including fictional +>PRODUCT : Objects, vehicles, foods, etc. (not services) +>QUANTITY : Measurements, as of weight or distance +>TIME : Times smaller than a day +>WORK_OF_ART : Titles of books, songs, etc. +``` + +We’ll create a function to extract the named entity tags from each Doc object and apply it to the Doc objects in the DataFrame, storing the named entities in a new column: + +``` +def extract_named_entities(doc): + return [ent.label_ for ent in doc.ents] + +final_paper_df['Named_Entities'] = final_paper_df['Doc'].apply(extract_named_entities) +final_paper_df['Named_Entities'] +``` + +We can add another column with the words and phrases identified as named entities: + +``` +def extract_named_entities(doc): + return [ent for ent in doc.ents] + +final_paper_df['NE_Words'] = final_paper_df['Doc'].apply(extract_named_entities) +final_paper_df['NE_Words'] +``` + +Let's visualize the words and their named entity tags in a single text. Call the first text's Doc object and use `displacy.render` to visualize the text with the named entities highlighted and tagged: + +``` +doc = final_paper_df['Doc'][1] +displacy.render(doc, style='ent', jupyter=True) +``` + +{% include figure.html filename="or-en-corpus-analysis-with-spacy-04.png" alt="Visualization of a student text paragraph with named entities labeled and color-coded based on entity type." caption="Figure 4: Visualization of one text with named entity tags" %} + +If you'd like to review the output of this code as raw `.html`, you can download it [here](/assets/corpus-analysis-with-spacy/corpus-analysis-with-spacy-20.html). Named entity recognition enables researchers to take a closer look at the 'real-world objects' that are present in their texts. The rendering allows for close-reading of these entities in context, their distinctions helpfully color-coded. In addition to studying named entities that spaCy automatically recognizes, you can use a training dataset to update the categories or create a new entity category, as in [this example](https://perma.cc/TLT6-U88T). + +### Download Enriched Dataset +To save the dataset of doc objects, text reductions and linguistic annotations generated with spaCy, download the ```final_paper_df``` DataFrame to your local computer as a `.csv` file: + +``` +# Save DataFrame as csv (in Google Drive) +final_paper_df.to_csv('MICUSP_papers_with_spaCy_tags.csv') + +# Download csv to your computer from Google Drive +files.download('MICUSP_papers_with_spaCy_tags.csv') +``` + +## Analysis of Linguistic Annotations +Why are spaCy's linguistic annotations useful to researchers? Below are two examples of how researchers can use data about the MICUSP corpus, produced through spaCy, to draw conclusions about discipline and genre conventions in student academic writing. We will use the enriched dataset generated with spaCy for these examples. + +### Part-of-Speech Analysis +In this section, we'll analyze the part-of-speech tags extracted by spaCy to answer the first research question: **Do students use certain parts-of-speech more frequently in Biology texts versus English texts, and does this signify differences in disciplinary conventions?** + +spaCy counts the number of each part-of-speech tag that appears in each document (for example the number of times the `NOUN` tag appears in a document). This is called using `doc.count_by(spacy.attrs.POS)`. Here's how it works on a single sentence: + +``` +# Create Doc object from single sentence +doc = nlp("This is 'an' example? sentence") + +# Print counts of each part of speech in sentence +print(doc.count_by(spacy.attrs.POS)) +``` + +Upon these commands, spaCy creates a Doc object from our sentence, then prints counts of each part-of-speech along with corresponding part-of-speech indices, for example: + +>``` +>{95: 1, 87: 1, 97: 3, 90: 1, 92: 2} +>``` + +spaCy generates a dictionary where the values represent the counts of each part-of-speech term found in the text. The keys in the dictionary correspond to numerical indices associated with each part-of-speech tag. To make the dictionary more legible, let's associate the numerical index values with their corresponding part of speech tags. In the example below, it's now possible to see which parts-of-speech tags correspond to which counts: + +>``` +>{'AUX': 1, 'DET': 1, 'NOUN': 2, 'PRON': 1, 'PUNCT': 3} +>``` + +To get the same type of dictionary for each text in a DataFrame, a function can be created to nest the above `for` loop. First, we'll create a new DataFrame for the purposes of part-of speech analysis, containing the text filenames, disciplines, and Doc objects. We can then apply the function to each Doc object in the new DataFrame. In this case (and above), we are interested in the simpler, coarse-grained parts of speech. + +``` +num_list = [] + +# Create new DataFrame for analysis purposes +pos_analysis_df = final_paper_df[['Filename','DISCIPLINE', 'Doc']] + +def get_pos_tags(doc): + dictionary = {} + num_pos = doc.count_by(spacy.attrs.POS) + for k,v in sorted(num_pos.items()): + dictionary[doc.vocab[k].text] = v + num_list.append(dictionary) + +pos_analysis_df['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags) +``` + +From here, we'll take the part-of-speech counts and put them into a new DataFrame where we can calculate the frequency of each part-of-speech per document. In the new DataFrame, if a paper does not contain a particular part-of-speech, the cell will read `NaN` (Not a Number). + +``` +pos_counts = pd.DataFrame(num_list) +columns = list(pos_counts.columns) +idx = 0 +new_col = pos_analysis_df['DISCIPLINE'] +pos_counts.insert(loc=idx, column='DISCIPLINE', value=new_col) +pos_counts.head() +``` + +This table shows the DataFrame including appearance counts of each part-of-speech in English and Biology papers. Notice that our column headings define the paper discipline and the part-of-speech tags counted. + +
+ +  | DISCIPLINE | ADJ | ADP | ADV | AUX | CCONJ | DET | INTJ | NOUN | NUM | PART | PRON | PROPN | PUNCT | SCONJ | VERB | SYM | X +-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- +0| Biology | 180 | 174 | 62 | 106 | 42 | 137 | 1 | 342 | 29 | 29 | 41 | 101 | 196 | 16 | 139 | NaN | NaN +1| Biology | 421 | 458 | 174 | 253 | 187 | 389 | 1 | 868 | 193 | 78 | 121 | 379 | 786 | 99 | 389 | 1.0 | 2.0 +2| Biology | 163 | 171 | 63 | 91 | 51 | 148 | 1 | 362 | 6 | 31 | 23 | 44 | 134 | 15 | 114 | 4.0 | 1.0 +3| Biology | 318 | 402 | 120 | 267 | 121 | 317 | 1 | 908 | 101 | 93 | 128 | 151 | 487 | 92 | 387 | 4.0 | NaN +4| Biology | 294 | 388 | 97 | 142 | 97 | 299 | 1 | 734 | 89 | 41 | 36 | 246 | 465 | 36 | 233 | 1.0 | 7.0 +... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... +160| English | 943 | 1164 | 365 | 512 | 395 | 954 | 3 | 2287 | 98 | 315 | 530 | 406 | 1275 | 221 | 1122 | 15.0 | 8.0 +161 | English | 672 | 833 | 219 | 175 | 202 | 650 | 1 | 1242 | 30 | 168 | 291 | 504 | 595 | 75 | 570 | NaN | 3.0 +162 | English | 487 | 715 | 175 | 240 | 324 | 500 | 2 | 1474 | 55 | 157 | 334 | 226 | 820 | 147 | 691 | 7.0 | 5.0 +163 | English | 68 | 94 | 23 | 34 | 26 | 79 | 3 | 144 | 2 | 25 | 36 | 54 | 80 | 22 | 69 | 1.0 | 2.0 +164 | English | 53 | 86 | 27 | 28 | 19 | 90 | 1 | 148 | 6 | 15 | 37 | 43 | 80 | 15 | 67 | NaN | NaN + +
+ +Now you can calculate the amount of times, on average, that each part-of-speech appears in Biology versus English papers. To do so, you use the `.groupby()` and `.mean()` functions to group all part-of-speech counts from the Biology texts together and calculate the mean usage of each part-of-speech, before doing the same for the English texts. The following code also rounds the counts to the nearest whole number: + +``` +average_pos_df = pos_counts.groupby(['DISCIPLINE']).mean() + +average_pos_df = average_pos_df.round(0) + +average_pos_df = average_pos_df.reset_index() + +average_pos_df +``` + +Our DataFrame now contains average counts of each part-of-speech tag within each discipline (Biology and English): + +
+ +  | DISCIPLINE | ADJ | ADP | ADV | AUX | CCONJ | DET | INTJ | NOUN | NUM | PART | PRON | PROPN | PUNCT | SCONJ | VERB | SYM | X +-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- +0| Biology | 237.0 | 299.0 | 93.0 | 141.0 | 89.0 | 234.0 | 1.0 | 614.0 | 81.0 | 44.0 | 74.0 | 194.0 | 343.0 | 50.0 | 237.0 | 8.0 | 6.0 +1| English | 211.0 | 364.0 | 127.0 | 141.0 | 108.0 | 283.0 | 2.0 | 578.0 | 34.0 | 99.0 | 223.0 | 189.0 | 367.0 | 70.0 | 306.0 | 7.0 | 5.0 + +
+ +Here we can examine the differences between average part-of-speech usage per genre. As suspected, Biology student papers use slightly more adjectives (235 per paper on average) than English student papers (209 per paper on average), while an even greater number of verbs (306) are used on average in English papers than in Biology papers (237). Another interesting contrast is in the `NUM` tag: almost 50 more numbers are used in Biology papers, on average, than in English papers. Given the conventions of scientific research, this does makes sense; studies are much more frequently quantitative, incorporating lab measurements and statistical calculations. + +We can visualize these differences using a bar graph: + +{% include figure.html filename="or-en-corpus-analysis-with-spacy-05.png" alt="Bar chart depicting average use of adjectives, verbs and numbers in English versus Biology papers, showing verbs used most and numbers used least in both disciplines, more verbs used in English papers and more adjectives and numbers used in Biology papers." caption="Figure 5: Bar graph showing verb use, adjective use and numeral use, on average, in Biology and English papers" %} + +Though admittedly a simple analysis, calculating part-of-speech frequency counts affirms prior studies which posit a correlation between lexico-grammatical features and disciplinary conventions, suggesting this application of spaCy can be adapted to serve other researchers' corpora and part-of-speech usage queries[^10]. + +### Fine-Grained Part-of-Speech Analysis +The same type of analysis could be performed using the fine-grained part-of-speech tags; for example, we could look at how Biology and English students use sub-groups of verbs with different frequencies. Fine-grain tagging can be deployed in a similar loop to those above; but instead of retrieving the `token.pos_` for each word, we call spaCy to retrieve the `token.tag_`: + +``` +tag_num_list = [] +def get_fine_pos_tags(doc): + dictionary = {} + num_tag = doc.count_by(spacy.attrs.TAG) + for k,v in sorted(num_tag.items()): + dictionary[doc.vocab[k].text] = v + tag_num_list.append(dictionary) + +pos_analysis_df['F_POS'] = pos_analysis_df['Doc'].apply(get_fine_pos_tags) +average_tag_df +``` + +Again, we can calculate the amount of times, on average, that each fine-grained part-of-speech appears in Biology versus English paper using the `groupby` and `mean` functions. + +``` +average_tag_df = tag_counts.groupby(['DISCIPLINE']).mean() + +average_tag_df = average_tag_df.round(0) + +average_tag_df = average_tag_df.reset_index() + +average_tag_df +``` + +Now, our DataFrame contains average counts of each fine-grained part-of-speech: + +
+ +  | DISCIPLINE | POS | RB | JJR | NNS | IN | VBG | RBR | RBS | -RRB- | ... | FW | LS | WP$ | NFP | AFX | $ | `` | XX | ADD | '' +-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- +0 | Biology | 5.0 | 94.0 | 10.0 | 198.0 | 339.0 | 35.0 | 6.0 | 4.0 | 38.0 | ... | 2.0 | 3.0 | 1.0 | 16.0 | 3.0 | 6.0 | 2.0 | 5.0 | 3.0 | 2.0 +1 | English | 35.0 | 138.0 | 7.0 | 141.0 | 414.0 | 50.0 | 6.0 | 3.0 | 25.0 | ... | 2.0 | 2.0 | 2.0 | 3.0 | NaN | 1.0 | 3.0 | 5.0 | 3.0 | 5.0 + +
+ +spaCy identifies around 50 fine-grained part-of-speech tags, of which ~20 are visible in the DataFrame above. The ellipses in the central column indicates further data which is not shown. Researchers can investigate trends in the average usage of any or all of them. For example, is there a difference in the average usage of past tense versus present tense verbs in English and Biology papers? Three fine-grained tags that could help with this analysis are `VBD` (past tense verbs), `VBP` (non third-person singular present tense verbs), and `VBZ` (third-person singular present tense verbs). Readers may find it useful to review [a full list](https://github.com/explosion/spaCy/blob/master/spacy/glossary.py) of the fine-grained part-of-speech tags that spaCy generates. + +{% include figure.html filename="or-en-corpus-analysis-with-spacy-06.png" alt="Bar chart depicting average use of three verb types (past-tense, third- and non-third person present tense) in English versus Biology papers, showing third-person present tense verbs used most in both disciplines, many more third-person present tense verbs used in English papers than the other two types and more past tense verbs used in Biology papers." caption="Figure 6: Graph of average usage of three verb types (past tense, third- and non-third person present tense) in English and Biology papers" %} + +Graphing these annotations reveals a fairly even distribution of the usage of the three verb types in Biology papers. However, in English papers, an average of 130 third-person singular present tense verbs are used per paper, compared to around 40 of the other two categories. What these differences indicate about the genres is not immediately discernible, but it does indicate spaCy's value in identifying patterns of linguistic annotations for further exploration by computational and close-reading methods. + +The analyses above are only a couple of many possible applications for part-of-speech tagging. Part-of-speech tagging is also useful for [research questions about sentence *intent*](https://perma.cc/QXH6-V6FF): the meaning of a text changes depending on whether the past, present, or infinitive form of a particular verb is used. Equally useful for such tasks as word sense disambiguation and language translation, part-of-speech tagging is additionally a building block of named entity recognition, the focus of the analysis below. + +### Named Entity Analysis +In this section, you'll use the named entity tags extracted from spaCy to investigate the second research question: **Do students use certain named entities more frequently in different academic genres, and does this signify differences in genre conventions?** + +To start, we'll create a new DataFrame with the text filenames, types (genres), and named entity words and tags: + +``` +ner_analysis_df = final_paper_df[['Filename','PAPER TYPE', 'Named_Entities', 'NE_Words']] +``` + +Using the `str.count` method, we can get counts of a specific named entity used in each text. Let's get the counts of the named entities of interest here (PERSON, ORG, DATE, and WORKS_OF_ART) and add them as new columns of the DataFrame. + +``` +ner_analysis_df['Named_Entities'] = ner_analysis_df['Named_Entities'].apply(lambda x: ' '.join(x)) + +person_counts = ner_analysis_df['Named_Entities'].str.count('PERSON') +org_counts = ner_analysis_df['Named_Entities'].str.count('ORG') +date_counts = ner_analysis_df['Named_Entities'].str.count('DATE') +woa_counts = ner_analysis_df['Named_Entities'].str.count('WORK_OF_ART') + +ner_counts_df = pd.DataFrame() +ner_counts_df['Genre'] = ner_analysis_df["PAPER TYPE"] +ner_counts_df['PERSON_Counts'] = person_counts +ner_counts_df['ORG_Counts'] = org_counts +ner_counts_df['DATE_Counts'] = date_counts +ner_counts_df['WORK_OF_ART_Counts'] = woa_counts +``` + +Reviewing the DataFrame now, our column headings define each paper's genre and four named entities (PERSON, ORG, DATE, and WORKS_OF_ART) of which spaCy will count usage: + +  | Genre | PERSON_Counts | LOC_Counts | DATE_Counts | WORK_OF_ART_Counts +-- | -- | :--: | :--: | :--: | :--: +0 | Argumentative Essay | 9 | 3 | 20 | 3 +1 | Argumentative Essay | 90 | 13 | 151 | 6 +2 | Argumentative Essay | 0 | 0 | 2 | 2 +3 | Proposal | 11 | 6 | 21 | 4 +4 | Proposal | 44 | 7 | 65 | 3 + +From here, we can compare the average usage of each named entity and plot across paper type. + +{% include figure.html filename="or-en-corpus-analysis-with-spacy-07.png" alt="Bar chart depicting average use of named entities across seven genres, with highest counts of PERSON and DATE tags across all genres, with more date tags used in proposals, research papers and creative writing papers and more person tags used in argumentative essays, critique/evaluations, reports and response papers." caption="Figure 7: Bar chart depicting average use of Person, Location, Date, and Work of Art named entities across genres" %} + +As hypothesized at the start of this lesson: more dates and numbers are used in description-heavy proposals and research papers, while more people and works of art are referenced in arguments and critiques/evaluations. Both of these hypotheses are predicated on engaging with and assessing other scholarship. + +Interestingly, people and locations are used the most frequently on average across all genres, likely because these concepts often appear in citations. Overall, locations are most frequently invoked in proposals and reports. Though this should be investigated further through close reading, it does follow that these genres would use locations frequently because they are often grounded in real-world spaces in which events are being reported or imagined. + +### Analysis of ```DATE``` Named Entities +Let's explore patterns of one of these entities' usage (```DATE```) further by retrieving the words most frequently tagged as dates in various genres. You'll do this by first creating functions to extract the words tagged as date entities in each document and adding the words to a new DataFrame column: + +``` +def extract_date_named_entities(doc): + return [ent for ent in doc.ents if ent.label_ == 'DATE'] + +ner_analysis_df['Date_Named_Entities'] = final_paper_df['Doc'].apply(extract_date_named_entities) + +ner_analysis_df['Date_Named_Entities'] = [', '.join(map(str, l)) for l in ner_analysis_df['Date_Named_Entities']] +``` + +Now we can retrieve only the subset of papers that are in the proposal genre, get the top words that have been tagged as "dates" in these papers and append them to a list: + +``` +date_word_counts_df = ner_analysis_df[(ner_analysis_df == 'Proposal').any(axis=1)] + +date_word_frequencies = date_word_counts_df.Date_Named_Entities.str.split(expand=True).stack().value_counts() +date_word_frequencies[:10] +``` + +spaCy outputs a list of the 10 words most-frequently labeled with the `DATE` named entity tag in Proposal papers: + +``` +2004, 24 +2003, 18 +the 17 +2002, 12 +2005, 11 +1998, 11 +2000, 9 +year, 9 +1977, 8 +season, 8 +``` + +The majority are standard 4-digit dates; though further analysis is certainly needed to confirm, these date entities seem to indicate citation references are occurring. This fits in with our expectations of the proposal genre, which requires references to prior scholarship to justify students' proposed claims. + +Let's contrast this with the top `DATE` entities in Critique/Evaluation papers: + +``` +# Search for only date words in critique/evaluation papers +date_word_counts_df = ner_analysis_df[(ner_analysis_df == 'Critique/Evaluation').any(axis=1)] + +# Count the frequency of each word in these papers and append to list +date_word_frequencies = date_word_counts_df.Date_Named_Entities.str.split(expand=True).stack().value_counts() + +# Get top 10 most common words and their frequencies +date_word_frequencies[:10] +``` + +Now, spaCy outputs a list of the 10 words most-frequently labeled with the `DATE` named entity tag in Critique/Evaluation papers: + +>``` +>the 10 +>winter, 8 +>years, 6 +>2009 5 +>1950, 5 +>1960, 5 +>century, 4 +>decade, 3 +>of 3 +>decades, 3 +>``` + +Here, only three of the most-frequently tagged `DATE` entities are standard 4-digit dates, and the rest are noun references to relative dates or periods. This, too, may indicate genre conventions, such as the need to provide context and/or center an argument in relative space and time in evaluative work. Future research could analyze chains of named entities (and parts-of-speech) to get a better understanding of how these features together indicate larger rhetorical tactics. + +## Conclusions +Through this lesson, we've gleaned more information about the grammatical makeup of a text corpus. Such information can be valuable to researchers who are seeking to understand differences between texts in their corpus: What types of named entities are most common across the corpus? How frequently are certain words used as nouns versus objects within individual texts and corpora? What may these frequencies reveal about the content or themes of the texts themselves? + +While we've covered the basics of spaCy in this lesson, it has other capacities, such as word vectorization and custom rule-based tagging, that are certainly worth exploring in more detail. This lesson's code can also be altered to work with custom feature sets. A great example of working with custom feaature sets is Susan Grunewald's and Andrew Janco's lesson, [Finding Places in Text with the World Historical Gazetteer,](/en/lessons/finding-places-world-historical-gazetteer#4-building-a-gazetteer) in which spaCy is leveraged to identify place names of German prisoner of war camps in World War II memoirs, drawing on a historical gazetteer of camp names. + +spaCy is an equally helpful tool to explore texts without fully-formed research questions in mind. Exploring linguistic annotations can propel further research questions and guide the development of text-mining methods. + +Ultimately, this lesson has provided a foundation for corpus analysis with spaCy. Whether you wish to investigate language use in student papers, novels, or another large collection of texts, this code can be repurposed for your use. + +## Endnotes +[^1]: Matthew Brooke O'Donnell and Ute Römer, "From student hard drive to web corpus (part 2): The annotation and online distribution of the Michigan Corpus of Upper-level Student Papers (MICUSP)," *Corpora* 7, no. 1 (2012): 1–18. [https://doi.org/10.3366/cor.2012.0015](https://doi.org/10.3366/cor.2012.0015). + +[^2]: Jack Hardy and Ute Römer, "Revealing disciplinary variation in student writing: A multi-dimensional analysis of the Michigan Corpus of Upper-level Student Papers (MICUSP)," *Corpora* 8, no. 2 (2013): 183–207. [https://doi.org/10.3366/cor.2013.0040](https://doi.org/10.3366/cor.2013.0040). + +[^3]: Laura Aull, "Linguistic Markers of Stance and Genre in Upper-Level Student Writing," *Written Communication* 36, no. 2 (2019): 267–295. [https://doi.org/10.1177/0741088318819472](https://doi.org/10.1177/0741088318819472). + +[^4]: Sugene Kim, "‘Two rules are at play when it comes to none ’: A corpus-based analysis of singular versus plural none: Most grammar books say that the number of the indefinite pronoun none depends on formality level; corpus findings show otherwise," *English Today* 34, no. 3 (2018): 50–56. [https://doi.org/10.1017/S0266078417000554](https://doi.org/10.1017/S0266078417000554). + +[^5]: Carol Berkenkotter and Thomas Huckin, *Genre knowledge in disciplinary communication: Cognition/culture/power,* (Lawrence Erlbaum Associates, Inc., 1995). + +[^6]: Jack Hardy and Eric Friginal, "Genre variation in student writing: A multi-dimensional analysis," *Journal of English for Academic Purposes* 22 (2016): 119-131. [https://doi.org/10.1016/j.jeap.2016.03.002](https://doi.org/10.1016/j.jeap.2016.03.002). + +[^7]: Jack Hardy and Ute Römer, "Revealing disciplinary variation in student writing: A multi-dimensional analysis of the Michigan Corpus of Upper-level Student Papers (MICUSP)," *Corpora* 8, no. 2 (2013): 183–207. [https://doi.org/10.3366/cor.2013.0040](https://doi.org/10.3366/cor.2013.0040). + +[^8]: Alexandra Schofield, Måns Magnusson and David Mimno, "Pulling Out the Stops: Rethinking Stopword Removal for Topic Models," *Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics* 2 (2017): 432-436. [https://aclanthology.org/E17-2069](https://perma.cc/JAN8-N296). + +[^9]: Fiona Martin and Mark Johnson. "More Efficient Topic Modelling Through a Noun Only Approach," *Proceedings of the Australasian Language Technology Association Workshop* (2015): 111–115. [https://aclanthology.org/U15-1013](https://perma.cc/QH7M-42S3). + +[^10]: Jack Hardy and Ute Römer, "Revealing disciplinary variation in student writing: A multi-dimensional analysis of the Michigan Corpus of Upper-level Student Papers (MICUSP)," *Corpora* 8, no. 2 (2013): 183–207. [https://doi.org/10.3366/cor.2013.0040](https://doi.org/10.3366/cor.2013.0040). diff --git a/gallery/corpus-analysis-with-spacy.png b/gallery/corpus-analysis-with-spacy.png new file mode 100644 index 000000000..4d6b8299a Binary files /dev/null and b/gallery/corpus-analysis-with-spacy.png differ diff --git a/gallery/originals/corpus-analysis-with-spacy-original.png b/gallery/originals/corpus-analysis-with-spacy-original.png new file mode 100644 index 000000000..ad98ad2a6 Binary files /dev/null and b/gallery/originals/corpus-analysis-with-spacy-original.png differ diff --git a/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-01.png b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-01.png new file mode 100644 index 000000000..ec6830861 Binary files /dev/null and b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-01.png differ diff --git a/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-02.png b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-02.png new file mode 100644 index 000000000..e635d2b32 Binary files /dev/null and b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-02.png differ diff --git a/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-03.png b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-03.png new file mode 100644 index 000000000..290a1075f Binary files /dev/null and b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-03.png differ diff --git a/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-04.png b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-04.png new file mode 100644 index 000000000..41e941659 Binary files /dev/null and b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-04.png differ diff --git a/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-05.png b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-05.png new file mode 100644 index 000000000..af678dd23 Binary files /dev/null and b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-05.png differ diff --git a/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-06.png b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-06.png new file mode 100644 index 000000000..e9cc17897 Binary files /dev/null and b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-06.png differ diff --git a/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-07.png b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-07.png new file mode 100644 index 000000000..7cbbac3c4 Binary files /dev/null and b/images/corpus-analysis-with-spacy/or-en-corpus-analysis-with-spacy-07.png differ