forked from vprusso/youtube_tutorials
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnlp_5.py
171 lines (124 loc) · 5.98 KB
/
nlp_5.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# LucidProgramming -- Natural Language Processing in Python: Part 5
# Part 5 Blog Post: http://vprusso.github.io/blog/2018/natural-language-processing-python-5/
# Part 5 YouTube Video: https://www.youtube.com/watch?v=P2PMgnQSHYQ
# In this tutorial, we shall focus on **stemming** and
# **lemmatization**.
"""
Stemming
"""
# Let us first focus on the notion of stemming. According
# to Wikipedia: https://en.wikipedia.org/wiki/Stemming
# "Stemming is the process of reducing inflected (or sometimes
# derived) words to their word stem, base, or root form--generally
# a written word form."
# That definition is a bit hard to follow, so let us consider
# an example.
# Take the word "fishing". This word is based on the so-called
# stem, that is, the word "fish". Likewise, the stem of "fished",
# "fisher", etc. has the stem "fish".
# Writing your own function to determine the stem of a word is
# possible, although there are many potential edge cases. Many
# of these edge cases are automatically accounted for via the
# stemming tools provided by NLTK.
# Applications of Stemming:
# According to the previously mentioned Wikipeda article on
# stemming:
# "Stemming is used as an approximate method for grouping words
# with a similar basic meaning together. For example, a text
# mentioning "daffodils" is probably closely related to a
# text mentioning "daffodil" (without the "s"). But in some
# cases, words with the same stem have idiomatic meanings which
# are not closely related: a user searching for "marketing" will
# not be satisfied by most documents mentioning "markets" but
# not "marketing"".
# One well-known application of stemming is used when you search
# in Google. For instance, searching for the term "fish" will also
# yields results for the term "fishing" as well, since "fish" is
# the stem of "fishing" and is most likely related to the stem
# in this case.
# One of the stemming algorithms used via NLTK is the so-called
# **Porter Stemmer**:
# (http://www.cs.odu.edu/~jbollen/IR04/readings/readings5.pdf)
from nltk.stem import PorterStemmer
# Let us attempt to determine the stem for the following words in
# this word list:
porter = PorterStemmer()
word_list = ["connected", "connecting", "connection", "connections"]
for word in word_list:
print(porter.stem(word))
# The Porter Stemmer identifies "connect" as the stem for
# each of the words in the list above.
# Let us take another example list of words:
word_list = ["argue", "argued", "argues", "arguing", "argus"]
for word in word_list:
print(porter.stem(word))
# Note that the term "stem" and "root" are independent. The word
# "argue" is the root word of the above word list, but according
# to the definition of "stem", the term "argu" is the stem.
# NLTK also provides access to a number of other stemmer algorithms.
from nltk.stem import LancasterStemmer
from nltk.stem import SnowballStemmer
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
# Using the Lancaster Stemmer on the "argue" word list:
for word in word_list:
print(lancaster.stem(word))
# Using the Snowball Stemmer on the "argue" word list:
for word in word_list:
print(snowball.stem(word))
# Notice that each stemming algorithm provides a different
# output. Delving into how each of these stemming algorithms
# work along with what the pros and cons of each are is beyond
# the scope of this video. However, if you would like a high level
# overview of when to use a particular stemming algorithm for your
# purposes, the following StackOverflow answer by Slater Tyranus
# provides a very well-written and concise summary of each:
# https://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg/11210358
"""
Lemmatizing
"""
# According to Wikipedia, the definition of lemmatization is:
# https://en.wikipedia.org/wiki/Lemmatisation
# "The process of grouping together the inflected forms of
# a word so they can be analyzed as a single item, identified by
# the word's lemma, or dictionary form.
# Lemmatization and stemming are related, but different.
# The difference is that a stemmer operates on a single word
# *without* knowledge of the context, and therefore cannot
# discriminate between words which have different meaning
# depending on part of speech.
# Let us consider some examples of lemmatization and also
# of stemming to consider the contrast between the two ideas.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# The `WordNetLemmatizer` class has a method called `lemmatize` which
# takes as arguments a word to lemmatize as well as what part of speech
# the word happens to be, i.e. noun, verb, adverb, etc.
# Let us attempt to determine the lemma for the word "bats":
print(lemmatizer.lemmatize("bats"))
# By default, the part of speech is noun (unless specified otherwise).
# Note that the lemmatizer is able to ascertain the lemma of the plural
# "bats" by the word "bat".
# Note that "bats" can be considered a noun, as in the plural for the
# type of animal for instance, but it may also be considered a verb,
# as in to "hit at" something.
# We can specify the part of speech to consider the word as by the optional
# `pos` argument, standing for "part-of-speech":
print(lemmatizer.lemmatize("bats", pos="v"))
# Let us now consider lemmatizing the word "better". In fact, let us lemmatize
# this word when the term better is an adjective, adverb, noun, and verb,
# respectively.
# Adjective:
print(lemmatizer.lemmatize("better", pos="a"))
# Adverb:
print(lemmatizer.lemmatize("better", pos="r"))
# Noun:
print(lemmatizer.lemmatize("better", pos="n"))
# Verb:
print(lemmatizer.lemmatize("better", pos="v"))
# Notice that the lemmatization of "better" when considered to be a
# noun or verb stays as "better". Whereas when it is considered as
# an adjective it lemmatizes to "good" and when the part of speech
# is an adverb it lemmatizes to "well".
# If you consult Google's dictionary tool, you will notice this
# coincides with this categorization as well.