- Python 2.7
- NLTK (Natural Language Processing Toolkit) 3.2.2
- TwitterSearch
- Django 1.9.11
- Semantic UI
- JQuery
- Tokenization
Using : http://www.nltk.org/_modules/nltk/tokenize/casual.html#TweetTokenizer
Example :
>>> from nltk.tokenize import TweetTokenizer
>>> tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
>>> TwitterTokenizer.tokenize(tweet)
['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']
- removing stop word
>>> from nltk.corpus import stopwords
>>> english_stops = set(stopwords.words('english'))
>>> "is" in english_stops
True
>>> "ganteng" in english_stops
False
- Stemming Porter Algorithm
Algorithm : http://snowball.tartarus.org/algorithms/porter/stemmer.html
Simple Explanation :
1.a
$sses -> $ss | caresses -> caress
$ies -> $i | ponies -> poni
$ss -> $ss | caress -> caress
$s -> $ | cats -> cat
1.b
2.(for long stems)
$ational -> $ate | relational -> relate
$izer -> $ize | digitizer -> digitize
3.(for long stems)
$al -> $ | revival -> reviv
$able -> $ | adjustable -> adjust
- lower_case using python built in .lower() method
>>> "TwitterPostTweet".lower()
"twitterposttweet"
- Using Binary term frequency.
>>> tweet = ["apple", "product", "best", "use", "apple", "forever"]
>>> "extraction_feature(tweet)
{"apple": True, "product": True, "best": True, "forever": True}
- Using NaiveBayesClassifier (NLTK),