Skip to content

Commit

Permalink
Merge pull request #11 from ahangarha/improve-clean-tweet
Browse files Browse the repository at this point in the history
Improve clean tweet
  • Loading branch information
behrouzbakhtiari authored Mar 1, 2020
2 parents 4b70503 + 9b31f93 commit ebc1774
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions twc.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,12 @@ def remove_emoji(tweet):
def clean_tweet(tweet):
tweet = str(tweet)
tweet = tweet.lower()
tweet = tweet.replace("#", "") # remove # so we preserve hashtags for the cloud
tweet = tp.clean(tweet)
tweet = remove_emoji(tweet)
normalizer = Normalizer()
tweet = normalizer.normalize(tweet)
tweet = re.sub(r'ن?می[‌]\S+','',tweet) # removes verbs such as می‌شود or نمی‌گویند
tokens = word_tokenize(tweet)
tokens = [token for token in tokens if token not in stopwords.persian]
tokens = [token for token in tokens if token not in stopwords.english]
Expand Down

0 comments on commit ebc1774

Please sign in to comment.