Skip to content

Commit

Permalink
Merge pull request behrouzbakhtiari#11 from ahangarha/improve-clean-t…
Browse files Browse the repository at this point in the history
…weet

Improve clean tweet
  • Loading branch information
behrouzbakhtiari authored Mar 1, 2020
2 parents a439fd6 + cfe28c3 commit f8f8c82
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions twc.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,12 @@ def remove_emoji(tweet):
def clean_tweet(tweet):
tweet = str(tweet)
tweet = tweet.lower()
tweet = tweet.replace("#", "") # remove # so we preserve hashtags for the cloud
tweet = tp.clean(tweet)
tweet = remove_emoji(tweet)
normalizer = Normalizer()
tweet = normalizer.normalize(tweet)
tweet = re.sub(r'ن?می[‌]\S+','',tweet) # removes verbs such as می‌شود or نمی‌گویند
tokens = word_tokenize(tweet)
tokens = [token for token in tokens if token not in stopwords.persian]
tokens = [token for token in tokens if token not in stopwords.english]
Expand Down

0 comments on commit f8f8c82

Please sign in to comment.