Skip to content

Commit

Permalink
filter out non-english
Browse files Browse the repository at this point in the history
  • Loading branch information
longshuicy committed Jul 16, 2024
1 parent e7acf92 commit ce8348c
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 0 deletions.
4 changes: 4 additions & 0 deletions containerized_analytics/smile/topic_modeling/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.1.6] - 07-16-2024

### Changed
- Add language detection to filter out non-English text [#123](https://github.com/ncsa/standalone-smm-analytics/issues/123)

## [0.1.5] - 01-23-2024

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from nltk import WordNetLemmatizer
import pyLDAvis
import pyLDAvis.gensim
from langdetect import detect


class Gensim_Topic_Modeling:
Expand All @@ -16,6 +17,9 @@ def __init__(self, df, column):
'str').tolist()

def preprocessing(self):
# Detect and keep only English texts
self.data = [sent for sent in self.data if detect(sent) == 'en']

self.data = [re.sub('\S*@\S*\s?', "", sent) for sent in self.data]
self.data = [re.sub('\s+', ' ', sent) for sent in self.data]
self.data = [re.sub("\'", "", sent) for sent in self.data]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ numpy>=1.18.1
pandas>=1.1.4
pyLDAvis==2.1.2
pika>=1.1.0
langdetect>=1.0.7

0 comments on commit ce8348c

Please sign in to comment.