Topic modeling #37

mgasvoda · 2018-01-02T18:04:08Z

No description provided.

OliverSherouse · 2018-01-17T20:59:16Z

setup.py

+        ],
+        'topic_modeling': [
+            'gensim',
+            'spacy'


Do we not need a spacy corpus as well?

OliverSherouse · 2018-01-17T21:02:47Z

quantgov/estimator/structures.py

+class QGLdaModel(BaseEstimator, TransformerMixin):
+    @check_gensim
+    @check_spacy
+    def __init__(self, word_regex=r'\b[A-z]{2,}\b', stop_words=STOP_WORDS):


I would think the options for stop_words should be:

None (default): No stop words

True: use built-in stop words

A sequence: user-specified stop words.

Not having any stop words seems to output a pretty unusable model - my thinking is it's best to have some default, and if the user chooses to override that default with None they can, but the defaults should be able to produce something usable - we could include some output if they don't provide any (e.g. "INFO: No stop words provided, using sklearn builtins"), and potentially a warning if None is passed

OliverSherouse · 2018-01-17T21:03:34Z

quantgov/estimator/structures.py

+class QGLdaModel(BaseEstimator, TransformerMixin):
+    @check_gensim
+    @check_spacy
+    def __init__(self, word_regex=r'\b[A-z]{2,}\b', stop_words=STOP_WORDS):


word_regex should be word_pattern to match what already exists in SKL.

OliverSherouse · 2018-01-17T21:04:14Z

quantgov/estimator/structures.py

@@ -85,3 +113,41 @@ class CandidateModel(
            parameter values to test as values
    """
    pass
+
+
+class QGLdaModel(BaseEstimator, TransformerMixin):


I don't like either the prefix or the Model specifier. I'd call this GensimLDA or something like that.

OliverSherouse · 2018-01-17T21:18:20Z

quantgov/estimator/structures.py

+import re
+
+try:
+    from spacy.lang.en.stop_words import STOP_WORDS


If we're literally only using spacy here for the stopwords, can't we somehow find the sklearn stopwords used in the CountVectorizer? That's got to be importable from somewhere.

OliverSherouse · 2018-01-17T21:22:07Z

quantgov/estimator/structures.py

+                                      for doc in driver.stream()])
+        stop_ids = [self.dictionary.token2id[stopword] for stopword
+                    in self.stop_words if stopword in self.dictionary.token2id]
+        once_ids = [tokenid for tokenid, docfreq in


Why are we doing this?

Filtering out words that only occur once was recommended in the Gensim documentation - beyond that, I don't know if it actually improves the performance of the model.

OliverSherouse · 2018-01-17T21:22:28Z

quantgov/estimator/structures.py

+                                      for i in self.word_regex
+                                        .finditer(doc.text)]
+                                      for doc in driver.stream()])
+        stop_ids = [self.dictionary.token2id[stopword] for stopword


Wouldn't it be better to only pass the dictionary words that aren't in stop_words?

mgasvoda · 2018-01-18T15:48:53Z

@OliverSherouse ready for follow up review

mgasvoda added 10 commits December 19, 2017 14:09

modifying lda class structure

b07786a

adjusting lda model, adding parameters for grid search

257131d

adjusting lda model, adding parameters for grid search

265cc83

adding topic modeling dependencies

bb5f036

fixing syntax

2a15b32

lowercasing words

c3faebf

adding sklearn api version

c6b764d

initial commit testing topic model

a17ed34

loading all models for testing

d5592f0

passing topic model test

58513f7

mgasvoda requested a review from OliverSherouse January 2, 2018 18:04

mgasvoda added 2 commits January 2, 2018 13:08

updating travis config dependencies

aac7d98

checking import before initializing

ef9853f

OliverSherouse requested changes Jan 17, 2018

View reviewed changes

mgasvoda added 3 commits January 18, 2018 10:25

adjusting dependencies, variable names

c97c6df

moving stopword removal

aaffd94

restructuring stopword args, adjusting min word freq

58db43a

mgasvoda added 2 commits March 1, 2018 14:45

providing wrapper for show_topics

30af4c2

Merge branch 'dev' of github.com:QuantGov/quantgov into topic_modeling

31a35ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic modeling #37

Topic modeling #37

mgasvoda commented Jan 2, 2018

OliverSherouse Jan 17, 2018

OliverSherouse Jan 17, 2018

mgasvoda Jan 18, 2018

OliverSherouse Jan 17, 2018

OliverSherouse Jan 17, 2018

OliverSherouse Jan 17, 2018

OliverSherouse Jan 17, 2018

mgasvoda Jan 18, 2018

OliverSherouse Jan 17, 2018

mgasvoda commented Jan 18, 2018

Topic modeling #37

Are you sure you want to change the base?

Topic modeling #37

Conversation

mgasvoda commented Jan 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgasvoda commented Jan 18, 2018