Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic modeling #37

Open
wants to merge 17 commits into
base: dev
Choose a base branch
from
Open

Topic modeling #37

wants to merge 17 commits into from

Conversation

mgasvoda
Copy link
Contributor

@mgasvoda mgasvoda commented Jan 2, 2018

No description provided.

],
'topic_modeling': [
'gensim',
'spacy'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need a spacy corpus as well?

class QGLdaModel(BaseEstimator, TransformerMixin):
@check_gensim
@check_spacy
def __init__(self, word_regex=r'\b[A-z]{2,}\b', stop_words=STOP_WORDS):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think the options for stop_words should be:

  • None (default): No stop words
  • True: use built-in stop words
  • A sequence: user-specified stop words.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having any stop words seems to output a pretty unusable model - my thinking is it's best to have some default, and if the user chooses to override that default with None they can, but the defaults should be able to produce something usable - we could include some output if they don't provide any (e.g. "INFO: No stop words provided, using sklearn builtins"), and potentially a warning if None is passed

class QGLdaModel(BaseEstimator, TransformerMixin):
@check_gensim
@check_spacy
def __init__(self, word_regex=r'\b[A-z]{2,}\b', stop_words=STOP_WORDS):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

word_regex should be word_pattern to match what already exists in SKL.

@@ -85,3 +113,41 @@ class CandidateModel(
parameter values to test as values
"""
pass


class QGLdaModel(BaseEstimator, TransformerMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like either the prefix or the Model specifier. I'd call this GensimLDA or something like that.

import re

try:
from spacy.lang.en.stop_words import STOP_WORDS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're literally only using spacy here for the stopwords, can't we somehow find the sklearn stopwords used in the CountVectorizer? That's got to be importable from somewhere.

for doc in driver.stream()])
stop_ids = [self.dictionary.token2id[stopword] for stopword
in self.stop_words if stopword in self.dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering out words that only occur once was recommended in the Gensim documentation - beyond that, I don't know if it actually improves the performance of the model.

for i in self.word_regex
.finditer(doc.text)]
for doc in driver.stream()])
stop_ids = [self.dictionary.token2id[stopword] for stopword
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to only pass the dictionary words that aren't in stop_words?

@mgasvoda
Copy link
Contributor Author

@OliverSherouse ready for follow up review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants