Merge pull request #8 from andrewtavis/verbose-clean

Verbose and multiprocessed cleaning with code and process factoring
andrewtavis · Mar 15, 2021 · a58255f · a58255f
2 parents 2af5253 + aa6bc1b
commit a58255f
Show file tree

Hide file tree

Showing 14 changed files with 956 additions and 802 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,19 @@
+### kwx 0.1.5 (March 15, 2021)
+
+Changes include:
+
+- Keyword extraction and selection are now disjointed so that modeling doesn't occur again to get new keywords
+
+- Keyword extraction and cleaning are now fully disjointed processes
+
+- kwargs for sentence-transformers BERT, LDA, and TFIDF can now be passed
+
+- The cleaning process is verbose and uses multiprocessing
+
+- The user has greater control over the cleaning process
+
+- Reformatting of the code to make the process more clear
+
 ### kwx 0.1.0 (Feb 17, 2021)
 
 First stable release of kwx

diff --git a/README.md b/README.md
@@ -11,7 +11,6 @@
 [![pypi](https://img.shields.io/pypi/v/kwx.svg?color=4B8BBE)](https://pypi.org/project/kwx/)
 [![pypistatus](https://img.shields.io/pypi/status/kwx.svg)](https://pypi.org/project/kwx/)
 [![license](https://img.shields.io/github/license/andrewtavis/kwx.svg)](https://github.com/andrewtavis/kwx/blob/main/LICENSE.text)
-[![contributions](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](https://github.com/andrewtavis/kwx/blob/main/.github/CONTRIBUTING.md)
 [![coc](https://img.shields.io/badge/coc-Contributor%20Covenant-ff69b4.svg)](https://github.com/andrewtavis/kwx/blob/main/.github/CODE_OF_CONDUCT.md)
 [![codestyle](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![colab](https://img.shields.io/badge/%20-Open%20in%20Colab-097ABB.svg?logo=google-colab&color=097ABB&labelColor=525252)](https://colab.research.google.com/github/andrewtavis/kwx)
@@ -21,7 +20,7 @@
 [//]: # "The '-' after the section links is needed to make them work on GH (because of ↩s)"
 **Jump to:**<a id="jumpto"></a> [Models](#models-) • [Usage](#usage-) • [Visuals](#visuals-) • [To-Do](#to-do-)
 
-**kwx** is a toolkit for multilingual keyword extraction based on [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) and Google's [BERT](https://github.com/google-research/bert). The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/kwx/languages.py) for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby allowing them to use their own intuitions to fine tune the modeling process.
+**kwx** is a toolkit for multilingual keyword extraction based on Google's [BERT](https://github.com/google-research/bert) and [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation). The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/kwx/languages.py) for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby allowing them to use their own intuitions to fine tune the modeling process.
 
 For a thorough overview of the process and techniques see the [Google slides](https://docs.google.com/presentation/d/1BNddaeipNQG1mUTjBYmrdpGC6xlBvAi3rapT88fkdBU/edit?usp=sharing), and reference the [documentation](https://kwx.readthedocs.io/en/latest/) for explanations of the models and visualization methods.
 
@@ -45,27 +44,31 @@ import kwx
 
 # Models [`↩`](#jumpto)
 
+Implemented NLP modeling methods within [kwx.model](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py) include:
+
+### BERT
+
+[Bidirectional Encoder Representations from Transformers](https://github.com/google-research/bert) derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics.
+
+kwx uses [sentence-transformers](https://github.com/UKPLab/sentence-transformers) pretrained models. See their GitHub and [documentation](https://www.sbert.net/) for the available models.
+
 ### LDA
 
 [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of kwx, documents or text entries are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics.
 
 Although not as statistically strong as the following machine learning models, LDA provides quick results that are suitable for many applications.
 
-### BERT
-
-[Bidirectional Encoder Representations from Transformers](https://github.com/google-research/bert) derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics.
-
 ### LDA with BERT embeddings
 
 The combination of LDA with BERT via [kwx.autoencoder](https://github.com/andrewtavis/kwx/blob/main/kwx/autoencoder.py).
 
-### Other
+### Other Methods
 
 The user can also choose to simply query the most common words from a text corpus or compute TFIDF ([Term Frequency Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) keywords - those that are unique in a text body in comparison to another that's compared. The former method is used in kwx as a baseline to check model efficacy, and the latter is a useful baseline when a user has another text or text body to compare the target corpus against.
 
 # Usage [`↩`](#jumpto)
 
-Keyword extraction can be useful to analyze surveys, tweets, other kinds of social media posts, research papers, and further classes of texts. [examples.kw_extraction](https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb) provides an example of how to use kwx by deriving keywords from tweets in the Kaggle [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) dataset.
+Keyword extraction can be useful to analyze surveys, tweets and other kinds of social media posts, research papers, and further classes of texts. [examples.kw_extraction](https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb) provides an example of how to use kwx by deriving keywords from tweets in the Kaggle [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) dataset.
 
 The following outlines using kwx to derive keywords from a text corpus with `prompt_remove_words` as `True` (the user will be asked if some of the extracted words need to be replaced):
 
@@ -78,33 +81,36 @@ num_keywords = 15
 num_topics = 10
 ignore_words = ["words", "user", "knows", "they", "don't", "want"]
 
+# kwx.utils.clean() can be used on a list of lists
 text_corpus = prepare_data(
     data='df_or_csv_xlsx_path',
     target_cols='cols_where_texts_are',
     input_language=input_language,
-    min_freq=2,
-    min_word_len=3,
-    sample_size=1,  # sample size (for testing)
-)[0]
+    min_token_freq=0,  # for BERT
+    min_token_len=0,  # for BERT
+    remove_stopwords=False,  # for BERT
+    verbose=True,
+)
 
 # Remove n-grams for BERT training
-# Clean texts without n-grams have been found to be better than raw texts for BERT
-# Retain n-grams for LDA and TFIDF
 corpus_no_ngrams = [
     " ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
 ]
 
+# We can pass keywords for sentence_transformers.SentenceTransformer.encode,
+# gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer
 bert_kws = extract_kws(
     method='BERT', # 'BERT', 'LDA_BERT', 'LDA', 'TFIDF', 'frequency'
-    text_corpus=corpus_no_ngrams,
+    bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
+    text_corpus=corpus_no_ngrams,  # automatically tokenized if using LDA
     input_language=input_language,
     output_language=None,  # allows the output to be translated
     num_keywords=num_keywords,
     num_topics=num_topics,
     corpuses_to_compare=None,  # for TFIDF
-    return_topics=False,  # to inspect topics rather than produce kws
     ignore_words=ignore_words,
     prompt_remove_words=True,  # check words with user
+    batch_size=32,
 )
 ```
 
@@ -125,7 +131,7 @@ The new BERT keywords are:
 Are there words that should be removed [y/n]? n
 ```
 
-The model will be re-ran until all words known to be unreasonable are removed for a suitable output. `kwx.model.gen_files` could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results).
+The model will be re-ran until all words known to be unreasonable are removed for a suitable output. [kwx.model.gen_files](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py) could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results).
 
 # Visuals [`↩`](#jumpto)
 
@@ -142,12 +148,9 @@ import matplotlib.pyplot as plt
 graph_topic_num_evals(
     method=["lda", "bert", "lda_bert"],
     text_corpus=text_corpus,
-    input_language=input_language,
     num_keywords=num_keywords,
     topic_nums_to_compare=list(range(5, 15)),
-    sample_size=1,
-    metrics=True, # stability and coherence
-    return_ideal_metrics=False, # selects ideal model given metrics for kwx.model.gen_files
+    metrics=True, #  stability and coherence
 )
 plt.show()
 ```
@@ -156,6 +159,27 @@ plt.show()
   <img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/resources/gh_images/topic_num_eval.png" width="600" />
 </p>
 
+### t-SNE
+
+[t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) allows the user to visualize their topic distribution in both two and three dimensions. Currently available just for LDA, this technique provides another check for model suitability.
+
+```python
+from kwx.visuals import t_sne
+import matplotlib.pyplot as plt
+
+t_sne(
+    dimension="both",  # 2d and 3d are options
+    text_corpus=text_corpus,
+    num_topics=10,
+    remove_3d_outliers=True,
+)
+plt.show()
+```
+
+<p align="middle">
+  <img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/resources/gh_images/t_sne.png" width="600" />
+</p>
+
 ### pyLDAvis
 
 [pyLDAvis](https://github.com/bmabey/pyLDAvis) is included so that users can inspect LDA extracted topics, and further so that it can easily be generated for output files.
@@ -166,7 +190,6 @@ from kwx.visuals import pyLDAvis_topics
 pyLDAvis_topics(
     method="lda",
     text_corpus=text_corpus,
-    input_language=input_language,
     num_topics=10,
     display_ipython=False,  # For Jupyter integration
 )
@@ -187,7 +210,6 @@ ignore_words = ["words", "user", "knows", "they", "don't", "want"]
 
 gen_word_cloud(
     text_corpus=text_corpus,
-    input_language=input_language,
     ignore_words=None,
     height=500,
 )
@@ -197,35 +219,11 @@ gen_word_cloud(
   <img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/resources/gh_images/word_cloud.png" width="600" />
 </p>
 
-### t-SNE
-
-[t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) allows the user to visualize their topic distribution in both two and three dimensions. Currently available just for LDA, this technique provides another check for model suitability.
-
-```python
-from kwx.visuals import t_sne
-import matplotlib.pyplot as plt
-
-t_sne(
-    dimension="both",  # 2d and 3d are options
-    text_corpus=text_corpus,
-    num_topics=10,
-    remove_3d_outliers=True,
-)
-plt.show()
-```
-
-<p align="middle">
-  <img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/resources/gh_images/t_sne.png" width="600" />
-</p>
-
 # To-Do [`↩`](#jumpto)
 
-- Including more methods to extract keywords, as well as improving the current ones
-- Adding BERT [sentence-transformers](https://github.com/UKPLab/sentence-transformers) language models as an argument in [kwx.model.extract_kws](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py)
-- Splitting the keyword selection process from [kwx.model.extract_kws](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py) into `kwx.model.select_kws` to allow for faster result iteration given user input
+- Including more methods to extract keywords
 - Allowing key phrase extraction
 - Adding t-SNE and pyLDAvis style visualizations for BERT models
-- Including more options to fine tune the cleaning process in [kwx.utils](https://github.com/andrewtavis/kwx/blob/main/kwx/utils.py)
 - Updates to [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/kwx/languages.py) as lemmatization and other linguistic package dependencies evolve
 - Creating, improving and sharing [examples](https://github.com/andrewtavis/kwx/tree/main/examples)
 - Improving [tests](https://github.com/andrewtavis/kwx/tree/main/tests) for greater [code coverage](https://codecov.io/gh/andrewtavis/kwx)

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -5,7 +5,7 @@
     :target: https://github.com/andrewtavis/kwx
 ============
 
-|rtd| |ci| |codecov| |pyversions| |pypi| |pypistatus| |license| |contributions| |coc| |codestyle| |colab|
+|rtd| |ci| |codecov| |pyversions| |pypi| |pypistatus| |license| |coc| |codestyle| |colab|
 
 .. |rtd| image:: https://img.shields.io/readthedocs/kwx.svg?logo=read-the-docs
     :target: http://kwx.readthedocs.io/en/latest/
@@ -28,9 +28,6 @@
 .. |license| image:: https://img.shields.io/github/license/andrewtavis/kwx.svg
     :target: https://github.com/andrewtavis/kwx/blob/main/LICENSE.txt
 
-.. |contributions| image:: https://img.shields.io/badge/contributions-welcome-brightgreen.svg
-    :target: https://github.com/andrewtavis/kwx/blob/main/.github/CONTRIBUTING.md
-
 .. |coc| image:: https://img.shields.io/badge/coc-Contributor%20Covenant-ff69b4.svg
     :target: https://github.com/andrewtavis/kwx/blob/main/.github/CODE_OF_CONDUCT.md
 

diff --git a/docs/source/model.rst b/docs/source/model.rst
@@ -8,11 +8,13 @@ The :py:mod:`model` module provides needed functions for modeling text corpuses
 * :py:func:`kwx.model.get_topic_words`
 * :py:func:`kwx.model.get_coherence`
 * :py:func:`kwx.model._order_and_subset_by_coherence`
+* :py:func:`kwx.model._select_kws`
 * :py:func:`kwx.model.extract_kws`
 * :py:func:`kwx.model.gen_files`
 
 .. autofunction:: kwx.model.get_topic_words
 .. autofunction:: kwx.model.get_coherence
 .. autofunction:: kwx.model._order_and_subset_by_coherence
+.. autofunction:: kwx.model._select_kws
 .. autofunction:: kwx.model.extract_kws
 .. autofunction:: kwx.model.gen_files
diff --git a/docs/source/utils.rst b/docs/source/utils.rst
@@ -6,19 +6,21 @@ The :py:mod:`utils` module provides needed functions for data loading, cleaning,
 **Functions**
 
 * :py:func:`kwx.utils.load_data`
-* :py:func:`kwx.utils._combine_tokens_to_str`
-* :py:func:`kwx.utils._clean_text_strings`
-* :py:func:`kwx.utils.clean_and_tokenize_texts`
+* :py:func:`kwx.utils._combine_texts_to_str`
+* :py:func:`kwx.utils._remove_unwanted`
+* :py:func:`kwx.utils._lemmatize`
+* :py:func:`kwx.utils.clean`
 * :py:func:`kwx.utils.prepare_data`
 * :py:func:`kwx.utils._prepare_corpus_path`
 * :py:func:`kwx.utils.translate_output`
 * :py:func:`kwx.utils.organize_by_pos`
 * :py:func:`kwx.utils.prompt_for_word_removal`
 
 .. autofunction:: kwx.utils.load_data
-.. autofunction:: kwx.utils._combine_tokens_to_str
-.. autofunction:: kwx.utils._clean_text_strings
-.. autofunction:: kwx.utils.clean_and_tokenize_texts
+.. autofunction:: kwx.utils._combine_texts_to_str
+.. autofunction:: kwx.utils._remove_unwanted
+.. autofunction:: kwx.utils._lemmatize
+.. autofunction:: kwx.utils.clean
 .. autofunction:: kwx.utils.prepare_data
 .. autofunction:: kwx.utils._prepare_corpus_path
 .. autofunction:: kwx.utils.translate_output

diff --git a/kwx/autoencoder.py b/kwx/autoencoder.py
@@ -10,11 +10,16 @@
         fit
 """
 
-from sklearn.model_selection import train_test_split
-
+import logging
+import os
 import warnings
 
+logging.disable(logging.WARNING)
 warnings.filterwarnings("ignore", message=r"Passing", category=FutureWarning)
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
+
+from sklearn.model_selection import train_test_split
+
 import keras
 from keras.layers import Input, Dense
 from keras.models import Model
@@ -72,10 +77,13 @@ def _compile(self, input_dim):
         input_vec = Input(shape=(input_dim,))
         encoded = Dense(units=self.latent_dim, activation=self.activation)(input_vec)
         decoded = Dense(units=input_dim, activation=self.activation)(encoded)
+
         self.autoencoder = Model(input_vec, decoded)
         self.encoder = Model(input_vec, encoded)
+
         encoded_input = Input(shape=(self.latent_dim,))
         decoder_layer = self.autoencoder.layers[-1]
+
         self.decoder = Model(encoded_input, decoder_layer(encoded_input))
         self.autoencoder.compile(optimizer="adam", loss=keras.losses.mean_squared_error)
 
@@ -95,6 +103,7 @@ def fit(self, X):
         """
         if not self.autoencoder:
             self._compile(X.shape[1])
+
         X_train, X_test = train_test_split(X)
         self.his = self.autoencoder.fit(
             X_train,