Skip to content

Commit

Permalink
Merge pull request #8 from andrewtavis/verbose-clean
Browse files Browse the repository at this point in the history
Verbose and multiprocessed cleaning with code and process factoring
  • Loading branch information
andrewtavis committed Mar 15, 2021
2 parents 2af5253 + aa6bc1b commit a58255f
Show file tree
Hide file tree
Showing 14 changed files with 956 additions and 802 deletions.
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
### kwx 0.1.5 (March 15, 2021)

Changes include:

- Keyword extraction and selection are now disjointed so that modeling doesn't occur again to get new keywords

- Keyword extraction and cleaning are now fully disjointed processes

- kwargs for sentence-transformers BERT, LDA, and TFIDF can now be passed

- The cleaning process is verbose and uses multiprocessing

- The user has greater control over the cleaning process

- Reformatting of the code to make the process more clear

### kwx 0.1.0 (Feb 17, 2021)

First stable release of kwx
Expand Down
94 changes: 46 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
[![pypi](https://img.shields.io/pypi/v/kwx.svg?color=4B8BBE)](https://pypi.org/project/kwx/)
[![pypistatus](https://img.shields.io/pypi/status/kwx.svg)](https://pypi.org/project/kwx/)
[![license](https://img.shields.io/github/license/andrewtavis/kwx.svg)](https://github.com/andrewtavis/kwx/blob/main/LICENSE.text)
[![contributions](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](https://github.com/andrewtavis/kwx/blob/main/.github/CONTRIBUTING.md)
[![coc](https://img.shields.io/badge/coc-Contributor%20Covenant-ff69b4.svg)](https://github.com/andrewtavis/kwx/blob/main/.github/CODE_OF_CONDUCT.md)
[![codestyle](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![colab](https://img.shields.io/badge/%20-Open%20in%20Colab-097ABB.svg?logo=google-colab&color=097ABB&labelColor=525252)](https://colab.research.google.com/github/andrewtavis/kwx)
Expand All @@ -21,7 +20,7 @@
[//]: # "The '-' after the section links is needed to make them work on GH (because of ↩s)"
**Jump to:**<a id="jumpto"></a> [Models](#models-)[Usage](#usage-)[Visuals](#visuals-)[To-Do](#to-do-)

**kwx** is a toolkit for multilingual keyword extraction based on [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) and Google's [BERT](https://github.com/google-research/bert). The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/kwx/languages.py) for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby allowing them to use their own intuitions to fine tune the modeling process.
**kwx** is a toolkit for multilingual keyword extraction based on Google's [BERT](https://github.com/google-research/bert) and [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation). The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/kwx/languages.py) for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby allowing them to use their own intuitions to fine tune the modeling process.

For a thorough overview of the process and techniques see the [Google slides](https://docs.google.com/presentation/d/1BNddaeipNQG1mUTjBYmrdpGC6xlBvAi3rapT88fkdBU/edit?usp=sharing), and reference the [documentation](https://kwx.readthedocs.io/en/latest/) for explanations of the models and visualization methods.

Expand All @@ -45,27 +44,31 @@ import kwx

# Models [``](#jumpto)

Implemented NLP modeling methods within [kwx.model](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py) include:

### BERT

[Bidirectional Encoder Representations from Transformers](https://github.com/google-research/bert) derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics.

kwx uses [sentence-transformers](https://github.com/UKPLab/sentence-transformers) pretrained models. See their GitHub and [documentation](https://www.sbert.net/) for the available models.

### LDA

[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of kwx, documents or text entries are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics.

Although not as statistically strong as the following machine learning models, LDA provides quick results that are suitable for many applications.

### BERT

[Bidirectional Encoder Representations from Transformers](https://github.com/google-research/bert) derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics.

### LDA with BERT embeddings

The combination of LDA with BERT via [kwx.autoencoder](https://github.com/andrewtavis/kwx/blob/main/kwx/autoencoder.py).

### Other
### Other Methods

The user can also choose to simply query the most common words from a text corpus or compute TFIDF ([Term Frequency Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) keywords - those that are unique in a text body in comparison to another that's compared. The former method is used in kwx as a baseline to check model efficacy, and the latter is a useful baseline when a user has another text or text body to compare the target corpus against.

# Usage [``](#jumpto)

Keyword extraction can be useful to analyze surveys, tweets, other kinds of social media posts, research papers, and further classes of texts. [examples.kw_extraction](https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb) provides an example of how to use kwx by deriving keywords from tweets in the Kaggle [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) dataset.
Keyword extraction can be useful to analyze surveys, tweets and other kinds of social media posts, research papers, and further classes of texts. [examples.kw_extraction](https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb) provides an example of how to use kwx by deriving keywords from tweets in the Kaggle [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) dataset.

The following outlines using kwx to derive keywords from a text corpus with `prompt_remove_words` as `True` (the user will be asked if some of the extracted words need to be replaced):

Expand All @@ -78,33 +81,36 @@ num_keywords = 15
num_topics = 10
ignore_words = ["words", "user", "knows", "they", "don't", "want"]

# kwx.utils.clean() can be used on a list of lists
text_corpus = prepare_data(
data='df_or_csv_xlsx_path',
target_cols='cols_where_texts_are',
input_language=input_language,
min_freq=2,
min_word_len=3,
sample_size=1, # sample size (for testing)
)[0]
min_token_freq=0, # for BERT
min_token_len=0, # for BERT
remove_stopwords=False, # for BERT
verbose=True,
)

# Remove n-grams for BERT training
# Clean texts without n-grams have been found to be better than raw texts for BERT
# Retain n-grams for LDA and TFIDF
corpus_no_ngrams = [
" ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
]

# We can pass keywords for sentence_transformers.SentenceTransformer.encode,
# gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer
bert_kws = extract_kws(
method='BERT', # 'BERT', 'LDA_BERT', 'LDA', 'TFIDF', 'frequency'
text_corpus=corpus_no_ngrams,
bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA
input_language=input_language,
output_language=None, # allows the output to be translated
num_keywords=num_keywords,
num_topics=num_topics,
corpuses_to_compare=None, # for TFIDF
return_topics=False, # to inspect topics rather than produce kws
ignore_words=ignore_words,
prompt_remove_words=True, # check words with user
batch_size=32,
)
```

Expand All @@ -125,7 +131,7 @@ The new BERT keywords are:
Are there words that should be removed [y/n]? n
```

The model will be re-ran until all words known to be unreasonable are removed for a suitable output. `kwx.model.gen_files` could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results).
The model will be re-ran until all words known to be unreasonable are removed for a suitable output. [kwx.model.gen_files](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py) could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results).

# Visuals [``](#jumpto)

Expand All @@ -142,12 +148,9 @@ import matplotlib.pyplot as plt
graph_topic_num_evals(
method=["lda", "bert", "lda_bert"],
text_corpus=text_corpus,
input_language=input_language,
num_keywords=num_keywords,
topic_nums_to_compare=list(range(5, 15)),
sample_size=1,
metrics=True, # stability and coherence
return_ideal_metrics=False, # selects ideal model given metrics for kwx.model.gen_files
metrics=True, # stability and coherence
)
plt.show()
```
Expand All @@ -156,6 +159,27 @@ plt.show()
<img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/resources/gh_images/topic_num_eval.png" width="600" />
</p>

### t-SNE

[t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) allows the user to visualize their topic distribution in both two and three dimensions. Currently available just for LDA, this technique provides another check for model suitability.

```python
from kwx.visuals import t_sne
import matplotlib.pyplot as plt

t_sne(
dimension="both", # 2d and 3d are options
text_corpus=text_corpus,
num_topics=10,
remove_3d_outliers=True,
)
plt.show()
```

<p align="middle">
<img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/resources/gh_images/t_sne.png" width="600" />
</p>

### pyLDAvis

[pyLDAvis](https://github.com/bmabey/pyLDAvis) is included so that users can inspect LDA extracted topics, and further so that it can easily be generated for output files.
Expand All @@ -166,7 +190,6 @@ from kwx.visuals import pyLDAvis_topics
pyLDAvis_topics(
method="lda",
text_corpus=text_corpus,
input_language=input_language,
num_topics=10,
display_ipython=False, # For Jupyter integration
)
Expand All @@ -187,7 +210,6 @@ ignore_words = ["words", "user", "knows", "they", "don't", "want"]

gen_word_cloud(
text_corpus=text_corpus,
input_language=input_language,
ignore_words=None,
height=500,
)
Expand All @@ -197,35 +219,11 @@ gen_word_cloud(
<img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/resources/gh_images/word_cloud.png" width="600" />
</p>

### t-SNE

[t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) allows the user to visualize their topic distribution in both two and three dimensions. Currently available just for LDA, this technique provides another check for model suitability.

```python
from kwx.visuals import t_sne
import matplotlib.pyplot as plt

t_sne(
dimension="both", # 2d and 3d are options
text_corpus=text_corpus,
num_topics=10,
remove_3d_outliers=True,
)
plt.show()
```

<p align="middle">
<img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/resources/gh_images/t_sne.png" width="600" />
</p>

# To-Do [``](#jumpto)

- Including more methods to extract keywords, as well as improving the current ones
- Adding BERT [sentence-transformers](https://github.com/UKPLab/sentence-transformers) language models as an argument in [kwx.model.extract_kws](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py)
- Splitting the keyword selection process from [kwx.model.extract_kws](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py) into `kwx.model.select_kws` to allow for faster result iteration given user input
- Including more methods to extract keywords
- Allowing key phrase extraction
- Adding t-SNE and pyLDAvis style visualizations for BERT models
- Including more options to fine tune the cleaning process in [kwx.utils](https://github.com/andrewtavis/kwx/blob/main/kwx/utils.py)
- Updates to [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/kwx/languages.py) as lemmatization and other linguistic package dependencies evolve
- Creating, improving and sharing [examples](https://github.com/andrewtavis/kwx/tree/main/examples)
- Improving [tests](https://github.com/andrewtavis/kwx/tree/main/tests) for greater [code coverage](https://codecov.io/gh/andrewtavis/kwx)
Expand Down
5 changes: 1 addition & 4 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
:target: https://github.com/andrewtavis/kwx
============

|rtd| |ci| |codecov| |pyversions| |pypi| |pypistatus| |license| |contributions| |coc| |codestyle| |colab|
|rtd| |ci| |codecov| |pyversions| |pypi| |pypistatus| |license| |coc| |codestyle| |colab|

.. |rtd| image:: https://img.shields.io/readthedocs/kwx.svg?logo=read-the-docs
:target: http://kwx.readthedocs.io/en/latest/
Expand All @@ -28,9 +28,6 @@
.. |license| image:: https://img.shields.io/github/license/andrewtavis/kwx.svg
:target: https://github.com/andrewtavis/kwx/blob/main/LICENSE.txt

.. |contributions| image:: https://img.shields.io/badge/contributions-welcome-brightgreen.svg
:target: https://github.com/andrewtavis/kwx/blob/main/.github/CONTRIBUTING.md

.. |coc| image:: https://img.shields.io/badge/coc-Contributor%20Covenant-ff69b4.svg
:target: https://github.com/andrewtavis/kwx/blob/main/.github/CODE_OF_CONDUCT.md

Expand Down
2 changes: 2 additions & 0 deletions docs/source/model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,13 @@ The :py:mod:`model` module provides needed functions for modeling text corpuses
* :py:func:`kwx.model.get_topic_words`
* :py:func:`kwx.model.get_coherence`
* :py:func:`kwx.model._order_and_subset_by_coherence`
* :py:func:`kwx.model._select_kws`
* :py:func:`kwx.model.extract_kws`
* :py:func:`kwx.model.gen_files`

.. autofunction:: kwx.model.get_topic_words
.. autofunction:: kwx.model.get_coherence
.. autofunction:: kwx.model._order_and_subset_by_coherence
.. autofunction:: kwx.model._select_kws
.. autofunction:: kwx.model.extract_kws
.. autofunction:: kwx.model.gen_files
14 changes: 8 additions & 6 deletions docs/source/utils.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,21 @@ The :py:mod:`utils` module provides needed functions for data loading, cleaning,
**Functions**

* :py:func:`kwx.utils.load_data`
* :py:func:`kwx.utils._combine_tokens_to_str`
* :py:func:`kwx.utils._clean_text_strings`
* :py:func:`kwx.utils.clean_and_tokenize_texts`
* :py:func:`kwx.utils._combine_texts_to_str`
* :py:func:`kwx.utils._remove_unwanted`
* :py:func:`kwx.utils._lemmatize`
* :py:func:`kwx.utils.clean`
* :py:func:`kwx.utils.prepare_data`
* :py:func:`kwx.utils._prepare_corpus_path`
* :py:func:`kwx.utils.translate_output`
* :py:func:`kwx.utils.organize_by_pos`
* :py:func:`kwx.utils.prompt_for_word_removal`

.. autofunction:: kwx.utils.load_data
.. autofunction:: kwx.utils._combine_tokens_to_str
.. autofunction:: kwx.utils._clean_text_strings
.. autofunction:: kwx.utils.clean_and_tokenize_texts
.. autofunction:: kwx.utils._combine_texts_to_str
.. autofunction:: kwx.utils._remove_unwanted
.. autofunction:: kwx.utils._lemmatize
.. autofunction:: kwx.utils.clean
.. autofunction:: kwx.utils.prepare_data
.. autofunction:: kwx.utils._prepare_corpus_path
.. autofunction:: kwx.utils.translate_output
Expand Down
13 changes: 11 additions & 2 deletions kwx/autoencoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,16 @@
fit
"""

from sklearn.model_selection import train_test_split

import logging
import os
import warnings

logging.disable(logging.WARNING)
warnings.filterwarnings("ignore", message=r"Passing", category=FutureWarning)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

from sklearn.model_selection import train_test_split

import keras
from keras.layers import Input, Dense
from keras.models import Model
Expand Down Expand Up @@ -72,10 +77,13 @@ def _compile(self, input_dim):
input_vec = Input(shape=(input_dim,))
encoded = Dense(units=self.latent_dim, activation=self.activation)(input_vec)
decoded = Dense(units=input_dim, activation=self.activation)(encoded)

self.autoencoder = Model(input_vec, decoded)
self.encoder = Model(input_vec, encoded)

encoded_input = Input(shape=(self.latent_dim,))
decoder_layer = self.autoencoder.layers[-1]

self.decoder = Model(encoded_input, decoder_layer(encoded_input))
self.autoencoder.compile(optimizer="adam", loss=keras.losses.mean_squared_error)

Expand All @@ -95,6 +103,7 @@ def fit(self, X):
"""
if not self.autoencoder:
self._compile(X.shape[1])

X_train, X_test = train_test_split(X)
self.his = self.autoencoder.fit(
X_train,
Expand Down
Loading

0 comments on commit a58255f

Please sign in to comment.