Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic Modelling and Visualization #163

Open
wants to merge 50 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
fa342a9
added MultiIndex DF support
mk2510 Aug 18, 2020
59a9f8c
beginning with tests
henrifroese Aug 19, 2020
19c52de
implemented correct sparse support
mk2510 Aug 19, 2020
66e566c
Merge branch 'master_upstream' into change_representation_to_multicolumn
mk2510 Aug 21, 2020
41f55a8
added back list() and rm .tolist()
mk2510 Aug 21, 2020
217611a
rm .tolist() and added list()
mk2510 Aug 21, 2020
6a3b56d
Adopted the test to the new dataframes
mk2510 Aug 21, 2020
b8ff561
wrong format
mk2510 Aug 21, 2020
e3af2f9
Address most review comments.
henrifroese Aug 21, 2020
77ad80e
Add more unittests for representation
henrifroese Aug 21, 2020
bee2157
Initial commit to add topic modelling
henrifroese Aug 24, 2020
dece7b5
add pyLDAvis to dependencies
henrifroese Aug 24, 2020
6387ce9
add return_figure option
henrifroese Aug 24, 2020
01c0818
allow display in Console and Jupyter Notebooks
henrifroese Aug 24, 2020
9cd113c
tsvd
mk2510 Aug 24, 2020
187d8f5
Change display at end of function
henrifroese Aug 24, 2020
d9d032c
Merge branch 'topic_modelling' of https://github.com/SummerOfCode-NoH…
henrifroese Aug 24, 2020
85089b1
change display
henrifroese Aug 24, 2020
77d815a
change display for notebook again
henrifroese Aug 24, 2020
242383a
added lda
mk2510 Aug 24, 2020
7edac3a
Merge remote-tracking branch 'origin/topic_modelling' into topic_mode…
mk2510 Aug 24, 2020
9e09e7a
Add tests
henrifroese Aug 24, 2020
fc54ff2
Merge branch 'topic_modelling' of https://github.com/SummerOfCode-NoH…
henrifroese Aug 24, 2020
46289f2
Format; change name; remove new type Signature
henrifroese Aug 24, 2020
bcfa78d
updatewd test
mk2510 Aug 24, 2020
eb2d31b
add docstring
henrifroese Aug 24, 2020
1fd16eb
Merge branch 'topic_modelling' of https://github.com/SummerOfCode-NoH…
henrifroese Aug 24, 2020
3a39346
Implement matrix multiplication changes; fix metadata error
henrifroese Aug 24, 2020
4ad7ee8
added test for lda and tSVD
mk2510 Aug 25, 2020
65504bb
Implement top_words_per_document, top_words_per_topic, topics_from_to…
henrifroese Aug 25, 2020
76e7689
added tests for topic functions
mk2510 Aug 25, 2020
3dfd528
Add docstrings and function comments to new topic modelling functions…
henrifroese Aug 25, 2020
6222448
Merge remote-tracking branch 'origin/topic_modelling' into topic_mode…
henrifroese Aug 25, 2020
79fc37e
fixed index and test
mk2510 Aug 25, 2020
2db883d
- Fix display options
henrifroese Aug 25, 2020
d1c582b
Fix plot visualization
henrifroese Aug 25, 2020
72a7736
remove return_figure parameter
henrifroese Aug 25, 2020
ebc4171
Fix errors and bugs.
henrifroese Aug 25, 2020
9d85c14
remove test-docstring at the end
henrifroese Aug 25, 2020
64edbaf
Start implementing discussed changes
henrifroese Aug 29, 2020
69af26b
Finish implementing the suggested changes.
henrifroese Aug 30, 2020
b9eaf1f
incorporate suggested changes from review
henrifroese Sep 12, 2020
6c30a5e
Fix pyLDAvis PCoA issue.
henrifroese Sep 18, 2020
d12ba7e
Add comment to docstring.
henrifroese Sep 18, 2020
a571925
import _helper in __init__ to overwrite pyLDAvis change
henrifroese Sep 18, 2020
a75aebe
enable auto-display for jupyter notebooks
henrifroese Sep 18, 2020
f8a09c4
Merge branch 'master_upstream' into topic_modelling
mk2510 Sep 22, 2020
cfc78d9
fixed vector series, as pca returns an array
mk2510 Sep 22, 2020
4c5aa0b
fixed the last merged issues
mk2510 Sep 22, 2020
dc42ed1
fix formatting
mk2510 Sep 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add docstring
henrifroese committed Aug 24, 2020
commit eb2d31b0408d303a45ecdd7833d0adf47be84296
101 changes: 89 additions & 12 deletions texthero/visualization.py
Original file line number Diff line number Diff line change
@@ -9,7 +9,7 @@
from wordcloud import WordCloud

from texthero import preprocessing
from texthero._types import TextSeries, InputSeries, DocumentTermDF
from texthero._types import TextSeries, InputSeries
import string

from matplotlib.colors import LinearSegmentedColormap as lsg
@@ -315,19 +315,21 @@ def top_words(s: TextSeries, normalize=False) -> pd.Series:
def _get_matrices_for_visualize_topics(s_document_term, s_document_topic, vectorizer):

if vectorizer:
# Here, s_document_topic is output of hero.lda or hero.truncatedSVD.

# s_document_topic is output of hero.lda or hero.lsi
document_term_matrix = s_document_term.sparse.to_coo()
document_topic_matrix = np.array(list(s_document_topic))

topic_term_matrix = vectorizer.components_

else:
# s_document_topic is output of some hero clustering function
# Here, s_document_topic is output of some hero clustering function.

# First remove documents that are not assigned to any cluster.
indexes_of_unassigned_documents = s_document_topic == -1
s_document_term = s_document_term[~indexes_of_unassigned_documents]
s_document_topic = s_document_topic[~indexes_of_unassigned_documents]
s_document_topic.cat.remove_unused_categories(inplace=True)
s_document_topic = s_document_topic.cat.remove_unused_categories()

document_term_matrix = s_document_term.sparse.to_coo()

@@ -370,8 +372,40 @@ def _prepare_matrices_for_pyLDAvis(document_topic_matrix, topic_term_matrix):
return document_topic_distributions, topic_term_distributions


def visualize_topics(s_document_term, s_document_topic): # TODO: add types to signature when they're merged
def visualize_topics(s_document_term, s_document_topic):
# TODO: add types everywhere when they're merged
"""
Visualize the topics of your dataset. First input has
to be output of one of
- :meth:`texthero.representation.tfidf`
- :meth:`texthero.representation.count`
- :meth:`texthero.representation.term_frequency`
(tfidf suggested).
Second input can either be the result of
clustering, so output of one of
- :meth:`texthero.representation.kmeans`
- :meth:`texthero.representation.meanshift`
- :meth:`texthero.representation.dbscan`
or the result of a topic modelling function, so
one of
- :meth:`texthero.representation.lda`
- :meth:`texthero.representation.truncatedSVD`
(topic modelling output suggested).
The function uses the given clustering
or topic modelling from the second input, which relates
documents to topics. The first input
relates documents to terms. From those
two relations (documents->topics, documents->terms),
the function calculates a distribution of
documents to topics, and a distribution
of topics to terms. These distributions
are passed to `pyLDAvis <https://pyldavis.readthedocs.io/en/latest/>`_,
which visualizes them.
**To show the plot**:
@@ -385,20 +419,62 @@ def visualize_topics(s_document_term, s_document_topic): # TODO: add types to s
Parameters
----------
s_document_term: pd.DataFrame
One of
- :meth:`texthero.representation.tfidf`
- :meth:`texthero.representation.count`
- :meth:`texthero.representation.term_frequency`
s_document_topic: pd.Series
One of
- :meth:`texthero.representation.kmeans`
- :meth:`texthero.representation.meanshift`
- :meth:`texthero.representation.dbscan`
(using clustering functions, documents
that are not assigned to a cluster are
not considered in the visualization)
or one of
- :meth:`texthero.representation.lda`
- :meth:`texthero.representation.truncatedSVD`
Examples
--------
Using Clustering:
>>> import texthero as hero
>>> import pandas as pd
>>> from sklearn.datasets import fetch_20newsgroups
>>> newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
>>> s = pd.Series(newsgroups.data)
>>> s_tfidf = s.pipe(hero.clean).pipe(hero.tokenize).pipe(hero.tfidf, max_df=0.5, min_df=100)
>>> df = pd.read_csv("https://raw.githubusercontent.com/jbesomi/texthero/master/dataset/bbcsport.csv", columns=["text"])
>>> # Use max_df=0.5, min_df=100 in tfidf to speed things up (fewer features).
>>> s_tfidf = df["text"].pipe(hero.clean).pipe(hero.tokenize).pipe(hero.tfidf, max_df=0.5, min_df=100)
>>> s_cluster = s_tfidf.pipe(hero.pca, n_components=20).pipe(hero.dbscan)
>>> hero.visualize_topics(s_tfidf, s_cluster) # doctest: +SKIP
>>> # Display in a new browser window:
>>> hero.display_browser(hero.visualize_topics(s_tfidf, s_cluster)) # doctest: +SKIP
>>> # Display inside the current Jupyter Notebook:
>>> hero.display_notebook(hero.visualize_topics(s_tfidf, s_cluster)) # doctest: +SKIP
Using LDA:
>>> import texthero as hero
>>> import pandas as pd
>>> df = pd.read_csv("https://raw.githubusercontent.com/jbesomi/texthero/master/dataset/bbcsport.csv")
>>> # Use max_df=0.5, min_df=100 in tfidf to speed things up (fewer features).
>>> s_tfidf = df["text"].pipe(hero.clean).pipe(hero.tokenize).pipe(hero.tfidf, max_df=0.5, min_df=100)
>>> s_lda = s_tfidf.pipe(hero.lda, n_components=20)
>>> # Display in a new browser window:
>>> hero.display_browser(hero.visualize_topics(s_tfidf, s_cluster)) # doctest: +SKIP
>>> # Display inside the current Jupyter Notebook:
>>> hero.display_notebook(hero.visualize_topics(s_tfidf, s_cluster)) # doctest: +SKIP
See Also
--------
`pyLDAvis <https://pyldavis.readthedocs.io/en/latest/>`_
TODO add tutorial link
"""
metadata_list = s_document_topic._metadata

@@ -412,13 +488,14 @@ def visualize_topics(s_document_term, s_document_topic): # TODO: add types to s
vectorizer = None

# Get / build matrices from input

(
s_document_term,
s_document_topic,
document_topic_matrix,
topic_term_matrix,
) = _get_matrices_for_visualize_topics(s_document_term, s_document_topic, vectorizer)
) = _get_matrices_for_visualize_topics(
s_document_term, s_document_topic, vectorizer
)

vocab = list(s_document_term.columns.levels[1])
doc_lengths = list(s_document_term.sum(axis=1))