diff --git a/CHANGELOG.md b/CHANGELOG.md index c07fbe8..9d1f6d6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,8 +2,10 @@ ## 3.2.3 -2022-02-?? +2022-03-06 + * handles missing `noun_chunks` in some language models (e.g., "ru") + * add *TopicRank* algorithm; kudos @tomaarsen * improved test suite; fixed tests for newer spacy releases; kudos @tomaarsen diff --git a/README.md b/README.md index 6dd1184..0c4638f 100644 --- a/README.md +++ b/README.md @@ -20,6 +20,7 @@ This includes the family of - *TextRank* by [[mihalcea04textrank]](https://derwen.ai/docs/ptr/biblio/#mihalcea04textrank) - *PositionRank* by [[florescuc17]](https://derwen.ai/docs/ptr/biblio/#florescuc17) - *Biased TextRank* by [[kazemi-etal-2020-biased]](https://derwen.ai/docs/ptr/biblio/#kazemi-etal-2020-biased) + - *TopicRank* by [[bougouin-etal-2013-topicrank]](https://derwen.ai/docs/ptr/biblio/#bougouin-etal-2013-topicrank) Popular use cases for this library include: diff --git a/docs/apidocs.yml b/docs/apidocs.yml index 06a4231..5644290 100644 --- a/docs/apidocs.yml +++ b/docs/apidocs.yml @@ -3,4 +3,4 @@ apidocs: template: ref.jinja package: pytextrank git: https://github.com/DerwenAI/pytextrank/blob/main - includes: BaseTextRankFactory, BaseTextRank, PositionRankFactory, PositionRank, BiasedTextRankFactory, BiasedTextRank, Lemma, Phrase, Sentence, VectorElem + includes: BaseTextRankFactory, BaseTextRank, TopicRankFactory, TopicRank, PositionRankFactory, PositionRank, BiasedTextRankFactory, BiasedTextRank, Lemma, Phrase, Sentence, VectorElem diff --git a/docs/biblio.md b/docs/biblio.md new file mode 100644 index 0000000..05edb43 --- /dev/null +++ b/docs/biblio.md @@ -0,0 +1,97 @@ +# Bibliography + +Where possible, the bibliography entries use conventions at + +for [*citation keys*](https://bibdesk.sourceforge.io/manual/BibDeskHelp_2.html). + +Journal abbreviations come from + +based on [*ISO 4*](https://en.wikipedia.org/wiki/ISO_4) standards. + +Links to online versions of cited works use +[DOI](https://www.doi.org/) +for [*persistent identifiers*](https://www.crossref.org/education/metadata/persistent-identifiers/). +When available, +[*open access*](https://peerj.com/preprints/3119v1/) +URLs are listed as well. + + +## – B – + +### bougouin-etal-2013-topicrank + +["TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction"](https://aclanthology.org/I13-1062/) +[**Adrien Bougouin**](https://derwen.ai/s/t67yc26f6hcg), [**Florian Boudin**](https://derwen.ai/s/y89xdcbr3mj8), [**Béatrice Daille**](https://derwen.ai/s/25nynb9g79jt) +[*IJCNLP*](https://aclanthology.org/I13-1062/) pp. 543-551 (2013-10-14) +open: https://aclanthology.org/I13-1062.pdf +slides: http://adrien-bougouin.github.io/publications/2013/topicrank_ijcnlp_slides.pdf +> Keyphrase extraction is the task of identifying single or multi-word expressions that represent the main topics of a document. In this paper we present TopicRank, a graph-based keyphrase extraction method that relies on a topical representation of the document. Candidate keyphrases are clustered into topics and used as vertices in a complete graph. A graph-based ranking model is applied to assign a significance score to each topic. Keyphrases are then generated by selecting a candidate from each of the topranked topics. We conducted experiments on four evaluation datasets of different languages and domains. Results show that TopicRank significantly outperforms state-of-the-art methods on three datasets. + + +## – F – + +### florescuc17 + +["PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents"](https://doi.org/10.18653/v1/P17-1102) +[**Corina Florescu**](https://derwen.ai/s/y3w6mvj2r9wv), [**Cornelia Caragea**](https://derwen.ai/s/v3rq24nf6426) +[*Comput Linguist Assoc Comput Linguis*](https://www.mitpressjournals.org/loi/coli) pp. 1105-1115 (2017-07-30) +DOI: 10.18653/v1/P17-1102 +open: https://www.aclweb.org/anthology/P17-1102.pdf +> The large and growing amounts of online scholarly data present both challenges and opportunities to enhance knowledge discovery. One such challenge is to automatically extract a small set of keyphrases from a document that can accurately describe the document’s content and can facilitate fast information processing. In this paper, we propose PositionRank, an unsupervised model for keyphrase extraction from scholarly documents that incorporates information from all positions of a word’s occurrences into a biased PageRank. Our model obtains remarkable improvements in performance over PageRank models that do not take into account word positions as well as over strong baselines for this task. Specifically, on several datasets of research papers, PositionRank achieves improvements as high as 29.09%. + + +## – G – + +### gleich15 + +["PageRank Beyond the Web"](https://doi.org/10.1137/140976649) +[**David Gleich**](https://derwen.ai/s/7zk738z8fn9t) +[*SIAM Review*](https://www.siam.org/publications/journals/siam-review-sirev) **57** 3 pp. 321-363 (2015-08-06) +DOI: 10.1137/140976649 +open: https://www.cs.purdue.edu/homes/dgleich/publications/Gleich%202015%20-%20prbeyond.pdf +> Google's PageRank method was developed to evaluate the importance of web-pages via their link structure. The mathematics of PageRank, however, are entirely general and apply to any graph or network in any domain. Thus, PageRank is now regularly used in bibliometrics, social and information network analysis, and for link prediction and recommendation. It's even used for systems analysis of road networks, as well as biology, chemistry, neuroscience, and physics. We'll see the mathematics and ideas that unite these diverse applications. + + +## – K – + +### kazemi-etal-2020-biased + +["Biased TextRank: Unsupervised Graph-Based Content Extraction"](https://doi.org/10.18653/v1/2020.coling-main.144) +[**Ashkan Kazemi**](https://derwen.ai/s/rjsnrs5jhswk), [**Verónica Pérez-Rosas**](https://derwen.ai/s/svmndvvnndkv), [**Rada Mihalcea**](https://derwen.ai/s/wwrw59tbtzzp) +[*COLING*](https://www.aclweb.org/anthology/venues/coling/) **28** pp. 1642-1652 (2020-12-08) +DOI: 10.18653/v1/2020.coling-main.144 +open: https://www.aclweb.org/anthology/2020.coling-main.144.pdf +> We introduce Biased TextRank, a graph-based content extraction method inspired by the popular TextRank algorithm that ranks text spans according to their importance for language processing tasks and according to their relevance to an input 'focus'. Biased TextRank enables focused content extraction for text by modifying the random restarts in the execution of TextRank. The random restart probabilities are assigned based on the relevance of the graph nodes to the focus of the task. We present two applications of Biased TextRank: focused summarization and explanation extraction, and show that our algorithm leads to improved performance on two different datasets by significant ROUGE-N score margins. Much like its predecessor, Biased TextRank is unsupervised, easy to implement and orders of magnitude faster and lighter than current state-of-the-art Natural Language Processing methods for similar tasks. + + +## – M – + +### mihalcea04textrank + +["TextRank: Bringing Order into Text"](https://www.aclweb.org/anthology/W04-3252/) +[**Rada Mihalcea**](https://derwen.ai/s/wwrw59tbtzzp), [**Paul Tarau**](https://derwen.ai/s/vnfvsgvc9gfy) +[*EMNLP*](https://www.aclweb.org/anthology/venues/emnlp/) pp. 404-411 (2004-07-25) +open: https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf +> In this paper, the authors introduce TextRank, a graph-based ranking model for text processing, and show how this model can be successfully used in natural language applications. + + +## – P – + +### page1998 + +["The PageRank Citation Ranking: Bringing Order to the Web"](http://ilpubs.stanford.edu:8090/422/) +[**Lawrence Page**](https://derwen.ai/s/mk6xj6cfrrxg), [**Sergey Brin**](https://derwen.ai/s/j636dghdyws5), [**Rajeev Motwani**](https://derwen.ai/s/9hhpmgjs7kwt), [**Terry Winograd**](https://derwen.ai/s/jdxk7fz84nzq) +[*Stanford InfoLab*](http://infolab.stanford.edu/) (1999-11-11) +open: http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf +> The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation. + + +## – W – + +### williams2016 + +["Summarizing documents"](https://mike.place/talks/pygotham/) +[**Mike Williams**](https://derwen.ai/s/2t2mbms2x4p3) +(2016-09-25) +> I've recently given a couple of talks (PyGotham video, PyGotham slides, Strata NYC slides) about text summarization. I cover three ways of automatically summarizing text. One is an extremely simple algorithm from the 1950s, one uses Latent Dirichlet Allocation, and one uses skipthoughts and recurrent neural networks. The talk is conceptual, and avoids code and mathematics. So here is a list of resources if you're interested in text summarization and want to dive deeper. This list useful is hopefully also useful if you're interested in topic modelling or neural networks for other reasons. + diff --git a/docs/glossary.md b/docs/glossary.md new file mode 100644 index 0000000..749bc4a --- /dev/null +++ b/docs/glossary.md @@ -0,0 +1,353 @@ +# Glossary + + +## – A – + +### abstractive summarization +> Generating a short, concise summary which captures salient ideas of the source text, potentially using new phrases and sentences that may not appear in the source. + + + + + +## – C – + +### cloud +See also: [cloud computing](#cloud-computing) + + +### cloud computing +> The on-demand availability of computing resources over the Internet without direct active management by the user, often paid for on a short-term basis, giving the illusion of infinite computing resources available and thereby eliminating the need to plan far ahead for provisioning. + + + +References: + + * http://www.wikidata.org/entity/Q483639 + * http://id.loc.gov/authorities/subjects/sh2008004883 + * https://csrc.nist.gov/publications/detail/sp/800-145/final + * https://rise.cs.berkeley.edu/blog/a-berkeley-view-on-serverless-computing/ + + + +### computable content +> Interactive learning materials which leverage remote computation, e.g., Jupyter notebooks. + + + +References: + + * https://youtu.be/EPftGvsXonc + * https://www.wolfram.com/cdf/ + * https://www.slideshare.net/pacoid/computable-content + + + +### coreference resolution +> Clustering mentions within a text that refer to the same underlying entities. + + + +References: + + * https://paperswithcode.com/task/coreference-resolution + * http://www.wikidata.org/entity/Q63087 + * http://nlpprogress.com/english/coreference_resolution.html + + + +## – D – + +### DL +See also: [deep learning](#deep-learning) + + +### data science +> An interdisciplinary field which emerged from industry not academia, focused on deriving insights from data, emphasizing how to leverage curiosity and domain expertise, and applying increasingly advanced mathematics for novel business cases in response to surges in data rates and compute resources. + + + +References: + + * https://ischoolonline.berkeley.edu/data-science/what-is-data-science/ + * https://hbr.org/2018/11/curiosity-driven-data-science + * https://community.ibm.com/community/user/datascience/blogs/paco-nathan/2019/03/04/what-is-data-science + * https://projecteuclid.org/euclid.aoms/1177704711 + * http://www.wikidata.org/entity/Q2374463 + + + +### data strategy +> The tools, processes, and practices that define how to manage and leverage data to make informed decisions. + + + + + +### deep learning +> A family of machine learning methods based on artificial neural networks which use representation learning. + + + +References: + + * https://en.wikipedia.org/wiki/Deep_learning + * http://www.wikidata.org/entity/Q197536 + + + +## – E – + +### eigenvector centrality +> Measuring the influence of a node within a network. + + + +References: + + * http://www.wikidata.org/entity/Q28401090 + * https://demonstrations.wolfram.com/NetworkCentralityUsingEigenvectors/ + + + +### entity linking +> Recognizing named entities within a text, then disambiguating them by linking to specific contexts in a knowledge graph. + + +Broader: + + * [named entity recognition](#named-entity-recognition) + * [knowledge graph](#knowledge-graph) + + +References: + + * http://www.wikidata.org/entity/Q17012245 + * https://paperswithcode.com/task/entity-linking + * http://nlpprogress.com/english/entity_linking.html + + + +### extractive summarization +> Summarizing the source text by identifying a subset of the sentences as the most important excerpts, then generating a sequence of them verbatim. + + + + + +## – G – + +### graph algorithms +> A family of algorithms that operation on graphs for network analysis, measurement, ranking, partitioning, and other methods that leverage graph theory. + + + +References: + + * http://id.loc.gov/authorities/subjects/sh2002004605 + * https://networkx.org/documentation/stable/reference/algorithms/index.html + + + +## – K – + +### KG +See also: [knowledge graph](#knowledge-graph) + + +### KGC +See also: [knowledge graph conference](#knowledge-graph-conference) + + +### knowledge graph +> A knowledge base that uses a graph-structured data model, representing and annotating interlinked descriptions of entities, with an overlay of semantic metadata. + + + +References: + + * https://www.poolparty.biz/what-is-a-knowledge-graph/ + * http://www.wikidata.org/entity/Q33002955 + + + +### knowledge graph conference +> Founded in 2019 at Columbia University, The Knowledge Graphs Conference is emerging as the premiere source of learning around knowledge graph technologies. We believe knowledge graphs are an underutilized yet essential force for solving complex societal challenges like climate change, democratizing access to knowledge and opportunity, and capturing business value made possible by the AI revolution. + + + + + +## – L – + +### language model +> A statistical model used for predicting the next word or character within a document. + + +Broader: + + * [natural language](#natural-language) + * https://derwen.ai/d/machine_learning + + +References: + + * http://www.wikidata.org/entity/Q3621696 + * https://paperswithcode.com/task/language-modelling + * http://nlpprogress.com/english/language_modeling.html + + + +### lemma graph +> A graph data structure used to represent links among phrase extracted from a source text, during the operation of the TextRank algorithm. + +Described in: [[mihalcea04textrank]](../biblio/#mihalcea04textrank) + + + + +## – N – + +### NER +See also: [named entity recognition](#named-entity-recognition) + + +### NLP +See also: [natural language](#natural-language) + + +### named entity recognition +> Extracting mentions of *named entities* from unstructured text, then annotating them with pre-defined categories. + + + +References: + + * http://www.wikidata.org/entity/Q403574 + * http://nlpprogress.com/english/named_entity_recognition.html + * https://paperswithcode.com/task/named-entity-recognition-ner + + + +### natural language +> Intersection of computer science and linguistics, used to leverage data in the form of text, speech, and images to identify structure and meaning. Also used for enabling people and computer-based agents to interact using natural language. + + + +References: + + * http://www.wikidata.org/entity/Q30642 + * http://id.loc.gov/authorities/subjects/sh88002425 + * https://plato.stanford.edu/entries/computational-linguistics/ + + + +## – P – + +### personalized pagerank +> Using the *personalized teleportation behaviors* originally described for the PageRank algorithm to focus ranked results within a neighborhood of the graph, given a set of nodes as input. + +Described in: [[page1998]](../biblio/#page1998), [[gleich15]](../biblio/#gleich15) + + + + +### phrase extraction +> Selecting representative phrases from a document as its characteristic entities; in contrast to *keyword* analysis. + + + + + +## – R – + +### RL +See also: [reinforcement learning](#reinforcement-learning) + + +### reinforcement learning +> Optimal control theory mixed with deep learning where software agents learn to take actions within an environment and make sequences of decisions to maximize a cumulative reward -- typically stated in terms of markov decision process -- finding a balance between exploration (uncharted territory) and exploitation (current knowledge). Generally a reverse engineering of various psychological learning processes. + + + +References: + + * https://docs.ray.io/en/master/rllib-models.html + * http://www.wikidata.org/entity/Q830687 + * http://id.loc.gov/authorities/subjects/sh92000704 + + + +## – S – + +### semantic relations +> Associations that exist between the meanings of phrases. + + + + + +### stop words +> Words to be filtered out during natural language processing. + + + +References: + + * http://www.wikidata.org/entity/Q80735 + * http://id.loc.gov/authorities/subjects/sh85046249 + + + +### summarization +> Producing a shorter version of one or more documents, while preserving most of the input's meaning. + + + +References: + + * http://nlpprogress.com/english/summarization.html + * http://www.wikidata.org/entity/Q1394144 + + + +## – T – + +### text summarization +See also: [summarization](#summarization) + + +### textgraphs +> Use of graph algorithms for NLP, based on a graph representation of a source text. + + +Broader: + + * [natural language](#natural-language) + * [graph algorithms](#graph-algorithms) + + +References: + + * http://www.wikidata.org/entity/Q18388823 + * http://www.gabormelli.com/RKB/Text_Graph + * http://www.textgraphs.org/ + + + +### transformers +> A family of deep learning models, mostly used in NLP, which adopts the mechanism of *attention* to weigh the influence of different parts of the input data. + + +Broader: + + * [language model](#language-model) + * [deep learning](#deep-learning) + + +References: + + * http://www.wikidata.org/entity/Q85810444 + * https://paperswithcode.com/methods/category/transformers + + diff --git a/docs/ref.md b/docs/ref.md index fcf0522..cd13bbc 100644 --- a/docs/ref.md +++ b/docs/ref.md @@ -250,6 +250,130 @@ the `altair` chart being rendered +## [`TopicRankFactory` class](#TopicRankFactory) + +A factory class that provides the document with its instance of +`TopicRank` + +--- +#### [`__init__` method](#pytextrank.TopicRankFactory.__init__) +[*\[source\]*](https://github.com/DerwenAI/pytextrank/blob/main/pytextrank/topicrank.py#L31) + +```python +__init__(edge_weight=1.0, pos_kept=None, token_lookback=3, scrubber=None, stopwords=None, threshold=0.25, method="average") +``` +Constructor for the factory class. + + + +--- +#### [`__call__` method](#pytextrank.TopicRankFactory.__call__) +[*\[source\]*](https://github.com/DerwenAI/pytextrank/blob/main/pytextrank/topicrank.py#L58) + +```python +__call__(doc) +``` +Set the extension attributes on a `spaCy` [`Doc`](https://spacy.io/api/doc) +document to create a *pipeline component* for `TopicRank` as +a stateful component, invoked when the document gets processed. + +See: + + * `doc` : `spacy.tokens.doc.Doc` +a document container, providing the annotations produced by earlier stages of the `spaCy` pipeline + + + +## [`TopicRank` class](#TopicRank) + +Implements the *TopicRank* algorithm described by +[[bougouin-etal-2013-topicrank]](https://derwen.ai/docs/ptr/biblio/#bougouin-etal-2013-topicrank) +deployed as a `spaCy` pipeline component. + +This class does not get called directly; instantiate its factory +instead. + +Algorithm Overview: + +1. Preprocessing: Sentence segmentation, word tokenization, POS tagging. + After this stage, we have preprocessed text. +2. Candidate extraction: Extract sequences of nouns and adjectives (i.e. noun chunks) + After this stage, we have a list of keyphrases that may be topics. +3. Candidate clustering: Hierarchical Agglomerative Clustering algorithm with average + linking using simple set-based overlap of lemmas. Similarity is achieved at > 25% + overlap. **Note**: PyTextRank deviates from the original algorithm here, which uses + stems rather than lemmas. + After this stage, we have a list of topics. +4. Candidate ranking: Apply *TextRank* on a complete graph, with topics as nodes + (i.e. clusters derived in the last step), where edge weights are higher between + topics that appear closer together within the document. + After this stage, we have a ranked list of topics. +5. Candidate selection: Select the first occurring keyphrase from each topic to + represent that topic. + After this stage, we have a ranked list of topics, with a keyphrase to represent + the topic. + +--- +#### [`__init__` method](#pytextrank.TopicRank.__init__) +[*\[source\]*](https://github.com/DerwenAI/pytextrank/blob/main/pytextrank/topicrank.py#L120) + +```python +__init__(doc, edge_weight, pos_kept, token_lookback, scrubber, stopwords, threshold, method) +``` +Constructor for a factory used to instantiate the PyTextRank pipeline components. + + * `edge_weight` : `float` +default weight for an edge + + * `pos_kept` : `typing.List[str]` +parts of speech tags to be kept; adjust this if strings representing the POS tags change + + * `token_lookback` : `int` +the window for neighboring tokens – similar to a *skip gram* + + * `scrubber` : `typing.Callable` +optional "scrubber" function to clean up punctuation from a token; if `None` then defaults to `pytextrank.default_scrubber`; when running, PyTextRank will throw a `FutureWarning` warning if the configuration uses a deprecated approach for a scrubber function + + * `stopwords` : `typing.Dict[str, typing.List[str]]` +optional dictionary of `lemma: [pos]` items to define the *stop words*, where each item has a key as a lemmatized token and a value as a list of POS tags; may be a file name (string) or a [`pathlib.Path`](https://docs.python.org/3/library/pathlib.html) for a JSON file; otherwise throws a `TypeError` exception + + * `threshold` : `float` +threshold used in *TopicRank* candidate clustering; the original algorithm uses 0.25 + + * `method` : `str` +clustering method used in *TopicRank* candidate clustering: see [`scipy.cluster.hierarchy.linkage`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) for valid methods; the original algorithm uses "average" + + + +--- +#### [`calc_textrank` method](#pytextrank.TopicRank.calc_textrank) +[*\[source\]*](https://github.com/DerwenAI/pytextrank/blob/main/pytextrank/topicrank.py#L307) + +```python +calc_textrank() +``` +Construct a complete graph using potential topics as nodes, +then apply the *TextRank* algorithm to return the top-ranked phrases. + +This method represents the heart of the *TopicRank* algorithm. + + * *returns* : `typing.List[pytextrank.base.Phrase]` +list of ranked phrases, in descending order + + + +--- +#### [`reset` method](#pytextrank.TopicRank.reset) +[*\[source\]*](https://github.com/DerwenAI/pytextrank/blob/main/pytextrank/topicrank.py#L367) + +```python +reset() +``` +Reinitialize the data structures needed for extracting phrases, +removing any pre-existing state. + + + ## [`PositionRankFactory` class](#PositionRankFactory) A factory class that provides the document with its instance of diff --git a/mkdocs.yml b/mkdocs.yml index 4a096af..dd59a96 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -6,8 +6,6 @@ site_author: pytextrank contributors, with Derwen, Inc. repo_url: https://github.com/DerwenAI/pytextrank repo_name: DerwenAI/pytextrank -mkrefs_config: mkrefs.yml - copyright: Source code and documentation are licensed under an MIT License; Copyright © 2016-2022 Derwen, Inc. nav: @@ -43,7 +41,6 @@ theme: plugins: - git-revision-date - mknotebooks - - mkrefs extra_css: - stylesheets/extra.css diff --git a/pkg_doc.py b/pkg_doc.py index 989fc3e..566aeb6 100755 --- a/pkg_doc.py +++ b/pkg_doc.py @@ -19,6 +19,8 @@ class_list = [ "BaseTextRankFactory", "BaseTextRank", + "TopicRankFactory", + "TopicRank", "PositionRankFactory", "PositionRank", "BiasedTextRankFactory", diff --git a/pytextrank/base.py b/pytextrank/base.py index 164e74a..61d3dbe 100644 --- a/pytextrank/base.py +++ b/pytextrank/base.py @@ -357,7 +357,14 @@ def calc_textrank ( # agglomerate the lemmas ranked in the lemma graph into ranked # phrases, leveraging information from earlier stages of the # pipeline: noun chunks and named entities - nc_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.noun_chunks, self.ranks) + nc_phrases: typing.Dict[Span, float] = {} + + try: + nc_phrases = self._collect_phrases(self.doc.noun_chunks, self.ranks) + except AttributeError: + # some languages do not have `noun_chunks` support in spaCy models + pass + ent_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.ents, self.ranks) all_phrases: typing.Dict[Span, float] = { **nc_phrases, **ent_phrases } diff --git a/pytextrank/topicrank.py b/pytextrank/topicrank.py index 574ea8c..709277a 100644 --- a/pytextrank/topicrank.py +++ b/pytextrank/topicrank.py @@ -6,15 +6,15 @@ Implements the *TopicRank* algorithm. """ -import time -import typing from collections import defaultdict from functools import lru_cache +import time +import typing -import networkx as nx # type: ignore # pylint: disable=E0401 from scipy.cluster.hierarchy import fcluster, linkage # type: ignore # pylint: disable=E0401 from scipy.spatial.distance import pdist # type: ignore # pylint: disable=E0401 from spacy.tokens import Doc, Span # type: ignore # pylint: disable=E0401 +import networkx as nx # type: ignore # pylint: disable=E0401 from .base import BaseTextRank, BaseTextRankFactory, Phrase, StopWordsLike @@ -28,7 +28,7 @@ class TopicRankFactory (BaseTextRankFactory): _CLUSTER_THRESHOLD: float = 0.25 _CLUSTER_METHOD: str = "average" - def __init__( + def __init__ ( self, *, edge_weight: float = BaseTextRankFactory._EDGE_WEIGHT, @@ -37,8 +37,11 @@ def __init__( scrubber: typing.Optional[typing.Callable] = None, stopwords: typing.Optional[StopWordsLike] = None, threshold: float = _CLUSTER_THRESHOLD, - method: str = _CLUSTER_METHOD - ) -> None: + method: str = _CLUSTER_METHOD, + ) -> None: + """ +Constructor for the factory class. + """ super().__init__( edge_weight=edge_weight, pos_kept=pos_kept, @@ -51,10 +54,11 @@ def __init__( self.threshold: float = threshold self.method: str = method - def __call__( + + def __call__ ( self, doc: Doc, - ) -> Doc: + ) -> Doc: """ Set the extension attributes on a `spaCy` [`Doc`](https://spacy.io/api/doc) document to create a *pipeline component* for `TopicRank` as @@ -84,10 +88,10 @@ def __call__( class TopicRank (BaseTextRank): - """Implements the *TopicRank* algorithm described by -TODO, deployed as a `spaCy` pipeline component. - -https://aclanthology.org/I13-1062.pdf + """ +Implements the *TopicRank* algorithm described by +[[bougouin-etal-2013-topicrank]](https://derwen.ai/docs/ptr/biblio/#bougouin-etal-2013-topicrank) +deployed as a `spaCy` pipeline component. This class does not get called directly; instantiate its factory instead. @@ -123,7 +127,7 @@ def __init__( stopwords: typing.Dict[str, typing.List[str]], threshold: float, method: str, - ) -> None: + ) -> None: """ Constructor for a factory used to instantiate the PyTextRank pipeline components. @@ -156,7 +160,11 @@ def __init__( self.threshold: float = threshold self.method: str = method - def _cluster(self, candidates: typing.List[Span]) -> typing.List[typing.List[Span]]: + + def _cluster ( + self, + candidates: typing.List[Span], + ) -> typing.List[typing.List[Span]]: """ Cluster candidates using Hierarchical Agglomerative Clustering (HAC), where similarity is defined by overlap of lemmas in candidates. @@ -185,8 +193,10 @@ def _cluster(self, candidates: typing.List[Span]) -> typing.List[typing.List[Spa # |candidates|x|bag_of_words| matrix # matrix = [[0] * len(bag_of_words) for _ in candidates] matrix = [] + for candidate in candidates: matrix.append([0] * len(bag_of_words)) + for term in candidate: if not term.is_stop: try: @@ -198,8 +208,10 @@ def _cluster(self, candidates: typing.List[Span]) -> typing.List[typing.List[Spa # using a threshold of 0.01 below 1 - threshold. # So, 0.74 for the default 0.25 threshold. pairwise_dist = pdist(matrix, "jaccard") + if not pairwise_dist.size: return [[candidate] for candidate in candidates] + raw_clusters = linkage(pairwise_dist, method=self.method) cluster_ids = fcluster( raw_clusters, t=0.99 - self.threshold, criterion="distance" @@ -208,12 +220,16 @@ def _cluster(self, candidates: typing.List[Span]) -> typing.List[typing.List[Spa # Map cluster_ids to the corresponding candidates, and then # ignore the cluster id keys. clusters = defaultdict(list) + for cluster_id, candidate in zip(cluster_ids, candidates): clusters[cluster_id].append(candidate) return list(clusters.values()) - def _get_candidates(self) -> typing.List[Span]: + + def _get_candidates ( + self + ) -> typing.List[Span]: """ Return a list of spans with candidate topics, such that the start of each candidate is a noun chunk that was trimmed of stopwords or tokens @@ -222,27 +238,34 @@ def _get_candidates(self) -> typing.List[Span]: returns: list of candidate spans """ - noun_chunks = list(self.doc.noun_chunks) - candidates = [] - for chunk in noun_chunks: - for token in chunk: - if self._keep_token(token): - candidates.append(self.doc[token.i : chunk.end]) - break + + try: + noun_chunks = list(self.doc.noun_chunks) + + for chunk in noun_chunks: + for token in chunk: + if self._keep_token(token): + candidates.append(self.doc[token.i : chunk.end]) + break + except AttributeError: + # some languages do not have `noun_chunks` support in spaCy models + pass return candidates + @property # type: ignore @lru_cache(maxsize=1) - def node_list(self) -> typing.List[typing.Tuple[Span, ...]]: # type: ignore + def node_list ( # type: ignore + self + ) -> typing.List[typing.Tuple[Span, ...]]: """ Build a list of vertices for the graph, cached for efficiency. returns: list of nodes """ - # Rely on spaCy to perform *preprocessing* and *candidate extraction* # through ``noun_chunks``, thus completing the first two steps of # the *TopicRank* algorithm. @@ -255,35 +278,40 @@ def node_list(self) -> typing.List[typing.Tuple[Span, ...]]: # type: ignore return clustered + @property - def edge_list( # type: ignore + def edge_list ( # type: ignore self, - ) -> typing.List[ - typing.Tuple[typing.List[Span], typing.List[Span], typing.Dict[str, float]] - ]: + ) -> typing.List[typing.Tuple[typing.List[Span], typing.List[Span], typing.Dict[str, float]]]: """ Build a list of weighted edges for the graph. returns: list of weighted edges """ - weighted_edges = [] + for i, source_topic in enumerate(self.node_list): # type: ignore for target_topic in self.node_list[i + 1 :]: # type: ignore weight = 0.0 + for source_member in source_topic: for target_member in target_topic: distance = abs(source_member.start - target_member.start) + if distance: weight += 1.0 / distance + weight_dict = {"weight": weight * self.edge_weight} weighted_edges.append((source_topic, target_topic, weight_dict)) weighted_edges.append((target_topic, source_topic, weight_dict)) return weighted_edges - def calc_textrank(self) -> typing.List[Phrase]: + + def calc_textrank ( + self + ) -> typing.List[Phrase]: """ Construct a complete graph using potential topics as nodes, then apply the *TextRank* algorithm to return the top-ranked phrases. @@ -295,6 +323,7 @@ def calc_textrank(self) -> typing.List[Phrase]: """ t0 = time.time() self.reset() + # for *TopicRank*, the constructed graph is complete # with topics as nodes and a distance measure between # two topics as the weights for the edge between them. @@ -329,6 +358,7 @@ def calc_textrank(self) -> typing.List[Phrase]: ) for topic, score in self.ranks.items() ] + phrase_list: typing.List[Phrase] = sorted( raw_phrase_list, key=lambda p: p.rank, reverse=True ) @@ -338,7 +368,10 @@ def calc_textrank(self) -> typing.List[Phrase]: return phrase_list - def reset(self) -> None: + + def reset ( + self + ) -> None: """ Reinitialize the data structures needed for extracting phrases, removing any pre-existing state. diff --git a/pytextrank/version.py b/pytextrank/version.py index 8c87687..4ec590d 100644 --- a/pytextrank/version.py +++ b/pytextrank/version.py @@ -10,7 +10,7 @@ ## Python version checking MIN_PY_VERSION: typing.Tuple = (3, 7,) -__version__: str = "3.2.2" +__version__: str = "3.2.3" def _versify ( diff --git a/requirements-dev.txt b/requirements-dev.txt index 2f5212b..3e712cd 100644 --- a/requirements-dev.txt +++ b/requirements-dev.txt @@ -3,13 +3,15 @@ codespell coverage flask grayskull -mistune >= 2.0.1 # not directly required, pinned by Snyk to avoid a vulnerability +jupyterlab >= 3.1.4 +mistune mkdocs-git-revision-date-plugin mkdocs-material mknotebooks mkrefs >= 0.2.0 mypy -nbconvert +nbconvert >= 6.4 +nbmake >= 1.0 notebook >= 6.1.5 pipdeptree pre-commit diff --git a/sample.py b/sample.py index 48661d5..915e143 100755 --- a/sample.py +++ b/sample.py @@ -1,12 +1,14 @@ #!/usr/bin/env python # -*- coding: utf-8 -*- -from icecream import ic # pylint: disable=E0401 import pathlib -import pytextrank # pylint: disable=W0611 -import spacy # pylint: disable=E0401 import sys # pylint: disable=W0611 +from icecream import ic # pylint: disable=E0401 +import spacy # pylint: disable=E0401 + +import pytextrank # pylint: disable=W0611 + ###################################################################### ## sample usage @@ -15,7 +17,7 @@ nlp = spacy.load("en_core_web_sm") # add PyTextRank into the spaCy pipeline -#nlp.add_pipe("positionrank") +# NB: substitute `"textrank"` with the name of other algorithms, e.g., `"positionrank"` nlp.add_pipe("textrank") # parse the document @@ -36,28 +38,32 @@ print("{:.4f} {:5d} {}".format(phrase.rank, phrase.count, phrase.text)) ic(phrase.chunks) -print("\n----\n") # switch to a longer text document... +print("\n----\n") +print("dat/lee.txt:") + text = pathlib.Path("dat/lee.txt").read_text() doc = nlp(text) for phrase in doc._.phrases[:20]: ic(phrase) -print("\n----\n") # to show use of stopwords: first we output a baseline... +print("\n----\n") +print("dat/gen.txt:") + text = pathlib.Path("dat/gen.txt").read_text() doc = nlp(text) for phrase in doc._.phrases[:10]: ic(phrase) -print("\n----\n") - # now add `"word": ["NOUN"]` to the stop words, to remove instances # of `"word"` or `"words"` then see how the ranked phrases differ... +print("\n----\n") +print("stop words:") nlp = spacy.load("en_core_web_sm") nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] } }) @@ -67,32 +73,40 @@ for phrase in doc._.phrases[:10]: ic(phrase) -print("\n----\n") # generate a GraphViz doc to visualize the lemma graph tr = doc._.textrank tr.write_dot(path="lemma_graph.dot") + # summarize the document based on its top 15 phrases, # yielding its top 5 sentences... +print("\n----\n") +print("extractive summarization:") + for sent in tr.summary(limit_phrases=15, limit_sentences=5): ic(sent) -print("\n----\n") -# show use of Biased TextRank algorithm +# compare results among the implemented textgraph algorithms EXPECTED_PHRASES = [ "grandmaster Lee Sedol", "Lee Sedol", "Deep Blue", "world chess champion Gary Kasparov", + "chess", "Gary Kasparov", "the following year", "Kasparov", ] + +# show use of TopicRank algorithm +print("\n----\n") +print("TopicRank:") + nlp = spacy.load("en_core_web_sm") -nlp.add_pipe("biasedtextrank") +nlp.add_pipe("topicrank") text = pathlib.Path("dat/lee.txt").read_text() doc = nlp(text) @@ -100,11 +114,41 @@ for phrase in doc._.phrases[:len(EXPECTED_PHRASES)]: ic(phrase) +tr = doc._.textrank + + +# show use of PositionRank algorithm print("\n----\n") +print("PositionRank:") + +nlp = spacy.load("en_core_web_sm") +nlp.add_pipe("positionrank") + +text = pathlib.Path("dat/lee.txt").read_text() +doc = nlp(text) + +for phrase in doc._.phrases[:len(EXPECTED_PHRASES)]: + ic(phrase) + tr = doc._.textrank -# note how the bias parameters get set here, to help emphasize the -# *focus set* + +# show use of Biased TextRank algorithm +print("\n----\n") +print("Biased TextRank:") + +nlp = spacy.load("en_core_web_sm") +nlp.add_pipe("biasedtextrank") + +text = pathlib.Path("dat/lee.txt").read_text() +doc = nlp(text) + +for phrase in doc._.phrases[:len(EXPECTED_PHRASES)]: + ic(phrase) + +# note how the bias parameters get set here, to help emphasize +# the *focus set* +tr = doc._.textrank phrases = tr.change_focus( focus="It wasn't until the following year that Deep Blue topped Kasparov over the course of a six-game contest.", @@ -112,6 +156,9 @@ default_bias=0.0, ) +print("\n----\n") +ic(EXPECTED_PHRASES) + for phrase in phrases[:len(EXPECTED_PHRASES)]: ic(phrase.text) - assert phrase.text in EXPECTED_PHRASES # nosec + ic(phrase.text in EXPECTED_PHRASES)