Skip to content

Commit

Permalink
make lemma_graph undirected
Browse files Browse the repository at this point in the history
  • Loading branch information
Ankush-Chander committed Sep 7, 2023
1 parent 73ee61b commit 3f61b59
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 7 deletions.
12 changes: 7 additions & 5 deletions pytextrank/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ def __init__ (
# effectively, performs the same work as the `reset()` method;
# called explicitly here for the sake of type annotations
self.elapsed_time: float = 0.0
self.lemma_graph: nx.DiGraph = nx.DiGraph()
self.lemma_graph: nx.Graph = nx.Graph()
self.phrases: typing.List[Phrase] = []
self.ranks: typing.Dict[Lemma, float] = {}
self.seen_lemma: typing.Dict[Lemma, typing.Set[int]] = OrderedDict()
Expand All @@ -323,7 +323,7 @@ def reset (
removing any pre-existing state.
"""
self.elapsed_time = 0.0
self.lemma_graph = nx.DiGraph()
self.lemma_graph = nx.Graph()
self.phrases = []
self.ranks = {}
self.seen_lemma = OrderedDict()
Expand Down Expand Up @@ -400,15 +400,15 @@ def get_personalization ( # pylint: disable=R0201

def _construct_graph (
self
) -> nx.DiGraph:
) -> nx.Graph:
"""
Construct the
[*lemma graph*](https://derwen.ai/docs/ptr/glossary/#lemma-graph).
returns:
a directed graph representing the lemma graph

This comment has been minimized.

Copy link
@yamika-g

yamika-g Oct 29, 2024

The annotation says this is a directed graph. Might consider changing that?

"""
g = nx.DiGraph()
g = nx.Graph()

# add nodes made of Lemma(lemma, pos)
g.add_nodes_from(self.node_list)
Expand Down Expand Up @@ -571,6 +571,8 @@ def _calc_discounted_normalised_rank (
returns:
normalized rank metric
"""
if len(span) < 1 :
return 0.0
non_lemma = len([tok for tok in span if tok.pos_ not in self.pos_kept])
non_lemma_discount = len(span) / (len(span) + (2.0 * non_lemma) + 1.0)

Expand Down Expand Up @@ -877,7 +879,7 @@ def write_dot (
path:
path for the output file; defaults to `"graph.dot"`
"""
dot = graphviz.Digraph()
dot = graphviz.Graph()

for lemma in self.lemma_graph.nodes():
rank = self.ranks[lemma]
Expand Down
4 changes: 2 additions & 2 deletions tests/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,13 +154,13 @@ def test_stop_words ():
for phrase in doc._.phrases[:5]
]

assert "words" in phrases
assert "sentences" in phrases

# add `"word": ["NOUN"]` to the *stop words*, to remove instances
# of `"word"` or `"words"` then see how the ranked phrases differ?

nlp2 = spacy.load("en_core_web_sm")
nlp2.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] } })
nlp2.add_pipe("textrank", config={ "stopwords": { "sentence": ["NOUN"] } })

with open("dat/gen.txt", "r") as f:
doc = nlp2(f.read())
Expand Down

5 comments on commit 3f61b59

@yamika-g
Copy link

@yamika-g yamika-g commented on 3f61b59 Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Fellow data scientist here. I'm using your work for a project! Can you please explain why you chose to make the lemma graph undirected instead of directed? Doesn't it make more sense to have a directed graph for the pagerank algorithm?

The problem with nx.Graph() is that if you feed it directional information (like i was doing), then it simply drops some information instead of consolidating it. For example, for the code below:

import networkx as nx

weighted_edges = [
    ("come", "play", {'weight': 3.0}),
    ("play", "with", {'weight': 2.0}),
    ("with", "play", {'weight': 11.0}),
    ("say", "nun", {'weight': 2.0}),
    ("nun", "calm", {'weight': 1.0}),
    ("late", "calm", {'weight': 3.0}),
    ("sleep", "sis", {'weight': 1.0}),
    ("sis", "sleep", {'weight': 2.0}),
    ("late", "sis", {'weight': 1.0}),
    ("  ", "sleep", {'weight': 9.0}),
    ("back", "ah", {'weight': 12.0}),
]

graph= nx.Graph()
graph.add_edges_from(weighted_edges)

for edge in graph.edges(data=True):
    print(edge)

You'll see that the 2nd edge (play --> with) (weight = 2) and the 3rd edge (with --> play) (weight = 11) are bidirectional. But when I make an undirected graph with these, and then i look at graph.edges, I see that only (play --> with) has been considered and surprisingly it has a weight of 11 instead of 2. And the edge (with --> play) is nowhere to be found in this graph. Surely that affects the pagerank results?

@Ankush-Chander
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yamika-g ,
Pagerank algorithm works better with directed graphs especially where there is strong notion of incoming/outgoing edges. Eg: Web links where citing a link is different from being cited by the link.

However, in there textrank paper, authors have shown that on text the undirected graph work better than directed graph(considering either direction).

To quote:

Regardless of the direction chosen for the arcs, results obtained with directed graphs are worse than  
results obtained with undirected graphs, which suggests that despite a natural flow in running text,   
there is no natural “direction” that can be established between cooccurring words.

Hence the lemma graph is being cosidered as undirected graph.

@yamika-g
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for your answer! I still see one problem: as I mentioned in my previous comment, if you run the code I provided, one direction of the edges gets dropped. In this example, (play --> with) (weight = 2) and the 3rd edge (with --> play) (weight = 11) are bidirectional and they both exist in the edge list. But when I make an undirected graph with these, and then I look at graph.edges, I see that only one of the edges has been considered and the other one has been dropped. I think that is causing us to lose information. And it seems that nx.Graph() just arbitrarily picks the edge it wants to keep.

In order to prevent that, I want to count the edges in such a way that for a bidirectional edge, both the directions are considered in the count. So, for the edges mentioned above, they both should be normalized to (play, with) and their total frequency should be 11+2 = 13. I think that's a better way to judge the importance of an edge, right? Considering just one direction only gives us half the information.

So here's the change that I'm making for my work:


def edge_list (
        self
        ) -> typing.List[typing.Tuple[Lemma, Lemma, typing.Dict[str, float]]]:
        
       edges: typing.List[typing.Tuple[Lemma, Lemma]] = []

        for sent in self.doc.sents:
            h = [
                Lemma(token.text) #, token.pos_)
                for token in sent
                if self._keep_token(token) and token.text.strip() != ""
            ]

            for hop in range(self.token_lookback):
                for idx, node in enumerate(h[: -1 - hop]):
                    nbor = h[hop + idx + 1]
                    edge = tuple(sorted((node, nbor))) #to ensure that symmetric edges are counted.
                    edges.append(edge)

        # include weight on the edge: (2, 3, {'weight': 3.1415})
        weighted_edges: typing.List[typing.Tuple[Lemma, Lemma, typing.Dict[str, float]]] = [
            (*n, {"weight": w * self.edge_weight}) for n, w in Counter(edges).items()
        ]

        return weighted_edges

Normalizing: When creating each edge, edge = tuple(sorted((node, nbor))) ensures that (node, nbor) and (nbor, node) are always represented in the same order.

Counting: Counter(edges) will now treat symmetric edges as identical and count them together.

Weights: Each edge is weighted based on the total count of its appearances.

I would love to know your thoughts on this.

@Ankush-Chander
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out @yamika-g .
A PR that solves this issue is under review.
Your code is mostly accurate however nodes are modelled as (lemma, pos-tag) tuple not just lemma.

One observation :
It is rare occurrence that two words occur in different order while maintain the same pos tag
(play, with) is usually ("play", "VERB"), ("with", "ADP" )
(with, play) is usually ("with", "ADP" ), ("play", "NOUN")
Hence treated as different nodes despite having same lemma.

@yamika-g
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ankush, I'm glad you liked my feedback!
My corpus contains user-generated text, and contains spelling errors, grammatical mistakes, and also contains non-english words because it's been generated by users all over the world. So since my corpus is not very clean, I decided to modify pytextrank to remove all dependency on linguistic features. I also looked inside the graph and the nodes, and the POS tags, and I wasn't happy with the POS tags generated. For example, the word 'match' was sometimes getting tagged as VERB, NOUN, or even PROPN. This resulted in 3 different nodes being formed for the word 'match'. And this was happening for every word, and therefore it was causing sparsity in the graph. The same 2 words had multiple edges between them because they formed different nodes because of different POS tags. I also didn't wish to lemmatize my words because misspelt words/ non-english words don't have a lemma.

So I changed the node structure from (token.lemma_, token.pos_) to just (token.text), and then sorted the (node, nbor) tuple so that in an edge, I can accumulate as much information as possible about the co-occurrence of those 2 words. That's why in my case, it was important to prevent the loss of information.

Please sign in to comment.