-
-
Notifications
You must be signed in to change notification settings - Fork 333
Commit
- Loading branch information
There are no files selected for viewing
5 comments
on commit 3f61b59
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Fellow data scientist here. I'm using your work for a project! Can you please explain why you chose to make the lemma graph undirected instead of directed? Doesn't it make more sense to have a directed graph for the pagerank algorithm?
The problem with nx.Graph() is that if you feed it directional information (like i was doing), then it simply drops some information instead of consolidating it. For example, for the code below:
import networkx as nx
weighted_edges = [
("come", "play", {'weight': 3.0}),
("play", "with", {'weight': 2.0}),
("with", "play", {'weight': 11.0}),
("say", "nun", {'weight': 2.0}),
("nun", "calm", {'weight': 1.0}),
("late", "calm", {'weight': 3.0}),
("sleep", "sis", {'weight': 1.0}),
("sis", "sleep", {'weight': 2.0}),
("late", "sis", {'weight': 1.0}),
(" ", "sleep", {'weight': 9.0}),
("back", "ah", {'weight': 12.0}),
]
graph= nx.Graph()
graph.add_edges_from(weighted_edges)
for edge in graph.edges(data=True):
print(edge)
You'll see that the 2nd edge (play --> with) (weight = 2) and the 3rd edge (with --> play) (weight = 11) are bidirectional. But when I make an undirected graph with these, and then i look at graph.edges
, I see that only (play --> with) has been considered and surprisingly it has a weight of 11 instead of 2. And the edge (with --> play) is nowhere to be found in this graph. Surely that affects the pagerank results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yamika-g ,
Pagerank algorithm works better with directed graphs especially where there is strong notion of incoming/outgoing edges. Eg: Web links where citing a link is different from being cited by the link.
However, in there textrank paper, authors have shown that on text the undirected graph work better than directed graph(considering either direction).
To quote:
Regardless of the direction chosen for the arcs, results obtained with directed graphs are worse than
results obtained with undirected graphs, which suggests that despite a natural flow in running text,
there is no natural “direction” that can be established between cooccurring words.
Hence the lemma graph is being cosidered as undirected graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for your answer! I still see one problem: as I mentioned in my previous comment, if you run the code I provided, one direction of the edges gets dropped. In this example, (play --> with) (weight = 2) and the 3rd edge (with --> play) (weight = 11) are bidirectional and they both exist in the edge list. But when I make an undirected graph with these, and then I look at graph.edges
, I see that only one of the edges has been considered and the other one has been dropped. I think that is causing us to lose information. And it seems that nx.Graph()
just arbitrarily picks the edge it wants to keep.
In order to prevent that, I want to count the edges in such a way that for a bidirectional edge, both the directions are considered in the count. So, for the edges mentioned above, they both should be normalized to (play, with) and their total frequency should be 11+2 = 13. I think that's a better way to judge the importance of an edge, right? Considering just one direction only gives us half the information.
So here's the change that I'm making for my work:
def edge_list (
self
) -> typing.List[typing.Tuple[Lemma, Lemma, typing.Dict[str, float]]]:
edges: typing.List[typing.Tuple[Lemma, Lemma]] = []
for sent in self.doc.sents:
h = [
Lemma(token.text) #, token.pos_)
for token in sent
if self._keep_token(token) and token.text.strip() != ""
]
for hop in range(self.token_lookback):
for idx, node in enumerate(h[: -1 - hop]):
nbor = h[hop + idx + 1]
edge = tuple(sorted((node, nbor))) #to ensure that symmetric edges are counted.
edges.append(edge)
# include weight on the edge: (2, 3, {'weight': 3.1415})
weighted_edges: typing.List[typing.Tuple[Lemma, Lemma, typing.Dict[str, float]]] = [
(*n, {"weight": w * self.edge_weight}) for n, w in Counter(edges).items()
]
return weighted_edges
Normalizing: When creating each edge, edge = tuple(sorted((node, nbor)))
ensures that (node, nbor) and (nbor, node) are always represented in the same order.
Counting: Counter(edges)
will now treat symmetric edges as identical and count them together.
Weights: Each edge is weighted based on the total count of its appearances.
I would love to know your thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out @yamika-g .
A PR that solves this issue is under review.
Your code is mostly accurate however nodes are modelled as (lemma, pos-tag) tuple not just lemma.
One observation :
It is rare occurrence that two words occur in different order while maintain the same pos tag
(play, with) is usually ("play", "VERB"), ("with", "ADP" )
(with, play) is usually ("with", "ADP" ), ("play", "NOUN")
Hence treated as different nodes despite having same lemma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ankush, I'm glad you liked my feedback!
My corpus contains user-generated text, and contains spelling errors, grammatical mistakes, and also contains non-english words because it's been generated by users all over the world. So since my corpus is not very clean, I decided to modify pytextrank
to remove all dependency on linguistic features. I also looked inside the graph and the nodes, and the POS tags, and I wasn't happy with the POS tags generated. For example, the word 'match' was sometimes getting tagged as VERB, NOUN, or even PROPN. This resulted in 3 different nodes being formed for the word 'match'. And this was happening for every word, and therefore it was causing sparsity in the graph. The same 2 words had multiple edges between them because they formed different nodes because of different POS tags. I also didn't wish to lemmatize my words because misspelt words/ non-english words don't have a lemma.
So I changed the node structure from (token.lemma_, token.pos_)
to just (token.text)
, and then sorted the (node, nbor) tuple so that in an edge, I can accumulate as much information as possible about the co-occurrence of those 2 words. That's why in my case, it was important to prevent the loss of information.
The annotation says this is a directed graph. Might consider changing that?