context size in sampling #1

deweihu96 · 2023-10-03T12:16:31Z

Hi, thanks for your great work. I've been trying to incorporate the augmentation code into my own workflow.

I'm a little bit confused by the "context_size" variable. Is it the same as the "window_size" variable in gensim word2vec?

The actual sequences in gensim word2vec used for training is 2*window_size - 1.

But the code in sampler.argew suggests that the length of sequences used for training is "context_size" : sequences = rw[:, j:j + self.context_size].

danieljunhee · 2023-10-04T12:59:06Z

@deweihu96
Thank you for your interest in our work.

It seems the two variables you mentioned are conceptually the same (i.e. how much of surrounding words/nodes to use) but just different in terms of package implementation.

In gensim, there is a multiplication by 2 probably because the implementation directly extracts words before and after the target word.

On the other hand, in our work, which follows the pytorch-geometric package's node2vec implementation, each random walk is splitted into subsequences of length = context_size, and then for each subsequence, the initial node pairs up with all the remaining nodes to form a positive example.

For instance, if you set the random walk length as 8 and context_size as 3, then for a random walk [n1, n2, ... , n8], we have:
- subsequence [n1, n2, n3] -> positive examples (n1, n2), (n1, n3)
- subsequence [n2, n3, n4] -> positive examples (n2, n3), (n2, n4)
- ...
- subsequence [n6, n7, n8] -> positive examples (n6, n7), (n6, n8)
Note that this is essentially using the 2 nodes before and the 2 nodes after the target node: e.g. n4 forms positive examples with n2, n3, n5, and n6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

context size in sampling #1

context size in sampling #1

deweihu96 commented Oct 3, 2023

danieljunhee commented Oct 4, 2023

context size in sampling #1

context size in sampling #1

Comments

deweihu96 commented Oct 3, 2023

danieljunhee commented Oct 4, 2023