Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

context size in sampling #1

Open
deweihu96 opened this issue Oct 3, 2023 · 1 comment
Open

context size in sampling #1

deweihu96 opened this issue Oct 3, 2023 · 1 comment

Comments

@deweihu96
Copy link

Hi, thanks for your great work. I've been trying to incorporate the augmentation code into my own workflow.

I'm a little bit confused by the "context_size" variable. Is it the same as the "window_size" variable in gensim word2vec?

The actual sequences in gensim word2vec used for training is 2*window_size - 1.

But the code in sampler.argew suggests that the length of sequences used for training is "context_size" : sequences = rw[:, j:j + self.context_size].

@danieljunhee
Copy link
Collaborator

@deweihu96
Thank you for your interest in our work.

It seems the two variables you mentioned are conceptually the same (i.e. how much of surrounding words/nodes to use) but just different in terms of package implementation.

In gensim, there is a multiplication by 2 probably because the implementation directly extracts words before and after the target word.

On the other hand, in our work, which follows the pytorch-geometric package's node2vec implementation, each random walk is splitted into subsequences of length = context_size, and then for each subsequence, the initial node pairs up with all the remaining nodes to form a positive example.

  • For instance, if you set the random walk length as 8 and context_size as 3, then for a random walk [n1, n2, ... , n8], we have:
    • subsequence [n1, n2, n3] -> positive examples (n1, n2), (n1, n3)
    • subsequence [n2, n3, n4] -> positive examples (n2, n3), (n2, n4)
    • ...
    • subsequence [n6, n7, n8] -> positive examples (n6, n7), (n6, n8)
  • Note that this is essentially using the 2 nodes before and the 2 nodes after the target node: e.g. n4 forms positive examples with n2, n3, n5, and n6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants