Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About sequence length and data splittling #9

Open
Manuel-Medina opened this issue Feb 1, 2023 · 1 comment
Open

About sequence length and data splittling #9

Manuel-Medina opened this issue Feb 1, 2023 · 1 comment

Comments

@Manuel-Medina
Copy link

Hello!

I know it's 2023, but I've been reading the paper and played around with the code lately, and there's somethingI I don't quite understand regarding data splitting and sequence length.

In the paper it's mentioned that for Assist2009 the maximum sequence length was set to 200. In the code, I was expecting that the sequences would be cut at 200 questions per user, and further questions would not be considered, but it seems this is not the case.

Example

For instance, let's take as example the comment in load_data.py, in function load_data

15
1,1,1,1,7,7,9,10,10,10,10,11,11,45,54
0,1,1,1,1,1,0,0,1,1,1,1,1,0,0

If we set seqlen to 10, what the code would do is:

  • Q will be set to [1,1,1,1,7,7,9,10,10,10,10,11,11,45,54]
  • A will be set to [0,1,1,1,1,1,0,0,1,1,1,1,1,0,0]
  • Since if len(Q) > self.seqlen is True, n_split will be math.floor(len(Q) / self.seqlen), which is math.floor(15/10), which is 1, but since len(Q) % self.seqlen is not 0, n_split will be 2, as per line 43
  • The for loop in line 45 will run twice, splitting the input in 2 parts: One for the first 10 elements, and the other for the remaining 5, but since the sequences must be of 10, the last one will be padded. So in the end, q_dataArray will have:
[
    [1,1,1,1,7,7,9,10,10,10],
    [10,11,11,45,54,0,0,0,0,0]
]

meaning that it has shape [n_split, self.seqlen], or [2, 10]

Questions

  1. Wouldn't that be incorrect when training? The entry in the data is for a student who did 15 questions, if I wanted to predict the result on, say, the 12th question, all the previous 11 questions and answers would have to be considered, but, if my understanding is correct, the training code considers them as two separate entries, as if the second one was a completely different student who just answered 5 questions. Moreover, the code is actually shuffling the data:
def train(net, params, q_data, qa_data, label):
    N = int(math.floor(len(q_data) / params.batch_size))
    q_data = q_data.T # Shape: (200,3633)  <-- Here
    qa_data = qa_data.T  # Shape: (200,3633) <-- Here
    # Shuffle the data
    shuffled_ind = np.arange(q_data.shape[1]) # <-- Here
    np.random.shuffle(shuffled_ind)
    q_data = q_data[:, shuffled_ind]
    qa_data = qa_data[:, shuffled_ind]
  1. If the above was intended, then, is the model reliable when predicting something like the 21st question (with seqlen same as in the example given above)?

I'd appreciate your input, answers and comments. Feel free to correct me if I didn't get something.

Thank you!

@Tong198-Hu
Copy link

Hello, I have the same problem with you. It seems that in many model codes in the literature, it is handled in this way, which is convenient for the model to process input, just like nlp. I always feel that this will not make full use of the historical information of the students.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants