You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know it's 2023, but I've been reading the paper and played around with the code lately, and there's somethingI I don't quite understand regarding data splitting and sequence length.
In the paper it's mentioned that for Assist2009 the maximum sequence length was set to 200. In the code, I was expecting that the sequences would be cut at 200 questions per user, and further questions would not be considered, but it seems this is not the case.
Example
For instance, let's take as example the comment in load_data.py, in function load_data
If we set seqlen to 10, what the code would do is:
Q will be set to [1,1,1,1,7,7,9,10,10,10,10,11,11,45,54]
A will be set to [0,1,1,1,1,1,0,0,1,1,1,1,1,0,0]
Since if len(Q) > self.seqlen is True, n_split will be math.floor(len(Q) / self.seqlen), which is math.floor(15/10), which is 1, but since len(Q) % self.seqlen is not 0, n_split will be 2, as per line 43
The for loop in line 45 will run twice, splitting the input in 2 parts: One for the first 10 elements, and the other for the remaining 5, but since the sequences must be of 10, the last one will be padded. So in the end, q_dataArray will have:
meaning that it has shape [n_split, self.seqlen], or [2, 10]
Questions
Wouldn't that be incorrect when training? The entry in the data is for a student who did 15 questions, if I wanted to predict the result on, say, the 12th question, all the previous 11 questions and answers would have to be considered, but, if my understanding is correct, the training code considers them as two separate entries, as if the second one was a completely different student who just answered 5 questions. Moreover, the code is actually shuffling the data:
If the above was intended, then, is the model reliable when predicting something like the 21st question (with seqlen same as in the example given above)?
I'd appreciate your input, answers and comments. Feel free to correct me if I didn't get something.
Thank you!
The text was updated successfully, but these errors were encountered:
Hello, I have the same problem with you. It seems that in many model codes in the literature, it is handled in this way, which is convenient for the model to process input, just like nlp. I always feel that this will not make full use of the historical information of the students.
Hello!
I know it's 2023, but I've been reading the paper and played around with the code lately, and there's somethingI I don't quite understand regarding data splitting and sequence length.
In the paper it's mentioned that for Assist2009 the maximum sequence length was set to 200. In the code, I was expecting that the sequences would be cut at 200 questions per user, and further questions would not be considered, but it seems this is not the case.
Example
For instance, let's take as example the comment in
load_data.py
, in functionload_data
If we set
seqlen
to 10, what the code would do is:Q
will be set to[1,1,1,1,7,7,9,10,10,10,10,11,11,45,54]
A
will be set to[0,1,1,1,1,1,0,0,1,1,1,1,1,0,0]
if len(Q) > self.seqlen
isTrue
,n_split
will bemath.floor(len(Q) / self.seqlen)
, which ismath.floor(15/10)
, which is1
, but sincelen(Q) % self.seqlen
is not 0,n_split
will be2
, as per line 43q_dataArray
will have:meaning that it has shape
[n_split, self.seqlen]
, or[2, 10]
Questions
seqlen
same as in the example given above)?I'd appreciate your input, answers and comments. Feel free to correct me if I didn't get something.
Thank you!
The text was updated successfully, but these errors were encountered: