-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notes for time-shifting data #15
Comments
|
|
|
|
There is an additional feature among the one I handpicked that requires special attention: cf20m026 is about the partner's birthyear and should be time sifted in similar ways to birthyears and cohabitation start years. I implemented a piece of code that does that. |
@emilycantrell TL; DR: After including time-shifted data, the maximum cross-validation F1 score increased only slightly from 0.769 to 0.773. I compared the cv results of the current version of the code to those in the following two files (R converted to txt). training_no_shift.txt The differences between the "no shift" code and our current code are as follows:
Here are the top pipelines from the "no shift" code: Here are the top pipelines from the current code: As you can see, after including time-shifted data, the maximum cross-validation F1 score increased only slightly from 0.769 to 0.773, and there's significant uncertainties surrounding both estimates. I am in fact pleasantly surprised by these results, because I half expected F1 scores to significantly go down because we are now training with out-of-sample data. It is quite impressive that even though our training data is now less representative of our test data in many ways, we still achieve similar, if not better, predictions. Perhaps this means that our new model is more robust to moving from one dataset to another, and we will see better performance on the holdout set even though the time-shift does not show much improvement in terms of cross validation in the training set. An additional thing that is interesting is that with time-shifted data, the winning pipeline seems to involve only tree stumps, i. e. trees with only one split and a depth of 1. This is surprising to me because stumps are not able to capture interactions between variables in the data (though multiple stumps can capture nonlinearities by making splits in different places). I changed the seed from 0 to 1 and 2 and ran the code again twice, and still it was stumps that won for both seeds. This makes me a little concerned about whether there is something wrong in the code, but perhaps our datasets are simply too small for considerations of interaction effects to be useful. |
@HanzhangRen Thank you for this side by side comparison!! Here are some reactions:
|
The text was updated successfully, but these errors were encountered: