This is a template repository to submit your method on the Next platform for phase 1 of the Predicting Fertility Data Challenge (PreFer). Here you can read how to participate in the challenge. The challenge is to predict whether an individual will have a child within a three year period (2021-2023), based on survey data from previous years (2007-2020). Data come from the LISS panel. For more information, on the data challenge, please visit the website and read this paper.
ℹ️ Check out (important dates) to see when this challenge phase will open and close.
- Make a copy of this template repository, by forking and cloning as explained here.
- Make sure to allow Github Actions on your own repository: Go to the “Actions” tab and click “I understand my workflows, go ahead and enable them.”
- If you have registered for the PreFer challenge, you will receive a link to download the data from the Next platform.
- Visit the Next platform and sign in to download the training data and codebooks. Here is a detailed explanation of the datasets that you have downloaded, and here for an explanation of how to use the codebooks.
To participate in the challenge you need to submit a method (i.e. code for data preprocessing, training, and making predictions, and the trained model) using this repository.
ℹ️ You can use either Python or R for your method. By default, Python is used. For Python this repo assumes that your method uses the Anaconda Python distribution.
-
Choose your programming language: the default set-up is Python, if you would like to use R, go to
settings.json
and change{"dockerfile": "python.Dockerfile"}
into{"dockerfile": "r.Dockerfile"}
. Read here how to update files in your forked repository. -
Choose the main script to work with: go to
submission.py
(Python) orsubmission.R
(R) depending on your preferred programming language. -
Preprocess the data: any steps to clean or preprocess the data need to be documented within the function
clean_df
in thesubmission.py
/submission.R
script (depending on your preferred programming language). Note: The functionclean_df
will also be applied to the holdout data when you submit your model. At this point, the codebooks can be useful to make sense of the data. -
Train, tune, and save your model: any steps to train your model need to be documented (e.g., code for the model, number of folds, set seed) within the
training.py
/training.R
script. The only function in this script istrain_save_model
in which you can put the steps needed to run the model. The output of this script is your saved model, eithermodel.joblib
ormodel.rds
. Make sure that your model is saved in the same folder assubmission.py
/submission.R
under the namemodel.joblib
(for Python) ormodel.rds
(for R). The model will be applied to the holdout data when you submit your model. -
Test your model on fake data: you can test your
clean_df
function and your model (stored in:model.joblib
/model.rds
) on fake data (PreFer_fake_data.csv
) through the functionpredict_outcomes
. You will also need to adapt this function such that the outputs of your model are predicted classes (i.e., 0s and 1s) rather than, for example, probabilities. If you passed the test (i.e.predict_outcomes
led to predictions rather than errors), you can submit your method. If your method does not run on the "fake data", it will not run on the holdout data. [If you "push" your method to Github this test will also be automatically run.] -
Submit your method: Submit your method as explained here.
Here are a bunch of videos and guides, notebooks, and blogs available that guide you through this process.
For Python users: please see the environment.yml
file to see which libraries are installed per default. You can add or remove libraries from this environment.yml
file as you desire. It is recommended to state particular versions (i.e., pandas=1.5
rather than pandas>=1.5
). You have to call upon those libraries in the submission.py
file.
For R users: no packages are pre-installed. You can use the packages.R
file and add the names of the packages to the code: install.packages(c("dplyr","data.table","tidyr"), repos="https://cran.r-project.org")
. You have to call upon those libraries in the submission.R
file. (i.e., adding library(c("dplyr","data.table","tidyr"))
)
Follow the instructions below to submit your method:
- Make sure that you describe your model in the
description.md
file in your GitHub repository and commit changes (i.e. save changes locally) - Push the commit (i.e. upload changed version to your online repository)
- In GitHub make sure that the checks pass:
ℹ️ If the check fails go to FAQ. You might need to add dependencies as described here. You can also test your implementation locally as explained here.
- On the main page of your repository, above the file list, click commits to view a list of commits, as described here
- Go to the commit that you want to submit and right click on view commit details, then click "Copy Link Address", see example below:
- Add a submission on the Next platform by providing the URL to your GitHub commit (copied at step 5), this commit will serve as your submission to the challenge.
The LISS panel challenge data is separated into an example dataset for tuning your method and a holdout dataset that will be used to validate your method performance. After submission your method will be run on the holdout data. Your performance scores on the holdout data will be added to the leaderboards, so your scores can be compared to the performance scores of other methods.
ℹ️ Leaderboards are generated at fixed time points, check out (important dates) for leaderboard submission deadlines.
The following leaderboards will be available:
*For the prediction of having a child in 2021-2023 (positive class).
For this challenge the F1 leaderboard is the main leaderboard.
ℹ️ The Python code to calculate the metric scores used to create the challenge leaderboards is included in this repo. Check out score.py.
Accurate predictions of the number and timing of children are crucial for effective resource allocation in society. However, despite many studies in the social sciences, we have no clear understanding of which factors are most important for fertility prediction or how well we are able to predict fertility behaviour.
To gain insight into how well methods are able to predict fertility within a three year period (2021-2023), based on survey data from previous years (2007-2020) of people in the LISS panel who were aged 18-45 in 2020. The LISS panel is a representative online longitudinal panel of Dutch households.
This project is licensed under the terms of the MIT license.
The code in this repository is developed by Eyra as part of the Rank program funded by ODISSEI and the NWO VIDI grant awarded to Gert Stulp. The LISS panel data is provided by Centerdata.