Skip to content

TjerkNan/fertility-prediction-challenge

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Fertility Data Challenge (PreFer)

This is a template repository to submit your method on the Next platform for phase 1 of the Predicting Fertility Data Challenge (PreFer). Here you can read how to participate in the challenge. The challenge is to predict whether an individual will have a child within a three year period (2021-2023), based on survey data from previous years (2007-2020). Data come from the LISS panel. For more information, on the data challenge, please visit the website and read this paper.

ℹ️ Check out (important dates) to see when this challenge phase will open and close.

Prerequisites

  1. Make a copy of this template repository, by forking and cloning as explained here.
  2. Make sure to allow Github Actions on your own repository: Go to the “Actions” tab and click “I understand my workflows, go ahead and enable them.”
  3. If you have registered for the PreFer challenge, you will receive a link to download the data from the Next platform.
  4. Visit the Next platform and sign in to download the training data and codebooks. Here is a detailed explanation of the datasets that you have downloaded, and here for an explanation of how to use the codebooks.

Prepare your method

To participate in the challenge you need to submit a method (i.e. code for data preprocessing, training, and making predictions, and the trained model) using this repository.

ℹ️ You can use either Python or R for your method. By default, Python is used. For Python this repo assumes that your method uses the Anaconda Python distribution.

  1. Choose your programming language: the default set-up is Python, if you would like to use R, go to settings.json and change {"dockerfile": "python.Dockerfile"} into {"dockerfile": "r.Dockerfile"}. Read here how to update files in your forked repository.

  2. Choose the main script to work with: go to submission.py (Python) or submission.R (R) depending on your preferred programming language.

  3. Preprocess the data: any steps to clean or preprocess the data need to be documented within the function clean_df in the submission.py / submission.R script (depending on your preferred programming language). Note: The function clean_df will also be applied to the holdout data when you submit your model. At this point, the codebooks can be useful to make sense of the data.

  4. Train, tune, and save your model: any steps to train your model need to be documented (e.g., code for the model, number of folds, set seed) within the training.py / training.R script. The only function in this script is train_save_model in which you can put the steps needed to run the model. The output of this script is your saved model, either model.joblib or model.rds. Make sure that your model is saved in the same folder as submission.py/submission.R under the name model.joblib (for Python) or model.rds (for R). The model will be applied to the holdout data when you submit your model.

  5. Test your model on fake data: you can test your clean_df function and your model (stored in: model.joblib/model.rds) on fake data (PreFer_fake_data.csv) through the function predict_outcomes. You will also need to adapt this function such that the outputs of your model are predicted classes (i.e., 0s and 1s) rather than, for example, probabilities. If you passed the test (i.e.predict_outcomes led to predictions rather than errors), you can submit your method. If your method does not run on the "fake data", it will not run on the holdout data. [If you "push" your method to Github this test will also be automatically run.]

  6. Submit your method: Submit your method as explained here.

Here are a bunch of videos and guides, notebooks, and blogs available that guide you through this process.

(Adding) libraries / packages

For Python users: please see the environment.yml file to see which libraries are installed per default. You can add or remove libraries from this environment.yml file as you desire. It is recommended to state particular versions (i.e., pandas=1.5 rather than pandas>=1.5). You have to call upon those libraries in the submission.py file.

For R users: no packages are pre-installed. You can use the packages.R file and add the names of the packages to the code: install.packages(c("dplyr","data.table","tidyr"), repos="https://cran.r-project.org"). You have to call upon those libraries in the submission.R file. (i.e., adding library(c("dplyr","data.table","tidyr")))

Submit your method

Follow the instructions below to submit your method:

  1. Make sure that you describe your model in the description.md file in your GitHub repository and commit changes (i.e. save changes locally)
  2. Push the commit (i.e. upload changed version to your online repository)
  3. In GitHub make sure that the checks pass:

ℹ️ If the check fails go to FAQ. You might need to add dependencies as described here. You can also test your implementation locally as explained here.

  1. On the main page of your repository, above the file list, click commits to view a list of commits, as described here
  2. Go to the commit that you want to submit and right click on view commit details, then click "Copy Link Address", see example below:

  1. Add a submission on the Next platform by providing the URL to your GitHub commit (copied at step 5), this commit will serve as your submission to the challenge.

Leaderboards

The LISS panel challenge data is separated into an example dataset for tuning your method and a holdout dataset that will be used to validate your method performance. After submission your method will be run on the holdout data. Your performance scores on the holdout data will be added to the leaderboards, so your scores can be compared to the performance scores of other methods.

ℹ️ Leaderboards are generated at fixed time points, check out (important dates) for leaderboard submission deadlines.

The following leaderboards will be available:

*For the prediction of having a child in 2021-2023 (positive class).

For this challenge the F1 leaderboard is the main leaderboard.

ℹ️ The Python code to calculate the metric scores used to create the challenge leaderboards is included in this repo. Check out score.py.

PreFer Challenge scope

Research problem

Accurate predictions of the number and timing of children are crucial for effective resource allocation in society. However, despite many studies in the social sciences, we have no clear understanding of which factors are most important for fertility prediction or how well we are able to predict fertility behaviour.

Purpose statement

To gain insight into how well methods are able to predict fertility within a three year period (2021-2023), based on survey data from previous years (2007-2020) of people in the LISS panel who were aged 18-45 in 2020. The LISS panel is a representative online longitudinal panel of Dutch households.

License

This project is licensed under the terms of the MIT license.

Acknowledgements

The code in this repository is developed by Eyra as part of the Rank program funded by ODISSEI and the NWO VIDI grant awarded to Gert Stulp. The LISS panel data is provided by Centerdata.

About

Fertility prediction challenge

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 56.7%
  • R 41.0%
  • Dockerfile 2.3%