Final project for course: How to Win a Data Science Competition: Learn from Top Kagglers.
Kaggle competition.
Best public result: 0.969141, best private result: 0.975202.
The biggest problem is correlation between LB and validation. My validation was too optimistic. May be there are problems with target leakage in my mean encoded features.
In order to get my final submission:
- 2.0-db-creating-dataset.ipynb
- 1.0-db-EDA.ipynb
- 4.0-db-text-features.ipynb
- 5.0-db-lgb.ipynb
Some comments:
- The first notebook you should run have number of two in logical sense. We start solving task with investigation. In notebook 1 I also explore some results of notebook 2.
- Baseline model is not needed to create final solution.
- XGBoost model didn't contribute to final stacking because of too long computations.
- You can also read src files, class
TimeSeriesGroupSplit
helped me a lot. - You can also look at my stacking scheme in notebook 8.0-db-stacking.ipynb but it wasn't helped.
In order to understand the logic:
- 1.0-db-EDA.ipynb (before train exploration)
- 2.0-db-creating-dataset.ipynb
- 1.0-db-EDA.ipynb (after train exploration)
- 4.0-db-text-features.ipynb
- 5.0-db-lgb.ipynb
To install dependencies:
conda create --name <env> --file requirements.txt
To download the data you should enter the environment and place the config of Kaggle according to documentation. After this run:
snakemake data --cores 1
This will create all data folders and download competition data into data/raw.
├── Snakefile <- Snakefile with commands like `snakemake data` or `snakemake train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- Documentation for the project
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `conda list -e > requirements.txt`
│
└── src <- Source code for use in this project.
├── __init__.py <- Makes src a Python module
│
└──utils <- Useful functions and classes for all project