Predict Future Sales

Final project for course: How to Win a Data Science Competition: Learn from Top Kagglers.

To reviewers

Best public result: 0.969141, best private result: 0.975202.

The biggest problem is correlation between LB and validation. My validation was too optimistic. May be there are problems with target leakage in my mean encoded features.

Running order

In order to get my final submission:

2.0-db-creating-dataset.ipynb
1.0-db-EDA.ipynb
4.0-db-text-features.ipynb
5.0-db-lgb.ipynb

Some comments:

The first notebook you should run have number of two in logical sense. We start solving task with investigation. In notebook 1 I also explore some results of notebook 2.
Baseline model is not needed to create final solution.
XGBoost model didn't contribute to final stacking because of too long computations.
You can also read src files, class TimeSeriesGroupSplit helped me a lot.
You can also look at my stacking scheme in notebook 8.0-db-stacking.ipynb but it wasn't helped.

Reading order

In order to understand the logic:

1.0-db-EDA.ipynb (before train exploration)
2.0-db-creating-dataset.ipynb
1.0-db-EDA.ipynb (after train exploration)
4.0-db-text-features.ipynb
5.0-db-lgb.ipynb

Development

To install dependencies:

conda create --name <env> --file requirements.txt

To download the data you should enter the environment and place the config of Kaggle according to documentation. After this run:

snakemake data --cores 1

This will create all data folders and download competition data into data/raw.

Project structure

├── Snakefile           <- Snakefile with commands like `snakemake data` or `snakemake train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- Documentation for the project
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `conda list -e > requirements.txt`
│
└── src                <- Source code for use in this project.
    ├── __init__.py    <- Makes src a Python module
    │
    └──utils          <- Useful functions and classes for all project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict Future Sales

To reviewers

Running order

Reading order

Development

Project structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
docs		docs
notebooks		notebooks
reports/figures		reports/figures
src		src
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
requirements.txt		requirements.txt

Mr-Geekman/predict-future-sales

Folders and files

Latest commit

History

Repository files navigation

Predict Future Sales

To reviewers

Running order

Reading order

Development

Project structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages