Skip to content

Mr-Geekman/predict-future-sales

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predict Future Sales

Final project for course: How to Win a Data Science Competition: Learn from Top Kagglers.

Kaggle competition.

To reviewers

Best public result: 0.969141, best private result: 0.975202.

The biggest problem is correlation between LB and validation. My validation was too optimistic. May be there are problems with target leakage in my mean encoded features.

Running order

In order to get my final submission:

  1. 2.0-db-creating-dataset.ipynb
  2. 1.0-db-EDA.ipynb
  3. 4.0-db-text-features.ipynb
  4. 5.0-db-lgb.ipynb

Some comments:

  • The first notebook you should run have number of two in logical sense. We start solving task with investigation. In notebook 1 I also explore some results of notebook 2.
  • Baseline model is not needed to create final solution.
  • XGBoost model didn't contribute to final stacking because of too long computations.
  • You can also read src files, class TimeSeriesGroupSplit helped me a lot.
  • You can also look at my stacking scheme in notebook 8.0-db-stacking.ipynb but it wasn't helped.

Reading order

In order to understand the logic:

  1. 1.0-db-EDA.ipynb (before train exploration)
  2. 2.0-db-creating-dataset.ipynb
  3. 1.0-db-EDA.ipynb (after train exploration)
  4. 4.0-db-text-features.ipynb
  5. 5.0-db-lgb.ipynb

Development

To install dependencies:

conda create --name <env> --file requirements.txt

To download the data you should enter the environment and place the config of Kaggle according to documentation. After this run:

snakemake data --cores 1

This will create all data folders and download competition data into data/raw.

Project structure


├── Snakefile           <- Snakefile with commands like `snakemake data` or `snakemake train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- Documentation for the project
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `conda list -e > requirements.txt`
│
└── src                <- Source code for use in this project.
    ├── __init__.py    <- Makes src a Python module
    │
    └──utils          <- Useful functions and classes for all project

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published