This is the source code for the HomeDepot's data science challenge on Kaggle [+]. Our solution gives the RMSE score of 0.456 and scored at top 3.5% spot in the leadboard.
Team:
- Arman Akbarian
- Ali Narimani
- Hamid Omid
As you can see the feature engineering part involves 4 parts:
- basic features: at
xgb.py
before merging other features - extended features: at
feat_eng_extended.py
- extended tf-idf: at
feat_eng_tfidf.py
(followed bymodel_xgb.py
) - basic tf-idf: done at the pipeline at
xgb.py
Text preprocessing is done by preprocessing.py
.
Misc Files Description:
spell_corrector.py
: buildsspell_corr.py
a python dictionary for correcting spellings in search query.data_trimming.py
: a pre-cleaning process needed byspell_corrector.py
andSynonyms.py
data_selection.py
: feature selection after extended features are built.nlp_utils.py
: a NLP utility funciton and wrapper classes for quick prototypingngrams.py
: builts n-gramsSynonyms.py
: finds the synonyms of the words in search queryproject_params.py
: sets global variables for the project, I/O directories etc...
In project_params.py
edit the following to the correct path
- ``input_root_dir``: the path to the original .csv files in your system
- ``output_root_dir``: a path with few GB disk space avaiable for I/O
To do all of the following (1,2,3) you can simply use:
``make install``
To install the required packages execute:
``pip install -r requirements.txt``
The project needs the following python packages: -numpy -pandas -scikit-learn -nltk
You may also need to install nltk data:
-
in python:
>>> nltk.download()
-
or via command line:
python -m nltk.downloader all
-
or makefile will take take care of everythin:
make install
(Particularly nlp_utils
uses WordNet data.)
The data, can be downloaded from the competition's homepage [+].
run make testutils
to check if things work!
Note: This documentation and repo is under construction, I will hopefully clean up the repo and modify all of the scripts such that the pipeline works perfectly on a single machine