Product Search Relevance

This is the source code for the HomeDepot's data science challenge on Kaggle [+]. Our solution gives the RMSE score of 0.456 and scored at top 3.5% spot in the leadboard.

Team:

Arman Akbarian
Ali Narimani
Hamid Omid

Overview of the ML pipeline:

As you can see the feature engineering part involves 4 parts:

basic features: at xgb.py before merging other features
extended features: at feat_eng_extended.py
extended tf-idf: at feat_eng_tfidf.py (followed by model_xgb.py)
basic tf-idf: done at the pipeline at xgb.py

Text preprocessing is done by preprocessing.py.

Misc Files Description:

spell_corrector.py: builds spell_corr.py a python dictionary for correcting spellings in search query.
data_trimming.py: a pre-cleaning process needed by spell_corrector.py and Synonyms.py
data_selection.py: feature selection after extended features are built.
nlp_utils.py: a NLP utility funciton and wrapper classes for quick prototyping
ngrams.py: builts n-grams
Synonyms.py: finds the synonyms of the words in search query
project_params.py: sets global variables for the project, I/O directories etc...

Preparation:

In project_params.py edit the following to the correct path

 - ``input_root_dir``: the path to the original .csv files in your system

 - ``output_root_dir``: a path with few GB disk space avaiable for I/O

Requirements:

Get all Requirements

To do all of the following (1,2,3) you can simply use:

``make install``

Packages

To install the required packages execute:

``pip install -r requirements.txt``

The project needs the following python packages: -numpy -pandas -scikit-learn -nltk

You may also need to install nltk data:

in python:

>>> nltk.download()
or via command line:

python -m nltk.downloader all
or makefile will take take care of everythin:

make install

(Particularly nlp_utils uses WordNet data.)

Data

The data, can be downloaded from the competition's homepage [+].

Initial Test:

run make testutils to check if things work!

Note: This documentation and repo is under construction, I will hopefully clean up the repo and modify all of the scripts such that the pipeline works perfectly on a single machine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Product Search Relevance

Overview of the ML pipeline:

Preparation:

Requirements:

Get all Requirements

Packages

Data

Initial Test:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
img		img
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
Synonym.py		Synonym.py
data_selection.py		data_selection.py
data_trimming.py		data_trimming.py
feat_eng_extended.py		feat_eng_extended.py
feat_eng_tfidf.py		feat_eng_tfidf.py
model_xgb.py		model_xgb.py
ngrams.py		ngrams.py
nlp_utils.py		nlp_utils.py
preprocessing.py		preprocessing.py
project_params.py		project_params.py
requirements.txt		requirements.txt
run_all		run_all
selected_extended_features.txt		selected_extended_features.txt
spell_corr.py		spell_corr.py
spell_corrector.py		spell_corrector.py
xgb.py		xgb.py
xgbFunctions.py		xgbFunctions.py

License

rmanak/search_relevance

Folders and files

Latest commit

History

Repository files navigation

Product Search Relevance

Overview of the ML pipeline:

Preparation:

Requirements:

Get all Requirements

Packages

Data

Initial Test:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages