This project was made for the (AMMI Ghana Bootcamp Kaggle competition)
Given the following features:
- country (String) The country that the wine is from
- province (String) The province or state that the wine is from
- region_1 (String) The wine growing area in a province or state (ie Napa)
- region_2 (String) Sometimes there are more specific regions within the wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank
- winery (String) The winery that made the wine
- variety (String) The type of grapes used to make the wine (ie Pinot Noir)
- designation (String) The vineyard within the winery where the grapes that made the wine are from
- taster_name (String) taster name
- taster_twitter_handle (String) taster twitter account name
- description (String) A few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.
- points (Numeric) Number of points WineEnthusiast rated the wine on a scale of 1-100
We need to predict the price (Numeric) The cost for a bottle of wine.
pip3 install -r requirements.txt
for the models to be able to deal with the categorical features some preprocessing was made.
- country, region_2, province, taster_name and variety were encoded as one hot vectors
- title, region_1 and designation were vectorized using CountVectorizer from
sklearn
- taster_twitter_handle was ignored due to it's redundant contribution to the data (see visualisation.ipynb)
- And finally the description feature was encoded using Word2Vec (by summing the vectors representing all of a training example description)
- Linear regression
- Dicision Trees
- Random Forests
- Neural networks
- K-Fold cross validation
- Word Embeddings
- GridSearch hyper-parameters optimization
- One Hot Enconding
- CountVectorizer
- PCA