The main objective of this project is to present a report based on a dataset about main features of diamonds and his price. In this first approach, the data will be analyzed and how some variables are related to others.
git clone
Exploratory analysis is done in a jupyter notebook file named data_analysis_report.ipynb based on the dataset located into data folder and named diamonds_train.csv.
it is necessary to have the following libraries installed to make use of the exploratory analysis
- Python
- Pandas
- Matplotlib
- Seaborn
see the next website to have a knowledge about the feature of diamonds. In summarize:
- Carat: The term carat actually refers to the diamond's total weight and not its size.
- Cut: The most important of the 4Cs is cut because it has the greatest influence on a diamond's sparkle.
- Color: The second most important of the 4Cs is color, which refers to a diamond's lack of color. The less color, the higher the grade.
- Clarity: Often the least important of the 4Cs because the tiny imperfections are often microscopic.
- Depth: The height of a gemstone measured from the culet to the table
- Table: The largest facet of a gemstone
The analysis that has been carried out is based on a statistical study on the data belonging to the diamonds sample dataset from kaggle. The analysis has been structured in two parts:
- Basic analysis: It is the first approach based on a study to know the data we have
- Exploratory analysis: It is the second approach and it is focused on the price variable compared with the other features to have a predictive model in the future. In this second approach, the values are represented graphically to better understand the data.
I have made a first approach in the analysis, being a basic analysis to know the type of data we have in the whole dataset. To carry out this test, I have made a statistical analysis to find out the most common values when it comes to showing the characteristics of the dataset, classifying the categorical and numerical variables. I have also carried out a study of the maximum and minimum values for each variable and thus better understand the extreme values of the dataset.
In this second approach, I have focused the analysis surround the price variable compared with the other variables to know if exist any correlations between them. For it, I have built a correlation matrix plot to know the correlation between variables. after that I compared the numerical variables and the categorical variables to know one by one the established relationship to know the best relations with price variable. With those data I will predict the price of the diamonds according tho his main features.
It is deduced in these approximations that the price has a strong correlation with the carat variable and relevant information is extracted with the combination of the attributes classified as categorical for diamonds.
Dashboards are powerful tools for communicating important information at-a-glance. The goal of this challenge is to build a data dashboard using our diamonds dataset that will help myself to perform better during Module 3 project.
In the following link you will find a single interactive interface built around a specific objetive understanding a group of relationship between diamonds attributes (features) and its price.
please visit my public tableau profile and take a look the project named diamonds_project2
Browse issues:
If you have any suggest or doubt, you can contact with me via email: "[email protected]"
Licence: The Unlicense