Cement is one of the components of concrete. Mixing of required substances in required amount produces concrete. The strength of concrete may be influenced by:
- Ratio of cement to water
- Size of aggregate
- Texture, stiffness of particles
In this project, we aim to determine the compressive strength of concrete given some data about cement.The data set for this project is acquired from the UCI ML repository
The data set has the following attributes:
- Cement
- Blast Furnace Slag
- Fly Ash
- Water
- Superplasticizer
- Coarse Aggregate
- Fine Aggregate
- Age
- Concrete compressive strength
You should have the following softwares/libraries installed:
Python3
Scikit-learn
Jupyter notebook
Scipy
Numpy
Pandas
Matplotlib
Seaborn
Good understanding of the following algorithms is required.
We aim to predict a target variable using some given data variables. What we can do is to pass a line from the data set fitting the data set as shown.
To generalise, you draw a straight line such that it crosses through the maximum number of points. Once we have done that, we can predict the target value using that line as the hypothesis function.
The only problem is how to define the values of slop and intercept of the line.
We can use calculus to get the minimum values of slope and intercept such that the line passes through maximum number of points. More commonly, gradient descent is used for updating parameters at each step. Gradient descent, more details.
We will not go into detailed mathematics because this note just provides intuition for the algorithms.
Sometimes the line may not fit the data because of the data might have a high dimensional polynomial nature. So a line won't be enough. For this we need to increase the degee of attibutes (which can be done using the scikit-learn library).
Check the below figure for linear regression.
As you can see in the figure, the given line can fit the data almost perfectly. Simple linear regression is fine in this case.
Now look at this figure.
The line for predicted speeds is not able to fit the ground truth perfectly. The data is too complex for simple linear regression to make predictions. We need to have high degree attributes.
Now look at this figure.
This hypothesis function fits the data almost perfectly. So to decide between linear and polynomial regression, you have to visualize the data. Use python's seaborn for better aid in visualization.
For detailed mathematics behind polynomial regression, Click Here
First let's look at Ensemble Learning. Ensemble learning is a technique that combines the predictions from multiple weak machine learning algorithms together to make more accurate predictions. Since the model is comprised of many models, it is called an Ensemble model. Look at the image below for better understanding.
Random forest is an supervised machine learning that can be used for classification and regression. It has trees which have no interaction in between them, they are completely independent from each other. Each tree acts as an independent model, which can be combined with others to form an ensemble. Each tree uses random data from the original data set when generating its splits, adding randomness to prevent overfitting. For many data sets, it produces a highly accurate classifier. It gives estimates of what variables that are important in the classification therefore, can be used for feature selection. It has an effective method for estimating missing data while maintaining accuracy.
SVM is one of the most powerful machine learning algorithm available today. A linear regression is only able to generate a line to predict the data points, while a support vector regression can also generate a hyperplane. In SVM, Margin is the perpendicular distance between the hyperplane and the closest points. SVM tries to maximise this margin, no penalty region in built by SVM in the muti dimensional data.
Look at the image.
Making a hyperplane to fit this data is very difficult, specially it it's high dimensional. SVM does this task perfectly. SVR tries to have as many support vectors as possible within the margin, thus it keeps the error within the threshold.
A major difference between Linear regression and SVR lies on the fact that Linear regression tends to minimize the error and SVR tends to keep it within a threshold.
For further information visit Sklearn SVR
- Linear regression is a linear model, which means it works really great with data with linear properties. But, linear model cannot capture the non-linear features.
- So in this case, you can use the decision trees or random forests which do a better job at capturing the non-linearity in the data by dividing the space into smaller sub-spaces.
- Random forests behave like ensemble models, making decision trees even more robust to deal with noisy data, whereas standard regression methods can get easily confused by noise and will result in high error.
- Normally, Support Vectors models perform better on sparse data than RF. Moreover, Decision trees work faster, non-linear data are handled well . Also they train faster but they have tendency to overfit.
- DevIncept Mentor
- Google for images
- Medium blog posts
- Wikipedia
- UCI Machine learning datset repository