The objective of this lab is to familiarize you with linear regression. We will be using the spam email dataset based on the Enron case. You can find it in this repository and also in the original post linked at the bottom.
Deliverable: The deliverable is a report with the required screenshots and answers to the questions in the instructions below.
The first step in any data science/ML/AI project is to take a look at the dataset. The objective is two-fold. First, we need to know if there are any missing data points or if we have data in unexpected formats. Second, we want to know basic statistical facts about the dataset including number of rows and columns, maximum and minimum values, and the largest correlations between columns.
-
Run the file
data-exploration.py
on a terminal. You can do this by typingpython data-exploration.py
. Note that you cannot submit this file through thesbatch
command due to theinput()
functions. -
Press enter to move through the script.
-
Observe the outputs.
- How many rows and columns are in the dataset?
- In a Pandas DataFrame, what is the index?
- What is the index of the 10th email message in the dataset?
- What is the difference between the index and the contents of the 'Email no.' column?
The next step for data exploration is to calculate the correlations between the columns in the dataset. The correlations may provide further insights into how words are related across emails.
- Run the file
correlations.py
. This file can by run throughsbatch
- If you run the file more than once, make sure you comment and uncomment the appropriate lines in such a way that you load the file instead of recomputing it.
- Observe the output.
- Now, change lines 56 to 60 to save the top and bottom correlations into files.
- What are the top 5 correlations?
- What are the bottom 5 correlations?
- What, if anything, can you infer from the top correlations?
- Include the lines that you modified to save the correlations in your report (number 4 above)
Scatter plots allow us to visualize the relationships between any two variables in the dataset.
- Run the file
scatter-plots.py
and observe the output plot. - You will need to modify line 36 to save the plot instead of using the show() function.
- Do you observe a relationship between the words enron and deal? Explain why or why not.
- Create at least three more scatter plots using different word combinations. Choose the word
combinations based on the top correlations that you found using the
correlations.py
file. - Answer question 1 for all your scatter plots.
This file implements linear regression.
- Run the file
linear-regression.py
- Observe the output
- You may need to change the show() function to observe the plots.
- Why are we dropping the 'Email no' and 'Prediction' columns?
- What does the value 0.2 in the test_size option of the function
train_test_split()
mean? - What is the MSE? Is this a good or bad result? Explain why.
- Rerun the file at least to more times with different target and input columns. Answer question 3 for all your word combinations.
Multi-linear regression uses the same training algorithm as linear regression but using multiple in put variables.
- Run the file
multi-linear-regression.py
- Observe the output.
- You may need to change the show() function to observe the plots.
- The coefficient output has multiple values. Why does multilinear regression have multiple coefficients while linear regression only has one?
- Modify the code to determine if there are coefficients equal (or almost) zero. Are there any coefficients equal to zero? If there are, what does it mean that a coefficient is zero?
- Modify the code to use the Lasso regression algorithm. Do you get more zero coefficients?
- Observe the output plot carefully. You should see that some of the predictions are negative. However, we know that there are no negative value sin our dataset and it would make no sense to have a negative number of words in an email. Read this webpage and replace the Linear Regression algorithm with an appropriate algorithm for this ML task.
- Provide a screen shot of a plot of your results with the new algorithm that prevents negative predictions.
Balaka Biswas here.
You will complete this lab by modifying the linear regression code lab. In logistic regression, we are interested in predicting the value of the 'Prediciton' column in the dataset. This will require that you repeat the linear regression lab but for logistic regression.
- Copy the scatter-plots.py file and name the copy scatter-plot-classification.py
- Modify the code to display the points on the plot according to their label in the 'Prediction' column. For example, the dots that correspond to spam email can be painted red while the regular email can be painted blue.
- Do you observe a relationship between the words enron, deal, and the dot colors? Explain why or why not.
- Create at least three more scatter plots using different word combinations. Choose the word
combinations based on the top correlations that you found using the
correlations.py
file. - Answer question 1 for all your scatter plots with colored dots.
- Include screen shots of your scatter plots.
- Copy the file linear-regresion.py and name the copy logistic-regression.py
- Modify the code to ensure that the 'Prediction' column is not dropped.
- Modify the header to ensure that the LogisticRegression module from sickit learned is loaded
- Modify the code that calls the LinearRegression model and instead call the LogisticRegression model.
- Modify the performance measure line to use an appropriate performance measure. The MSE is no longer appropriate in logistic regression.
- Remove the Coefficient and Intercept lines.
- The plot should show the classification results. Usually, we use a confusion matrix. Scikit learn has a function to create it.
- Explain what is a confusion matrix. What are the rows? What are the columns? What do you expect to see in the confusion matrix if your model makes perfect predictions? What if it completely wrong every time?
- Include a screen shot of your confusion matrix.
In this lab, we will use a decision tree to classify the emails into spam and normal emails. The main advantage of decision trees is that they tell us what features are being sued to decide the classification output.
- Run the file
decision-tree.py
- Open the file
decision-tree.png
- Observe the output of the .png file
-
Read pages 176-178 in the ML book and the following:
- What does "enron <= 0.5" mean in the root node?
- What does "sample" mean?
- What does the vector next to "value" mean?
- What does class mean?
-
What is the Gini index?
-
What does a high/low Gini index mean for a feature?
-
The Sikit learn decision tree depends on the following parameters:
- max_depth: the maximum height of the decision tree.
- min_samples_split: the minimum number of samples a node must have before it can be split
- min_samples_leaf: the minimum number of samples a leaf node must have
- max_leaf_nodes: the maximum number of leaf nodes
- max_feature: the maximum number of features that are evaluated for splitting at each node
Increase the max_depth of the tree and observe the confusion matrix. Did you model improve its predictions? Explain why or why not.
-
Would a hiogher value for max_depth increase or decrease the capacity of the model? Explain why or why not.
-
How can we tell if the decision tree is overfitted? Briefly give an explanation.
- Rewrite the
decision-tree.py
code but this time using random forests instead of decision trees. - Compare the results of the decision tree and the random forests by comptuing their confusion matrices. Which one performs better? Can you improve the perfomance of the worse classifier by changing its parameters? If so, explain.