Stinsight

A Machine Learning project implemented for the Artificial Intelligence Society DataSci '17 competition. The UCI Student Performance Data Set was used to effectively predict if a student is expected to fail the course or not, using school reports and questionnaires. Different classification models were compared. The best model, Logistic Regression achieved about 70% accuracy.

Prerequisites

This project requires Python 3.4 and the following Python libraries installed:

Install

On Unix systems, the above libraries can be installed using these (or corresponding ones for your distribution) commands,

sudo apt-get install build-essential python3-dev python3-setuptools python3-numpy python3-scipy libatlas-dev libatlas3gf-base

sudo apt-get install python3-pip

sudo -H pip3 install -U scikit-learn

sudo -H pip3 install pandas

Observations and Evaluating Criterions

1. In which class of problems, does this qualify?

This is a classification problem. The reason for terming it as a classification problem being, we are asked to identify students who might end up failing the final exam.

Thus, we need to identify such students and intervene before it's too late. Or in other words, we have to classify the whole group of students into two sections - Ones who are expected to fail the course and others who are expected to pass the course.

2. Observations from Given Data

Total number of students: 395
Total number of features: 30
Target: 1 ("passed")
Number of students who passed: 265
Number of students who failed: 130
Number of columns with numeral values: 13

3. What are the features and Relevant Target?

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason',
 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

4. Which is the best model to use?


Support Vector Machine - Linear 

Total execution time: 23.7 milliseconds
Accuracy is 69.62%


Support Vector Machine - Polynomial 

Total execution time: 478.0 milliseconds
Accuracy is 63.92%


Support Vector Machine - Radial Basis Function 

Total execution time: 8.7 milliseconds
Accuracy is 66.46%


Logistic Regression 

Total execution time: 2.7 milliseconds
Accuracy is 69.62%


Decision Tree 

Total execution time: 1.7 milliseconds
Accuracy is 56.33%


Random Forest 

Total execution time: 201.7 milliseconds
Accuracy is 69.62%


Random Forest (Optimised) 

Total execution time: 47.6 milliseconds
Accuracy is 63.29%

Based on the statistics obtained, Logistic Regression provides the best performance and also in the least time.

The other models didn't perform well in comparison with Logistic regression. Though Logistic Regression appears to be a clear winner in both measures used to decide the best model i.e. performance and training and testing time, I would've still chosen to trade off the training time for the higher performance. The reason being that failing a course in school can have some adverse effect on the mental and psychological health of the student, making accuracy a dominant factor in comparing different models.

Thus, as the predictions need to be very accurate, we would have to value the accuracy more as compared to training and testing time.

Appendix

Attributes for Data Set along with the DataSet itself can be found at UCI ML Repository

Author

Pranav Dixit - Github - Linkedin

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Paulo Cortez - For making the Data Set public by donating it to the UCI ML Repository.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE.md		LICENSE.md
README.md		README.md
stinsight.py		stinsight.py
student_data.csv		student_data.csv
student_intervention.jpg		student_intervention.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stinsight

Prerequisites

Install

Observations and Evaluating Criterions

1. In which class of problems, does this qualify?

2. Observations from Given Data

3. What are the features and Relevant Target?

4. Which is the best model to use?

Appendix

Author

License

Acknowledgments

About

Releases

Packages

Languages

License

prnvdixit/Stinsight

Folders and files

Latest commit

History

Repository files navigation

Stinsight

Prerequisites

Install

Observations and Evaluating Criterions

1. In which class of problems, does this qualify?

2. Observations from Given Data

3. What are the features and Relevant Target?

4. Which is the best model to use?

Appendix

Author

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages