Data Mining Playground

This repository contains five mini projects covering several main topics in Data Mining. Below you can find the list of projects:

Iris Analysis

The aim of this project is to implement some preprocessing techniques to demonstrate the importance of understanding, cleaning, and adjusting the raw dataset. considered facets include:

The importance of missing values
Non-numerical data
Normalization
PCA
Plotting

About the dataset

The Iris flower dataset is a multivariate dataset consisting of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length and width, stored in a 150x4 numpy.ndarray.

scikit-learn and pandas libraries were used to employ the techniques.

Classification

In this project, the attempt is to classify data using neural networks with different architectures. The data is a large circle containing a smaller circle in 2d (make_circles), from scikit-learn library. The process of obtaining the best architecture for NN is as follows:

A NN without activation funtions in its layeres
A NN with linear activation funtions
Employing a proper Mean Squared Error to the network
A single hidden layer NN with 16 nodes
Finding an adequet value for learning rate (lr = 0.01) by testing various values.
Design a sufficient NN to get the best result

In the next step, the fashion_mnist dataset was loaded from the TensorFlow library that's also been used to implement the NNs. This dataset was trained on the NN, and the results were exhibited through a Confusion Matrix.

Association Rules

This mini-project contains the implementations of Association Rules extraction. The process consists of three main parameters:

Support: shows the popularity of an item according to the number of times it appears in transactions
Confidence: shows the probability of buying item y if item x is bought. x -> y
Lift

The last one is calculated from the combination of the first two ones through the below equation:

This project employs the Apriori algorithm to implement association rules using mlxtend library.

Clustering

This project aims to cluster the make_blobs dataset using the K-means algorithm and elbow technique to find the best value for K. It also contains some complex clustering examples with illustrations. In the following, the DBSCAN algorithm is implemented, together with estimating the values for epsilon (for KNN) and MinPts.

Diabetes Classifier

In this project, a Diabetes Classifier was developed to predict whether a new given case is diabetic. The used dataset consists of more than 70,000 records of patients who have filled out the questionnaire designed by the Centers for Disease Control and Prevention (CDC). It has 22 columns listed below:

Diabetes_binary: The target column that determines whether a person has diabetes or pre-diabetes
HighBP
High Cholesterol
Cholesterol Check
BMI
Smoker
Stroke
HeartDiseaseorAttack
Physical Activity
Fruits
Veggies
Heavy Alcohol Consumption
Any Health Care
No Doctor because of Cost
General Health
Mental Health
Physical Health
Difficulty Walking
Sex
Age
Education
Income

The XGBoost or Extreme Gradient Boost was used to implement the classifier. But before moving on to that, the preprocessing steps are listed in the following:

Handle Missing Values
- Impute missing continuous values with Mean
- Impute missing categorical values with the most frequent category
Replace white spaces in columns' names with '_'
Normalizing/Scaling
One-hot-encoding

In the next step, we design and train an XGBoost classifier with such architecture as below:

model = XGBClassifier(
            learning_rate=0.1, 
            max_depth=4, 
            n_estimators=200, 
            subsample=0.5, 
            colsample_bytree=1, 
            random=123, 
            eval_metric='auc', 
            verbosity=1, 
            tree_method='gpu_hist', 
            early_stop=10)

And finally, to obtain the best combination of hyperparameters, we employ GridSearchCV on the following values for each parameter to tune them:

grid_params = {
    'learning_rate_list': [0.02, 0.05, 0.1, 0.3],
    'max_depth_list': [2, 3, 4],
    'n_estimators_list': [100 ,200 ,300],
    'colsample_bytree': [0.8 ,1]
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Association-Rules		Association-Rules
Classification		Classification
Clustering		Clustering
Diabetes-Classifier		Diabetes-Classifier
Iris-Analysis		Iris-Analysis
.DS_Store		.DS_Store
README.md		README.md
lift-equation.png		lift-equation.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Mining Playground

Iris Analysis

About the dataset

Classification

Association Rules

Clustering

Diabetes Classifier

About

Releases

Packages

Languages

zahrasalarian/Data-Mining-Playground

Folders and files

Latest commit

History

Repository files navigation

Data Mining Playground

Iris Analysis

About the dataset

Classification

Association Rules

Clustering

Diabetes Classifier

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages