This project presents an in-depth analysis of a dataset related to hypothyroidism, focusing on anomaly detection using unsupervised learning techniques. The analysis includes data preprocessing, visualization, and the implementation of DBSCAN, Local Outlier Factor (LOF), and a combined method of both.
Hypothyroidism is a medical condition characterized by an underactive thyroid gland. Detecting anomalies in medical datasets is crucial for identifying unusual patterns that could indicate potential health issues. This study employs unsupervised learning techniques to detect anomalies in a hypothyroidism dataset.
The dataset contains 21 attributes (15 binary and 6 continuous) and 7,200 data objects. The attributes are anonymized and represented as Dim_0
, Dim_1
, etc.
- Python 3.6+
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- gower
Clone the repository and install the required packages:
git clone https://github.com/oktaykurt/Anomaly-Detection-with-DBSCAN-and-LOF.git
cd Anomaly-Detection-with-DBSCAN-and-LOF
pip install -r requirements.txt
- Ensure the dataset file (
Unsupervised_Learning_23-24_Project_Dataset.csv
) is in the project directory. - Run the script to perform the analysis and generate visualizations:
- Loading the Dataset: The dataset is loaded from a CSV file with attributes separated by semicolons and decimals marked by commas.
- Dropping Unnecessary Columns: The last two columns, which are empty due to the way the CSV file is read, are dropped.
- Identifying and Converting Column Types: Binary columns are identified and converted to integer type, while continuous columns are converted to float type. The 'Row' column, serving as an index, is dropped.
- Handling Missing Values: The dataset is checked for missing values. No missing values were found.
- Histograms of Continuous Features: Visualize the distribution of continuous features.
- Box Plots of Continuous Features: Highlight the spread and potential outliers in continuous data.
- Bar Plots of Binary Features: Show the frequency of binary attributes.
- Correlation Matrix Heatmap: Visualize the correlation between features.
- Gower Distance: Used for handling mixed data types, combining distances of binary and continuous attributes.
- t-SNE: Dimensionality reduction technique for visualizing high-dimensional data.
- DBSCAN: Density-based clustering algorithm.
- LOF: Local Outlier Factor algorithm for anomaly detection.
- Combined DBSCAN and LOF: A method leveraging both DBSCAN and LOF to enhance anomaly detection.
The analysis includes multiple figures to illustrate the findings:
- Histograms of Continuous Features
- Box Plots of Continuous Features
- Bar Plots of Binary Features
- Correlation Matrix Heatmap
- t-SNE of Gower Distance Matrix
- 3D Scatter Plot of eps, min_samples, and Silhouette Score
- DBSCAN Clustering (t-SNE)
- DBSCAN Clustering with Anomalies Highlighted (t-SNE)
- DBSCAN Pair Plot of Continuous Features Highlighting Anomalies
- Histogram of Anomaly Counts (LOF)
- LOF Clustering with Anomalies Highlighted
- LOF Pair Plot of Continuous Features Highlighting Anomalies
- LOF Values
- Comparison of Outliers Detected by DBSCAN and LOF
- Number of Anomalies Detected by Each Technique
- Combined DBSCAN and LOF Clustering with Anomalies Highlighted
- Combined DBSCAN and LOF Pair Plot of Continuous Features Highlighting Anomalies
The analysis demonstrates the effectiveness of DBSCAN and LOF in detecting anomalies in the hypothyroidism dataset. The combined approach enhances anomaly detection by leveraging the strengths of both methods. However, visual interpretations indicate that LOF performs better in separating anomalies from normal data points.
This study successfully applies DBSCAN, LOF, and a combined approach for anomaly detection in a hypothyroidism dataset. Visualizations and performance metrics indicate that these methods are effective in identifying unusual patterns. Based on visual interpretation, LOF was selected as the preferred method for anomaly detection. Future work could involve exploring other clustering algorithms and improving parameter optimization techniques.
- F. Pedregosa et al., "Scikit-learn: Machine Learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.
- L. Van der Maaten and G. Hinton, "Visualizing Data using t-SNE," Journal of Machine Learning Research, vol. 9, pp. 2579-2605, 2008.
- J. C. Gower, "A general coefficient of similarity and some of its properties," Biometrics, vol. 27, pp. 857-874, 1971.
This project is licensed under the MIT License - see the LICENSE file for details.
For more detailed information, you can read the full paper here.