From 15d4584b5b47aaaac0dfd2028c761c38158bbae1 Mon Sep 17 00:00:00 2001 From: VARUNSHIYAM <138989960+Varunshiyam@users.noreply.github.com> Date: Sun, 10 Nov 2024 16:31:15 +0530 Subject: [PATCH 1/2] fixes 873 --- ...idney-disease-prediction-98-accuracy.ipynb | 2209 +++++++++++++++++ 1 file changed, 2209 insertions(+) create mode 100644 Prediction Models/Chronic_Kidney_Disease_prediction/chronic-kidney-disease-prediction-98-accuracy.ipynb diff --git a/Prediction Models/Chronic_Kidney_Disease_prediction/chronic-kidney-disease-prediction-98-accuracy.ipynb b/Prediction Models/Chronic_Kidney_Disease_prediction/chronic-kidney-disease-prediction-98-accuracy.ipynb new file mode 100644 index 00000000..dbe0fdcf --- /dev/null +++ b/Prediction Models/Chronic_Kidney_Disease_prediction/chronic-kidney-disease-prediction-98-accuracy.ipynb @@ -0,0 +1,2209 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "614aa99b", + "metadata": { + "papermill": { + "duration": 0.073943, + "end_time": "2021-08-03T10:26:45.709931", + "exception": false, + "start_time": "2021-08-03T10:26:45.635988", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "

Chronic Kidney Disease Prediction

" + ] + }, + { + "cell_type": "markdown", + "id": "d4835a5e", + "metadata": { + "papermill": { + "duration": 0.071935, + "end_time": "2021-08-03T10:26:45.851968", + "exception": false, + "start_time": "2021-08-03T10:26:45.780033", + "status": "completed" + }, + "tags": [] + }, + "source": [ + " " + ] + }, + { + "cell_type": "markdown", + "id": "641a7e99", + "metadata": { + "papermill": { + "duration": 0.069571, + "end_time": "2021-08-03T10:26:45.993857", + "exception": false, + "start_time": "2021-08-03T10:26:45.924286", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Table of Contents

\n", + "\n", + "* [EDA](#2.0)\n", + "* [Data Pre Processing](#3.0)\n", + "* [Feature Encoding](#4.0)\n", + "* [Model Building](#5.0)\n", + " * [Knn](#5.1)\n", + " * [Decision Tree Classifier](#5.2)\n", + " * [Random Forest Classifier](#5.3)\n", + " * [Ada Boost Classifier](#5.4)\n", + " * [Gradient Boosting Classifier](#5.5)\n", + " * [Stochastic Gradient Boosting (SGB)](#5.6)\n", + " * [XgBoost](#5.7)\n", + " * [Cat Boost Classifier](#5.8)\n", + " * [Extra Trees Classifier](#5.9)\n", + " * [LGBM Classifier](#5.10)\n", + "\n", + "* [Models Comparison](#6.0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "89764c70", + "metadata": { + "papermill": { + "duration": 2.342669, + "end_time": "2021-08-03T10:26:48.406881", + "exception": false, + "start_time": "2021-08-03T10:26:46.064212", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# necessary imports \n", + "\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "import plotly.express as px\n", + "\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "plt.style.use('fivethirtyeight')\n", + "%matplotlib inline\n", + "pd.set_option('display.max_columns', 26)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f101956a", + "metadata": { + "papermill": { + "duration": 0.134709, + "end_time": "2021-08-03T10:26:48.611821", + "exception": false, + "start_time": "2021-08-03T10:26:48.477112", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# loading data\n", + "\n", + "df= pd.read_csv('../input/ckdisease/kidney_disease.csv')\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c7a54143", + "metadata": { + "papermill": { + "duration": 0.078791, + "end_time": "2021-08-03T10:26:48.762575", + "exception": false, + "start_time": "2021-08-03T10:26:48.683784", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2142b9e", + "metadata": { + "papermill": { + "duration": 0.081484, + "end_time": "2021-08-03T10:26:48.915444", + "exception": false, + "start_time": "2021-08-03T10:26:48.833960", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# dropping id column\n", + "df.drop('id', axis = 1, inplace = True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "07a84f5d", + "metadata": { + "papermill": { + "duration": 0.081719, + "end_time": "2021-08-03T10:26:49.068594", + "exception": false, + "start_time": "2021-08-03T10:26:48.986875", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# rename column names to make it more user-friendly\n", + "\n", + "df.columns = ['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar', 'red_blood_cells', 'pus_cell',\n", + " 'pus_cell_clumps', 'bacteria', 'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium',\n", + " 'potassium', 'haemoglobin', 'packed_cell_volume', 'white_blood_cell_count', 'red_blood_cell_count',\n", + " 'hypertension', 'diabetes_mellitus', 'coronary_artery_disease', 'appetite', 'peda_edema',\n", + " 'aanemia', 'class']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "68d8ee8b", + "metadata": { + "papermill": { + "duration": 0.10622, + "end_time": "2021-08-03T10:26:49.245870", + "exception": false, + "start_time": "2021-08-03T10:26:49.139650", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "82fca848", + "metadata": { + "papermill": { + "duration": 0.118516, + "end_time": "2021-08-03T10:26:49.435620", + "exception": false, + "start_time": "2021-08-03T10:26:49.317104", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76413f53", + "metadata": { + "papermill": { + "duration": 0.094965, + "end_time": "2021-08-03T10:26:49.603561", + "exception": false, + "start_time": "2021-08-03T10:26:49.508596", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df.info()" + ] + }, + { + "cell_type": "markdown", + "id": "369d68b7", + "metadata": { + "papermill": { + "duration": 0.072233, + "end_time": "2021-08-03T10:26:49.748407", + "exception": false, + "start_time": "2021-08-03T10:26:49.676174", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "

As we can see that 'packed_cell_volume', 'white_blood_cell_count' and 'red_blood_cell_count' are object type. We need to change them to numerical dtype.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f28e32f", + "metadata": { + "papermill": { + "duration": 0.083074, + "end_time": "2021-08-03T10:26:49.904522", + "exception": false, + "start_time": "2021-08-03T10:26:49.821448", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# converting necessary columns to numerical type\n", + "\n", + "df['packed_cell_volume'] = pd.to_numeric(df['packed_cell_volume'], errors='coerce')\n", + "df['white_blood_cell_count'] = pd.to_numeric(df['white_blood_cell_count'], errors='coerce')\n", + "df['red_blood_cell_count'] = pd.to_numeric(df['red_blood_cell_count'], errors='coerce')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61b30220", + "metadata": { + "papermill": { + "duration": 0.092811, + "end_time": "2021-08-03T10:26:50.070111", + "exception": false, + "start_time": "2021-08-03T10:26:49.977300", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df.info()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c419242f", + "metadata": { + "papermill": { + "duration": 0.081228, + "end_time": "2021-08-03T10:26:50.225955", + "exception": false, + "start_time": "2021-08-03T10:26:50.144727", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# Extracting categorical and numerical columns\n", + "\n", + "cat_cols = [col for col in df.columns if df[col].dtype == 'object']\n", + "num_cols = [col for col in df.columns if df[col].dtype != 'object']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f754b17f", + "metadata": { + "papermill": { + "duration": 0.083961, + "end_time": "2021-08-03T10:26:50.382941", + "exception": false, + "start_time": "2021-08-03T10:26:50.298980", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# looking at unique values in categorical columns\n", + "\n", + "for col in cat_cols:\n", + " print(f\"{col} has {df[col].unique()} values\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "74a090a0", + "metadata": { + "papermill": { + "duration": 0.073034, + "end_time": "2021-08-03T10:26:50.529711", + "exception": false, + "start_time": "2021-08-03T10:26:50.456677", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "

There is some ambugity present in the columns we have to remove that.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9685ff4a", + "metadata": { + "papermill": { + "duration": 0.084758, + "end_time": "2021-08-03T10:26:50.689885", + "exception": false, + "start_time": "2021-08-03T10:26:50.605127", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# replace incorrect values\n", + "\n", + "df['diabetes_mellitus'].replace(to_replace = {'\\tno':'no','\\tyes':'yes',' yes':'yes'},inplace=True)\n", + "\n", + "df['coronary_artery_disease'] = df['coronary_artery_disease'].replace(to_replace = '\\tno', value='no')\n", + "\n", + "df['class'] = df['class'].replace(to_replace = {'ckd\\t': 'ckd', 'notckd': 'not ckd'})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "223ae9c3", + "metadata": { + "papermill": { + "duration": 0.085716, + "end_time": "2021-08-03T10:26:50.849530", + "exception": false, + "start_time": "2021-08-03T10:26:50.763814", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df['class'] = df['class'].map({'ckd': 0, 'not ckd': 1})\n", + "df['class'] = pd.to_numeric(df['class'], errors='coerce')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16e7f4b5", + "metadata": { + "papermill": { + "duration": 0.084823, + "end_time": "2021-08-03T10:26:51.008781", + "exception": false, + "start_time": "2021-08-03T10:26:50.923958", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "cols = ['diabetes_mellitus', 'coronary_artery_disease', 'class']\n", + "\n", + "for col in cols:\n", + " print(f\"{col} has {df[col].unique()} values\\n\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8730f87e", + "metadata": { + "papermill": { + "duration": 3.614873, + "end_time": "2021-08-03T10:26:54.697367", + "exception": false, + "start_time": "2021-08-03T10:26:51.082494", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# checking numerical features distribution\n", + "\n", + "plt.figure(figsize = (20, 15))\n", + "plotnumber = 1\n", + "\n", + "for column in num_cols:\n", + " if plotnumber <= 14:\n", + " ax = plt.subplot(3, 5, plotnumber)\n", + " sns.distplot(df[column])\n", + " plt.xlabel(column)\n", + " \n", + " plotnumber += 1\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "6b09bcfe", + "metadata": { + "papermill": { + "duration": 0.08436, + "end_time": "2021-08-03T10:26:54.859924", + "exception": false, + "start_time": "2021-08-03T10:26:54.775564", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "

Skewness is present in some of the columns.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2f52c8b", + "metadata": { + "papermill": { + "duration": 1.323488, + "end_time": "2021-08-03T10:26:56.263063", + "exception": false, + "start_time": "2021-08-03T10:26:54.939575", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# looking at categorical columns\n", + "\n", + "plt.figure(figsize = (20, 15))\n", + "plotnumber = 1\n", + "\n", + "for column in cat_cols:\n", + " if plotnumber <= 11:\n", + " ax = plt.subplot(3, 4, plotnumber)\n", + " sns.countplot(df[column], palette = 'rocket')\n", + " plt.xlabel(column)\n", + " \n", + " plotnumber += 1\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "466544be", + "metadata": { + "papermill": { + "duration": 1.360798, + "end_time": "2021-08-03T10:26:57.704357", + "exception": false, + "start_time": "2021-08-03T10:26:56.343559", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# heatmap of data\n", + "\n", + "plt.figure(figsize = (15, 8))\n", + "\n", + "sns.heatmap(df.corr(), annot = True, linewidths = 2, linecolor = 'lightgrey')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc853a8e", + "metadata": { + "papermill": { + "duration": 0.108845, + "end_time": "2021-08-03T10:26:57.898066", + "exception": false, + "start_time": "2021-08-03T10:26:57.789221", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df.columns" + ] + }, + { + "cell_type": "markdown", + "id": "be705e54", + "metadata": { + "papermill": { + "duration": 0.086169, + "end_time": "2021-08-03T10:26:58.076327", + "exception": false, + "start_time": "2021-08-03T10:26:57.990158", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Exploratory Data Analysis (EDA)

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "967d82e4", + "metadata": { + "papermill": { + "duration": 0.094623, + "end_time": "2021-08-03T10:26:58.256821", + "exception": false, + "start_time": "2021-08-03T10:26:58.162198", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# defining functions to create plot\n", + "\n", + "def violin(col):\n", + " fig = px.violin(df, y=col, x=\"class\", color=\"class\", box=True, template = 'plotly_dark')\n", + " return fig.show()\n", + "\n", + "def kde(col):\n", + " grid = sns.FacetGrid(df, hue=\"class\", height = 6, aspect=2)\n", + " grid.map(sns.kdeplot, col)\n", + " grid.add_legend()\n", + " \n", + "def scatter(col1, col2):\n", + " fig = px.scatter(df, x=col1, y=col2, color=\"class\", template = 'plotly_dark')\n", + " return fig.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec62d7f3", + "metadata": { + "papermill": { + "duration": 1.23421, + "end_time": "2021-08-03T10:26:59.576202", + "exception": false, + "start_time": "2021-08-03T10:26:58.341992", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "violin('red_blood_cell_count')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d74513d0", + "metadata": { + "papermill": { + "duration": 0.696727, + "end_time": "2021-08-03T10:27:00.359251", + "exception": false, + "start_time": "2021-08-03T10:26:59.662524", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "kde('red_blood_cell_count')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d6e30bd", + "metadata": { + "papermill": { + "duration": 0.161446, + "end_time": "2021-08-03T10:27:00.610032", + "exception": false, + "start_time": "2021-08-03T10:27:00.448586", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "violin('white_blood_cell_count')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65a86825", + "metadata": { + "papermill": { + "duration": 0.499478, + "end_time": "2021-08-03T10:27:01.198653", + "exception": false, + "start_time": "2021-08-03T10:27:00.699175", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "kde('white_blood_cell_count')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08a6f9d4", + "metadata": { + "papermill": { + "duration": 0.16341, + "end_time": "2021-08-03T10:27:01.452634", + "exception": false, + "start_time": "2021-08-03T10:27:01.289224", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "violin('packed_cell_volume')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d102207", + "metadata": { + "papermill": { + "duration": 0.493359, + "end_time": "2021-08-03T10:27:02.037553", + "exception": false, + "start_time": "2021-08-03T10:27:01.544194", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "kde('packed_cell_volume')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d389ceb", + "metadata": { + "papermill": { + "duration": 0.164039, + "end_time": "2021-08-03T10:27:02.294343", + "exception": false, + "start_time": "2021-08-03T10:27:02.130304", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "violin('haemoglobin')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce56153b", + "metadata": { + "papermill": { + "duration": 0.496789, + "end_time": "2021-08-03T10:27:02.884896", + "exception": false, + "start_time": "2021-08-03T10:27:02.388107", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "kde('haemoglobin')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a3484ae", + "metadata": { + "papermill": { + "duration": 0.167185, + "end_time": "2021-08-03T10:27:03.145179", + "exception": false, + "start_time": "2021-08-03T10:27:02.977994", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "violin('albumin')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d2760cf", + "metadata": { + "papermill": { + "duration": 0.474798, + "end_time": "2021-08-03T10:27:03.714180", + "exception": false, + "start_time": "2021-08-03T10:27:03.239382", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "kde('albumin')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85c11e76", + "metadata": { + "papermill": { + "duration": 0.164613, + "end_time": "2021-08-03T10:27:03.974212", + "exception": false, + "start_time": "2021-08-03T10:27:03.809599", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "violin('blood_glucose_random')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bfd77dfe", + "metadata": { + "papermill": { + "duration": 0.504677, + "end_time": "2021-08-03T10:27:04.576271", + "exception": false, + "start_time": "2021-08-03T10:27:04.071594", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "kde('blood_glucose_random')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "349f1b89", + "metadata": { + "papermill": { + "duration": 0.175514, + "end_time": "2021-08-03T10:27:04.851924", + "exception": false, + "start_time": "2021-08-03T10:27:04.676410", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "violin('sodium')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c1e25cf6", + "metadata": { + "papermill": { + "duration": 0.499284, + "end_time": "2021-08-03T10:27:05.450739", + "exception": false, + "start_time": "2021-08-03T10:27:04.951455", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "kde('sodium')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf0b05d1", + "metadata": { + "papermill": { + "duration": 0.174906, + "end_time": "2021-08-03T10:27:05.728795", + "exception": false, + "start_time": "2021-08-03T10:27:05.553889", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "violin('blood_urea')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "38bd2516", + "metadata": { + "papermill": { + "duration": 0.468025, + "end_time": "2021-08-03T10:27:06.299500", + "exception": false, + "start_time": "2021-08-03T10:27:05.831475", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "kde('blood_urea')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73c8dcdf", + "metadata": { + "papermill": { + "duration": 0.174498, + "end_time": "2021-08-03T10:27:06.579223", + "exception": false, + "start_time": "2021-08-03T10:27:06.404725", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "violin('specific_gravity')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d503e187", + "metadata": { + "papermill": { + "duration": 0.49488, + "end_time": "2021-08-03T10:27:07.180112", + "exception": false, + "start_time": "2021-08-03T10:27:06.685232", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "kde('specific_gravity')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9f5423c", + "metadata": { + "papermill": { + "duration": 0.204755, + "end_time": "2021-08-03T10:27:07.490266", + "exception": false, + "start_time": "2021-08-03T10:27:07.285511", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "scatter('haemoglobin', 'packed_cell_volume')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4f8fb5f8", + "metadata": { + "papermill": { + "duration": 0.177467, + "end_time": "2021-08-03T10:27:07.775294", + "exception": false, + "start_time": "2021-08-03T10:27:07.597827", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "scatter('red_blood_cell_count', 'packed_cell_volume')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5caf7bb6", + "metadata": { + "papermill": { + "duration": 0.178486, + "end_time": "2021-08-03T10:27:08.063116", + "exception": false, + "start_time": "2021-08-03T10:27:07.884630", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "scatter('red_blood_cell_count', 'albumin')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15e1ad2c", + "metadata": { + "papermill": { + "duration": 0.181918, + "end_time": "2021-08-03T10:27:08.403504", + "exception": false, + "start_time": "2021-08-03T10:27:08.221586", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "scatter('sugar', 'blood_glucose_random')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ff21213", + "metadata": { + "papermill": { + "duration": 0.181601, + "end_time": "2021-08-03T10:27:08.695981", + "exception": false, + "start_time": "2021-08-03T10:27:08.514380", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "scatter('packed_cell_volume','blood_urea')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e59f133", + "metadata": { + "papermill": { + "duration": 0.216764, + "end_time": "2021-08-03T10:27:09.021491", + "exception": false, + "start_time": "2021-08-03T10:27:08.804727", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "px.bar(df, x=\"specific_gravity\", y=\"packed_cell_volume\", color='class', barmode='group', template = 'plotly_dark', height = 400)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e3428ac1", + "metadata": { + "papermill": { + "duration": 0.1848, + "end_time": "2021-08-03T10:27:09.316691", + "exception": false, + "start_time": "2021-08-03T10:27:09.131891", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "px.bar(df, x=\"specific_gravity\", y=\"albumin\", color='class', barmode='group', template = 'plotly_dark', height = 400)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d131cd7d", + "metadata": { + "papermill": { + "duration": 0.184964, + "end_time": "2021-08-03T10:27:09.612069", + "exception": false, + "start_time": "2021-08-03T10:27:09.427105", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "px.bar(df, x=\"blood_pressure\", y=\"packed_cell_volume\", color='class', barmode='group', template = 'plotly_dark', height = 400)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1274eabf", + "metadata": { + "papermill": { + "duration": 0.1828, + "end_time": "2021-08-03T10:27:09.906757", + "exception": false, + "start_time": "2021-08-03T10:27:09.723957", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "px.bar(df, x=\"blood_pressure\", y=\"haemoglobin\", color='class', barmode='group', template = 'plotly_dark', height = 400)" + ] + }, + { + "cell_type": "markdown", + "id": "0f407857", + "metadata": { + "papermill": { + "duration": 0.113446, + "end_time": "2021-08-03T10:27:10.130433", + "exception": false, + "start_time": "2021-08-03T10:27:10.016987", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Data Pre Processing

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f80213d4", + "metadata": { + "papermill": { + "duration": 0.123788, + "end_time": "2021-08-03T10:27:10.366637", + "exception": false, + "start_time": "2021-08-03T10:27:10.242849", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# checking for null values\n", + "\n", + "df.isna().sum().sort_values(ascending = False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ba098c4d", + "metadata": { + "papermill": { + "duration": 0.126099, + "end_time": "2021-08-03T10:27:10.604839", + "exception": false, + "start_time": "2021-08-03T10:27:10.478740", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df[num_cols].isnull().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "53fd363f", + "metadata": { + "papermill": { + "duration": 0.124901, + "end_time": "2021-08-03T10:27:10.843945", + "exception": false, + "start_time": "2021-08-03T10:27:10.719044", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df[cat_cols].isnull().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60a279ff", + "metadata": { + "papermill": { + "duration": 0.121529, + "end_time": "2021-08-03T10:27:11.080479", + "exception": false, + "start_time": "2021-08-03T10:27:10.958950", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# filling null values, we will use two methods, random sampling for higher null values and \n", + "# mean/mode sampling for lower null values\n", + "\n", + "def random_value_imputation(feature):\n", + " random_sample = df[feature].dropna().sample(df[feature].isna().sum())\n", + " random_sample.index = df[df[feature].isnull()].index\n", + " df.loc[df[feature].isnull(), feature] = random_sample\n", + " \n", + "def impute_mode(feature):\n", + " mode = df[feature].mode()[0]\n", + " df[feature] = df[feature].fillna(mode)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8977f691", + "metadata": { + "papermill": { + "duration": 0.14567, + "end_time": "2021-08-03T10:27:11.341199", + "exception": false, + "start_time": "2021-08-03T10:27:11.195529", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# filling num_cols null values using random sampling method\n", + "\n", + "for col in num_cols:\n", + " random_value_imputation(col)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6494929f", + "metadata": { + "papermill": { + "duration": 0.128198, + "end_time": "2021-08-03T10:27:11.583291", + "exception": false, + "start_time": "2021-08-03T10:27:11.455093", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df[num_cols].isnull().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "075163c6", + "metadata": { + "papermill": { + "duration": 0.13398, + "end_time": "2021-08-03T10:27:11.832599", + "exception": false, + "start_time": "2021-08-03T10:27:11.698619", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# filling \"red_blood_cells\" and \"pus_cell\" using random sampling method and rest of cat_cols using mode imputation\n", + "\n", + "random_value_imputation('red_blood_cells')\n", + "random_value_imputation('pus_cell')\n", + "\n", + "for col in cat_cols:\n", + " impute_mode(col)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6aba043e", + "metadata": { + "papermill": { + "duration": 0.128218, + "end_time": "2021-08-03T10:27:12.074909", + "exception": false, + "start_time": "2021-08-03T10:27:11.946691", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df[cat_cols].isnull().sum()" + ] + }, + { + "cell_type": "markdown", + "id": "60e0befb", + "metadata": { + "papermill": { + "duration": 0.114518, + "end_time": "2021-08-03T10:27:12.303641", + "exception": false, + "start_time": "2021-08-03T10:27:12.189123", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "

All the missing values are handeled now, lets do ctaegorical features encding now

" + ] + }, + { + "cell_type": "markdown", + "id": "b158cf98", + "metadata": { + "papermill": { + "duration": 0.113887, + "end_time": "2021-08-03T10:27:12.531784", + "exception": false, + "start_time": "2021-08-03T10:27:12.417897", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Feature Encoding

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fba81f1a", + "metadata": { + "papermill": { + "duration": 0.130215, + "end_time": "2021-08-03T10:27:12.776345", + "exception": false, + "start_time": "2021-08-03T10:27:12.646130", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "for col in cat_cols:\n", + " print(f\"{col} has {df[col].nunique()} categories\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "bd9945c2", + "metadata": { + "papermill": { + "duration": 0.116643, + "end_time": "2021-08-03T10:27:13.006895", + "exception": false, + "start_time": "2021-08-03T10:27:12.890252", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "

As all of the categorical columns have 2 categories we can use label encoder

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5c8e2126", + "metadata": { + "papermill": { + "duration": 0.248224, + "end_time": "2021-08-03T10:27:13.372008", + "exception": false, + "start_time": "2021-08-03T10:27:13.123784", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from sklearn.preprocessing import LabelEncoder\n", + "\n", + "le = LabelEncoder()\n", + "\n", + "for col in cat_cols:\n", + " df[col] = le.fit_transform(df[col])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f3d91ef", + "metadata": { + "papermill": { + "duration": 0.14916, + "end_time": "2021-08-03T10:27:13.635946", + "exception": false, + "start_time": "2021-08-03T10:27:13.486786", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "d478f5a7", + "metadata": { + "papermill": { + "duration": 0.117214, + "end_time": "2021-08-03T10:27:13.870718", + "exception": false, + "start_time": "2021-08-03T10:27:13.753504", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Model Building

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a66e3e6", + "metadata": { + "papermill": { + "duration": 0.127828, + "end_time": "2021-08-03T10:27:14.114725", + "exception": false, + "start_time": "2021-08-03T10:27:13.986897", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "ind_col = [col for col in df.columns if col != 'class']\n", + "dep_col = 'class'\n", + "\n", + "X = df[ind_col]\n", + "y = df[dep_col]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5cdcd5f7", + "metadata": { + "papermill": { + "duration": 0.174408, + "end_time": "2021-08-03T10:27:14.405112", + "exception": false, + "start_time": "2021-08-03T10:27:14.230704", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# splitting data intp training and test set\n", + "\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)" + ] + }, + { + "cell_type": "markdown", + "id": "19a8ca3f", + "metadata": { + "papermill": { + "duration": 0.115477, + "end_time": "2021-08-03T10:27:14.637105", + "exception": false, + "start_time": "2021-08-03T10:27:14.521628", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

KNN

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8cab3352", + "metadata": { + "papermill": { + "duration": 0.318107, + "end_time": "2021-08-03T10:27:15.071807", + "exception": false, + "start_time": "2021-08-03T10:27:14.753700", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from sklearn.neighbors import KNeighborsClassifier\n", + "from sklearn.metrics import accuracy_score, confusion_matrix, classification_report\n", + "\n", + "knn = KNeighborsClassifier()\n", + "knn.fit(X_train, y_train)\n", + "\n", + "# accuracy score, confusion matrix and classification report of knn\n", + "\n", + "knn_acc = accuracy_score(y_test, knn.predict(X_test))\n", + "\n", + "print(f\"Training Accuracy of KNN is {accuracy_score(y_train, knn.predict(X_train))}\")\n", + "print(f\"Test Accuracy of KNN is {knn_acc} \\n\")\n", + "\n", + "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, knn.predict(X_test))}\\n\")\n", + "print(f\"Classification Report :- \\n {classification_report(y_test, knn.predict(X_test))}\")" + ] + }, + { + "cell_type": "markdown", + "id": "09deaf07", + "metadata": { + "papermill": { + "duration": 0.116247, + "end_time": "2021-08-03T10:27:15.306480", + "exception": false, + "start_time": "2021-08-03T10:27:15.190233", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Decision Tree Classifier

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c4263757", + "metadata": { + "papermill": { + "duration": 0.175633, + "end_time": "2021-08-03T10:27:15.597852", + "exception": false, + "start_time": "2021-08-03T10:27:15.422219", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from sklearn.tree import DecisionTreeClassifier\n", + "\n", + "dtc = DecisionTreeClassifier()\n", + "dtc.fit(X_train, y_train)\n", + "\n", + "# accuracy score, confusion matrix and classification report of decision tree\n", + "\n", + "dtc_acc = accuracy_score(y_test, dtc.predict(X_test))\n", + "\n", + "print(f\"Training Accuracy of Decision Tree Classifier is {accuracy_score(y_train, dtc.predict(X_train))}\")\n", + "print(f\"Test Accuracy of Decision Tree Classifier is {dtc_acc} \\n\")\n", + "\n", + "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, dtc.predict(X_test))}\\n\")\n", + "print(f\"Classification Report :- \\n {classification_report(y_test, dtc.predict(X_test))}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "560d2cbe", + "metadata": { + "papermill": { + "duration": 17.017514, + "end_time": "2021-08-03T10:27:32.735752", + "exception": false, + "start_time": "2021-08-03T10:27:15.718238", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# hyper parameter tuning of decision tree \n", + "\n", + "from sklearn.model_selection import GridSearchCV\n", + "grid_param = {\n", + " 'criterion' : ['gini', 'entropy'],\n", + " 'max_depth' : [3, 5, 7, 10],\n", + " 'splitter' : ['best', 'random'],\n", + " 'min_samples_leaf' : [1, 2, 3, 5, 7],\n", + " 'min_samples_split' : [1, 2, 3, 5, 7],\n", + " 'max_features' : ['auto', 'sqrt', 'log2']\n", + "}\n", + "\n", + "grid_search_dtc = GridSearchCV(dtc, grid_param, cv = 5, n_jobs = -1, verbose = 1)\n", + "grid_search_dtc.fit(X_train, y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b909268", + "metadata": { + "papermill": { + "duration": 0.125459, + "end_time": "2021-08-03T10:27:32.978520", + "exception": false, + "start_time": "2021-08-03T10:27:32.853061", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# best parameters and best score\n", + "\n", + "print(grid_search_dtc.best_params_)\n", + "print(grid_search_dtc.best_score_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "704ee48f", + "metadata": { + "papermill": { + "duration": 0.1406, + "end_time": "2021-08-03T10:27:33.238633", + "exception": false, + "start_time": "2021-08-03T10:27:33.098033", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# best estimator\n", + "\n", + "dtc = grid_search_dtc.best_estimator_\n", + "\n", + "# accuracy score, confusion matrix and classification report of decision tree\n", + "\n", + "dtc_acc = accuracy_score(y_test, dtc.predict(X_test))\n", + "\n", + "print(f\"Training Accuracy of Decision Tree Classifier is {accuracy_score(y_train, dtc.predict(X_train))}\")\n", + "print(f\"Test Accuracy of Decision Tree Classifier is {dtc_acc} \\n\")\n", + "\n", + "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, dtc.predict(X_test))}\\n\")\n", + "print(f\"Classification Report :- \\n {classification_report(y_test, dtc.predict(X_test))}\")" + ] + }, + { + "cell_type": "markdown", + "id": "6cac43df", + "metadata": { + "papermill": { + "duration": 0.117918, + "end_time": "2021-08-03T10:27:33.473599", + "exception": false, + "start_time": "2021-08-03T10:27:33.355681", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Random Forest Classifier

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8125ad61", + "metadata": { + "papermill": { + "duration": 0.471782, + "end_time": "2021-08-03T10:27:34.064632", + "exception": false, + "start_time": "2021-08-03T10:27:33.592850", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "rd_clf = RandomForestClassifier(criterion = 'entropy', max_depth = 11, max_features = 'auto', min_samples_leaf = 2, min_samples_split = 3, n_estimators = 130)\n", + "rd_clf.fit(X_train, y_train)\n", + "\n", + "# accuracy score, confusion matrix and classification report of random forest\n", + "\n", + "rd_clf_acc = accuracy_score(y_test, rd_clf.predict(X_test))\n", + "\n", + "print(f\"Training Accuracy of Random Forest Classifier is {accuracy_score(y_train, rd_clf.predict(X_train))}\")\n", + "print(f\"Test Accuracy of Random Forest Classifier is {rd_clf_acc} \\n\")\n", + "\n", + "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, rd_clf.predict(X_test))}\\n\")\n", + "print(f\"Classification Report :- \\n {classification_report(y_test, rd_clf.predict(X_test))}\")" + ] + }, + { + "cell_type": "markdown", + "id": "ecb4a146", + "metadata": { + "papermill": { + "duration": 0.117503, + "end_time": "2021-08-03T10:27:34.299456", + "exception": false, + "start_time": "2021-08-03T10:27:34.181953", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Ada Boost Classifier

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b78ff36b", + "metadata": { + "papermill": { + "duration": 0.151699, + "end_time": "2021-08-03T10:27:34.568981", + "exception": false, + "start_time": "2021-08-03T10:27:34.417282", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from sklearn.ensemble import AdaBoostClassifier\n", + "\n", + "ada = AdaBoostClassifier(base_estimator = dtc)\n", + "ada.fit(X_train, y_train)\n", + "\n", + "# accuracy score, confusion matrix and classification report of ada boost\n", + "\n", + "ada_acc = accuracy_score(y_test, ada.predict(X_test))\n", + "\n", + "print(f\"Training Accuracy of Ada Boost Classifier is {accuracy_score(y_train, ada.predict(X_train))}\")\n", + "print(f\"Test Accuracy of Ada Boost Classifier is {ada_acc} \\n\")\n", + "\n", + "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, ada.predict(X_test))}\\n\")\n", + "print(f\"Classification Report :- \\n {classification_report(y_test, ada.predict(X_test))}\")" + ] + }, + { + "cell_type": "markdown", + "id": "39681949", + "metadata": { + "papermill": { + "duration": 0.116459, + "end_time": "2021-08-03T10:27:34.802069", + "exception": false, + "start_time": "2021-08-03T10:27:34.685610", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Gradient Boosting Classifier

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a13b3203", + "metadata": { + "papermill": { + "duration": 0.273095, + "end_time": "2021-08-03T10:27:35.197368", + "exception": false, + "start_time": "2021-08-03T10:27:34.924273", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from sklearn.ensemble import GradientBoostingClassifier\n", + "\n", + "gb = GradientBoostingClassifier()\n", + "gb.fit(X_train, y_train)\n", + "\n", + "# accuracy score, confusion matrix and classification report of gradient boosting classifier\n", + "\n", + "gb_acc = accuracy_score(y_test, gb.predict(X_test))\n", + "\n", + "print(f\"Training Accuracy of Gradient Boosting Classifier is {accuracy_score(y_train, gb.predict(X_train))}\")\n", + "print(f\"Test Accuracy of Gradient Boosting Classifier is {gb_acc} \\n\")\n", + "\n", + "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, gb.predict(X_test))}\\n\")\n", + "print(f\"Classification Report :- \\n {classification_report(y_test, gb.predict(X_test))}\")" + ] + }, + { + "cell_type": "markdown", + "id": "08ca4967", + "metadata": { + "papermill": { + "duration": 0.11849, + "end_time": "2021-08-03T10:27:35.434909", + "exception": false, + "start_time": "2021-08-03T10:27:35.316419", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Stochastic Gradient Boosting (SGB)

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0270b3dc", + "metadata": { + "papermill": { + "duration": 0.450977, + "end_time": "2021-08-03T10:27:36.004903", + "exception": false, + "start_time": "2021-08-03T10:27:35.553926", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "sgb = GradientBoostingClassifier(max_depth = 4, subsample = 0.90, max_features = 0.75, n_estimators = 200)\n", + "sgb.fit(X_train, y_train)\n", + "\n", + "# accuracy score, confusion matrix and classification report of stochastic gradient boosting classifier\n", + "\n", + "sgb_acc = accuracy_score(y_test, sgb.predict(X_test))\n", + "\n", + "print(f\"Training Accuracy of Stochastic Gradient Boosting is {accuracy_score(y_train, sgb.predict(X_train))}\")\n", + "print(f\"Test Accuracy of Stochastic Gradient Boosting is {sgb_acc} \\n\")\n", + "\n", + "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, sgb.predict(X_test))}\\n\")\n", + "print(f\"Classification Report :- \\n {classification_report(y_test, sgb.predict(X_test))}\")" + ] + }, + { + "cell_type": "markdown", + "id": "bad3dd1b", + "metadata": { + "papermill": { + "duration": 0.12016, + "end_time": "2021-08-03T10:27:36.244512", + "exception": false, + "start_time": "2021-08-03T10:27:36.124352", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

XgBoost

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9240394", + "metadata": { + "papermill": { + "duration": 0.295935, + "end_time": "2021-08-03T10:27:36.659899", + "exception": false, + "start_time": "2021-08-03T10:27:36.363964", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from xgboost import XGBClassifier\n", + "\n", + "xgb = XGBClassifier(objective = 'binary:logistic', learning_rate = 0.5, max_depth = 5, n_estimators = 150)\n", + "xgb.fit(X_train, y_train)\n", + "\n", + "# accuracy score, confusion matrix and classification report of xgboost\n", + "\n", + "xgb_acc = accuracy_score(y_test, xgb.predict(X_test))\n", + "\n", + "print(f\"Training Accuracy of XgBoost is {accuracy_score(y_train, xgb.predict(X_train))}\")\n", + "print(f\"Test Accuracy of XgBoost is {xgb_acc} \\n\")\n", + "\n", + "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, xgb.predict(X_test))}\\n\")\n", + "print(f\"Classification Report :- \\n {classification_report(y_test, xgb.predict(X_test))}\")" + ] + }, + { + "cell_type": "markdown", + "id": "fb95f577", + "metadata": { + "papermill": { + "duration": 0.119363, + "end_time": "2021-08-03T10:27:36.901621", + "exception": false, + "start_time": "2021-08-03T10:27:36.782258", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Cat Boost Classifier

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18744595", + "metadata": { + "papermill": { + "duration": 0.502686, + "end_time": "2021-08-03T10:27:37.525643", + "exception": false, + "start_time": "2021-08-03T10:27:37.022957", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from catboost import CatBoostClassifier\n", + "\n", + "cat = CatBoostClassifier(iterations=10)\n", + "cat.fit(X_train, y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a40d62d3", + "metadata": { + "papermill": { + "duration": 0.145452, + "end_time": "2021-08-03T10:27:37.791916", + "exception": false, + "start_time": "2021-08-03T10:27:37.646464", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# accuracy score, confusion matrix and classification report of cat boost\n", + "\n", + "cat_acc = accuracy_score(y_test, cat.predict(X_test))\n", + "\n", + "print(f\"Training Accuracy of Cat Boost Classifier is {accuracy_score(y_train, cat.predict(X_train))}\")\n", + "print(f\"Test Accuracy of Cat Boost Classifier is {cat_acc} \\n\")\n", + "\n", + "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, cat.predict(X_test))}\\n\")\n", + "print(f\"Classification Report :- \\n {classification_report(y_test, cat.predict(X_test))}\")" + ] + }, + { + "cell_type": "markdown", + "id": "163e2b66", + "metadata": { + "papermill": { + "duration": 0.11928, + "end_time": "2021-08-03T10:27:38.032119", + "exception": false, + "start_time": "2021-08-03T10:27:37.912839", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Extra Trees Classifier

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa0260e3", + "metadata": { + "papermill": { + "duration": 0.321475, + "end_time": "2021-08-03T10:27:38.474354", + "exception": false, + "start_time": "2021-08-03T10:27:38.152879", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from sklearn.ensemble import ExtraTreesClassifier\n", + "\n", + "etc = ExtraTreesClassifier()\n", + "etc.fit(X_train, y_train)\n", + "\n", + "# accuracy score, confusion matrix and classification report of extra trees classifier\n", + "\n", + "etc_acc = accuracy_score(y_test, etc.predict(X_test))\n", + "\n", + "print(f\"Training Accuracy of Extra Trees Classifier is {accuracy_score(y_train, etc.predict(X_train))}\")\n", + "print(f\"Test Accuracy of Extra Trees Classifier is {etc_acc} \\n\")\n", + "\n", + "print(f\"Confusion Matrix :- \\n{confusion_matrix(y_test, etc.predict(X_test))}\\n\")\n", + "print(f\"Classification Report :- \\n {classification_report(y_test, etc.predict(X_test))}\")" + ] + }, + { + "cell_type": "markdown", + "id": "9a1ef9c8", + "metadata": { + "papermill": { + "duration": 0.119764, + "end_time": "2021-08-03T10:27:38.714992", + "exception": false, + "start_time": "2021-08-03T10:27:38.595228", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

LGBM Classifier

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f6a932da", + "metadata": { + "papermill": { + "duration": 0.560678, + "end_time": "2021-08-03T10:27:39.394880", + "exception": false, + "start_time": "2021-08-03T10:27:38.834202", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from lightgbm import LGBMClassifier\n", + "\n", + "lgbm = LGBMClassifier(learning_rate = 1)\n", + "lgbm.fit(X_train, y_train)\n", + "\n", + "# accuracy score, confusion matrix and classification report of lgbm classifier\n", + "\n", + "lgbm_acc = accuracy_score(y_test, lgbm.predict(X_test))\n", + "\n", + "print(f\"Training Accuracy of LGBM Classifier is {accuracy_score(y_train, lgbm.predict(X_train))}\")\n", + "print(f\"Test Accuracy of LGBM Classifier is {lgbm_acc} \\n\")\n", + "\n", + "print(f\"{confusion_matrix(y_test, lgbm.predict(X_test))}\\n\")\n", + "print(classification_report(y_test, lgbm.predict(X_test)))" + ] + }, + { + "cell_type": "markdown", + "id": "1bad90b3", + "metadata": { + "papermill": { + "duration": 0.12238, + "end_time": "2021-08-03T10:27:39.641368", + "exception": false, + "start_time": "2021-08-03T10:27:39.518988", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "\n", + "

Models Comparison

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d0edf63f", + "metadata": { + "papermill": { + "duration": 0.137113, + "end_time": "2021-08-03T10:27:39.900584", + "exception": false, + "start_time": "2021-08-03T10:27:39.763471", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "models = pd.DataFrame({\n", + " 'Model' : [ 'KNN', 'Decision Tree Classifier', 'Random Forest Classifier','Ada Boost Classifier',\n", + " 'Gradient Boosting Classifier', 'Stochastic Gradient Boosting', 'XgBoost', 'Cat Boost', 'Extra Trees Classifier'],\n", + " 'Score' : [knn_acc, dtc_acc, rd_clf_acc, ada_acc, gb_acc, sgb_acc, xgb_acc, cat_acc, etc_acc]\n", + "})\n", + "\n", + "\n", + "models.sort_values(by = 'Score', ascending = False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "472c52ec", + "metadata": { + "papermill": { + "duration": 0.390074, + "end_time": "2021-08-03T10:27:40.409843", + "exception": false, + "start_time": "2021-08-03T10:27:40.019769", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "px.bar(data_frame = models, x = 'Score', y = 'Model', color = 'Score', template = 'plotly_dark', \n", + " title = 'Models Comparison')" + ] + }, + { + "cell_type": "markdown", + "id": "029f4fe8", + "metadata": { + "papermill": { + "duration": 0.124031, + "end_time": "2021-08-03T10:27:40.653600", + "exception": false, + "start_time": "2021-08-03T10:27:40.529569", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "

If you like my work, don't forget to leave an upvote!!

" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + }, + "papermill": { + "default_parameters": {}, + "duration": 63.932833, + "end_time": "2021-08-03T10:27:41.688051", + "environment_variables": {}, + "exception": null, + "input_path": "__notebook__.ipynb", + "output_path": "__notebook__.ipynb", + "parameters": {}, + "start_time": "2021-08-03T10:26:37.755218", + "version": "2.3.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From e388a4bf03ad0d90413a7190bc3175d8850dd5d6 Mon Sep 17 00:00:00 2001 From: VARUNSHIYAM <138989960+Varunshiyam@users.noreply.github.com> Date: Sun, 10 Nov 2024 16:32:02 +0530 Subject: [PATCH 2/2] Create Readme.md --- .../Readme.md | 34 +++++++++++++++++++ 1 file changed, 34 insertions(+) create mode 100644 Prediction Models/Chronic_Kidney_Disease_prediction/Readme.md diff --git a/Prediction Models/Chronic_Kidney_Disease_prediction/Readme.md b/Prediction Models/Chronic_Kidney_Disease_prediction/Readme.md new file mode 100644 index 00000000..0347fdf6 --- /dev/null +++ b/Prediction Models/Chronic_Kidney_Disease_prediction/Readme.md @@ -0,0 +1,34 @@ +# Chronic Kidney Disease Prediction Model + +## Project Description + +This project aims to predict chronic kidney disease (CKD) using advanced machine learning models. Leveraging a dataset that includes patient health metrics, the project implements various algorithms to achieve accurate classification of CKD. The primary objective is to create a robust model that can assist in early detection, contributing to better patient outcomes and proactive management of the disease. + +### Key Features: +- **Data Preprocessing**: Cleaning, normalization, and transformation of the dataset to prepare for effective training. +- **Model Implementation**: Application of various machine learning models such as Logistic Regression, Decision Trees, Random Forests, and Support Vector Machines (SVM). +- **Evaluation Metrics**: Comprehensive evaluation using metrics like accuracy, precision, recall, and F1-score. +- **High Accuracy**: The project achieves up to 98% accuracy, showcasing the effectiveness of the chosen methodologies. + +### Technologies Used: +- **Python**: Primary programming language for coding and data analysis. +- **Pandas & NumPy**: For data manipulation and analysis. +- **Scikit-learn**: For implementing machine learning models and evaluation metrics. +- **Matplotlib & Seaborn**: For data visualization to aid in understanding the dataset and model results. + +## Problem Statement + +Chronic kidney disease (CKD) is a global health challenge with significant morbidity and mortality. Early diagnosis is crucial for effective treatment and slowing disease progression. However, manual analysis of patient health metrics can be time-consuming and prone to human error. This project addresses the need for an automated, accurate, and efficient system to predict CKD from patient data. By employing machine learning techniques, the system helps: +- **Streamline Diagnosis**: Providing faster, data-driven insights for healthcare professionals. +- **Improve Accuracy**: Reducing the variability and potential inaccuracies in manual assessments. +- **Assist in Preventative Care**: Enabling early intervention strategies to mitigate disease impact. + +## Project Structure + +- `data/`: Contains the dataset used for training and testing. +- `notebooks/`: Jupyter notebooks for data exploration and model development. +- `src/`: Python scripts for data processing, model training, and evaluation. +- `results/`: Includes reports, plots, and saved models. +- `README.md`: This documentation. + +