Skip to content

hmza09/nd00333-capstone

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

Udacity Capstone Project: Automl & HyperDrive Experiment

The current project uses machine learning to predict patients’ survival based on their medical data.

I create two models in the environment of Azure Machine Learning Studio: one using AutoML and one customized model whose hyperparameters are tuned using HyperDrive, then compare the performance of both models and deploy the best performing model as a service using Azure Container Instances (ACI).

Project Set Up and Installation

I ussed the provided workspace and environment, so everything was pre-installed by Udacity course. Following scripts were used in this project:

  • automl.ipynb: for the AutoML experiment
  • hyperparameter_tuning.ipynb: for the HyperDrive experiment
  • heart_failure_clinical_records_dataset.csv: the dataset file taken from Kaggle
  • train.py: a basic script for manipulating the data used in the HyperDrive experiment; modified script given in first project
  • scoring_file_v_1_0_0.py: the script used to deploy the model which is downloaded from within Azure Machine Learning Studio
  • env.yml: the environment file which is also downloaded from within Azure Machine Learning Studio

Dataset

Overview

The dataset used is taken from Kaggle and -as we can read in the original Research article- the data comes from 299 patients with heart failure collected at the Faisalabad Institute of Cardiology and at the Allied Hospital in Faisalabad (Punjab, Pakistan), during April–December 2015. The patients consisted of 105 women and 194 men, and their ages range between 40 and 95 years old.

The dataset contains 13 features:

Feature Explanation Measurement
age Age of patient Years (40-95)
anaemia Decrease of red blood cells or hemoglobin Boolean (0=No, 1=Yes)
creatinine-phosphokinase Level of the CPK enzyme in the blood mcg/L
diabetes Whether the patient has diabetes or not Boolean (0=No, 1=Yes)
ejection_fraction Percentage of blood leaving the heart at each contraction Percentage
high_blood_pressure Whether the patient has hypertension or not Boolean (0=No, 1=Yes)
platelets Platelets in the blood kiloplatelets/mL
serum_creatinine Level of creatinine in the blood mg/dL
serum_sodium Level of sodium in the blood mEq/L
sex Female (F) or Male (M) Binary (0=F, 1=M)
smoking Whether the patient smokes or not Boolean (0=No, 1=Yes)
time Follow-up period Days
DEATH_EVENT Whether the patient died during the follow-up period Boolean (0=No, 1=Yes)

Task

The task was to classify patients based on their odd of survival, the prediction is based on features included in above table.

Access

I uploaded the data on azure ml studio, also it was available on my github repository and provided the link in notebook. https://github.com/hmza09/nd00333-capstone/blob/master/starter_file/heart_failure_clinical_records_dataset.csv

Automated ML

Below you can see an overview of the automl settings and configuration I used for the AutoML run:

"n_cross_validations": 2

This parameter sets how many cross validations to perform, based on the same number of folds (number of subsets). As one cross-validation could result in overfit, in my code I chose 2 folds for cross-validation; thus the metrics are calculated with the average of the 2 validation metrics.

"primary_metric": 'accuracy'

I chose accuracy as the primary metric as it is the default metric used for classification tasks.

"enable_early_stopping": True

It defines to enable early termination if the score is not improving in the short term. In this experiment, it could also be omitted because the experiment_timeout_minutes is already defined below.

"max_concurrent_iterations": 4

It represents the maximum number of iterations that would be executed in parallel.

"experiment_timeout_minutes": 20

This is an exit criterion and is used to define how long, in minutes, the experiment should continue to run. To help avoid experiment time out failures, I used the value of 20 minutes.

"verbosity": logging.INFO

The verbosity level for writing to the log file.

compute_target = compute_target

The Azure Machine Learning compute target to run the Automated Machine Learning experiment on.

task = 'classification'

This defines the experiment type which in this case is classification. Other options are regression and forecasting.

training_data = dataset

The training data to be used within the experiment. It should contain both training features and a label column - see next parameter.

label_column_name = 'DEATH_EVENT'

The name of the label column i.e. the target column based on which the prediction is done.

path = project_folder

The full path to the Azure Machine Learning project folder.

featurization = 'auto'

This parameter defines whether featurization step should be done automatically as in this case (auto) or not (off).

debug_log = 'automl_errors.log

The log file to write debug information to.

Results

  • Model Run Widget

  • Metrics

  • Best Performance Model

Hyperparameter Tuning

Parameter sampler

I specified the parameter sampler as such:

ps = RandomParameterSampling(
    {
        '--C' : choice(0.001,0.01,0.1,1,10,20,50,100,200,500,1000),
        '--max_iter': choice(50,100,200,300)
    }
)

I chose discrete values with choice for both parameters, C and max_iter.

C is the Regularization while max_iter is the maximum number of iterations.

RandomParameterSampling is one of the choices available for the sampler and I chose it because it is the faster and supports early termination of low-performance runs. If budget is not an issue, we could use GridParameterSampling to exhaustively search over the search space or BayesianParameterSampling to explore the hyperparameter space.

Early stopping policy

An early stopping policy is used to automatically terminate poorly performing runs thus improving computational efficiency. I chose the BanditPolicy which I specified as follows:

policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)
  • Two hyperparameters tunned in this model

  • Run Widget

Results

  • Model with different Hyperparameter tunning and Metrics

  • Register Model with RunID

Model Deployment

The deployment is done following the steps below:

  • Preparation of an inference configuration
  • Preparation of an entry script
  • Choosing a compute target
  • Deployment of the model
  • Testing the resulting web service

Inference configuration

The inference configuration defines the environment used to run the deployed model. The inference configuration includes two entities, which are used to run the model when it's deployed.

Entry script

The entry script is the scoring.py file. The entry script loads the model when the deployed service starts and it is also responsible for receiving data, passing it to the model, and then returning a response.

Compute target

As compute target, I chose the Azure Container Instances (ACI) service, which is used for low-scale CPU-based workloads that require less than 48 GB of RAM.

The ACI Webservice Class represents a machine learning model deployed as a web service endpoint on Azure Container Instances. The deployed service is created from the model, script, and associated files, as I explain above. The resulting web service is a load-balanced, HTTP endpoint with a REST API. We can send data to this API and receive the prediction returned by the model.

  • Serive State of Deployed Model

  • Testing the resulting web service

Screen Recording

The screen recording can be found here and it shows the project in demonstration which include:

  • A working model
  • Demo of the deployed model
  • Demo of a sample request sent to the endpoint and its response

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.0%
  • Python 2.0%