Finished tutorial 5

otavioon · Jun 27, 2024 · b461311 · b461311
1 parent 096b9fc
commit b461311
Show file tree

Hide file tree

Showing 2 changed files with 113 additions and 11 deletions.
diff --git a/notebooks/05_covid_anomaly_detection.ipynb b/notebooks/05_covid_anomaly_detection.ipynb
@@ -6,20 +6,73 @@
    "source": [
     "# 5. Training an Anomaly Detection Model for Covid Anomaly Detection\n",
     "\n",
+    "## Overview\n",
+    "\n",
     "In this tutorial, we will train an anomaly detection model using a simple [LSTM-AutoEncoder model](https://www.medrxiv.org/content/10.1101/2021.01.08.21249474v1).\n",
     "Data can be obtained from [this link](https://iscteiul365-my.sharepoint.com/:u:/g/personal/oonia_iscte-iul_pt/ERZLm1ruUNpMqkSwjpqhE9wB_7loVWAC4yZWuIH2RKGOlQ?e=kD4HlI). This is a processed version of data from original Stanford dataset-Phase 2. The overall pre-processing pipeline used is illustrated in Figure below.\n",
     "\n",
     "![preprocessing](stanford_data_processing.png)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Data was aquired from diferent sources (Germin, FitBit, Apple Watch) and pre-processed to have a common format. In this form, data has two columns: heart rate and number of user steps in last minute. \n",
-    "Then the processing pipeline was applied to the data. The pipeline is composed of the following steps:\n",
-    "1. Once data was standardized, the resting heart rate was extracted (``Resting Heart Rate Extractor``, in Figure). This process takes as input `min_minutes_rest` that is the number of minutes that the user has to be at rest to consider the heart rate as resting. This variable looks at user steps and, when user steps is 0 for `min_minutes_rest` minutes, the heart rate is considered as resting. At the end of this process, we will have a new dataframe with: the date and the resting heart rate of the last minute.\n",
-    "2. The second step is adding labels."
+    "\n",
+    "The processing pipeline is then applied to the data. The pipeline is composed of the following steps:\n",
+    "1. Once data was standardized, the resting heart rate was extracted (``Resting Heart Rate Extractor``, in Figure). This process takes as input `min_minutes_rest` that is the number of minutes that the user has to be at rest to consider the heart rate as resting. `min_minutes_rest` variable looks at user steps and, when user steps is 0 for `min_minutes_rest` minutes, the heart rate is considered as resting. At the end of this process, we will have a new dataframe with: the date (`datetime` column) and the resting heart rate of the last minute (`RHR` column).\n",
+    "2. The smoother process is applied to the data (`Smoother`, in Figure). This process takes as input `smooth_window_sample` that is the number of samples that we will use to smooth the data, and `sample_rate` that is the sample rate. The smoother process will apply a moving average filter to the data, with a window of `smooth_window_sample` samples. Then the data is downsampled to `sample_rate` samples per minute. This process will produce a new dataframe with the date (`datetime` column), the resting heart rate at desired sampling rate (`RHR` column).\n",
+    "3. The second step is adding labels (`Label Adder`, in Figure). Is is also illustrated in Figure below. This process takes 3 inputs: `baseline_days`, `before_onset`, and `after_onset`. The `baseline_days` is the number of days before the onset of the symptoms that we consider as baseline (in figure below, this is 21 days). Thus, using the dataframe from last step, a new column named `baseline` is added, which is a boolean column that is True if the date is in the baseline period (21 days before onset). The `before_onset` and `after_onset` are the number of days before and after the onset of the symptoms that we consider as the anomaly period (7 days before and 21 days before, in Figure below). A new column named `anomaly` is added, which is a boolean column that is True if the date is in the anomaly period. Finnaly, we also add a `status` column,that is a metadata column for a descriptive status of the date. If can be: \n",
+    "   - `normal`: if the date is in the baseline period; \n",
+    "   - `before onset`: if the date is in the period before the onset of the symptoms; \n",
+    "   - `onset` if the date is the onset of the symptoms (day); \n",
+    "   - `after onset` if the date is in the period after the onset of the symptoms, but before the recovery; \n",
+    "   - `recovered` if the date is in the recovery period.\n",
+    "4. Once the labels were added we normalize the data (`Standardizer` in Figure above). This process perform a Z-norm scale on the data. The Z-norm scale is calculated as: $z = \\frac{x - \\mu}{\\sigma}$, where $x$ is the value, $\\mu$ is the mean of the column and $\\sigma$ is the standard deviation of the column. An important note here is that the mean and standard deviation are calculated only for the baseline period, and then applied to the entire dataset.\n",
+    "5. The last step is to create the sequences (`Transposer`, in Figure), that will group $n$ rows and transform it into columns (features). This process takes as input `window_size` and `overlap` parameters and creates sequences of `window_size` samples with an overlap of `overlap` samples. Thus, if we have a dataset with 100 samples, a `window_size` of 20 and an `overlap` of 0, we will have 5 sequences of 20 samples each (*i.e.* 5 rows with 20 columns). Each element of the sequence will be a column in the dataframe, numbered from 0 to 19. Thus, for example, the sequences will have columns `RHR-0`, `RHR-1`, ..., `RHR-19`, where the first row is the first 20 samples, the second row is the second 20 samples, and so on. This is useful as it is the format that the LSTM-AutoEncoder model expects as input. An important note is that we do not mix sequences from anomaly and non-anomaly periods. Thus, no label is mixed, that is, an anomaly sample only has anomaly time-steps.\n",
+    "\n",
+    "This will produce a dataframe (CSV file) for each user. In processed dataset, we joined all users in a single file and add a column `participant_id` to identify the user. This makes easier to work with the data in the next steps."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![labeling](anomaly_periods.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We already generated several files, with different parameters and operations of the pre-processing pipeline:\n",
+    "* `rhr_df`: dataframe with the resting heart rate without normalization (step 4) and transposing (step 5). The `min_minutes_rest` is 12, `smooth_window_sample` is 400, `sample_rate` is 1 hour, `baseline_days` is 21, `before_onset` is 7, and  `after_onset` is 21.\n",
+    "* `rhr_df_scaled`: same as `rhr_df`, but with normalization.\n",
+    "* `windowed_16_overlap_0_rate_10min_df`: same dataframe as `rhr_df` with the resting heart rate normalized (step 4) and transposed (step 5). The `window_size` is 16, `overlap` is 0, and `sample_rate` is 10 minutes.\n",
+    "* `windowed_16_overlap_0_rate_10min_scaled_df`: same dataframe as `windowed_16_overlap_0_rate_10min_df`, but with normalization.\n",
+    "\n",
+    "**NOTE**: The files follows this naming convention: `windowed_{window_size}_overlap_{overlap}_rate_{sample_rate}_df.csv`. If sample_rate is ommited, it is, by default 1 hour.\n",
+    "**NOTE**: The files may and with `fold_X`, where `X` is the fold number. This is used for cross-validation purposes."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's import some libraries"
    ]
   },
   {
@@ -38,6 +91,13 @@
     "from torchmetrics import MeanSquaredError"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load data and inspect"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
@@ -425,6 +485,20 @@
     "df"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Creating a [LightningDataModule](https://lightning.ai/docs/pytorch/stable/data/datamodule.html).\n",
+    "* The first parameter is the path to the CSV file.\n",
+    "* `participants`: is a list of participants to include in dataset. If nothing is passed, all participants in CSV are included.\n",
+    "* `batch_size`: is the batch size to use in the dataloader.\n",
+    "* `num_workers`: is the number of workers to use in the dataloader.\n",
+    "* `reshape`: is the shape of the input data. For LSTM-AutoEncoder, it is `(sequence_length, num_features)`, or, in our case `(16, 1)`\n",
+    "\n",
+    "**NOTE**: The training data is only data where baseline is True. The test data will be only data where baseline is False."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 3,
@@ -452,6 +526,13 @@
     "dm"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's create the lightning model"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
@@ -483,6 +564,13 @@
     "model"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Creting Trainer"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 5,
@@ -515,6 +603,13 @@
     "trainer"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Fit the model using training data from the datamodule"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
@@ -1985,9 +2080,16 @@
     "trainer.fit(model, dm)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Predicting"
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -2000,13 +2102,6 @@
     "    return losses"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Predict"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 12,
@@ -2243,6 +2338,13 @@
     "results_dataframe"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Visualizing Metrics and Confusion Matrix"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 18,

diff --git a/notebooks/anomaly_periods.png b/notebooks/anomaly_periods.png