Data cleaning (#166)

* Refactor data cleaning module: move it from example workflow to main directory * Replace NAs with 0 in selected event-based features * Add one step to drop highly correlated features Co-authored-by: Weiyu <[email protected]>
carissalow · Nov 19, 2021 · 5bad3eb · 5bad3eb
1 parent 296960f
commit 5bad3eb
Show file tree

Hide file tree

Showing 77 changed files with 1,607 additions and 1,263 deletions.
diff --git a/Snakefile b/Snakefile
@@ -394,6 +394,13 @@ if config["HEATMAP_PHONE_DATA_YIELD_PER_PARTICIPANT_PER_TIME_SEGMENT"]["PLOT"]:
 if config["HEATMAP_FEATURE_CORRELATION_MATRIX"]["PLOT"]:
     files_to_compute.append("reports/data_exploration/heatmap_feature_correlation_matrix.html")
 
+# Data Cleaning
+for provider in config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"].keys():
+    if config["ALL_CLEANING_INDIVIDUAL"]["PROVIDERS"][provider]["COMPUTE"]:
+        files_to_compute.extend(expand("data/processed/features/{pid}/all_sensor_features_cleaned_" + provider.lower() +".csv", pid=config["PIDS"]))
+for provider in config["ALL_CLEANING_OVERALL"]["PROVIDERS"].keys():
+    if config["ALL_CLEANING_OVERALL"]["PROVIDERS"][provider]["COMPUTE"]:
+        files_to_compute.extend(expand("data/processed/features/all_participants/all_sensor_features_cleaned_" + provider.lower() +".csv"))
 
 rule all:
     input:

diff --git a/config.yaml b/config.yaml
@@ -565,3 +565,43 @@ HEATMAP_FEATURE_CORRELATION_MATRIX:
   CORR_THRESHOLD: 0.1
   CORR_METHOD: "pearson" # choose from {"pearson", "kendall", "spearman"}
 
+
+########################################################################################################################
+#                                                    Data Cleaning                                                     #
+########################################################################################################################
+
+ALL_CLEANING_INDIVIDUAL:
+  PROVIDERS:
+    RAPIDS:
+      COMPUTE: False
+      IMPUTE_SELECTED_EVENT_FEATURES:
+        COMPUTE: True
+        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
+      COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      COLS_VAR_THRESHOLD: True
+      ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: True
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      SRC_SCRIPT: src/features/all_cleaning_individual/rapids/main.R
+
+ALL_CLEANING_OVERALL:
+  PROVIDERS:
+    RAPIDS:
+      COMPUTE: False
+      IMPUTE_SELECTED_EVENT_FEATURES:
+        COMPUTE: True
+        MIN_DATA_YIELDED_MINUTES_TO_IMPUTE: 0.33
+      COLS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      COLS_VAR_THRESHOLD: True
+      ROWS_NAN_THRESHOLD: 0.3 # set to 1 to disable
+      DATA_YIELD_FEATURE: RATIO_VALID_YIELDED_HOURS # RATIO_VALID_YIELDED_HOURS or RATIO_VALID_YIELDED_MINUTES
+      DATA_YIELD_RATIO_THRESHOLD: 0.5 # set to 0 to disable
+      DROP_HIGHLY_CORRELATED_FEATURES:
+        COMPUTE: True
+        MIN_OVERLAP_FOR_CORR_THRESHOLD: 0.5
+        CORR_THRESHOLD: 0.95
+      SRC_SCRIPT: src/features/all_cleaning_overall/rapids/main.R
diff --git a/docs/workflow-examples/analysis.md → docs/analysis/complete-workflow-example.md b/docs/workflow-examples/analysis.md → docs/analysis/complete-workflow-example.md
@@ -1,8 +1,8 @@
 # Analysis Workflow Example
 
 !!! info "TL;DR"
-    - In addition to using RAPIDS to extract behavioral features and create plots, you can structure your data analysis within RAPIDS (i.e. cleaning your features and creating ML/statistical models)
-    - We include an analysis example in RAPIDS that covers raw data processing, cleaning, feature extraction, machine learning modeling, and evaluation
+    - In addition to using RAPIDS to extract behavioral features, create plots, and clean sensor features, you can structure your data analysis within RAPIDS (i.e. creating ML/statistical models and evaluating your models)
+    - We include an analysis example in RAPIDS that covers raw data processing, feature extraction, cleaning, machine learning modeling, and evaluation
     - Use this example as a guide to structure your own analysis within RAPIDS
     - RAPIDS analysis workflows are compatible with your favorite data science tools and libraries
     - RAPIDS analysis workflows are reproducible and we encourage you to publish them along with your research papers
@@ -69,12 +69,12 @@ Note you will see a lot of warning messages, you can ignore them since they happ
 ??? info "6. Feature cleaning."
     In this stage we perform four steps to clean our sensor feature file. First, we discard days with a data yield hour ratio less than or equal to 0.75, i.e. we include days with at least 18 hours of data. Second, we drop columns (features) with more than 30% of missing rows. Third, we drop columns with zero variance. Fourth, we drop rows (days) with more than 30% of missing columns (features). In this cleaning stage several parameters are created and exposed in `example_profile/example_config.yaml`. 
 
-    After this step, we kept 163 features over 11 days for the individual model of p01, 101 features over 12 days for the individual model of p02 and 109 features over 20 days for the population model. Note that the difference in the number of features between p01 and p02 is mostly due to iOS restrictions that stops researchers from collecting the same number of sensors than in Android phones. 
+    After this step, we kept 173 features over 11 days for the individual model of p01, 101 features over 12 days for the individual model of p02 and 117 features over 22 days for the population model. Note that the difference in the number of features between p01 and p02 is mostly due to iOS restrictions that stops researchers from collecting the same number of sensors than in Android phones. 
 
     Feature cleaning for the individual models is done in the `clean_sensor_features_for_individual_participants` rule and for the population model in the `clean_sensor_features_for_all_participants` rule in `rules/models.smk`.
 
 ??? info "7. Merge features and targets."
-    In this step we merge the cleaned features and target labels for our individual models in the `merge_features_and_targets_for_individual_model` rule in `rules/models.smk`. Additionally, we merge the cleaned features, target labels, and demographic features of our two participants for the population model in the `merge_features_and_targets_for_population_model` rule in `rules/models.smk`. These two merged files are the input for our individual and population models. 
+    In this step we merge the cleaned features and target labels for our individual models in the `merge_features_and_targets_for_individual_model` rule in `rules/features.smk`. Additionally, we merge the cleaned features, target labels, and demographic features of our two participants for the population model in the `merge_features_and_targets_for_population_model` rule in `rules/features.smk`. These two merged files are the input for our individual and population models. 
 
 ??? info "8. Modelling."
     This stage has three phases: model building, training and evaluation. 

diff --git a/docs/analysis/data-cleaning.md b/docs/analysis/data-cleaning.md
@@ -0,0 +1,92 @@
+Data Cleaning
+=============
+
+The goal of this module is to perform basic clean tasks on the behavioral features that RAPIDS computes. You might need to do further processing depending on your analysis objectives. This module can clean features at the individual level and at the study level. If you are interested in creating individual models (using each participant's features independently of the others) use [`ALL_CLEANING_INDIVIDUAL`]. If you are interested in creating population models (using everyone's data in the same model) use [`ALL_CLEANING_OVERALL`]
+
+## Clean sensor features for individual participants
+
+!!! info "File Sequence"
+    ```bash
+    - data/processed/features/{pid}/all_sensor_features.csv
+    - data/processed/features/{pid}/all_sensor_features_cleaned_{provider_key}.csv
+    ```
+
+### RAPIDS provider
+
+Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS]`:
+
+|Key&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            | Description |
+|----------------|-----------------------------------------------------------------------------------------------------------------------------------
+|`[COMPUTE]` | Set to `True` to execute the cleaning tasks described below. You can use the parameters of each task to tweak them or deactivate them|
+|`[IMPUTE_SELECTED_EVENT_FEATURES]`     | Fill NAs with 0 only for event-based features, see table below
+|`[COLS_NAN_THRESHOLD]`                 | Discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. Set to 1 to disable
+|`[COLS_VAR_THRESHOLD]`                 | Set to `True` to discard columns with zero variance
+|`[ROWS_NAN_THRESHOLD]`                 | Discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. Set to 1 to disable
+|`[DATA_YIELD_FEATURE]`                 | `RATIO_VALID_YIELDED_HOURS` or `RATIO_VALID_YIELDED_MINUTES`
+|`[DATA_YIELD_RATIO_THRESHOLD]`         | Discard rows with `ratiovalidyieldedhours` or `ratiovalidyieldedminutes` feature less than `[DATA_YIELD_RATIO_THRESHOLD]`. The feature name is determined by `[DATA_YIELD_FEATURE]` parameter. Set to 0 to disable
+|`DROP_HIGHLY_CORRELATED_FEATURES`      | Discard highly correlated features, see table below
+
+Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][IMPUTE_SELECTED_EVENT_FEATURES]`:
+
+|Parameters                             | Description                                                    |
+|-------------------------------------- |----------------------------------------------------------------|
+|`[COMPUTE]`                            | Set to `True` to fill NAs with 0 for phone event-based features
+|`[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` | Any feature value in a time segment instance with phone data yield > `[MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` will be replaced with a zero. See below for an explanation. |
+
+Parameters description for `[ALL_CLEANING_INDIVIDUAL][PROVIDERS][RAPIDS][DROP_HIGHLY_CORRELATED_FEATURES]`:
+
+|Parameters                             | Description                                                    |
+|-------------------------------------- |----------------------------------------------------------------|
+|`[COMPUTE]`                            | Set to `True` to drop highly correlated features
+|`[MIN_OVERLAP_FOR_CORR_THRESHOLD]`     | Minimum ratio of observations required per pair of columns (features) to be considered as a valid correlation. 
+|`[CORR_THRESHOLD]` | The absolute values of pair-wise correlations are calculated. If two variables have a valid correlation higher than `[CORR_THRESHOLD]`, we looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation.
+
+Steps to clean sensor features for individual participants. It only considers the **phone sensors** currently.
+
+??? info "1. Fill NA with 0 for the selected event features."
+    Some event features should be zero instead of NA. In this step, we fill those missing features with 0 when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column is higher than the `[IMPUTE_SELECTED_EVENT_FEATURES][MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]` parameter. Plugins such as Activity Recognition sensor are not considered. You can skip this step by setting `[IMPUTE_SELECTED_EVENT_FEATURES][COMPUTE]` to `False`.
+
+    Take phone calls sensor as an example. If there are no calls records during a time segment for a participant, then (1) the calls sensor was not working during that time segment; or (2) the calls sensor was working and the participant did not have any calls during that time segment. To differentiate these two situations, we assume the selected sensors are working when `phone_data_yield_rapids_ratiovalidyieldedminutes > [MIN_DATA_YIELDED_MINUTES_TO_IMPUTE]`.
+
+    The following phone event-based features are considered currently:
+
+      - Application foreground: countevent, countepisode, minduration, maxduration, meanduration, sumduration.
+      - Battery: all features.
+      - Calls: count, distinctcontacts, sumduration, minduration, maxduration, meanduration, modeduration.
+      - Keyboard: sessioncount, averagesessionlength, changeintextlengthlessthanminusone, changeintextlengthequaltominusone, changeintextlengthequaltoone, changeintextlengthmorethanone, maxtextlength, totalkeyboardtouches.
+      - Messages: count, distinctcontacts.
+      - Screen: sumduration, maxduration, minduration, avgduration, countepisode.
+      - WiFi: all connected and visible features.
+
+??? info "2. Discard unreliable rows."
+    Extracted features might be not reliable if the sensor only works for a short period during a time segment. In this step, we discard rows when the `phone_data_yield_rapids_ratiovalidyieldedminutes` column or the `phone_data_yield_rapids_ratiovalidyieldedhours` column is less than the `[DATA_YIELD_RATIO_THRESHOLD]` parameter. We recommend using `phone_data_yield_rapids_ratiovalidyieldedminutes` column (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_MINUTES`) on time segments that are shorter than two or three hours and `phone_data_yield_rapids_ratiovalidyieldedhours` (set `[DATA_YIELD_FEATURE]` to `RATIO_VALID_YIELDED_HOURS`) for longer segments. We do not recommend you to skip this step, but you can do it by setting `[DATA_YIELD_RATIO_THRESHOLD]` to 0.
+
+??? info "3. Discard columns (features) with too many missing values."
+    In this step, we discard columns with missing value ratios higher than `[COLS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[COLS_NAN_THRESHOLD]` to 1.
+
+??? info "4. Discard columns (features) with zero variance."
+    In this step, we discard columns with zero variance. We do not recommend you to skip this step, but you can do it by setting `[COLS_VAR_THRESHOLD]` to `False`.
+
+??? info "5. Drop highly correlated features."
+    As highly correlated features might not bring additional information and will increase the complexity of a model, we drop them in this step. The absolute values of pair-wise correlations are calculated. Each correlation vector between two variables is regarded as valid only if the ratio of valid value pairs (i.e. non NA pairs) is greater than or equal to `[DROP_HIGHLY_CORRELATED_FEATURES][MIN_OVERLAP_FOR_CORR_THRESHOLD]`. If two variables have a correlation coefficient higher than `[DROP_HIGHLY_CORRELATED_FEATURES][CORR_THRESHOLD]`, we look at the mean absolute correlation of each variable and remove the variable with the largest mean absolute correlation. This step can be skipped by setting `[DROP_HIGHLY_CORRELATED_FEATURES][COMPUTE]` to False.
+
+??? info "6. Discard rows with too many missing values."
+    In this step, we discard rows with missing value ratios higher than `[ROWS_NAN_THRESHOLD]`. We do not recommend you to skip this step, but you can do it by setting `[ROWS_NAN_THRESHOLD]` to 1. In other words, we are discarding time segments (e.g. days) that did not have enough data to be considered reliable. This step is similar to step 2 except the ratio is computed based on NA values instead of a phone data yield threshold.
+
+
+
+
+## Clean sensor features for all participants
+
+!!! info "File Sequence"
+    ```bash
+    - data/processed/features/all_participants/all_sensor_features.csv
+    - data/processed/features/all_participants/all_sensor_features_cleaned_{provider_key}.csv
+    ```
+
+
+### RAPIDS provider
+
+Parameters description and the steps are the same as the above [RAPIDS provider](#rapids-provider) section for individual participants.
+
+
diff --git a/docs/workflow-examples/minimal.md → docs/analysis/minimal.md b/docs/workflow-examples/minimal.md → docs/analysis/minimal.md
diff --git a/docs/change-log.md b/docs/change-log.md
@@ -3,6 +3,7 @@
 - Add firststeptime and laststeptime features to FITBIT_STEPS_INTRADAY RAPIDS provider
 - Update tests for Fitbit steps intraday features
 - Add tests for phone battery features
+- Add a data cleaning module to replace NAs with 0 in selected event-based features, discard unreliable rows and columns, discard columns with zero variance, and discard highly correlated columns
 ## v1.6.0
 - Refactor PHONE_CALLS RAPIDS provider to compute features based on call episodes or events
 - Refactor PHONE_LOCATIONS DORYAB provider to compute features based on location episodes

diff --git a/docs/features/phone-keyboard.md b/docs/features/phone-keyboard.md
@@ -6,6 +6,12 @@ Sensor parameters description for `[PHONE_KEYBOARD]`:
 |----------------|-----------------------------------------------------------------------------------------------------------------------------------
 |`[CONTAINER]`| Data stream [container](../../datastreams/data-streams-introduction/) (database table, CSV file, etc.) where the keyboard data is stored
 
+## RAPIDS provider
+
+!!! info "Available time segments and platforms"
+    - Available for all time segments
+    - Available for Android only
+
 !!! info "File Sequence"
     ```bash
     - data/raw/{pid}/phone_keyboard_raw.csv

diff --git a/docs/index.md b/docs/index.md
@@ -1,12 +1,12 @@
 # Welcome to RAPIDS documentation
 
-Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](workflow-examples/analysis.md) your analysis into reproducible workflows.
+Reproducible Analysis Pipeline for Data Streams (RAPIDS) allows you to process smartphone and wearable data to [extract](features/feature-introduction.md) and [create](features/add-new-features.md) **behavioral features** (a.k.a. digital biomarkers), [visualize](visualizations/data-quality-visualizations.md) mobile sensor data, and [structure](analysis/complete-workflow-example.md) your analysis into reproducible workflows.
 
 RAPIDS is open source, documented, multi-platform, modular, tested, and reproducible. At the moment, we support [data streams](datastreams/data-streams-introduction) logged by smartphones, Fitbit wearables, and Empatica wearables (the latter in collaboration with the [DBDP](https://dbdp.org/)). 
 
 !!! tip "Where do I start?"
 
-    :material-power-standby: New to RAPIDS? Check our [Overview + FAQ](setup/overview/) and [minimal example](workflow-examples/minimal)
+    :material-power-standby: New to RAPIDS? Check our [Overview + FAQ](setup/overview/) and [minimal example](analysis/minimal)
 
     :material-play-speed: [Install](setup/installation), [configure](setup/configuration), and [execute](setup/execution) RAPIDS to [extract](features/feature-introduction.md) and [plot](visualizations/data-quality-visualizations.md) behavioral features
 

diff --git a/docs/setup/overview.md b/docs/setup/overview.md
@@ -23,10 +23,10 @@ Let's review some key concepts we use throughout these docs:
     - [Add your own behavioral features](../../features/add-new-features/) (we can include them in RAPIDS if you want to share them with the community)
     - [Add support for new data streams](../../datastreams/add-new-data-streams/) if yours cannot be processed by RAPIDS yet
     - Create visualizations for [data quality control](../../visualizations/data-quality-visualizations/)  and [feature inspection](../../visualizations/feature-visualizations/)
-    - [Extending RAPIDS to organize your analysis](../../workflow-examples/analysis/) and publish a code repository along with your code
+    - [Extending RAPIDS to organize your analysis](../../analysis/complete-workflow-example/) and publish a code repository along with your code
 
 !!! hint
-    - We recommend you follow the [Minimal Example](../../workflow-examples/minimal/) tutorial to get familiar with RAPIDS
+    - We recommend you follow the [Minimal Example](../../analysis/minimal/) tutorial to get familiar with RAPIDS
 
     - In order to follow any of the previous tutorials, you will have to [Install](../installation/), [Configure](../configuration/), and learn how to [Execute](../execution/) RAPIDS.