Skip to content

Commit

Permalink
updates README, first contest white paper; adds second contest white …
Browse files Browse the repository at this point in the history
…paper
  • Loading branch information
Kelly committed May 20, 2022
1 parent db4a234 commit e86d4fb
Show file tree
Hide file tree
Showing 3 changed files with 107 additions and 6 deletions.
73 changes: 68 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,71 @@
# NASA Cognitive State Determination - Fault Tolerance Marathon Match
## STEPS to Build & Run
# NASA Cognitive State Determination :thought_balloon:
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

- Step 1 - cd code & build docker `docker build -t latest .`
- Step 2 - run model training
![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)
![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white)
![NumPy](https://img.shields.io/badge/numpy-%23013243.svg?style=for-the-badge&logo=numpy&logoColor=white)

In the near future, Astronauts will need to carry out autonomous operations as they venture beyond Low Earth Orbit. For many of the activities that the astronauts perform, the same training is provided to all crew members months ahead of their mission. A system that provides real-time operations support that is optimized and tailored to each astronaut's psychophysiological state at the time of the activities and which can be used during the training leading up to the mission would significantly improve human spaceflight operations.

In this challenge series, solvers were asked to determine the present cognitive state of a trainee based on biosensor data, as well as predict what their future cognitive state will be, three seconds in the future.

## Data loading

The provided data has a high frequency (1 sec = thousands of rows), but the labeling was done manually. Also, final predictions are expected to have a frequency of one (one second = one row). I decided to transform the data so second equals one row both for training and testing. I rounded all timestamps to the closes second and tested several approaches:
1. Take the first value
2. Take the last value
3. Calculate the mean value for numerical values and mode for the target
4. Calculate mean and std for numerical values and mode for target

The first approach showed the best results.

## EDA

During EDA, I noticed that the timestamps might have “holes” between neighbors. The hole is defined as: if the delta time between two neighbors' rows is above one second, there is a “hole” between them. These two neighbors belong to different “sessions.” I noticed that the sensor data between other sessions might be different. I also noticed that the actual target is always constant within these sessions. I incorporated my findings into feature generation and postprocessing of my predictions.


## Creating target

We have to predict cognitive start for time `t` and `t+3`. The target for `t` is equal to the value of the `induced_state` column. The target for `t+3` is the same as the target for `t` because the cognitive state is the same within the session. The data and target are the same, so I decided to train a single model, and use the same model for making predictions for `t` and `t+3`. Note: There is a separation between `t` and `t+3` models in the code. I decided to keep it just in case the data would be different in the future.


## Feature generation

I had several ideas on features generation, and I combined them into the following groups.

1. Raw sensor data. Provide data “as is.”
2. Rolling statistics with different time windows (5, 999999 seconds) for both separate sessions and “global” (i.e. no separate sessions). Rolling statistincs include: mean, std, z-score: [x - mean(x)] / std(x)
3. Shift features, i.e. the value of sensor data a second ago, two secodns ago, etc.
4. Features based on the interactions between sensor data, e.g., the value of Zephyr_HR divided by the value of Zephyr_HRV.
5. Features based on the distances between eyes positions and gazing points.


## Validation

I used stratified group k fold for validation. Stratified = each fold has approximately the same number of samples for each class. Group = provided `test_suite` column.


## Model

I used the Lightgbm classifier. I optimized hyperparameters using Optuna. The final prediction is the average predictions of several Lightgbm classifiers with different hyperparameters.

## Postprocessing

As mentioned in EDA, the target is the same within a single session. Therefore, I post-processed predictions by calculating the rolling average of the model’s predictions from the beginning of the session till time `t` (including time `t`) for which we’re making predictions.

## Important features

To determine the most important features for making predictions for time t, I used built-in SHAP values calculation. Then, I selected top-3 sensor features from the output.

## Libraries

You can find the list of used libraries in `requirenments.txt`.


## Steps to Build & Run

- Step 1: cd code & build docker `docker build -t latest .`
- Step 2: run model training
`docker run -v /path_to_data/:/work/data/ -v /path_to_save_model/:/work/model/ latest sh train.sh /work/data/data_training.zip`
- Step 3 - run model prediction
- Step 3: run model prediction
`docker run -v /path_to_data/:/data/ -v /path_to_model/:/work/model/ latest sh test.sh /data/test_data_file.zip /data/file_to_save_predictions.csv`
2 changes: 1 addition & 1 deletion WHITEPAPER.md → WHITEPAPER1.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Whitepaper
# Cognitive State Determination Whitepaper
## Overview
This codebase is a winner of the Topcoder NASA Cognitive State Determination marathon match. As part of the final submission the competitor was asked to compelte this document. Personal details have been removed.

Expand Down
38 changes: 38 additions & 0 deletions WHITEPAPER2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Fault Tolerance Whitepaper
## Overview
This codebase is a winner of the Topcoder NASA Cognitive State Determination - Fault Tolerance marathon match. As part of the final submission the competitor was asked to compelte this document. Personal details have been removed.

## 1. Introduction
Tell us a bit about yourself, and why you have decided to participate in the contest.
- **Handle:** fly_fly
- **Placement you achieved in the MM:** 1st
- **About you:** freelancer
- **Why you participated in the MM:** TCO points, money, fame

## 2. Solution Development
How did you solve the problem? What approaches did you try and what choices did you make, and why? Also, what alternative approaches did you consider? Also try to add your cross validation approaches used.
- During EDA, I noticed that eye distance feature has bad influence on performance of the model. In my opinion, the loss of left or right features which may amplify the error because of square calculation. In the end, I did not use the eye distance feature and this reduction improve score a lot (about +10%).
- To make the model less sensitive to the loss of some features, I increase the value of L1 and L2 regularization, and add more time windows of features based on original (including: mean, median, std, max).
- Stratified group k fold is also used for validation. Stratified = each fold has approximately the same number of samples for each class. Group = provided test_suite column.

## 3. Open Source Resources, Frameworks and Libraries
Please specify the name of the open source resource, the URL to where it can be found, and it’s license type:
- all libraries are open-sourced
- pandas==1.3.5
- numpy==1.22.1
- lightgbm==3.3.2
- scikit-learn==1.0.2
- tqdm==4.62.3
- scipy==1.7.3
- optuna==2.10.0

## 4. Potential Algorithm Improvements
Please specify any potential improvements that can be made to the algorithm:
- You can try other models. Such as TabNet or other Neural Networks.
- You can try more hyperparameter tuning.


## 5. Algorithm Limitations
Please specify any potential limitations with the algorithm:
- This is not a pretrained model, so you need to train it from scratch before predcit.
- Too many absent of data features may result in a worse score.

0 comments on commit e86d4fb

Please sign in to comment.