This project is focused on distress-related action recognition in a variety of environmental and weather conditions. We hand crafted the generation of synthetic video data using Unreal Engine 5 (UE5), performing simulations of six distress-related human actions: injured waving, jumping, pointing, injured walked(limping), running, and holding something in pain. And six non-distress-related human actions: blowing a kiss, greeting, rumba dancing, salute, silly dancing, and sitting.
We pass this data through a convolutional neural network (CNN) to perform feature extraction, and follow this up by utilising a Long Short-Term Memory (LSTM) network to analyze patterns within the time series data. We utilize fully connected layers to then map our learned representations to one of the six distress-related actions, which we may use for action classification.
Dataset: Google Drive Link
Waving | Jumping | Pointing |
---|---|---|
Injured Walking | Running | Holding something in pain |
Blowing a kiss | Greeting | Rumba Dancing |
---|---|---|
Salute | Silly Dancing | Sitting |
- Front view
- Side view
- Left
- Right
- Top-down view
- Day
- Night
- Neutral(plain)
- Grassy/Rural
- Road/City
dataset/
├── holding_something_in_pain/
│ ├── holding_something_in_pain_day_rural_right_30fps.mp4
│ ├── holding_something_in_pain_night_grassy_back_30fps.mp4
│ ...
├── jumping/
│ ├── jumping_day_grassy_front_30fps.mp4
│ ...
├── injured_walking/
│ ├── injured_walking_night_plain_left_30fps.mp4
│ ...
...
The model architecture combines convolutional neural networks (CNNs) and Long Short-Term Memory (LSTM) to classify our synthetic video data of distress-related actions. Our data is of the form of 30fps video segments of animations in various environments and conditions.
We pass our data through CNN to extract spatial features from individual frames, which helps the model learn visual patterns.
We then utilise these learned features by passing them through an LSTM model, which are better equipped to handle the time series data from videos, and can interpret the features across each time step.
Finally, we utilise a fully connected layer to then map our learned representations to each of the labels associated with the six distress-related actions.