VTT-action-recognition

This project tries to recognize actions occuring in the American television sitcom, Friends. Original code is from C3D-tensorflow, and I modified it for this project.

And I borrowed some codes which implement following networks

C3D from C3D-tensorflow by hx173149

Requirements

Ubuntu 16.04
CUDA 9.0
cuDNN 7.3.1
Python 3.6.6
Python libraries listed on requirements.txt (including tensorflow 1.12.0)

How to use

Step 1. Prepare Data

For this project, you need to have following two data.

Frames of each Friends video
1. Download Friends videos.
2. Extract frames from each video on 5 fps, and locate them on <project_root>/data/friends/frames/S<season>_EP<episode>/<frame_number>.jpg
  
  e.g. <project_root>/data/friends/frames/S01_EP01/00001.jpg
Annotations
1. Download two annotation files we got from Konan Technology
  - VTT3_2차년도_메타데이터1차_배포_20180809.zip
  - 20181024_VTT3세부_메타데이터_2차배포.zip
2. Extract following json files
  - s<season>_ep<episode>_tag2_visual_Final_180809.json from VTT3_2차년도_메타데이터1차_배포_20180809.zip
  - s<season>_ep<episode>_tag2_visual_final.json from 20181024_VTT3세부_메타데이터_2차배포.zip
  and locate the json files on the following directory
  
  <project_root>/data/friends/annotations
So, the directory structure will be
```
data/
  friends/
    frames/
      S01_EP01/
        00001.jpg
        ...
      ...
    annotations/
      S01_EP01.json
      ...
```

Step 2. Analyze Data

By combining some duplicated action classes like Standing up, standing up, stading up into standing, I can have 32 action classes. After removing none which indicates no action and trimming action classes which is assigned to only 1 clip (cup, ' ' (blank), desk), I can have 28 action classes as following.

Step 3. Construct dataset from the annotation files

Each ground truth annotation have information of "start seconds", "end seconds", and "action classes".

{
  "start seconds": "00:01:46;16",
  "end seconds": "00:01:51;60",
  "actions": [ "Holding something", "standing" ]
}

So, I constructed dataset through following steps.

Since I extracted 5 frames per second from each video, I can get the frame number of each seconds.

e.g.

{
  "start seconds": frame #531,
  "end seconds": frame #558,
  "actions": [ "Holding something", "standing" ]
}

As specified in C3D paper, I extracted every 16 frames which are overlapped 8 frames.

e.g.

[
  {
    "start seconds": frame #531,
    "end seconds": frame #546,
    "actions": [ "Holding something", "standing" ]
  },
  {
    "start seconds": frame #546,
    "end seconds": frame #561,
    "actions": [ "Holding something", "standing" ]
  },
  {
    "start seconds": frame #561,
    "end seconds": frame #576,
    "actions": [ "Holding something", "standing" ]
  }
]

If the length of clip is less than 16 frames, calculate the median between the start and the end and pad with some frames around them (so it'll contain some frames that does not belong to the groundtruth actions).

{
  "start seconds": the median - 7,
  "end seconds": the median + 8,
  "actions": [ "Holding something", "standing" ]
}

Step 4. Split dataset into train & test

$ python -m lists.train_test

I divided the dataset into training and testing with balanced label distribution in mind. Since this project deals with a multi-label classification task, I cannot strictly divide all labels into a train and a test dataset with same ratio. So I tried to more fairly divide data which has more few clips.

Step 5. Train

$ CUDA_VISIBLE_DEVICES=0 python train.py

NOTE: Multi-gpu training is not supported yet.

Result

Base model Precision Recall F1 score

C3D v1 0.5455 0.5162 0.5299

C3D v2 0.8040 0.7992 0.8384
Examples

Step 6. Predict

$ python -m lists.episodes
$ CUDA_VISIBLE_DEVICES=0 python predict.py

It will generate json files in <project root>/outputs/predictions used to generate demo videos, and jsonlines files in <project root>/outputs/integration for integration. The json schema follows one defined at https://github.com/uilab-vtt/knowledge-graph-input

Step 7. Demo

$ python demo.py

By using files generated in Step 6, it will generate demo videos in `/outputs/demo" for each Friends episode.

References

Tran, Du, et al. "Learning spatiotemporal features with 3d convolutional networks." Proceedings of the IEEE international conference on computer vision. 2015.

Acknowledgements

This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (2017-0-01780, The technology development for event recognition/relational reasoning and learning knowledge based system for video understanding)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
analysis		analysis
data/friends		data/friends
license_report		license_report
lists		lists
nets		nets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
dataset.py		dataset.py
demo.py		demo.py
logger.py		logger.py
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VTT-action-recognition

Requirements

How to use

Step 1. Prepare Data

Step 2. Analyze Data

Step 3. Construct dataset from the annotation files

Step 4. Split dataset into train & test

Step 5. Train

Step 6. Predict

Step 7. Demo

References

Acknowledgements

About

Releases

Packages

Languages

Base model	Precision	Recall	F1 score
C3D v1	0.5455	0.5162	0.5299
C3D v2	0.8040	0.7992	0.8384

License

videoturingtest/vtt-action-recognition

Folders and files

Latest commit

History

Repository files navigation

VTT-action-recognition

Requirements

How to use

Step 1. Prepare Data

Step 2. Analyze Data

Step 3. Construct dataset from the annotation files

Step 4. Split dataset into train & test

Step 5. Train

Step 6. Predict

Step 7. Demo

References

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages