By Zhi-Song Liu, Robin Courant and Vicky Kalogeiton
Project Page | Paper | Data
By Zhi-Song Liu*, Robin Courant* and Vicky Kalogeiton
ACCV 2022 (Oral, Best Student Paper Honorable mention)
Project Page | Paper | Data
Python 3.8 OpenCV library Pytorch 1.12.0 CUDA 11.3
- Clone code to your local computer.
git clone https://github.com/robincourant/FunnyNet.git
cd FunnyNet
- Create working environment.
conda create --name funnynet -y python=3.8
conda activate funnynet
- Install the dependencies.
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
- Run the setup script to intsall all the dependencies.
./setup.sh
- Modify in
ext/TimeSformer/timesformer/models/vit_utils.py
from torch._six import container_abcs --> import collections.abc as container_abcs
- Comment
ext/TimeSformer/timesformer/models/resnet_helper.py
from torch.nn.modules.linear import _LinearWithBias
- Download friends data:
gdown https://drive.google.com/drive/folders/1ZM6agmEnheiyP0IIrD3Fc7DOubjyu5eO -O ./data --folder
Note: label files are strutured as follow: [season, episode, funny-label, start, end]
The dataset directory is organized as followed:
FunnyNet-data/
└── tv_show_name/
├── audio/
│ ├── diff/ # `.wav` files with stereo channel difference
│ ├── embedding/ # `.pt` files with audio embedding vectors
│ ├── laughter/ # `.pickle` files with laughter timecodes
│ ├── laughter_segment/ # `.wav` files with detected laughters
│ ├── left/ # `.wav` files with the surround left channel
│ └── raw/ # `.wav` files with extracted raw audio from videos
├── laughter/ # `.pk` files with laughter labels
├── sub/ # `.pk` files with subtitles
├── episode/ # `.mkv` files with videos
├── audio_split/ # `.wav` files with audio 8 seconds windows
│ ├── test_8s/
│ ├── train_8s/
│ └── validation_8s/
├── video_split/ # `.mp4` files with video 8 seconds windows
│ ├── test_8s/
│ ├── train_8s/
│ └── validation_8s/
└── sub_split/ # `.pk` files with subtitles 8 seconds windows
├── sub_test_8s.pk
├── sub_train_8s.pk
└── sub_validation_8s.pk
Note: we cannot provide audio and video data for obvious copyright issues.
Split audio, subtitles and videos into segments of n seconds (default 8 seconds):
python data_processing/mask_audio.py DATA_DIR/audio/raw DATA_DIR/audio/laughter DATA_DIR/audio/processed
python data_processing/audio_processing.py DATA_DIR/audio/raw DATA_DIR/laughter/xx.pk DATA_DIR/audio_split
python data_processing/sub_processing.py DATA_DIR/sub DATA_DIR/laughter/xx.pk DATA_DIR/sub_split
python data_processing/video_processing.py DATA_DIR/episode DATA_DIR/laughter/xx.pk DATA_DIR/video_split
- Train multimodality with audio and vision
python funnynet/train.py model.batch_size=BATCH_SIZE xp_name=XP_NAME data.data_dir=DATA_DIR model=avf-timesformer-byol-lstm data=avf-timesformer-byol-lstm
- Test multimodality with audio and vision
python funnynet/evaluate.py
There is 4 scripts:
-
laughter_detection/scripts/extract_audio.py
: extracts from video files contained inepisode/
corresponding audio tracks and saves them inaudio/raw/
. -
laughter_detection/scripts/detect_laughter.py
: detects laughters from audio files inaudio/raw/
and saves laughter timecodes as.pickle
files inaudio/laughter/
. -
laughter_detection/scripts/extract_laughter.py
: extracts from raw audio segments inaudio/raw/
each detected laughter inaudio/laughter/
and saves them inaudio/laughter_segment/
. -
laughter_detection/scripts/evaluate_laughters.py
: given directories of predicted and ground-truth laughter files (.pickle
), compare them and compute metrics.