diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..1b20be5
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,8 @@
+*.pt
+*.pth
+*.pkl
+*.pyc
+*.txt
+__pycache__
+det_results
+.vscode
\ No newline at end of file
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..4c3670a
--- /dev/null
+++ b/README.md
@@ -0,0 +1,167 @@
+# YOWOv2
+
+## Requirements
+- We recommend you to use Anaconda to create a conda environment:
+```Shell
+conda create -n yowo python=3.6
+```
+
+- Then, activate the environment:
+```Shell
+conda activate yowo
+```
+
+- Requirements:
+```Shell
+pip install -r requirements.txt 
+```
+
+## Visualization
+
+Comming soon ...
+
+# Dataset
+You can download **UCF24** from the following links:
+
+## UCF101-24:
+* Google drive
+
+Link: https://drive.google.com/file/d/1Dwh90pRi7uGkH5qLRjQIFiEmMJrAog5J/view?usp=sharing
+
+* BaiduYun Disk
+
+Link: https://pan.baidu.com/s/11GZvbV0oAzBhNDVKXsVGKg
+
+Password: hmu6 
+
+## AVA
+You can use instructions from [here](https://github.com/yjh0410/AVA_Dataset) to prepare **AVA** dataset.
+
+# Experiment
+* UCF101-24
+
+|      Model     |  Clip  | GFLOPs |  Params | F-mAP | V-mAP |   FPS   |    Weight    |
+|----------------|--------|--------|---------|-------|-------|---------|--------------|
+|  YOWOv2-Nano   |   16   |  2.6   | 3.5 M   | 78.8  | 48.0  |   42    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_nano_ucf24.pth) |
+|  YOWOv2-Tiny   |   16   |  5.8   | 10.9 M  | 80.5  | 51.3  |   50    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_tiny_ucf24.pth) |
+|  YOWOv2-Medium |   16   |  24.1  | 52.0 M  | 83.1  | 50.7  |   42    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_medium_ucf24.pth) |
+|  YOWOv2-Large  |   16   |  107.1 | 109.7 M | 85.2  | 52.0  |   30    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_large_ucf24.pth) |
+|  YOWOv2-Nano   |   32   |  4.0   | 3.5 M   | 79.4  | 49.0  |   42    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_nano_ucf24_k32.pth) |
+|  YOWOv2-Tiny   |   32   |  9.0   | 10.9 M  | 83.0  | 51.2  |   50    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_tiny_ucf24_k32.pth) |
+|  YOWOv2-Medium |   32   |  27.3  | 52.0 M  | 83.7  | 52.5  |   40    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_medium_ucf24_k32.pth) |
+|  YOWOv2-Large  |   32   |  183.9 | 109.7 M | 87.0  | 52.8  |   22    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_large_ucf24_k32.pth) |
+
+* AVA v2.2
+
+|     Model      |    Clip    |    mAP    |   FPS   |    weight    |
+|----------------|------------|-----------|---------|--------------|
+|  YOWOv2-Nano   |     16     |   12.6    |   40    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_nano_ava.pth) |
+|  YOWOv2-Tiny   |     16     |   14.9    |   49    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_tiny_ava.pth) |
+|  YOWOv2-Medium |     16     |   18.4    |   41    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_medium_ava.pth) |
+|  YOWOv2-Large  |     16     |   20.2    |   29    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_large_ava.pth) |
+|  YOWOv2-Nano   |     32     |       |       |  |
+|  YOWOv2-Tiny   |     32     |   15.6    |   49    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_tiny_ava_k32.pth) |
+|  YOWOv2-Medium |     32     |   18.4    |   40    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_medium_ava_k32.pth) |
+|  YOWOv2-Large  |     32     |   21.7    |   22    | [ckpt](https://github.com/yjh0410/YOWOv2/releases/download/yowo_v2_weight/yowo_v2_large_ava_k32.pth) |
+
+
+## Train YOWOv2
+* UCF101-24
+
+```Shell
+python train.py --cuda -d ucf24 --root path/to/dataset -v yowo_v2_nano --num_workers 4 --eval_epoch 1 --max_epoch 8 --lr_epoch 2 3 4 5 --lr 0.0001 -ldr 0.5 -bs 8 -accu 16
+```
+
+or you can just run the script:
+
+```Shell
+sh train_ucf.sh
+```
+
+* AVA
+```Shell
+python train.py --cuda -d ava_v2.2 --root path/to/dataset -v yowo_v2_nano --num_workers 4 --eval_epoch 1 --max_epoch 10 --lr_epoch 3 4 5 6 --lr 0.0001 -ldr 0.5 -bs 8 -accu 16 --eval
+```
+
+or you can just run the script:
+
+```Shell
+sh train_ava.sh
+```
+
+##  Test YOWOv2
+* UCF101-24
+For example:
+
+```Shell
+python test.py --cuda -d ucf24 -v yowo_v2_nano --weight path/to/weight -size 224 --show
+```
+
+* AVA
+For example:
+
+```Shell
+python test.py --cuda -d ava_v2.2 -v yowo_v2_nano --weight path/to/weight -size 224 --show
+```
+
+##  Test YOWOv2 on AVA video
+For example:
+
+```Shell
+python test_video_ava.py --cuda -d ava_v2.2 -v yowo_v2_nano --weight path/to/weight --video path/to/video --show
+```
+
+Note that you can set ```path/to/video``` to other videos in your local device, not AVA videos.
+
+## Evaluate YOWOv2
+* UCF101-24
+For example:
+
+```Shell
+# Frame mAP
+python eval.py \
+        --cuda \
+        -d ucf24 \
+        -v yowo_v2_nano \
+        -bs 16 \
+        -size 224 \
+        --weight path/to/weight \
+        --cal_frame_mAP \
+```
+
+```Shell
+# Video mAP
+python eval.py \
+        --cuda \
+        -d ucf24 \
+        -v yowo_v2_nano \
+        -bs 16 \
+        -size 224 \
+        --weight path/to/weight \
+        --cal_video_mAP \
+```
+
+* AVA
+
+Run the following command to calculate frame mAP@0.5 IoU:
+
+```Shell
+python eval.py \
+        --cuda \
+        -d ava_v2.2 \
+        -v yowo_v2_nano \
+        -bs 16 \
+        --weight path/to/weight
+```
+
+## Demo
+```Shell
+# run demo
+python demo.py --cuda -d ucf24 -v yowo_v2_nano -size 224 --weight path/to/weight --video path/to/video --show
+                      -d ava_v2.2
+```
+
+## References
+If you are using our code, please consider citing our paper.
+
+Comming soon ...
diff --git a/config/__init__.py b/config/__init__.py
new file mode 100644
index 0000000..3926cf3
--- /dev/null
+++ b/config/__init__.py
@@ -0,0 +1,21 @@
+from .dataset_config import dataset_config
+from .yowo_v2_config import yowo_v2_config
+
+
+def build_model_config(args):
+    print('==============================')
+    print('Model Config: {} '.format(args.version.upper()))
+    
+    if 'yowo_v2_' in args.version:
+        m_cfg = yowo_v2_config[args.version]
+
+    return m_cfg
+
+
+def build_dataset_config(args):
+    print('==============================')
+    print('Dataset Config: {} '.format(args.dataset.upper()))
+    
+    d_cfg = dataset_config[args.dataset]
+
+    return d_cfg
diff --git a/config/dataset_config.py b/config/dataset_config.py
new file mode 100644
index 0000000..4495325
--- /dev/null
+++ b/config/dataset_config.py
@@ -0,0 +1,92 @@
+# Dataset configuration
+
+
+dataset_config = {
+    'ucf24': {
+        # dataset
+        'gt_folder': './evaluator/groundtruths_ucf_jhmdb/groundtruths_ucf/',
+        # input size
+        'train_size': 224,
+        'test_size': 224,
+        # transform
+        'jitter': 0.2,
+        'hue': 0.1,
+        'saturation': 1.5,
+        'exposure': 1.5,
+        'sampling_rate': 1,
+        # cls label
+        'multi_hot': False,  # one hot
+        # optimizer
+        'optimizer': 'adamw',
+        'momentum': 0.9,
+        'weight_decay': 5e-4,
+        # warmup strategy
+        'warmup': 'linear',
+        'warmup_factor': 0.00066667,
+        'wp_iter': 500,
+        # class names
+        'valid_num_classes': 24,
+        'label_map': (
+                    'Basketball',     'BasketballDunk',    'Biking',            'CliffDiving',
+                    'CricketBowling', 'Diving',            'Fencing',           'FloorGymnastics', 
+                    'GolfSwing',      'HorseRiding',       'IceDancing',        'LongJump',
+                    'PoleVault',      'RopeClimbing',      'SalsaSpin',         'SkateBoarding',
+                    'Skiing',         'Skijet',            'SoccerJuggling',    'Surfing',
+                    'TennisSwing',    'TrampolineJumping', 'VolleyballSpiking', 'WalkingWithDog'
+                ),
+    },
+    
+    'ava_v2.2':{
+        # dataset
+        'frames_dir': 'frames/',
+        'frame_list': 'frame_lists/',
+        'annotation_dir': 'annotations/',
+        'train_gt_box_list': 'ava_v2.2/ava_train_v2.2.csv',
+        'val_gt_box_list': 'ava_v2.2/ava_val_v2.2.csv',
+        'train_exclusion_file': 'ava_v2.2/ava_train_excluded_timestamps_v2.2.csv',
+        'val_exclusion_file': 'ava_v2.2/ava_val_excluded_timestamps_v2.2.csv',
+        'labelmap_file': 'ava_v2.2/ava_action_list_v2.2_for_activitynet_2019.pbtxt', # 'ava_v2.2/ava_action_list_v2.2.pbtxt',
+        'class_ratio_file': 'config/ava_categories_ratio.json',
+        'backup_dir': 'results/',
+        # input size
+        'train_size': 224,
+        'test_size': 224,
+        # transform
+        'jitter': 0.2,
+        'hue': 0.1,
+        'saturation': 1.5,
+        'exposure': 1.5,
+        'sampling_rate': 1,
+        # cls label
+        'multi_hot': True,  # multi hot
+        # train config
+        'optimizer': 'adamw',
+        'momentum': 0.9,
+        'weight_decay': 5e-4,
+        # warmup strategy
+        'warmup': 'linear',
+        'warmup_factor': 0.00066667,
+        'wp_iter': 500,
+        # class names
+        'valid_num_classes': 80,
+        'label_map': (
+                    'bend/bow(at the waist)', 'crawl', 'crouch/kneel', 'dance', 'fall down',  # 1-5
+                    'get up', 'jump/leap', 'lie/sleep', 'martial art', 'run/jog',             # 6-10
+                    'sit', 'stand', 'swim', 'walk', 'answer phone',                           # 11-15
+                    'brush teeth', 'carry/hold (an object)', 'catch (an object)', 'chop', 'climb (e.g. a mountain)',  # 16-20
+                    'clink glass', 'close (e.g., a door, a box)', 'cook', 'cut', 'dig',                               # 21-25
+                    'dress/put on clothing', 'drink', 'drive (e.g., a car, a truck)', 'eat', 'enter',                 # 26-30
+                    'exit', 'extract', 'fishing', 'hit (an object)', 'kick (an object)',                              # 31-35
+                    'lift/pick up', 'listen (e.g., to music)', 'open (e.g., a window, a car door)', 'paint', 'play board game',  # 36-40
+                    'play musical instrument', 'play with pets', 'point to (an object)', 'press','pull (an object)',             # 41-45
+                    'push (an object)', 'put down', 'read', 'ride (e.g., a bike, a car, a horse)', 'row boat',                   # 46-50
+                    'sail boat', 'shoot', 'shovel', 'smoke', 'stir',                                                             # 51-55
+                    'take a photo', 'text on/look at a cellphone', 'throw', 'touch (an object)', 'turn (e.g., a screwdriver)',   # 56-60
+                    'watch (e.g., TV)', 'work on a computer', 'write', 'fight/hit (a person)', 'give/serve (an object) to (a person)',  # 61-65
+                    'grab (a person)', 'hand clap', 'hand shake', 'hand wave', 'hug (a person)',                                        # 66-70
+                    'kick (a person)', 'kiss (a person)', 'lift (a person)', 'listen to (a person)', 'play with kids',                  # 71-75
+                    'push (another person)', 'sing to (e.g., self, a person, a group)', 'take (an object) from (a person)',             # 76-78
+                    'talk to (e.g., self, a person, a group)', 'watch (a person)'                                                       # 79-80
+                ),
+    }
+}
diff --git a/config/yowo_v2_config.py b/config/yowo_v2_config.py
new file mode 100644
index 0000000..8d0e8e4
--- /dev/null
+++ b/config/yowo_v2_config.py
@@ -0,0 +1,84 @@
+# Model configuration
+
+
+yowo_v2_config = {
+    'yowo_v2_nano': {
+        # backbone
+        ## 2D
+        'backbone_2d': 'yolo_free_nano',
+        'pretrained_2d': True,
+        'stride': [8, 16, 32],
+        ## 3D
+        'backbone_3d': 'shufflenetv2',
+        'model_size': '1.0x',
+        'pretrained_3d': True,
+        'memory_momentum': 0.9,
+        # head
+        'head_dim': 64,
+        'head_norm': 'BN',
+        'head_act': 'lrelu',
+        'num_cls_heads': 2,
+        'num_reg_heads': 2,
+        'head_depthwise': True,
+    },
+
+    'yowo_v2_tiny': {
+        # backbone
+        ## 2D
+        'backbone_2d': 'yolo_free_tiny',
+        'pretrained_2d': True,
+        'stride': [8, 16, 32],
+        ## 3D
+        'backbone_3d': 'shufflenetv2',
+        'model_size': '2.0x',
+        'pretrained_3d': True,
+        'memory_momentum': 0.9,
+        # head
+        'head_dim': 64,
+        'head_norm': 'BN',
+        'head_act': 'lrelu',
+        'num_cls_heads': 2,
+        'num_reg_heads': 2,
+        'head_depthwise': False,
+    },
+
+    'yowo_v2_medium': {
+        # backbone
+        ## 2D
+        'backbone_2d': 'yolo_free_large',
+        'pretrained_2d': True,
+        'stride': [8, 16, 32],
+        ## 3D
+        'backbone_3d': 'shufflenetv2',
+        'model_size': '2.0x',
+        'pretrained_3d': True,
+        'memory_momentum': 0.9,
+        # head
+        'head_dim': 128,
+        'head_norm': 'BN',
+        'head_act': 'silu',
+        'num_cls_heads': 2,
+        'num_reg_heads': 2,
+        'head_depthwise': False,
+    },
+
+    'yowo_v2_large': {
+        # backbone
+        ## 2D
+        'backbone_2d': 'yolo_free_large',
+        'pretrained_2d': True,
+        'stride': [8, 16, 32],
+        ## 3D
+        'backbone_3d': 'resnext101',
+        'pretrained_3d': True,
+        'memory_momentum': 0.9,
+        # head
+        'head_dim': 256,
+        'head_norm': 'BN',
+        'head_act': 'silu',
+        'num_cls_heads': 2,
+        'num_reg_heads': 2,
+        'head_depthwise': False,
+    },
+
+}
\ No newline at end of file
diff --git a/dataset/__init__.py b/dataset/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/dataset/ava.py b/dataset/ava.py
new file mode 100644
index 0000000..1b61342
--- /dev/null
+++ b/dataset/ava.py
@@ -0,0 +1,298 @@
+#!/usr/bin/python
+# encoding: utf-8
+
+import os
+import numpy as np
+
+import torch
+from torch.utils.data import Dataset
+from PIL import Image
+
+try:
+    import ava_helper
+except:
+    from . import ava_helper
+
+
+# Dataset for AVA
+class AVA_Dataset(Dataset):
+    def __init__(self,
+                 cfg,
+                 data_root,
+                 is_train=False,
+                 img_size=224,
+                 transform=None,
+                 len_clip=16,
+                 sampling_rate=1):
+        self._downsample = 4
+        self.num_classes = 80
+        self.data_root = data_root
+        self.frames_dir = os.path.join(data_root, cfg['frames_dir'])
+        self.frame_list = os.path.join(data_root, cfg['frame_list'])
+        self.annotation_dir = os.path.join(data_root, cfg['annotation_dir'])
+        self.labelmap_file = os.path.join(data_root, cfg['annotation_dir'], cfg['labelmap_file'])
+        if is_train:
+            self.gt_box_list = os.path.join(self.annotation_dir, cfg['train_gt_box_list'])
+            self.exclusion_file = os.path.join(self.annotation_dir, cfg['train_exclusion_file'])
+        else:
+            self.gt_box_list = os.path.join(self.annotation_dir, cfg['val_gt_box_list'])
+            self.exclusion_file = os.path.join(self.annotation_dir, cfg['val_exclusion_file'])
+
+        self.transform = transform
+        self.is_train = is_train
+        
+        self.img_size = img_size
+        self.len_clip = len_clip
+        self.sampling_rate = sampling_rate
+        self.seq_len = self.len_clip * self.sampling_rate
+
+        # load ava data
+        self._load_data()
+
+
+    def _load_data(self):
+        # Loading frame paths.
+        (
+            self._image_paths,
+            self._video_idx_to_name,
+        ) = ava_helper.load_image_lists(
+            self.frames_dir,
+            self.frame_list,
+            self.is_train
+            )
+
+        # Loading annotations for boxes and labels.
+        # boxes_and_labels: {'<video_name>': {<frame_num>: a list of [box_i, box_i_labels]} }
+        boxes_and_labels = ava_helper.load_boxes_and_labels(
+            self.gt_box_list,
+            self.exclusion_file,
+            self.is_train,
+            full_test_on_val=False
+            )
+
+        assert len(boxes_and_labels) == len(self._image_paths)
+
+        # boxes_and_labels: a list of {<frame_num>: a list of [box_i, box_i_labels]}
+        boxes_and_labels = [
+            boxes_and_labels[self._video_idx_to_name[i]]
+            for i in range(len(self._image_paths))
+        ]
+
+        # Get indices of keyframes and corresponding boxes and labels.
+        # _keyframe_indices: [video_idx, sec_idx, sec, frame_index]
+        # _keyframe_boxes_and_labels: list[list[list]], outer is video_idx, middle is sec_idx,
+        # inner is a list of [box_i, box_i_labels]
+        (
+            self._keyframe_indices,
+            self._keyframe_boxes_and_labels,
+        ) = ava_helper.get_keyframe_data(boxes_and_labels)
+
+        # Calculate the number of used boxes.
+        self._num_boxes_used = ava_helper.get_num_boxes_used(
+            self._keyframe_indices, self._keyframe_boxes_and_labels
+        )
+
+        self._max_objs = ava_helper.get_max_objs(
+            self._keyframe_indices, self._keyframe_boxes_and_labels
+        )
+
+        print("=== AVA dataset summary ===")
+        print("Train: {}".format(self.is_train))
+        print("Number of videos: {}".format(len(self._image_paths)))
+        total_frames = sum(
+            len(video_img_paths) for video_img_paths in self._image_paths
+        )
+        print("Number of frames: {}".format(total_frames))
+        print("Number of key frames: {}".format(len(self)))
+        print("Number of boxes: {}.".format(self._num_boxes_used))
+
+
+    def __len__(self):
+        return len(self._keyframe_indices)
+
+
+    def get_sequence(self, center_idx, half_len, sample_rate, num_frames):
+        """
+        Sample frames among the corresponding clip.
+
+        Args:
+            center_idx (int): center frame idx for current clip
+            half_len (int): half of the clip length
+            sample_rate (int): sampling rate for sampling frames inside of the clip
+            num_frames (int): number of expected sampled frames
+
+        Returns:
+            seq (list): list of indexes of sampled frames in this clip.
+        """
+        # seq = list(range(center_idx - half_len, center_idx + half_len, sample_rate))
+        seq = list(range(center_idx - half_len*2 + 1*sample_rate, center_idx+1*sample_rate, sample_rate))
+        
+        for seq_idx in range(len(seq)):
+            if seq[seq_idx] < 0:
+                seq[seq_idx] = 0
+            elif seq[seq_idx] >= num_frames:
+                seq[seq_idx] = num_frames - 1
+        return seq
+
+
+    def get_frame_idx(self, latest_idx, sample_length, sample_rate, num_frames):
+        """
+        Sample frames among the corresponding clip. But see keyframe as the latest frame,
+        instead of viewing it in center
+        """
+        # seq = list(range(latest_idx - sample_length + 1, latest_idx + 1, sample_rate))
+        seq = list(range(latest_idx, latest_idx - sample_length, -sample_rate))
+        seq.reverse()
+        for seq_idx in range(len(seq)):
+            if seq[seq_idx] < 0:
+                seq[seq_idx] = 0
+            elif seq[seq_idx] >= num_frames:
+                seq[seq_idx] = num_frames - 1
+
+        return seq
+
+
+    def __getitem__(self, idx):
+        # load a data
+        frame_idx, video_clip, target = self.pull_item(idx)
+
+        return frame_idx, video_clip, target
+
+
+    def pull_item(self, idx):
+        # Get the frame idxs for current clip. We can use it as center or latest
+        video_idx, sec_idx, sec, frame_idx = self._keyframe_indices[idx]
+        clip_label_list = self._keyframe_boxes_and_labels[video_idx][sec_idx]
+
+        # check label list
+        assert len(clip_label_list) > 0
+        assert len(clip_label_list) <= self._max_objs
+
+        # get a sequence
+        seq = self.get_sequence(
+            frame_idx,
+            self.seq_len // 2,
+            self.sampling_rate,
+            num_frames=len(self._image_paths[video_idx]),
+        )
+        image_paths = [self._image_paths[video_idx][frame - 1] for frame in seq]
+        keyframe_info = self._image_paths[video_idx][frame_idx - 1]
+
+        # load a video clip
+        video_clip = []
+        for img_path in image_paths:
+            frame = Image.open(img_path).convert('RGB')
+            video_clip.append(frame)
+        ow, oh = frame.width, frame.height
+
+        # Get boxes and labels for current clip.
+        boxes = []
+        labels = []
+        for box_labels in clip_label_list:
+            bbox = box_labels[0]
+            label = box_labels[1]
+            multi_hot_label = np.zeros(1 + self.num_classes)
+            multi_hot_label[..., label] = 1.0
+
+            boxes.append(bbox)
+            labels.append(multi_hot_label[..., 1:].tolist())
+
+        boxes = np.array(boxes).reshape(-1, 4)
+        # renormalize bbox
+        boxes[..., [0, 2]] *= ow
+        boxes[..., [1, 3]] *= oh
+        labels = np.array(labels).reshape(-1, self.num_classes)
+
+        # target: [N, 4 + C]
+        target = np.concatenate([boxes, labels], axis=-1)
+
+        # transform
+        video_clip, target = self.transform(video_clip, target)
+        # List [T, 3, H, W] -> [3, T, H, W]
+        video_clip = torch.stack(video_clip, dim=1)
+
+        # reformat target
+        target = {
+            'boxes': target[:, :4].float(),  # [N, 4]
+            'labels': target[:, 4:].long(),  # [N, C]
+            'orig_size': [ow, oh],
+            'video_idx': video_idx,
+            'sec': sec,
+
+        }
+
+        return [video_idx, sec], video_clip, target
+
+
+
+if __name__ == '__main__':
+    import cv2
+    from transforms import Augmentation, BaseTransform
+
+    is_train = False
+    img_size = 224
+    len_clip = 16
+    dataset_config = {
+        'data_root': '/mnt/share/sda1/dataset/STAD/AVA_Dataset',
+        'frames_dir': 'frames/',
+        'frame_list': 'frame_lists/',
+        'annotation_dir': 'annotations/',
+        'train_gt_box_list': 'ava_v2.2/ava_train_v2.2.csv',
+        'val_gt_box_list': 'ava_v2.2/ava_val_v2.2.csv',
+        'train_exclusion_file': 'ava_v2.2/ava_train_excluded_timestamps_v2.2.csv',
+        'val_exclusion_file': 'ava_v2.2/ava_val_excluded_timestamps_v2.2.csv',
+        'labelmap_file': 'ava_v2.2/ava_action_list_v2.2.pbtxt',
+    }
+    trans_config = {
+        'jitter': 0.2,
+        'hue': 0.1,
+        'saturation': 1.5,
+        'exposure': 1.5
+    }
+    train_transform = Augmentation(
+        img_size=img_size,
+        jitter=trans_config['jitter'],
+        saturation=trans_config['saturation'],
+        exposure=trans_config['exposure']
+        )
+    val_transform = BaseTransform(img_size=img_size)
+
+    train_dataset = AVA_Dataset(
+        cfg=dataset_config,
+        data_root=dataset_config['data_root'],
+        is_train=is_train,
+        img_size=img_size,
+        transform=train_transform,
+        len_clip=len_clip,
+        sampling_rate=1
+    )
+
+    print(len(train_dataset))
+    for i in range(len(train_dataset)):
+        frame_id, video_clip, target = train_dataset[i]
+        key_frame = video_clip[:, -1, :, :]
+
+        # to numpy
+        key_frame = key_frame.permute(1, 2, 0).numpy()
+        key_frame = key_frame.astype(np.uint8)
+
+        # to BGR
+        key_frame = key_frame[..., (2, 1, 0)]
+        H, W, C = key_frame.shape
+
+        key_frame = key_frame.copy()
+        bboxes = target['boxes']
+        labels = target['labels']
+
+        for box, cls_id in zip(bboxes, labels):
+            x1, y1, x2, y2 = box
+            x1 = int(x1 * W)
+            y1 = int(y1 * H)
+            x2 = int(x2 * W)
+            y2 = int(y2 * H)
+            key_frame = cv2.rectangle(key_frame, (x1, y1), (x2, y2), (255, 0, 0))
+
+        # cv2 show
+        cv2.imshow('key frame', key_frame)
+        cv2.waitKey(0)
+        
\ No newline at end of file
diff --git a/dataset/ava_helper.py b/dataset/ava_helper.py
new file mode 100644
index 0000000..a20777a
--- /dev/null
+++ b/dataset/ava_helper.py
@@ -0,0 +1,228 @@
+#!/usr/bin/env python3
+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
+
+import logging
+import os
+import csv
+from collections import defaultdict
+
+
+logger = logging.getLogger(__name__)
+FPS = 30
+AVA_VALID_FRAMES = range(902, 1799)
+
+
+def make_image_key(video_id, timestamp):
+    """Returns a unique identifier for a video id & timestamp."""
+    return "%s,%04d" % (video_id, int(timestamp))
+
+
+def read_exclusions(exclusions_file):
+    """Reads a CSV file of excluded timestamps.
+    Args:
+      exclusions_file: A file object containing a csv of video-id,timestamp.
+    Returns:
+      A set of strings containing excluded image keys, e.g. "aaaaaaaaaaa,0904",
+      or an empty set if exclusions file is None.
+    """
+    excluded = set()
+    if exclusions_file:
+        with open(exclusions_file, "r") as f:
+            reader = csv.reader(f)
+            for row in reader:
+                assert len(row) == 2, "Expected only 2 columns, got: " + row
+                excluded.add(make_image_key(row[0], row[1]))
+    return excluded
+
+
+def load_image_lists(frames_dir, frame_list, is_train):
+    """
+    Loading image paths from corresponding files.
+
+    Args:
+        frames_dir (str): path to frames dir.
+        frame_list (str): path to frame list.
+        is_train (bool): if it is training dataset or not.
+
+    Returns:
+        image_paths (list[list]): a list of items. Each item (also a list)
+            corresponds to one video and contains the paths of images for
+            this video.
+        video_idx_to_name (list): a list which stores video names.
+    """
+    # frame_list_dir is /data3/ava/frame_lists/
+    # contains 'train.csv' and 'val.csv'
+    if is_train:
+        list_name = "train.csv"
+    else:
+        list_name = "val.csv"
+
+    list_filename = os.path.join(frame_list, list_name)
+
+    image_paths = defaultdict(list)
+    video_name_to_idx = {}
+    video_idx_to_name = []
+    with open(list_filename, "r") as f:
+        f.readline()
+        for line in f:
+            row = line.split()
+            # The format of each row should follow:
+            # original_vido_id video_id frame_id path labels.
+            assert len(row) == 5
+            video_name = row[0]
+
+            if video_name not in video_name_to_idx:
+                idx = len(video_name_to_idx)
+                video_name_to_idx[video_name] = idx
+                video_idx_to_name.append(video_name)
+
+            data_key = video_name_to_idx[video_name]
+
+            image_paths[data_key].append(os.path.join(frames_dir, row[3]))
+
+    image_paths = [image_paths[i] for i in range(len(image_paths))]
+
+    print("Finished loading image paths from: {}".format(list_filename))
+
+    return image_paths, video_idx_to_name
+
+
+def load_boxes_and_labels(gt_box_list, exclusion_file, is_train=False, full_test_on_val=False):
+    """
+    Loading boxes and labels from csv files.
+
+    Args:
+        cfg (CfgNode): config.
+        mode (str): 'train', 'val', or 'test' mode.
+    Returns:
+        all_boxes (dict): a dict which maps from `video_name` and
+            `frame_sec` to a list of `box`. Each `box` is a
+            [`box_coord`, `box_labels`] where `box_coord` is the
+            coordinates of box and 'box_labels` are the corresponding
+            labels for the box.
+    """
+    ann_filename = gt_box_list
+    all_boxes = {}
+    count = 0
+    unique_box_count = 0
+    excluded_keys = read_exclusions(exclusion_file)
+
+    with open(ann_filename, 'r') as f:
+        for line in f:
+            row = line.strip().split(',')
+
+            video_name, frame_sec = row[0], int(row[1])
+            key = "%s,%04d" % (video_name, frame_sec)
+            # if mode == 'train' and key in excluded_keys:
+            if key in excluded_keys:
+                print("Found {} to be excluded...".format(key))
+                continue
+
+            # Only select frame_sec % 4 = 0 samples for validation if not
+            # set FULL_TEST_ON_VAL (default False)
+            if not is_train and not full_test_on_val and frame_sec % 4 != 0:
+                continue
+            # Box with [x1, y1, x2, y2] with a range of [0, 1] as float
+            box_key = ",".join(row[2:6])
+            box = list(map(float, row[2:6]))
+            label = -1 if row[6] == "" else int(row[6])
+            if video_name not in all_boxes:
+                all_boxes[video_name] = {}
+                for sec in AVA_VALID_FRAMES:
+                    all_boxes[video_name][sec] = {}
+            if box_key not in all_boxes[video_name][frame_sec]:
+                all_boxes[video_name][frame_sec][box_key] = [box, []]
+                unique_box_count += 1
+
+            all_boxes[video_name][frame_sec][box_key][1].append(label)
+            if label != -1:
+                count += 1
+
+    for video_name in all_boxes.keys():
+        for frame_sec in all_boxes[video_name].keys():
+            # Save in format of a list of [box_i, box_i_labels].
+            all_boxes[video_name][frame_sec] = list(
+                all_boxes[video_name][frame_sec].values()
+            )
+
+    print("Finished loading annotations from: %s" % ", ".join([ann_filename]))
+    print("Number of unique boxes: %d" % unique_box_count)
+    print("Number of annotations: %d" % count)
+
+    return all_boxes
+
+
+def get_keyframe_data(boxes_and_labels):
+    """
+    Getting keyframe indices, boxes and labels in the dataset.
+
+    Args:
+        boxes_and_labels (list[dict]): a list which maps from video_idx to a dict.
+            Each dict `frame_sec` to a list of boxes and corresponding labels.
+
+    Returns:
+        keyframe_indices (list): a list of indices of the keyframes.
+        keyframe_boxes_and_labels (list[list[list]]): a list of list which maps from
+            video_idx and sec_idx to a list of boxes and corresponding labels.
+    """
+
+    def sec_to_frame(sec):
+        """
+        Convert time index (in second) to frame index.
+        0: 900
+        30: 901
+        """
+        return (sec - 900) * FPS
+
+    keyframe_indices = []
+    keyframe_boxes_and_labels = []
+    count = 0
+    for video_idx in range(len(boxes_and_labels)):
+        sec_idx = 0
+        keyframe_boxes_and_labels.append([])
+        for sec in boxes_and_labels[video_idx].keys():
+            if sec not in AVA_VALID_FRAMES:
+                continue
+
+            if len(boxes_and_labels[video_idx][sec]) > 0:
+                keyframe_indices.append(
+                    (video_idx, sec_idx, sec, sec_to_frame(sec))
+                )
+                keyframe_boxes_and_labels[video_idx].append(
+                    boxes_and_labels[video_idx][sec]
+                )
+                sec_idx += 1
+                count += 1
+    logger.info("%d keyframes used." % count)
+
+    return keyframe_indices, keyframe_boxes_and_labels
+
+
+def get_num_boxes_used(keyframe_indices, keyframe_boxes_and_labels):
+    """
+    Get total number of used boxes.
+
+    Args:
+        keyframe_indices (list): a list of indices of the keyframes.
+        keyframe_boxes_and_labels (list[list[list]]): a list of list which maps from
+            video_idx and sec_idx to a list of boxes and corresponding labels.
+
+    Returns:
+        count (int): total number of used boxes.
+    """
+
+    count = 0
+    for video_idx, sec_idx, _, _ in keyframe_indices:
+        count += len(keyframe_boxes_and_labels[video_idx][sec_idx])
+    return count
+
+
+def get_max_objs(keyframe_indices, keyframe_boxes_and_labels):
+    # max_objs = 0
+    # for video_idx, sec_idx, _, _ in keyframe_indices:
+    #     num_boxes = len(keyframe_boxes_and_labels[video_idx][sec_idx])
+    #     if num_boxes > max_objs:
+    #         max_objs = num_boxes
+
+    # return max_objs
+    return 50 #### MODIFICATION FOR NOW! TODO: FIX LATER!
diff --git a/dataset/transforms.py b/dataset/transforms.py
new file mode 100644
index 0000000..d76ce42
--- /dev/null
+++ b/dataset/transforms.py
@@ -0,0 +1,175 @@
+import random
+import numpy as np
+import torch
+import torchvision.transforms.functional as F
+from PIL import Image
+
+
+# Augmentation for Training
+class Augmentation(object):
+    def __init__(self, img_size=224, jitter=0.2, hue=0.1, saturation=1.5, exposure=1.5):
+        self.img_size = img_size
+        self.jitter = jitter
+        self.hue = hue
+        self.saturation = saturation
+        self.exposure = exposure
+
+
+    def rand_scale(self, s):
+        scale = random.uniform(1, s)
+
+        if random.randint(0, 1): 
+            return scale
+
+        return 1./scale
+
+
+    def random_distort_image(self, video_clip):
+        dhue = random.uniform(-self.hue, self.hue)
+        dsat = self.rand_scale(self.saturation)
+        dexp = self.rand_scale(self.exposure)
+        
+        video_clip_ = []
+        for image in video_clip:
+            image = image.convert('HSV')
+            cs = list(image.split())
+            cs[1] = cs[1].point(lambda i: i * dsat)
+            cs[2] = cs[2].point(lambda i: i * dexp)
+            
+            def change_hue(x):
+                x += dhue * 255
+                if x > 255:
+                    x -= 255
+                if x < 0:
+                    x += 255
+                return x
+
+            cs[0] = cs[0].point(change_hue)
+            image = Image.merge(image.mode, tuple(cs))
+
+            image = image.convert('RGB')
+
+            video_clip_.append(image)
+
+        return video_clip_
+
+
+    def random_crop(self, video_clip, width, height):
+        dw =int(width * self.jitter)
+        dh =int(height * self.jitter)
+
+        pleft  = random.randint(-dw, dw)
+        pright = random.randint(-dw, dw)
+        ptop   = random.randint(-dh, dh)
+        pbot   = random.randint(-dh, dh)
+
+        swidth =  width - pleft - pright
+        sheight = height - ptop - pbot
+
+        sx = float(swidth)  / width
+        sy = float(sheight) / height
+        
+        dx = (float(pleft) / width)/sx
+        dy = (float(ptop) / height)/sy
+
+        # random crop
+        cropped_clip = [img.crop((pleft, ptop, pleft + swidth - 1, ptop + sheight - 1)) for img in video_clip]
+
+        return cropped_clip, dx, dy, sx, sy
+
+
+    def apply_bbox(self, target, ow, oh, dx, dy, sx, sy):
+        sx, sy = 1./sx, 1./sy
+        # apply deltas on bbox
+        target[..., 0] = np.minimum(0.999, np.maximum(0, target[..., 0] / ow * sx - dx)) 
+        target[..., 1] = np.minimum(0.999, np.maximum(0, target[..., 1] / oh * sy - dy)) 
+        target[..., 2] = np.minimum(0.999, np.maximum(0, target[..., 2] / ow * sx - dx)) 
+        target[..., 3] = np.minimum(0.999, np.maximum(0, target[..., 3] / oh * sy - dy)) 
+
+        # refine target
+        refine_target = []
+        for i in range(target.shape[0]):
+            tgt = target[i]
+            bw = (tgt[2] - tgt[0]) * ow
+            bh = (tgt[3] - tgt[1]) * oh
+
+            if bw < 1. or bh < 1.:
+                continue
+            
+            refine_target.append(tgt)
+
+        refine_target = np.array(refine_target).reshape(-1, target.shape[-1])
+
+        return refine_target
+        
+
+    def to_tensor(self, video_clip):
+        return [F.to_tensor(image) * 255. for image in video_clip]
+
+
+    def __call__(self, video_clip, target):
+        # Initialize Random Variables
+        oh = video_clip[0].height  
+        ow = video_clip[0].width
+        
+        # random crop
+        video_clip, dx, dy, sx, sy = self.random_crop(video_clip, ow, oh)
+
+        # resize
+        video_clip = [img.resize([self.img_size, self.img_size]) for img in video_clip]
+
+        # random flip
+        flip = random.randint(0, 1)
+        if flip:
+            video_clip = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in video_clip]
+
+        # distort
+        video_clip = self.random_distort_image(video_clip)
+
+        # process target
+        if target is not None:
+            target = self.apply_bbox(target, ow, oh, dx, dy, sx, sy)
+            if flip:
+                target[..., [0, 2]] = 1.0 - target[..., [2, 0]]
+        else:
+            target = np.array([])
+            
+        # to tensor
+        video_clip = self.to_tensor(video_clip)
+        target = torch.as_tensor(target).float()
+
+        return video_clip, target 
+
+
+# Transform for Testing
+class BaseTransform(object):
+    def __init__(self, img_size=224, ):
+        self.img_size = img_size
+
+
+    def to_tensor(self, video_clip):
+        return [F.to_tensor(image) * 255. for image in video_clip]
+
+
+    def __call__(self, video_clip, target=None, normalize=True):
+        oh = video_clip[0].height
+        ow = video_clip[0].width
+
+        # resize
+        video_clip = [img.resize([self.img_size, self.img_size]) for img in video_clip]
+
+        # normalize target
+        if target is not None:
+            if normalize:
+                target[..., [0, 2]] /= ow
+                target[..., [1, 3]] /= oh
+
+        else:
+            target = np.array([])
+
+        # to tensor
+        video_clip = self.to_tensor(video_clip)
+        target = torch.as_tensor(target).float()
+
+        return video_clip, target 
+
diff --git a/dataset/ucf24_demo/v_Basketball_g01_c02.mp4 b/dataset/ucf24_demo/v_Basketball_g01_c02.mp4
new file mode 100644
index 0000000..2347ac4
Binary files /dev/null and b/dataset/ucf24_demo/v_Basketball_g01_c02.mp4 differ
diff --git a/dataset/ucf24_demo/v_Basketball_g07_c04.mp4 b/dataset/ucf24_demo/v_Basketball_g07_c04.mp4
new file mode 100644
index 0000000..f3376cd
Binary files /dev/null and b/dataset/ucf24_demo/v_Basketball_g07_c04.mp4 differ
diff --git a/dataset/ucf24_demo/v_Biking_g01_c01.mp4 b/dataset/ucf24_demo/v_Biking_g01_c01.mp4
new file mode 100644
index 0000000..5621f06
Binary files /dev/null and b/dataset/ucf24_demo/v_Biking_g01_c01.mp4 differ
diff --git a/dataset/ucf24_demo/v_CliffDiving_g03_c01.mp4 b/dataset/ucf24_demo/v_CliffDiving_g03_c01.mp4
new file mode 100644
index 0000000..99b40d7
Binary files /dev/null and b/dataset/ucf24_demo/v_CliffDiving_g03_c01.mp4 differ
diff --git a/dataset/ucf24_demo/v_Fencing_g01_c06.mp4 b/dataset/ucf24_demo/v_Fencing_g01_c06.mp4
new file mode 100644
index 0000000..fc915dc
Binary files /dev/null and b/dataset/ucf24_demo/v_Fencing_g01_c06.mp4 differ
diff --git a/dataset/ucf24_demo/v_HorseRiding_g01_c03.mp4 b/dataset/ucf24_demo/v_HorseRiding_g01_c03.mp4
new file mode 100644
index 0000000..841f3ef
Binary files /dev/null and b/dataset/ucf24_demo/v_HorseRiding_g01_c03.mp4 differ
diff --git a/dataset/ucf24_demo/v_IceDancing_g02_c05.mp4 b/dataset/ucf24_demo/v_IceDancing_g02_c05.mp4
new file mode 100644
index 0000000..38b254b
Binary files /dev/null and b/dataset/ucf24_demo/v_IceDancing_g02_c05.mp4 differ
diff --git a/dataset/ucf24_demo/v_SalsaSpin_g03_c01.mp4 b/dataset/ucf24_demo/v_SalsaSpin_g03_c01.mp4
new file mode 100644
index 0000000..e382fb1
Binary files /dev/null and b/dataset/ucf24_demo/v_SalsaSpin_g03_c01.mp4 differ
diff --git a/dataset/ucf24_demo/v_SkateBoarding_g02_c01.mp4 b/dataset/ucf24_demo/v_SkateBoarding_g02_c01.mp4
new file mode 100644
index 0000000..d781e82
Binary files /dev/null and b/dataset/ucf24_demo/v_SkateBoarding_g02_c01.mp4 differ
diff --git a/dataset/ucf_jhmdb.py b/dataset/ucf_jhmdb.py
new file mode 100644
index 0000000..1f724fa
--- /dev/null
+++ b/dataset/ucf_jhmdb.py
@@ -0,0 +1,313 @@
+#!/usr/bin/python
+# encoding: utf-8
+
+import os
+import random
+import numpy as np
+import glob
+
+import torch
+from torch.utils.data import Dataset
+from PIL import Image
+
+
+# Dataset for UCF24 & JHMDB
+class UCF_JHMDB_Dataset(Dataset):
+    def __init__(self,
+                 data_root,
+                 dataset='ucf24',
+                 img_size=224,
+                 transform=None,
+                 is_train=False,
+                 len_clip=16,
+                 sampling_rate=1):
+        self.data_root = data_root
+        self.dataset = dataset
+        self.transform = transform
+        self.is_train = is_train
+        
+        self.img_size = img_size
+        self.len_clip = len_clip
+        self.sampling_rate = sampling_rate
+            
+        if self.is_train:
+            self.split_list = 'trainlist.txt'
+        else:
+            self.split_list = 'testlist.txt'
+
+        # load data
+        with open(os.path.join(data_root, self.split_list), 'r') as file:
+            self.file_names = file.readlines()
+        self.num_samples  = len(self.file_names)
+
+        if dataset == 'ucf24':
+            self.num_classes = 24
+        elif dataset == 'jhmdb21':
+            self.num_classes = 21
+
+
+    def __len__(self):
+        return self.num_samples
+
+
+    def __getitem__(self, index):
+        # load a data
+        frame_idx, video_clip, target = self.pull_item(index)
+
+        return frame_idx, video_clip, target
+
+
+    def pull_item(self, index):
+        """ load a data """
+        assert index <= len(self), 'index range error'
+        image_path = self.file_names[index].rstrip()
+
+        img_split = image_path.split('/')  # ex. ['labels', 'Basketball', 'v_Basketball_g08_c01', '00070.txt']
+        # image name
+        img_id = int(img_split[-1][:5])
+
+        # path to label
+        label_path = os.path.join(self.data_root, img_split[0], img_split[1], img_split[2], '{:05d}.txt'.format(img_id))
+
+        # image folder
+        img_folder = os.path.join(self.data_root, 'rgb-images', img_split[1], img_split[2])
+
+        # frame numbers
+        if self.dataset == 'ucf24':
+            max_num = len(os.listdir(img_folder))
+        elif self.dataset == 'jhmdb21':
+            max_num = len(os.listdir(img_folder)) - 1
+
+        # sampling rate
+        if self.is_train:
+            d = random.randint(1, 2)
+        else:
+            d = self.sampling_rate
+
+        # load images
+        video_clip = []
+        for i in reversed(range(self.len_clip)):
+            # make it as a loop
+            img_id_temp = img_id - i * d
+            if img_id_temp < 1:
+                img_id_temp = 1
+            elif img_id_temp > max_num:
+                img_id_temp = max_num
+
+            # load a frame
+            if self.dataset == 'ucf24':
+                path_tmp = os.path.join(self.data_root, 'rgb-images', img_split[1], img_split[2] ,'{:05d}.jpg'.format(img_id_temp))
+            elif self.dataset == 'jhmdb21':
+                path_tmp = os.path.join(self.data_root, 'rgb-images', img_split[1], img_split[2] ,'{:05d}.png'.format(img_id_temp))
+            frame = Image.open(path_tmp).convert('RGB')
+            ow, oh = frame.width, frame.height
+
+            video_clip.append(frame)
+
+            frame_id = img_split[1] + '_' +img_split[2] + '_' + img_split[3]
+
+        # load an annotation
+        if os.path.getsize(label_path):
+            target = np.loadtxt(label_path)
+        else:
+            target = None
+
+        # [label, x1, y1, x2, y2] -> [x1, y1, x2, y2, label]
+        label = target[..., :1]
+        boxes = target[..., 1:]
+        target = np.concatenate([boxes, label], axis=-1).reshape(-1, 5)
+            
+        # transform
+        video_clip, target = self.transform(video_clip, target)
+        # List [T, 3, H, W] -> [3, T, H, W]
+        video_clip = torch.stack(video_clip, dim=1)
+
+        # reformat target
+        target = {
+            'boxes': target[:, :4].float(),      # [N, 4]
+            'labels': target[:, -1].long() - 1,    # [N,]
+            'orig_size': [ow, oh],
+            'video_idx':frame_id[:-10]
+        }
+
+        return frame_id, video_clip, target
+
+
+    def pull_anno(self, index):
+        """ load a data """
+        assert index <= len(self), 'index range error'
+        image_path = self.file_names[index].rstrip()
+
+        img_split = image_path.split('/')  # ex. ['labels', 'Basketball', 'v_Basketball_g08_c01', '00070.txt']
+        # image name
+        img_id = int(img_split[-1][:5])
+
+        # path to label
+        label_path = os.path.join(self.data_root, img_split[0], img_split[1], img_split[2], '{:05d}.txt'.format(img_id))
+
+        # load an annotation
+        target = np.loadtxt(label_path)
+        target = target.reshape(-1, 5)
+
+        return target
+        
+
+# Video Dataset for UCF24 & JHMDB
+class UCF_JHMDB_VIDEO_Dataset(Dataset):
+    def __init__(self,
+                 data_root,
+                 dataset='ucf24',
+                 img_size=224,
+                 transform=None,
+                 len_clip=16,
+                 sampling_rate=1):
+        self.data_root = data_root
+        self.dataset = dataset
+        self.transform = transform
+        
+        self.img_size = img_size
+        self.len_clip = len_clip
+        self.sampling_rate = sampling_rate
+            
+        if dataset == 'ucf24':
+            self.num_classes = 24
+        elif dataset == 'jhmdb21':
+            self.num_classes = 21
+
+
+    def set_video_data(self, line):
+        self.line = line
+
+        # load a video
+        self.img_folder = os.path.join(self.data_root, 'rgb-images', self.line)
+
+        if self.dataset == 'ucf24':
+            self.label_paths = sorted(glob.glob(os.path.join(self.img_folder, '*.jpg')))
+        elif self.dataset == 'jhmdb21':
+            self.label_paths = sorted(glob.glob(os.path.join(self.img_folder, '*.png')))
+
+
+    def __len__(self):
+        return len(self.label_paths)
+
+
+    def __getitem__(self, index):
+        return self.pull_item(index)
+
+
+    def pull_item(self, index):
+        image_path = self.label_paths[index]
+
+        video_split = self.line.split('/')
+        video_class = video_split[0]
+        video_file = video_split[1]
+        # for windows:
+        # img_split = image_path.split('\\')  # ex. [..., 'Basketball', 'v_Basketball_g08_c01', '00070.txt']
+        # for linux
+        img_split = image_path.split('/')  # ex. [..., 'Basketball', 'v_Basketball_g08_c01', '00070.txt']
+
+        # image name
+        img_id = int(img_split[-1][:5])
+        max_num = len(os.listdir(self.img_folder))
+        if self.dataset == 'ucf24':
+            img_name = os.path.join(video_class, video_file, '{:05d}.jpg'.format(img_id))
+        elif self.dataset == 'jhmdb21':
+            img_name = os.path.join(video_class, video_file, '{:05d}.png'.format(img_id))
+
+        # load video clip
+        video_clip = []
+        for i in reversed(range(self.len_clip)):
+            # make it as a loop
+            img_id_temp = img_id - i
+            if img_id_temp < 1:
+                img_id_temp = 1
+            elif img_id_temp > max_num:
+                img_id_temp = max_num
+
+            # load a frame
+            if self.dataset == 'ucf24':
+                path_tmp = os.path.join(self.data_root, 'rgb-images', video_class, video_file ,'{:05d}.jpg'.format(img_id_temp))
+            elif self.dataset == 'jhmdb21':
+                path_tmp = os.path.join(self.data_root, 'rgb-images', video_class, video_file ,'{:05d}.png'.format(img_id_temp))
+            frame = Image.open(path_tmp).convert('RGB')
+            ow, oh = frame.width, frame.height
+
+            video_clip.append(frame)
+
+        # transform
+        video_clip, _ = self.transform(video_clip, normalize=False)
+        # List [T, 3, H, W] -> [3, T, H, W]
+        video_clip = torch.stack(video_clip, dim=1)
+        orig_size = [ow, oh]  # width, height
+
+        target = {'orig_size': [ow, oh]}
+
+        return img_name, video_clip, target
+
+
+
+
+if __name__ == '__main__':
+    import cv2
+    from transforms import Augmentation, BaseTransform
+
+    data_root = 'D:/python_work/spatial-temporal_action_detection/dataset/ucf24'
+    dataset = 'ucf24'
+    is_train = True
+    img_size = 224
+    len_clip = 16
+    trans_config = {
+        'jitter': 0.2,
+        'hue': 0.1,
+        'saturation': 1.5,
+        'exposure': 1.5
+    }
+    train_transform = Augmentation(
+        img_size=img_size,
+        jitter=trans_config['jitter'],
+        saturation=trans_config['saturation'],
+        exposure=trans_config['exposure']
+        )
+    val_transform = BaseTransform(img_size=img_size)
+
+    train_dataset = UCF_JHMDB_Dataset(
+        data_root=data_root,
+        dataset=dataset,
+        img_size=img_size,
+        transform=train_transform,
+        is_train=is_train,
+        len_clip=len_clip,
+        sampling_rate=1
+    )
+
+    print(len(train_dataset))
+    std = trans_config['pixel_std']
+    mean = trans_config['pixel_mean']
+    for i in range(len(train_dataset)):
+        frame_id, video_clip, target = train_dataset[i]
+        key_frame = video_clip[:, -1, :, :]
+
+        # to numpy
+        key_frame = key_frame.permute(1, 2, 0).numpy()
+        key_frame = key_frame.astype(np.uint8)
+
+        # to BGR
+        key_frame = key_frame[..., (2, 1, 0)]
+        H, W, C = key_frame.shape
+
+        key_frame = key_frame.copy()
+        bboxes = target['boxes']
+        labels = target['labels']
+
+        for box, cls_id in zip(bboxes, labels):
+            x1, y1, x2, y2 = box
+            x1 = int(x1 * W)
+            y1 = int(y1 * H)
+            x2 = int(x2 * W)
+            y2 = int(y2 * H)
+            key_frame = cv2.rectangle(key_frame, (x1, y1), (x2, y2), (255, 0, 0))
+
+        # cv2 show
+        cv2.imshow('key frame', key_frame)
+        cv2.waitKey(0)
+        
diff --git a/demo.py b/demo.py
new file mode 100644
index 0000000..c7913f2
--- /dev/null
+++ b/demo.py
@@ -0,0 +1,271 @@
+import argparse
+import cv2
+import os
+import time
+import numpy as np
+import torch
+import imageio
+from PIL import Image
+
+from dataset.transforms import BaseTransform
+from utils.misc import load_weight
+from utils.box_ops import rescale_bboxes
+from utils.vis_tools import vis_detection
+from config import build_dataset_config, build_model_config
+from models import build_model
+
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='YOWOv2 Demo')
+
+    # basic
+    parser.add_argument('-size', '--img_size', default=224, type=int,
+                        help='the size of input frame')
+    parser.add_argument('--show', action='store_true', default=False,
+                        help='show the visulization results.')
+    parser.add_argument('--cuda', action='store_true', default=False, 
+                        help='use cuda.')
+    parser.add_argument('--save_folder', default='det_results/', type=str,
+                        help='Dir to save results')
+    parser.add_argument('-vs', '--vis_thresh', default=0.3, type=float,
+                        help='threshold for visualization')
+    parser.add_argument('--video', default='9Y_l9NsnYE0.mp4', type=str,
+                        help='AVA video name.')
+    parser.add_argument('--gif', action='store_true', default=False, 
+                        help='generate gif.')
+
+    # class label config
+    parser.add_argument('-d', '--dataset', default='ava_v2.2',
+                        help='ava_v2.2')
+    parser.add_argument('--pose', action='store_true', default=False, 
+                        help='show 14 action pose of AVA.')
+
+    # model
+    parser.add_argument('-v', '--version', default='yowo_v2_large', type=str,
+                        help='build YOWOv2')
+    parser.add_argument('--weight', default=None,
+                        type=str, help='Trained state_dict file path to open')
+    parser.add_argument('-ct', '--conf_thresh', default=0.1, type=float,
+                        help='confidence threshold')
+    parser.add_argument('-nt', '--nms_thresh', default=0.5, type=float,
+                        help='NMS threshold')
+    parser.add_argument('--topk', default=40, type=int,
+                        help='NMS threshold')
+    parser.add_argument('-K', '--len_clip', default=16, type=int,
+                        help='video clip length.')
+    parser.add_argument('-m', '--memory', action="store_true", default=False,
+                        help="memory propagate.")
+
+    return parser.parse_args()
+                    
+
+def multi_hot_vis(args, frame, out_bboxes, orig_w, orig_h, class_names, act_pose=False):
+    # visualize detection results
+    for bbox in out_bboxes:
+        x1, y1, x2, y2 = bbox[:4]
+        if act_pose:
+            # only show 14 poses of AVA.
+            cls_conf = bbox[5:5+14]
+        else:
+            # show all actions of AVA.
+            cls_conf = bbox[5:]
+    
+        # rescale bbox
+        x1, x2 = int(x1 * orig_w), int(x2 * orig_w)
+        y1, y2 = int(y1 * orig_h), int(y2 * orig_h)
+
+        # score = obj * cls
+        det_conf = float(bbox[4])
+        cls_scores = np.sqrt(det_conf * cls_conf)
+
+        indices = np.where(cls_scores > args.vis_thresh)
+        scores = cls_scores[indices]
+        indices = list(indices[0])
+        scores = list(scores)
+
+        if len(scores) > 0:
+            # draw bbox
+            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
+            # draw text
+            blk   = np.zeros(frame.shape, np.uint8)
+            font  = cv2.FONT_HERSHEY_SIMPLEX
+            coord = []
+            text  = []
+            text_size = []
+
+            for _, cls_ind in enumerate(indices):
+                text.append("[{:.2f}] ".format(scores[_]) + str(class_names[cls_ind]))
+                text_size.append(cv2.getTextSize(text[-1], font, fontScale=0.5, thickness=1)[0])
+                coord.append((x1+3, y1+14+20*_))
+                cv2.rectangle(blk, (coord[-1][0]-1, coord[-1][1]-12), (coord[-1][0]+text_size[-1][0]+1, coord[-1][1]+text_size[-1][1]-4), (0, 255, 0), cv2.FILLED)
+            frame = cv2.addWeighted(frame, 1.0, blk, 0.5, 1)
+            for t in range(len(text)):
+                cv2.putText(frame, text[t], coord[t], font, 0.5, (0, 0, 0), 1)
+    
+    return frame
+
+
+@torch.no_grad()
+def detect(args, d_cfg, model, device, transform, class_names, class_colors):
+    # path to save 
+    save_path = os.path.join(args.save_folder, 'demo', 'videos')
+    os.makedirs(save_path, exist_ok=True)
+
+    # path to video
+    path_to_video = os.path.join(args.video)
+
+    # video
+    video = cv2.VideoCapture(path_to_video)
+    fourcc = cv2.VideoWriter_fourcc(*'XVID')
+    save_size = (640, 480)
+    save_name = os.path.join(save_path, 'detection.avi')
+    fps = 20.0
+    out = cv2.VideoWriter(save_name, fourcc, fps, save_size)
+
+    # run
+    video_clip = []
+    image_list = []
+    while(True):
+        ret, frame = video.read()
+        
+        if ret:
+            # to RGB
+            frame_rgb = frame[..., (2, 1, 0)]
+
+            # to PIL image
+            frame_pil = Image.fromarray(frame_rgb.astype(np.uint8))
+
+            # prepare
+            if len(video_clip) <= 0:
+                for _ in range(args.len_clip):
+                    video_clip.append(frame_pil)
+
+            video_clip.append(frame_pil)
+            del video_clip[0]
+
+            # orig size
+            orig_h, orig_w = frame.shape[:2]
+
+            # transform
+            x, _ = transform(video_clip)
+            # List [T, 3, H, W] -> [3, T, H, W]
+            x = torch.stack(x, dim=1)
+            x = x.unsqueeze(0).to(device) # [B, 3, T, H, W], B=1
+
+            t0 = time.time()
+            # inference
+            outputs = model(x)
+            print("inference time ", time.time() - t0, "s")
+
+            # vis detection results
+            if args.dataset in ['ava_v2.2']:
+                batch_bboxes = outputs
+                # batch size = 1
+                bboxes = batch_bboxes[0]
+                # multi hot
+                frame = multi_hot_vis(
+                    args=args,
+                    frame=frame,
+                    out_bboxes=bboxes,
+                    orig_w=orig_w,
+                    orig_h=orig_h,
+                    class_names=class_names,
+                    act_pose=args.pose
+                    )
+            elif args.dataset in ['ucf24']:
+                batch_scores, batch_labels, batch_bboxes = outputs
+                # batch size = 1
+                scores = batch_scores[0]
+                labels = batch_labels[0]
+                bboxes = batch_bboxes[0]
+                # rescale
+                bboxes = rescale_bboxes(bboxes, [orig_w, orig_h])
+                # one hot
+                frame = vis_detection(
+                    frame=frame,
+                    scores=scores,
+                    labels=labels,
+                    bboxes=bboxes,
+                    vis_thresh=args.vis_thresh,
+                    class_names=class_names,
+                    class_colors=class_colors
+                    )
+            # save
+            frame_resized = cv2.resize(frame, save_size)
+            out.write(frame_resized)
+
+            if args.gif:
+                gif_resized = cv2.resize(frame, (160, 120))
+                gif_resized_rgb = gif_resized[..., (2, 1, 0)]
+                image_list.append(gif_resized_rgb)
+
+            if args.show:
+                # show
+                cv2.imshow('key-frame detection', frame)
+                cv2.waitKey(1)
+
+        else:
+            break
+
+    video.release()
+    out.release()
+    cv2.destroyAllWindows()
+
+    # generate GIF
+    if args.gif:
+        save_name = os.path.join(save_path, 'detect.gif')
+        print('generating GIF ...')
+        imageio.mimsave(save_name, image_list, fps=fps)
+        print('GIF done: {}'.format(save_name))
+
+
+if __name__ == '__main__':
+    np.random.seed(100)
+    args = parse_args()
+
+    # cuda
+    if args.cuda:
+        print('use cuda')
+        device = torch.device("cuda")
+    else:
+        device = torch.device("cpu")
+
+    # config
+    d_cfg = build_dataset_config(args)
+    m_cfg = build_model_config(args)
+
+    class_names = d_cfg['label_map']
+    num_classes = d_cfg['valid_num_classes']
+
+    class_colors = [(np.random.randint(255),
+                     np.random.randint(255),
+                     np.random.randint(255)) for _ in range(num_classes)]
+
+    # transform
+    basetransform = BaseTransform(img_size=args.img_size)
+
+    # build model
+    model, _ = build_model(
+        args=args,
+        d_cfg=d_cfg,
+        m_cfg=m_cfg,
+        device=device, 
+        num_classes=num_classes, 
+        trainable=False
+        )
+
+    # load trained weight
+    model = load_weight(model=model, path_to_ckpt=args.weight)
+
+    # to eval
+    model = model.to(device).eval()
+
+    # run
+    detect(args=args,
+            d_cfg=d_cfg,
+            model=model,
+            device=device,
+            transform=basetransform,
+            class_names=class_names,
+            class_colors=class_colors)
diff --git a/eval.py b/eval.py
new file mode 100644
index 0000000..323c582
--- /dev/null
+++ b/eval.py
@@ -0,0 +1,183 @@
+import argparse
+import torch
+import os
+
+from evaluator.ucf_jhmdb_evaluator import UCF_JHMDB_Evaluator
+from evaluator.ava_evaluator import AVA_Evaluator
+
+from dataset.transforms import BaseTransform
+
+from utils.misc import load_weight, CollateFunc
+
+from config import build_dataset_config, build_model_config
+from models import build_model
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='YOWOv2')
+
+    # basic
+    parser.add_argument('-bs', '--batch_size', default=8, type=int,
+                        help='test batch size')
+    parser.add_argument('-size', '--img_size', default=224, type=int,
+                        help='the size of input frame')
+    parser.add_argument('--cuda', action='store_true', default=False, 
+                        help='use cuda.')
+    parser.add_argument('--save_path', default='./evaluator/eval_results/',
+                        type=str, help='Trained state_dict file path to open')
+
+    # dataset
+    parser.add_argument('-d', '--dataset', default='ucf24',
+                        help='ucf24, jhmdb, ava_v2.2.')
+    parser.add_argument('--root', default='/mnt/share/ssd2/dataset/STAD/',
+                        help='data root')
+
+    # eval
+    parser.add_argument('--cal_frame_mAP', action='store_true', default=False, 
+                        help='calculate frame mAP.')
+    parser.add_argument('--cal_video_mAP', action='store_true', default=False, 
+                        help='calculate video mAP.')
+
+    # model
+    parser.add_argument('-v', '--version', default='yowo_v2_large', type=str,
+                        help='build YOWOv2')
+    parser.add_argument('--weight', default=None,
+                        type=str, help='Trained state_dict file path to open')
+    parser.add_argument('-ct', '--conf_thresh', default=0.1, type=float,
+                        help='confidence threshold. We suggest 0.005 for UCF24 and 0.1 for AVA.')
+    parser.add_argument('-nt', '--nms_thresh', default=0.5, type=float,
+                        help='NMS threshold. We suggest 0.5 for UCF24 and AVA.')
+    parser.add_argument('--topk', default=40, type=int,
+                        help='topk prediction candidates.')
+    parser.add_argument('-K', '--len_clip', default=16, type=int,
+                        help='video clip length.')
+    parser.add_argument('-m', '--memory', action="store_true", default=False,
+                        help="memory propagate.")
+
+    return parser.parse_args()
+
+
+def ucf_jhmdb_eval(args, d_cfg, model, transform, collate_fn):
+    data_dir = os.path.join(args.root, 'ucf24')
+    if args.cal_frame_mAP:
+        # Frame mAP evaluator
+        evaluator = UCF_JHMDB_Evaluator(
+            data_root=data_dir,
+            dataset=args.dataset,
+            model_name=args.version,
+            metric='fmap',
+            img_size=args.img_size,
+            len_clip=args.len_clip,
+            batch_size=args.batch_size,
+            conf_thresh=0.01,
+            iou_thresh=0.5,
+            transform=transform,
+            collate_fn=collate_fn,
+            gt_folder=d_cfg['gt_folder'],
+            save_path=args.save_path
+            )
+        # evaluate
+        evaluator.evaluate_frame_map(model, show_pr_curve=True)
+
+    elif args.cal_video_mAP:
+        # Video mAP evaluator
+        evaluator = UCF_JHMDB_Evaluator(
+            data_root=data_dir,
+            dataset=args.dataset,
+            model_name=args.version,
+            metric='vmap',
+            img_size=args.img_size,
+            len_clip=args.len_clip,
+            batch_size=args.batch_size,
+            conf_thresh=0.01,
+            iou_thresh=0.5,
+            transform=transform,
+            collate_fn=collate_fn,
+            gt_folder=d_cfg['gt_folder'],
+            save_path=args.save_path
+            )
+        # evaluate
+        evaluator.evaluate_video_map(model)
+
+
+def ava_eval(args, d_cfg, model, transform, collate_fn):
+    data_dir = os.path.join(args.root, 'AVA_Dataset')
+    evaluator = AVA_Evaluator(
+        d_cfg=d_cfg,
+        data_root=data_dir,
+        img_size=args.img_size,
+        len_clip=args.len_clip,
+        sampling_rate=d_cfg['sampling_rate'],
+        batch_size=args.batch_size,
+        transform=transform,
+        collate_fn=collate_fn,
+        full_test_on_val=False,
+        version='v2.2')
+
+    mAP = evaluator.evaluate_frame_map(model)
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    # dataset
+    if args.dataset == 'ucf24':
+        num_classes = 24
+
+    elif args.dataset == 'jhmdb':
+        num_classes = 21
+
+    elif args.dataset == 'ava_v2.2':
+        num_classes = 80
+
+    else:
+        print('unknow dataset.')
+        exit(0)
+
+    # cuda
+    if args.cuda:
+        print('use cuda')
+        device = torch.device("cuda")
+    else:
+        device = torch.device("cpu")
+
+    # config
+    d_cfg = build_dataset_config(args)
+    m_cfg = build_model_config(args)
+
+
+    # build model
+    model, _ = build_model(
+        args=args, 
+        d_cfg=d_cfg,
+        m_cfg=m_cfg,
+        device=device, 
+        num_classes=num_classes, 
+        trainable=False
+        )
+
+    # load trained weight
+    model = load_weight(model=model, path_to_ckpt=args.weight)
+
+    # to eval
+    model = model.to(device).eval()
+
+    # transform
+    basetransform = BaseTransform(img_size=args.img_size)
+
+    # run
+    if args.dataset in ['ucf24', 'jhmdb21']:
+        ucf_jhmdb_eval(
+            args=args,
+            d_cfg=d_cfg,
+            model=model,
+            transform=basetransform,
+            collate_fn=CollateFunc()
+            )
+    elif args.dataset == 'ava_v2.2':
+        ava_eval(
+            args=args,
+            d_cfg=d_cfg,
+            model=model,
+            transform=basetransform,
+            collate_fn=CollateFunc()
+            )
diff --git a/evaluator/__init__.py b/evaluator/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/evaluator/ava_eval_helper.py b/evaluator/ava_eval_helper.py
new file mode 100644
index 0000000..6fa5ff2
--- /dev/null
+++ b/evaluator/ava_eval_helper.py
@@ -0,0 +1,308 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+##############################################################################
+#
+# Based on:
+# --------------------------------------------------------
+# ActivityNet
+# Copyright (c) 2015 ActivityNet
+# Licensed under The MIT License
+# [see https://github.com/activitynet/ActivityNet/blob/master/LICENSE for details]
+# --------------------------------------------------------
+
+"""Helper functions for AVA evaluation."""
+
+from __future__ import (
+    absolute_import,
+    division,
+    print_function,
+    unicode_literals,
+)
+import csv
+import logging
+import numpy as np
+import pprint
+import time
+from collections import defaultdict
+
+from .ava_evaluation import (
+    object_detection_evaluation,
+    standard_fields,
+)
+
+logger = logging.getLogger(__name__)
+
+
+def make_image_key(video_id, timestamp):
+    """Returns a unique identifier for a video id & timestamp."""
+    return "%s,%04d" % (video_id, int(timestamp))
+
+
+def read_csv(csv_file, class_whitelist=None, load_score=False):
+    """Loads boxes and class labels from a CSV file in the AVA format.
+    CSV file format described at https://research.google.com/ava/download.html.
+    Args:
+      csv_file: A file object.
+      class_whitelist: If provided, boxes corresponding to (integer) class labels
+        not in this set are skipped.
+    Returns:
+      boxes: A dictionary mapping each unique image key (string) to a list of
+        boxes, given as coordinates [y1, x1, y2, x2].
+      labels: A dictionary mapping each unique image key (string) to a list of
+        integer class lables, matching the corresponding box in `boxes`.
+      scores: A dictionary mapping each unique image key (string) to a list of
+        score values lables, matching the corresponding label in `labels`. If
+        scores are not provided in the csv, then they will default to 1.0.
+    """
+    boxes = defaultdict(list)
+    labels = defaultdict(list)
+    scores = defaultdict(list)
+    with open(csv_file, "r") as f:
+        reader = csv.reader(f)
+        for row in reader:
+            assert len(row) in [7, 8], "Wrong number of columns: " + row
+            image_key = make_image_key(row[0], row[1])
+            x1, y1, x2, y2 = [float(n) for n in row[2:6]]
+            action_id = int(row[6])
+            if class_whitelist and action_id not in class_whitelist:
+                continue
+            score = 1.0
+            if load_score:
+                score = float(row[7])
+            boxes[image_key].append([y1, x1, y2, x2])
+            labels[image_key].append(action_id)
+            scores[image_key].append(score)
+    return boxes, labels, scores
+
+
+def read_exclusions(exclusions_file):
+    """Reads a CSV file of excluded timestamps.
+    Args:
+      exclusions_file: A file object containing a csv of video-id,timestamp.
+    Returns:
+      A set of strings containing excluded image keys, e.g. "aaaaaaaaaaa,0904",
+      or an empty set if exclusions file is None.
+    """
+    excluded = set()
+    if exclusions_file:
+        with open(exclusions_file, "r") as f:
+            reader = csv.reader(f)
+            for row in reader:
+                assert len(row) == 2, "Expected only 2 columns, got: " + row
+                excluded.add(make_image_key(row[0], row[1]))
+    return excluded
+
+
+def read_labelmap(labelmap_file):
+    """Read label map and class ids."""
+
+    labelmap = []
+    class_ids = set()
+    name = ""
+    class_id = ""
+    with open(labelmap_file, "r") as f:
+        for line in f:
+            if line.startswith("  name:"):
+                name = line.split('"')[1]
+            elif line.startswith("  id:") or line.startswith("  label_id:"):
+                class_id = int(line.strip().split(" ")[-1])
+                labelmap.append({"id": class_id, "name": name})
+                class_ids.add(class_id)
+    return labelmap, class_ids
+
+
+def evaluate_ava_from_files(labelmap, groundtruth, detections, exclusions):
+    """Run AVA evaluation given annotation/prediction files."""
+
+    categories, class_whitelist = read_labelmap(labelmap)
+    excluded_keys = read_exclusions(exclusions)
+    groundtruth = read_csv(groundtruth, class_whitelist, load_score=False)
+    detections = read_csv(detections, class_whitelist, load_score=True)
+    run_evaluation(categories, groundtruth, detections, excluded_keys)
+
+
+def evaluate_ava(
+    preds,
+    original_boxes,
+    metadata,
+    excluded_keys,
+    class_whitelist,
+    categories,
+    groundtruth=None,
+    video_idx_to_name=None,
+    name="latest",
+):
+    """Run AVA evaluation given numpy arrays."""
+
+    eval_start = time.time()
+
+    detections = get_ava_eval_data(
+        preds,
+        original_boxes,
+        metadata,
+        class_whitelist,
+        video_idx_to_name=video_idx_to_name,
+    )
+
+    logger.info("Evaluating with %d unique GT frames." % len(groundtruth[0]))
+    logger.info(
+        "Evaluating with %d unique detection frames" % len(detections[0])
+    )
+
+    write_results(detections, "detections_%s.csv" % name)
+    write_results(groundtruth, "groundtruth_%s.csv" % name)
+
+    results = run_evaluation(categories, groundtruth, detections, excluded_keys)
+
+    logger.info("AVA eval done in %f seconds." % (time.time() - eval_start))
+    return results["PascalBoxes_Precision/mAP@0.5IOU"]
+
+
+def run_evaluation(
+    categories, groundtruth, detections, excluded_keys, verbose=True
+):
+    """AVA evaluation main logic."""
+
+    pascal_evaluator = object_detection_evaluation.PascalDetectionEvaluator(
+        categories
+    )
+
+    boxes, labels, _ = groundtruth
+
+    gt_keys = []
+    pred_keys = []
+
+    for image_key in boxes:
+        if image_key in excluded_keys:
+            logging.info(
+                (
+                    "Found excluded timestamp in ground truth: %s. "
+                    "It will be ignored."
+                ),
+                image_key,
+            )
+            continue
+        pascal_evaluator.add_single_ground_truth_image_info(
+            image_key,
+            {
+                standard_fields.InputDataFields.groundtruth_boxes: np.array(
+                    boxes[image_key], dtype=float
+                ),
+                standard_fields.InputDataFields.groundtruth_classes: np.array(
+                    labels[image_key], dtype=int
+                ),
+                standard_fields.InputDataFields.groundtruth_difficult: np.zeros(
+                    len(boxes[image_key]), dtype=bool
+                ),
+            },
+        )
+
+        gt_keys.append(image_key)
+
+    '''detections format
+    boxes: dict, {'<video_name>,<sec>': [box1, box2,...(each box_i is normalized x1y1x2y2)]}
+    labels: dict, {'<video_name>,<sec>': [cls_id(1 based), ...]}
+    scores: dict, {'<video_name>,<sec>': [score...]}
+    each box_i corresponds to 60 classes (classwhite list otherwise should be 80) and 60 scores
+    '''
+    boxes, labels, scores = detections
+
+    for image_key in boxes:
+        if image_key in excluded_keys:
+            logging.info(
+                (
+                    "Found excluded timestamp in detections: %s. "
+                    "It will be ignored."
+                ),
+                image_key,
+            )
+            continue
+        pascal_evaluator.add_single_detected_image_info(
+            image_key,
+            {
+                standard_fields.DetectionResultFields.detection_boxes: np.array(
+                    boxes[image_key], dtype=float
+                ),
+                standard_fields.DetectionResultFields.detection_classes: np.array(
+                    labels[image_key], dtype=int
+                ),
+                standard_fields.DetectionResultFields.detection_scores: np.array(
+                    scores[image_key], dtype=float
+                ),
+            },
+        )
+
+        pred_keys.append(image_key)
+
+    metrics = pascal_evaluator.evaluate()
+
+    pprint.pprint(metrics, indent=2)
+    return metrics
+
+
+def get_ava_eval_data(
+    scores,
+    boxes,
+    metadata,
+    class_whitelist,
+    verbose=False,
+    video_idx_to_name=None,
+):
+    """
+    Convert our data format into the data format used in official AVA
+    evaluation.
+    """
+
+    out_scores = defaultdict(list)
+    out_labels = defaultdict(list)
+    out_boxes = defaultdict(list)
+    count = 0
+    for i in range(scores.shape[0]):
+        video_idx = int(np.round(metadata[i][0]))
+        sec = int(np.round(metadata[i][1]))
+
+        video = video_idx_to_name[video_idx]
+
+        key = video + "," + "%04d" % (sec)
+        batch_box = boxes[i].tolist()  # [batch_idx, x1, y1, x2, y2]
+        # The first is batch idx.
+        batch_box = [batch_box[j] for j in [0, 2, 1, 4, 3]]  # [batch_idx, y1, x1, y2, x2]
+        # here use this order is because below writing csv it use again use right order
+
+        one_scores = scores[i].tolist()
+        for cls_idx, score in enumerate(one_scores):
+            if cls_idx + 1 in class_whitelist:
+                out_scores[key].append(score)
+                out_labels[key].append(cls_idx + 1)
+                out_boxes[key].append(batch_box[1:])
+                count += 1
+
+    return out_boxes, out_labels, out_scores
+
+
+def write_results(detections, filename):
+    """Write prediction results into official formats."""
+    start = time.time()
+
+    boxes, labels, scores = detections
+    with open(filename, "w") as f:
+        for key in boxes.keys():
+            for box, label, score in zip(boxes[key], labels[key], scores[key]):
+                f.write(
+                    "%s,%.03f,%.03f,%.03f,%.03f,%d,%.04f\n"
+                    % (key, box[0], box[1], box[2], box[3], label, score)
+                )
+
+    logger.info("AVA results wrote to %s" % filename)
+    logger.info("\ttook %d seconds." % (time.time() - start))
diff --git a/evaluator/ava_evaluation/README.md b/evaluator/ava_evaluation/README.md
new file mode 100644
index 0000000..3fe4626
--- /dev/null
+++ b/evaluator/ava_evaluation/README.md
@@ -0,0 +1 @@
+The code under this folder is from the official [ActivityNet repo](https://github.com/activitynet/ActivityNet).
diff --git a/evaluator/ava_evaluation/__init__.py b/evaluator/ava_evaluation/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/evaluator/ava_evaluation/label_map_util.py b/evaluator/ava_evaluation/label_map_util.py
new file mode 100644
index 0000000..bd5e62d
--- /dev/null
+++ b/evaluator/ava_evaluation/label_map_util.py
@@ -0,0 +1,187 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Label map utility functions."""
+
+from __future__ import (
+    absolute_import,
+    division,
+    print_function,
+    unicode_literals,
+)
+import logging
+
+# from google.protobuf import text_format
+# from google3.third_party.tensorflow_models.object_detection.protos import string_int_label_map_pb2
+
+
+def _validate_label_map(label_map):
+    """Checks if a label map is valid.
+
+  Args:
+    label_map: StringIntLabelMap to validate.
+
+  Raises:
+    ValueError: if label map is invalid.
+  """
+    for item in label_map.item:
+        if item.id < 1:
+            raise ValueError("Label map ids should be >= 1.")
+
+
+def create_category_index(categories):
+    """Creates dictionary of COCO compatible categories keyed by category id.
+
+  Args:
+    categories: a list of dicts, each of which has the following keys:
+      'id': (required) an integer id uniquely identifying this category.
+      'name': (required) string representing category name
+        e.g., 'cat', 'dog', 'pizza'.
+
+  Returns:
+    category_index: a dict containing the same entries as categories, but keyed
+      by the 'id' field of each category.
+  """
+    category_index = {}
+    for cat in categories:
+        category_index[cat["id"]] = cat
+    return category_index
+
+
+def get_max_label_map_index(label_map):
+    """Get maximum index in label map.
+
+  Args:
+    label_map: a StringIntLabelMapProto
+
+  Returns:
+    an integer
+  """
+    return max([item.id for item in label_map.item])
+
+
+def convert_label_map_to_categories(
+    label_map, max_num_classes, use_display_name=True
+):
+    """Loads label map proto and returns categories list compatible with eval.
+
+  This function loads a label map and returns a list of dicts, each of which
+  has the following keys:
+    'id': (required) an integer id uniquely identifying this category.
+    'name': (required) string representing category name
+      e.g., 'cat', 'dog', 'pizza'.
+  We only allow class into the list if its id-label_id_offset is
+  between 0 (inclusive) and max_num_classes (exclusive).
+  If there are several items mapping to the same id in the label map,
+  we will only keep the first one in the categories list.
+
+  Args:
+    label_map: a StringIntLabelMapProto or None.  If None, a default categories
+      list is created with max_num_classes categories.
+    max_num_classes: maximum number of (consecutive) label indices to include.
+    use_display_name: (boolean) choose whether to load 'display_name' field
+      as category name.  If False or if the display_name field does not exist,
+      uses 'name' field as category names instead.
+  Returns:
+    categories: a list of dictionaries representing all possible categories.
+  """
+    categories = []
+    list_of_ids_already_added = []
+    if not label_map:
+        label_id_offset = 1
+        for class_id in range(max_num_classes):
+            categories.append(
+                {
+                    "id": class_id + label_id_offset,
+                    "name": "category_{}".format(class_id + label_id_offset),
+                }
+            )
+        return categories
+    for item in label_map.item:
+        if not 0 < item.id <= max_num_classes:
+            logging.info(
+                "Ignore item %d since it falls outside of requested "
+                "label range.",
+                item.id,
+            )
+            continue
+        if use_display_name and item.HasField("display_name"):
+            name = item.display_name
+        else:
+            name = item.name
+        if item.id not in list_of_ids_already_added:
+            list_of_ids_already_added.append(item.id)
+            categories.append({"id": item.id, "name": name})
+    return categories
+
+
+def load_labelmap(path):
+    """Loads label map proto.
+
+  Args:
+    path: path to StringIntLabelMap proto text file.
+  Returns:
+    a StringIntLabelMapProto
+  """
+    with open(path, "r") as fid:
+        label_map_string = fid.read()
+        label_map = string_int_label_map_pb2.StringIntLabelMap()
+        try:
+            text_format.Merge(label_map_string, label_map)
+        except text_format.ParseError:
+            label_map.ParseFromString(label_map_string)
+    _validate_label_map(label_map)
+    return label_map
+
+
+def get_label_map_dict(label_map_path, use_display_name=False):
+    """Reads a label map and returns a dictionary of label names to id.
+
+  Args:
+    label_map_path: path to label_map.
+    use_display_name: whether to use the label map items' display names as keys.
+
+  Returns:
+    A dictionary mapping label names to id.
+  """
+    label_map = load_labelmap(label_map_path)
+    label_map_dict = {}
+    for item in label_map.item:
+        if use_display_name:
+            label_map_dict[item.display_name] = item.id
+        else:
+            label_map_dict[item.name] = item.id
+    return label_map_dict
+
+
+def create_category_index_from_labelmap(label_map_path):
+    """Reads a label map and returns a category index.
+
+  Args:
+    label_map_path: Path to `StringIntLabelMap` proto text file.
+
+  Returns:
+    A category index, which is a dictionary that maps integer ids to dicts
+    containing categories, e.g.
+    {1: {'id': 1, 'name': 'dog'}, 2: {'id': 2, 'name': 'cat'}, ...}
+  """
+    label_map = load_labelmap(label_map_path)
+    max_num_classes = max(item.id for item in label_map.item)
+    categories = convert_label_map_to_categories(label_map, max_num_classes)
+    return create_category_index(categories)
+
+
+def create_class_agnostic_category_index():
+    """Creates a category index with a single `object` class."""
+    return {1: {"id": 1, "name": "object"}}
diff --git a/evaluator/ava_evaluation/metrics.py b/evaluator/ava_evaluation/metrics.py
new file mode 100644
index 0000000..0094216
--- /dev/null
+++ b/evaluator/ava_evaluation/metrics.py
@@ -0,0 +1,154 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Functions for computing metrics like precision, recall, CorLoc and etc."""
+from __future__ import division
+import numpy as np
+
+
+def compute_precision_recall(scores, labels, num_gt):
+    """Compute precision and recall.
+
+  Args:
+    scores: A float numpy array representing detection score
+    labels: A boolean numpy array representing true/false positive labels
+    num_gt: Number of ground truth instances
+
+  Raises:
+    ValueError: if the input is not of the correct format
+
+  Returns:
+    precision: Fraction of positive instances over detected ones. This value is
+      None if no ground truth labels are present.
+    recall: Fraction of detected positive instance over all positive instances.
+      This value is None if no ground truth labels are present.
+
+  """
+    if (
+        not isinstance(labels, np.ndarray)
+        or labels.dtype != np.bool
+        or len(labels.shape) != 1
+    ):
+        raise ValueError("labels must be single dimension bool numpy array")
+
+    if not isinstance(scores, np.ndarray) or len(scores.shape) != 1:
+        raise ValueError("scores must be single dimension numpy array")
+
+    if num_gt < np.sum(labels):
+        raise ValueError(
+            "Number of true positives must be smaller than num_gt."
+        )
+
+    if len(scores) != len(labels):
+        raise ValueError("scores and labels must be of the same size.")
+
+    if num_gt == 0:
+        return None, None
+
+    sorted_indices = np.argsort(scores)
+    sorted_indices = sorted_indices[::-1]
+    labels = labels.astype(int)
+    true_positive_labels = labels[sorted_indices]
+    false_positive_labels = 1 - true_positive_labels
+    cum_true_positives = np.cumsum(true_positive_labels)
+    cum_false_positives = np.cumsum(false_positive_labels)
+    precision = cum_true_positives.astype(float) / (
+        cum_true_positives + cum_false_positives
+    )
+    recall = cum_true_positives.astype(float) / num_gt
+    return precision, recall
+
+
+def compute_average_precision(precision, recall):
+    """Compute Average Precision according to the definition in VOCdevkit.
+
+  Precision is modified to ensure that it does not decrease as recall
+  decrease.
+
+  Args:
+    precision: A float [N, 1] numpy array of precisions
+    recall: A float [N, 1] numpy array of recalls
+
+  Raises:
+    ValueError: if the input is not of the correct format
+
+  Returns:
+    average_precison: The area under the precision recall curve. NaN if
+      precision and recall are None.
+
+  """
+    if precision is None:
+        if recall is not None:
+            raise ValueError("If precision is None, recall must also be None")
+        return np.NAN
+
+    if not isinstance(precision, np.ndarray) or not isinstance(
+        recall, np.ndarray
+    ):
+        raise ValueError("precision and recall must be numpy array")
+    if precision.dtype != np.float or recall.dtype != np.float:
+        raise ValueError("input must be float numpy array.")
+    if len(precision) != len(recall):
+        raise ValueError("precision and recall must be of the same size.")
+    if not precision.size:
+        return 0.0
+    if np.amin(precision) < 0 or np.amax(precision) > 1:
+        raise ValueError("Precision must be in the range of [0, 1].")
+    if np.amin(recall) < 0 or np.amax(recall) > 1:
+        raise ValueError("recall must be in the range of [0, 1].")
+    if not all(recall[i] <= recall[i + 1] for i in range(len(recall) - 1)):
+        raise ValueError("recall must be a non-decreasing array")
+
+    recall = np.concatenate([[0], recall, [1]])
+    precision = np.concatenate([[0], precision, [0]])
+
+    # Preprocess precision to be a non-decreasing array
+    for i in range(len(precision) - 2, -1, -1):
+        precision[i] = np.maximum(precision[i], precision[i + 1])
+
+    indices = np.where(recall[1:] != recall[:-1])[0] + 1
+    average_precision = np.sum(
+        (recall[indices] - recall[indices - 1]) * precision[indices]
+    )
+    return average_precision
+
+
+def compute_cor_loc(
+    num_gt_imgs_per_class, num_images_correctly_detected_per_class
+):
+    """Compute CorLoc according to the definition in the following paper.
+
+  https://www.robots.ox.ac.uk/~vgg/rg/papers/deselaers-eccv10.pdf
+
+  Returns nans if there are no ground truth images for a class.
+
+  Args:
+    num_gt_imgs_per_class: 1D array, representing number of images containing
+        at least one object instance of a particular class
+    num_images_correctly_detected_per_class: 1D array, representing number of
+        images that are correctly detected at least one object instance of a
+        particular class
+
+  Returns:
+    corloc_per_class: A float numpy array represents the corloc score of each
+      class
+  """
+    # Divide by zero expected for classes with no gt examples.
+    with np.errstate(divide="ignore", invalid="ignore"):
+        return np.where(
+            num_gt_imgs_per_class == 0,
+            np.nan,
+            num_images_correctly_detected_per_class / num_gt_imgs_per_class,
+        )
diff --git a/evaluator/ava_evaluation/np_box_list.py b/evaluator/ava_evaluation/np_box_list.py
new file mode 100644
index 0000000..18744b0
--- /dev/null
+++ b/evaluator/ava_evaluation/np_box_list.py
@@ -0,0 +1,143 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Numpy BoxList classes and functions."""
+
+from __future__ import (
+    absolute_import,
+    division,
+    print_function,
+    unicode_literals,
+)
+import numpy as np
+
+
+class BoxList(object):
+    """Box collection.
+
+  BoxList represents a list of bounding boxes as numpy array, where each
+  bounding box is represented as a row of 4 numbers,
+  [y_min, x_min, y_max, x_max].  It is assumed that all bounding boxes within a
+  given list correspond to a single image.
+
+  Optionally, users can add additional related fields (such as
+  objectness/classification scores).
+  """
+
+    def __init__(self, data):
+        """Constructs box collection.
+
+    Args:
+      data: a numpy array of shape [N, 4] representing box coordinates
+
+    Raises:
+      ValueError: if bbox data is not a numpy array
+      ValueError: if invalid dimensions for bbox data
+    """
+        if not isinstance(data, np.ndarray):
+            raise ValueError("data must be a numpy array.")
+        if len(data.shape) != 2 or data.shape[1] != 4:
+            raise ValueError("Invalid dimensions for box data.")
+        if data.dtype != np.float32 and data.dtype != np.float64:
+            raise ValueError(
+                "Invalid data type for box data: float is required."
+            )
+        if not self._is_valid_boxes(data):
+            raise ValueError(
+                "Invalid box data. data must be a numpy array of "
+                "N*[y_min, x_min, y_max, x_max]"
+            )
+        self.data = {"boxes": data}
+
+    def num_boxes(self):
+        """Return number of boxes held in collections."""
+        return self.data["boxes"].shape[0]
+
+    def get_extra_fields(self):
+        """Return all non-box fields."""
+        return [k for k in self.data.keys() if k != "boxes"]
+
+    def has_field(self, field):
+        return field in self.data
+
+    def add_field(self, field, field_data):
+        """Add data to a specified field.
+
+    Args:
+      field: a string parameter used to speficy a related field to be accessed.
+      field_data: a numpy array of [N, ...] representing the data associated
+          with the field.
+    Raises:
+      ValueError: if the field is already exist or the dimension of the field
+          data does not matches the number of boxes.
+    """
+        if self.has_field(field):
+            raise ValueError("Field " + field + "already exists")
+        if len(field_data.shape) < 1 or field_data.shape[0] != self.num_boxes():
+            raise ValueError("Invalid dimensions for field data")
+        self.data[field] = field_data
+
+    def get(self):
+        """Convenience function for accesssing box coordinates.
+
+    Returns:
+      a numpy array of shape [N, 4] representing box corners
+    """
+        return self.get_field("boxes")
+
+    def get_field(self, field):
+        """Accesses data associated with the specified field in the box collection.
+
+    Args:
+      field: a string parameter used to speficy a related field to be accessed.
+
+    Returns:
+      a numpy 1-d array representing data of an associated field
+
+    Raises:
+      ValueError: if invalid field
+    """
+        if not self.has_field(field):
+            raise ValueError("field {} does not exist".format(field))
+        return self.data[field]
+
+    def get_coordinates(self):
+        """Get corner coordinates of boxes.
+
+    Returns:
+     a list of 4 1-d numpy arrays [y_min, x_min, y_max, x_max]
+    """
+        box_coordinates = self.get()
+        y_min = box_coordinates[:, 0]
+        x_min = box_coordinates[:, 1]
+        y_max = box_coordinates[:, 2]
+        x_max = box_coordinates[:, 3]
+        return [y_min, x_min, y_max, x_max]
+
+    def _is_valid_boxes(self, data):
+        """Check whether data fullfills the format of N*[ymin, xmin, ymax, xmin].
+
+    Args:
+      data: a numpy array of shape [N, 4] representing box coordinates
+
+    Returns:
+      a boolean indicating whether all ymax of boxes are equal or greater than
+          ymin, and all xmax of boxes are equal or greater than xmin.
+    """
+        if data.shape[0] > 0:
+            for i in range(data.shape[0]):
+                if data[i, 0] > data[i, 2] or data[i, 1] > data[i, 3]:
+                    return False
+        return True
diff --git a/evaluator/ava_evaluation/np_box_list_ops.py b/evaluator/ava_evaluation/np_box_list_ops.py
new file mode 100644
index 0000000..6a85b29
--- /dev/null
+++ b/evaluator/ava_evaluation/np_box_list_ops.py
@@ -0,0 +1,593 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Bounding Box List operations for Numpy BoxLists.
+
+Example box operations that are supported:
+  * Areas: compute bounding box areas
+  * IOU: pairwise intersection-over-union scores
+"""
+from __future__ import (
+    absolute_import,
+    division,
+    print_function,
+    unicode_literals,
+)
+import numpy as np
+
+from . import np_box_list, np_box_ops
+
+
+class SortOrder(object):
+    """Enum class for sort order.
+
+  Attributes:
+    ascend: ascend order.
+    descend: descend order.
+  """
+
+    ASCEND = 1
+    DESCEND = 2
+
+
+def area(boxlist):
+    """Computes area of boxes.
+
+  Args:
+    boxlist: BoxList holding N boxes
+
+  Returns:
+    a numpy array with shape [N*1] representing box areas
+  """
+    y_min, x_min, y_max, x_max = boxlist.get_coordinates()
+    return (y_max - y_min) * (x_max - x_min)
+
+
+def intersection(boxlist1, boxlist2):
+    """Compute pairwise intersection areas between boxes.
+
+  Args:
+    boxlist1: BoxList holding N boxes
+    boxlist2: BoxList holding M boxes
+
+  Returns:
+    a numpy array with shape [N*M] representing pairwise intersection area
+  """
+    return np_box_ops.intersection(boxlist1.get(), boxlist2.get())
+
+
+def iou(boxlist1, boxlist2):
+    """Computes pairwise intersection-over-union between box collections.
+
+  Args:
+    boxlist1: BoxList holding N boxes
+    boxlist2: BoxList holding M boxes
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise iou scores.
+  """
+    return np_box_ops.iou(boxlist1.get(), boxlist2.get())
+
+
+def ioa(boxlist1, boxlist2):
+    """Computes pairwise intersection-over-area between box collections.
+
+  Intersection-over-area (ioa) between two boxes box1 and box2 is defined as
+  their intersection area over box2's area. Note that ioa is not symmetric,
+  that is, IOA(box1, box2) != IOA(box2, box1).
+
+  Args:
+    boxlist1: BoxList holding N boxes
+    boxlist2: BoxList holding M boxes
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise ioa scores.
+  """
+    return np_box_ops.ioa(boxlist1.get(), boxlist2.get())
+
+
+def gather(boxlist, indices, fields=None):
+    """Gather boxes from BoxList according to indices and return new BoxList.
+
+  By default, gather returns boxes corresponding to the input index list, as
+  well as all additional fields stored in the boxlist (indexing into the
+  first dimension).  However one can optionally only gather from a
+  subset of fields.
+
+  Args:
+    boxlist: BoxList holding N boxes
+    indices: a 1-d numpy array of type int_
+    fields: (optional) list of fields to also gather from.  If None (default),
+        all fields are gathered from.  Pass an empty fields list to only gather
+        the box coordinates.
+
+  Returns:
+    subboxlist: a BoxList corresponding to the subset of the input BoxList
+        specified by indices
+
+  Raises:
+    ValueError: if specified field is not contained in boxlist or if the
+        indices are not of type int_
+  """
+    if indices.size:
+        if np.amax(indices) >= boxlist.num_boxes() or np.amin(indices) < 0:
+            raise ValueError("indices are out of valid range.")
+    subboxlist = np_box_list.BoxList(boxlist.get()[indices, :])
+    if fields is None:
+        fields = boxlist.get_extra_fields()
+    for field in fields:
+        extra_field_data = boxlist.get_field(field)
+        subboxlist.add_field(field, extra_field_data[indices, ...])
+    return subboxlist
+
+
+def sort_by_field(boxlist, field, order=SortOrder.DESCEND):
+    """Sort boxes and associated fields according to a scalar field.
+
+  A common use case is reordering the boxes according to descending scores.
+
+  Args:
+    boxlist: BoxList holding N boxes.
+    field: A BoxList field for sorting and reordering the BoxList.
+    order: (Optional) 'descend' or 'ascend'. Default is descend.
+
+  Returns:
+    sorted_boxlist: A sorted BoxList with the field in the specified order.
+
+  Raises:
+    ValueError: if specified field does not exist or is not of single dimension.
+    ValueError: if the order is not either descend or ascend.
+  """
+    if not boxlist.has_field(field):
+        raise ValueError("Field " + field + " does not exist")
+    if len(boxlist.get_field(field).shape) != 1:
+        raise ValueError("Field " + field + "should be single dimension.")
+    if order != SortOrder.DESCEND and order != SortOrder.ASCEND:
+        raise ValueError("Invalid sort order")
+
+    field_to_sort = boxlist.get_field(field)
+    sorted_indices = np.argsort(field_to_sort)
+    if order == SortOrder.DESCEND:
+        sorted_indices = sorted_indices[::-1]
+    return gather(boxlist, sorted_indices)
+
+
+def non_max_suppression(
+    boxlist, max_output_size=10000, iou_threshold=1.0, score_threshold=-10.0
+):
+    """Non maximum suppression.
+
+  This op greedily selects a subset of detection bounding boxes, pruning
+  away boxes that have high IOU (intersection over union) overlap (> thresh)
+  with already selected boxes. In each iteration, the detected bounding box with
+  highest score in the available pool is selected.
+
+  Args:
+    boxlist: BoxList holding N boxes.  Must contain a 'scores' field
+      representing detection scores. All scores belong to the same class.
+    max_output_size: maximum number of retained boxes
+    iou_threshold: intersection over union threshold.
+    score_threshold: minimum score threshold. Remove the boxes with scores
+                     less than this value. Default value is set to -10. A very
+                     low threshold to pass pretty much all the boxes, unless
+                     the user sets a different score threshold.
+
+  Returns:
+    a BoxList holding M boxes where M <= max_output_size
+  Raises:
+    ValueError: if 'scores' field does not exist
+    ValueError: if threshold is not in [0, 1]
+    ValueError: if max_output_size < 0
+  """
+    if not boxlist.has_field("scores"):
+        raise ValueError("Field scores does not exist")
+    if iou_threshold < 0.0 or iou_threshold > 1.0:
+        raise ValueError("IOU threshold must be in [0, 1]")
+    if max_output_size < 0:
+        raise ValueError("max_output_size must be bigger than 0.")
+
+    boxlist = filter_scores_greater_than(boxlist, score_threshold)
+    if boxlist.num_boxes() == 0:
+        return boxlist
+
+    boxlist = sort_by_field(boxlist, "scores")
+
+    # Prevent further computation if NMS is disabled.
+    if iou_threshold == 1.0:
+        if boxlist.num_boxes() > max_output_size:
+            selected_indices = np.arange(max_output_size)
+            return gather(boxlist, selected_indices)
+        else:
+            return boxlist
+
+    boxes = boxlist.get()
+    num_boxes = boxlist.num_boxes()
+    # is_index_valid is True only for all remaining valid boxes,
+    is_index_valid = np.full(num_boxes, 1, dtype=bool)
+    selected_indices = []
+    num_output = 0
+    for i in range(num_boxes):
+        if num_output < max_output_size:
+            if is_index_valid[i]:
+                num_output += 1
+                selected_indices.append(i)
+                is_index_valid[i] = False
+                valid_indices = np.where(is_index_valid)[0]
+                if valid_indices.size == 0:
+                    break
+
+                intersect_over_union = np_box_ops.iou(
+                    np.expand_dims(boxes[i, :], axis=0), boxes[valid_indices, :]
+                )
+                intersect_over_union = np.squeeze(intersect_over_union, axis=0)
+                is_index_valid[valid_indices] = np.logical_and(
+                    is_index_valid[valid_indices],
+                    intersect_over_union <= iou_threshold,
+                )
+    return gather(boxlist, np.array(selected_indices))
+
+
+def multi_class_non_max_suppression(
+    boxlist, score_thresh, iou_thresh, max_output_size
+):
+    """Multi-class version of non maximum suppression.
+
+  This op greedily selects a subset of detection bounding boxes, pruning
+  away boxes that have high IOU (intersection over union) overlap (> thresh)
+  with already selected boxes.  It operates independently for each class for
+  which scores are provided (via the scores field of the input box_list),
+  pruning boxes with score less than a provided threshold prior to
+  applying NMS.
+
+  Args:
+    boxlist: BoxList holding N boxes.  Must contain a 'scores' field
+      representing detection scores.  This scores field is a tensor that can
+      be 1 dimensional (in the case of a single class) or 2-dimensional, which
+      which case we assume that it takes the shape [num_boxes, num_classes].
+      We further assume that this rank is known statically and that
+      scores.shape[1] is also known (i.e., the number of classes is fixed
+      and known at graph construction time).
+    score_thresh: scalar threshold for score (low scoring boxes are removed).
+    iou_thresh: scalar threshold for IOU (boxes that that high IOU overlap
+      with previously selected boxes are removed).
+    max_output_size: maximum number of retained boxes per class.
+
+  Returns:
+    a BoxList holding M boxes with a rank-1 scores field representing
+      corresponding scores for each box with scores sorted in decreasing order
+      and a rank-1 classes field representing a class label for each box.
+  Raises:
+    ValueError: if iou_thresh is not in [0, 1] or if input boxlist does not have
+      a valid scores field.
+  """
+    if not 0 <= iou_thresh <= 1.0:
+        raise ValueError("thresh must be between 0 and 1")
+    if not isinstance(boxlist, np_box_list.BoxList):
+        raise ValueError("boxlist must be a BoxList")
+    if not boxlist.has_field("scores"):
+        raise ValueError("input boxlist must have 'scores' field")
+    scores = boxlist.get_field("scores")
+    if len(scores.shape) == 1:
+        scores = np.reshape(scores, [-1, 1])
+    elif len(scores.shape) == 2:
+        if scores.shape[1] is None:
+            raise ValueError(
+                "scores field must have statically defined second " "dimension"
+            )
+    else:
+        raise ValueError("scores field must be of rank 1 or 2")
+    num_boxes = boxlist.num_boxes()
+    num_scores = scores.shape[0]
+    num_classes = scores.shape[1]
+
+    if num_boxes != num_scores:
+        raise ValueError("Incorrect scores field length: actual vs expected.")
+
+    selected_boxes_list = []
+    for class_idx in range(num_classes):
+        boxlist_and_class_scores = np_box_list.BoxList(boxlist.get())
+        class_scores = np.reshape(scores[0:num_scores, class_idx], [-1])
+        boxlist_and_class_scores.add_field("scores", class_scores)
+        boxlist_filt = filter_scores_greater_than(
+            boxlist_and_class_scores, score_thresh
+        )
+        nms_result = non_max_suppression(
+            boxlist_filt,
+            max_output_size=max_output_size,
+            iou_threshold=iou_thresh,
+            score_threshold=score_thresh,
+        )
+        nms_result.add_field(
+            "classes", np.zeros_like(nms_result.get_field("scores")) + class_idx
+        )
+        selected_boxes_list.append(nms_result)
+    selected_boxes = concatenate(selected_boxes_list)
+    sorted_boxes = sort_by_field(selected_boxes, "scores")
+    return sorted_boxes
+
+
+def scale(boxlist, y_scale, x_scale):
+    """Scale box coordinates in x and y dimensions.
+
+  Args:
+    boxlist: BoxList holding N boxes
+    y_scale: float
+    x_scale: float
+
+  Returns:
+    boxlist: BoxList holding N boxes
+  """
+    y_min, x_min, y_max, x_max = np.array_split(boxlist.get(), 4, axis=1)
+    y_min = y_scale * y_min
+    y_max = y_scale * y_max
+    x_min = x_scale * x_min
+    x_max = x_scale * x_max
+    scaled_boxlist = np_box_list.BoxList(
+        np.hstack([y_min, x_min, y_max, x_max])
+    )
+
+    fields = boxlist.get_extra_fields()
+    for field in fields:
+        extra_field_data = boxlist.get_field(field)
+        scaled_boxlist.add_field(field, extra_field_data)
+
+    return scaled_boxlist
+
+
+def clip_to_window(boxlist, window):
+    """Clip bounding boxes to a window.
+
+  This op clips input bounding boxes (represented by bounding box
+  corners) to a window, optionally filtering out boxes that do not
+  overlap at all with the window.
+
+  Args:
+    boxlist: BoxList holding M_in boxes
+    window: a numpy array of shape [4] representing the
+            [y_min, x_min, y_max, x_max] window to which the op
+            should clip boxes.
+
+  Returns:
+    a BoxList holding M_out boxes where M_out <= M_in
+  """
+    y_min, x_min, y_max, x_max = np.array_split(boxlist.get(), 4, axis=1)
+    win_y_min = window[0]
+    win_x_min = window[1]
+    win_y_max = window[2]
+    win_x_max = window[3]
+    y_min_clipped = np.fmax(np.fmin(y_min, win_y_max), win_y_min)
+    y_max_clipped = np.fmax(np.fmin(y_max, win_y_max), win_y_min)
+    x_min_clipped = np.fmax(np.fmin(x_min, win_x_max), win_x_min)
+    x_max_clipped = np.fmax(np.fmin(x_max, win_x_max), win_x_min)
+    clipped = np_box_list.BoxList(
+        np.hstack([y_min_clipped, x_min_clipped, y_max_clipped, x_max_clipped])
+    )
+    clipped = _copy_extra_fields(clipped, boxlist)
+    areas = area(clipped)
+    nonzero_area_indices = np.reshape(
+        np.nonzero(np.greater(areas, 0.0)), [-1]
+    ).astype(np.int32)
+    return gather(clipped, nonzero_area_indices)
+
+
+def prune_non_overlapping_boxes(boxlist1, boxlist2, minoverlap=0.0):
+    """Prunes the boxes in boxlist1 that overlap less than thresh with boxlist2.
+
+  For each box in boxlist1, we want its IOA to be more than minoverlap with
+  at least one of the boxes in boxlist2. If it does not, we remove it.
+
+  Args:
+    boxlist1: BoxList holding N boxes.
+    boxlist2: BoxList holding M boxes.
+    minoverlap: Minimum required overlap between boxes, to count them as
+                overlapping.
+
+  Returns:
+    A pruned boxlist with size [N', 4].
+  """
+    intersection_over_area = ioa(boxlist2, boxlist1)  # [M, N] tensor
+    intersection_over_area = np.amax(
+        intersection_over_area, axis=0
+    )  # [N] tensor
+    keep_bool = np.greater_equal(intersection_over_area, np.array(minoverlap))
+    keep_inds = np.nonzero(keep_bool)[0]
+    new_boxlist1 = gather(boxlist1, keep_inds)
+    return new_boxlist1
+
+
+def prune_outside_window(boxlist, window):
+    """Prunes bounding boxes that fall outside a given window.
+
+  This function prunes bounding boxes that even partially fall outside the given
+  window. See also ClipToWindow which only prunes bounding boxes that fall
+  completely outside the window, and clips any bounding boxes that partially
+  overflow.
+
+  Args:
+    boxlist: a BoxList holding M_in boxes.
+    window: a numpy array of size 4, representing [ymin, xmin, ymax, xmax]
+            of the window.
+
+  Returns:
+    pruned_corners: a tensor with shape [M_out, 4] where M_out <= M_in.
+    valid_indices: a tensor with shape [M_out] indexing the valid bounding boxes
+     in the input tensor.
+  """
+
+    y_min, x_min, y_max, x_max = np.array_split(boxlist.get(), 4, axis=1)
+    win_y_min = window[0]
+    win_x_min = window[1]
+    win_y_max = window[2]
+    win_x_max = window[3]
+    coordinate_violations = np.hstack(
+        [
+            np.less(y_min, win_y_min),
+            np.less(x_min, win_x_min),
+            np.greater(y_max, win_y_max),
+            np.greater(x_max, win_x_max),
+        ]
+    )
+    valid_indices = np.reshape(
+        np.where(np.logical_not(np.max(coordinate_violations, axis=1))), [-1]
+    )
+    return gather(boxlist, valid_indices), valid_indices
+
+
+def concatenate(boxlists, fields=None):
+    """Concatenate list of BoxLists.
+
+  This op concatenates a list of input BoxLists into a larger BoxList.  It also
+  handles concatenation of BoxList fields as long as the field tensor shapes
+  are equal except for the first dimension.
+
+  Args:
+    boxlists: list of BoxList objects
+    fields: optional list of fields to also concatenate.  By default, all
+      fields from the first BoxList in the list are included in the
+      concatenation.
+
+  Returns:
+    a BoxList with number of boxes equal to
+      sum([boxlist.num_boxes() for boxlist in BoxList])
+  Raises:
+    ValueError: if boxlists is invalid (i.e., is not a list, is empty, or
+      contains non BoxList objects), or if requested fields are not contained in
+      all boxlists
+  """
+    if not isinstance(boxlists, list):
+        raise ValueError("boxlists should be a list")
+    if not boxlists:
+        raise ValueError("boxlists should have nonzero length")
+    for boxlist in boxlists:
+        if not isinstance(boxlist, np_box_list.BoxList):
+            raise ValueError(
+                "all elements of boxlists should be BoxList objects"
+            )
+    concatenated = np_box_list.BoxList(
+        np.vstack([boxlist.get() for boxlist in boxlists])
+    )
+    if fields is None:
+        fields = boxlists[0].get_extra_fields()
+    for field in fields:
+        first_field_shape = boxlists[0].get_field(field).shape
+        first_field_shape = first_field_shape[1:]
+        for boxlist in boxlists:
+            if not boxlist.has_field(field):
+                raise ValueError("boxlist must contain all requested fields")
+            field_shape = boxlist.get_field(field).shape
+            field_shape = field_shape[1:]
+            if field_shape != first_field_shape:
+                raise ValueError(
+                    "field %s must have same shape for all boxlists "
+                    "except for the 0th dimension." % field
+                )
+        concatenated_field = np.concatenate(
+            [boxlist.get_field(field) for boxlist in boxlists], axis=0
+        )
+        concatenated.add_field(field, concatenated_field)
+    return concatenated
+
+
+def filter_scores_greater_than(boxlist, thresh):
+    """Filter to keep only boxes with score exceeding a given threshold.
+
+  This op keeps the collection of boxes whose corresponding scores are
+  greater than the input threshold.
+
+  Args:
+    boxlist: BoxList holding N boxes.  Must contain a 'scores' field
+      representing detection scores.
+    thresh: scalar threshold
+
+  Returns:
+    a BoxList holding M boxes where M <= N
+
+  Raises:
+    ValueError: if boxlist not a BoxList object or if it does not
+      have a scores field
+  """
+    if not isinstance(boxlist, np_box_list.BoxList):
+        raise ValueError("boxlist must be a BoxList")
+    if not boxlist.has_field("scores"):
+        raise ValueError("input boxlist must have 'scores' field")
+    scores = boxlist.get_field("scores")
+    if len(scores.shape) > 2:
+        raise ValueError("Scores should have rank 1 or 2")
+    if len(scores.shape) == 2 and scores.shape[1] != 1:
+        raise ValueError(
+            "Scores should have rank 1 or have shape "
+            "consistent with [None, 1]"
+        )
+    high_score_indices = np.reshape(
+        np.where(np.greater(scores, thresh)), [-1]
+    ).astype(np.int32)
+    return gather(boxlist, high_score_indices)
+
+
+def change_coordinate_frame(boxlist, window):
+    """Change coordinate frame of the boxlist to be relative to window's frame.
+
+  Given a window of the form [ymin, xmin, ymax, xmax],
+  changes bounding box coordinates from boxlist to be relative to this window
+  (e.g., the min corner maps to (0,0) and the max corner maps to (1,1)).
+
+  An example use case is data augmentation: where we are given groundtruth
+  boxes (boxlist) and would like to randomly crop the image to some
+  window (window). In this case we need to change the coordinate frame of
+  each groundtruth box to be relative to this new window.
+
+  Args:
+    boxlist: A BoxList object holding N boxes.
+    window: a size 4 1-D numpy array.
+
+  Returns:
+    Returns a BoxList object with N boxes.
+  """
+    win_height = window[2] - window[0]
+    win_width = window[3] - window[1]
+    boxlist_new = scale(
+        np_box_list.BoxList(
+            boxlist.get() - [window[0], window[1], window[0], window[1]]
+        ),
+        1.0 / win_height,
+        1.0 / win_width,
+    )
+    _copy_extra_fields(boxlist_new, boxlist)
+
+    return boxlist_new
+
+
+def _copy_extra_fields(boxlist_to_copy_to, boxlist_to_copy_from):
+    """Copies the extra fields of boxlist_to_copy_from to boxlist_to_copy_to.
+
+  Args:
+    boxlist_to_copy_to: BoxList to which extra fields are copied.
+    boxlist_to_copy_from: BoxList from which fields are copied.
+
+  Returns:
+    boxlist_to_copy_to with extra fields.
+  """
+    for field in boxlist_to_copy_from.get_extra_fields():
+        boxlist_to_copy_to.add_field(
+            field, boxlist_to_copy_from.get_field(field)
+        )
+    return boxlist_to_copy_to
+
+
+def _update_valid_indices_by_removing_high_iou_boxes(
+    selected_indices, is_index_valid, intersect_over_union, threshold
+):
+    max_iou = np.max(intersect_over_union[:, selected_indices], axis=1)
+    return np.logical_and(is_index_valid, max_iou <= threshold)
diff --git a/evaluator/ava_evaluation/np_box_mask_list.py b/evaluator/ava_evaluation/np_box_mask_list.py
new file mode 100644
index 0000000..14bd8eb
--- /dev/null
+++ b/evaluator/ava_evaluation/np_box_mask_list.py
@@ -0,0 +1,73 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Numpy BoxMaskList classes and functions."""
+
+from __future__ import (
+    absolute_import,
+    division,
+    print_function,
+    unicode_literals,
+)
+import numpy as np
+
+from . import np_box_list
+
+
+class BoxMaskList(np_box_list.BoxList):
+    """Convenience wrapper for BoxList with masks.
+
+  BoxMaskList extends the np_box_list.BoxList to contain masks as well.
+  In particular, its constructor receives both boxes and masks. Note that the
+  masks correspond to the full image.
+  """
+
+    def __init__(self, box_data, mask_data):
+        """Constructs box collection.
+
+    Args:
+      box_data: a numpy array of shape [N, 4] representing box coordinates
+      mask_data: a numpy array of shape [N, height, width] representing masks
+        with values are in {0,1}. The masks correspond to the full
+        image. The height and the width will be equal to image height and width.
+
+    Raises:
+      ValueError: if bbox data is not a numpy array
+      ValueError: if invalid dimensions for bbox data
+      ValueError: if mask data is not a numpy array
+      ValueError: if invalid dimension for mask data
+    """
+        super(BoxMaskList, self).__init__(box_data)
+        if not isinstance(mask_data, np.ndarray):
+            raise ValueError("Mask data must be a numpy array.")
+        if len(mask_data.shape) != 3:
+            raise ValueError("Invalid dimensions for mask data.")
+        if mask_data.dtype != np.uint8:
+            raise ValueError(
+                "Invalid data type for mask data: uint8 is required."
+            )
+        if mask_data.shape[0] != box_data.shape[0]:
+            raise ValueError(
+                "There should be the same number of boxes and masks."
+            )
+        self.data["masks"] = mask_data
+
+    def get_masks(self):
+        """Convenience function for accessing masks.
+
+    Returns:
+      a numpy array of shape [N, height, width] representing masks
+    """
+        return self.get_field("masks")
diff --git a/evaluator/ava_evaluation/np_box_mask_list_ops.py b/evaluator/ava_evaluation/np_box_mask_list_ops.py
new file mode 100644
index 0000000..be16a0a
--- /dev/null
+++ b/evaluator/ava_evaluation/np_box_mask_list_ops.py
@@ -0,0 +1,428 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Operations for np_box_mask_list.BoxMaskList.
+
+Example box operations that are supported:
+  * Areas: compute bounding box areas
+  * IOU: pairwise intersection-over-union scores
+"""
+from __future__ import (
+    absolute_import,
+    division,
+    print_function,
+    unicode_literals,
+)
+import numpy as np
+
+from . import np_box_list_ops, np_box_mask_list, np_mask_ops
+
+
+def box_list_to_box_mask_list(boxlist):
+    """Converts a BoxList containing 'masks' into a BoxMaskList.
+
+  Args:
+    boxlist: An np_box_list.BoxList object.
+
+  Returns:
+    An np_box_mask_list.BoxMaskList object.
+
+  Raises:
+    ValueError: If boxlist does not contain `masks` as a field.
+  """
+    if not boxlist.has_field("masks"):
+        raise ValueError("boxlist does not contain mask field.")
+    box_mask_list = np_box_mask_list.BoxMaskList(
+        box_data=boxlist.get(), mask_data=boxlist.get_field("masks")
+    )
+    extra_fields = boxlist.get_extra_fields()
+    for key in extra_fields:
+        if key != "masks":
+            box_mask_list.data[key] = boxlist.get_field(key)
+    return box_mask_list
+
+
+def area(box_mask_list):
+    """Computes area of masks.
+
+  Args:
+    box_mask_list: np_box_mask_list.BoxMaskList holding N boxes and masks
+
+  Returns:
+    a numpy array with shape [N*1] representing mask areas
+  """
+    return np_mask_ops.area(box_mask_list.get_masks())
+
+
+def intersection(box_mask_list1, box_mask_list2):
+    """Compute pairwise intersection areas between masks.
+
+  Args:
+    box_mask_list1: BoxMaskList holding N boxes and masks
+    box_mask_list2: BoxMaskList holding M boxes and masks
+
+  Returns:
+    a numpy array with shape [N*M] representing pairwise intersection area
+  """
+    return np_mask_ops.intersection(
+        box_mask_list1.get_masks(), box_mask_list2.get_masks()
+    )
+
+
+def iou(box_mask_list1, box_mask_list2):
+    """Computes pairwise intersection-over-union between box and mask collections.
+
+  Args:
+    box_mask_list1: BoxMaskList holding N boxes and masks
+    box_mask_list2: BoxMaskList holding M boxes and masks
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise iou scores.
+  """
+    return np_mask_ops.iou(
+        box_mask_list1.get_masks(), box_mask_list2.get_masks()
+    )
+
+
+def ioa(box_mask_list1, box_mask_list2):
+    """Computes pairwise intersection-over-area between box and mask collections.
+
+  Intersection-over-area (ioa) between two masks mask1 and mask2 is defined as
+  their intersection area over mask2's area. Note that ioa is not symmetric,
+  that is, IOA(mask1, mask2) != IOA(mask2, mask1).
+
+  Args:
+    box_mask_list1: np_box_mask_list.BoxMaskList holding N boxes and masks
+    box_mask_list2: np_box_mask_list.BoxMaskList holding M boxes and masks
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise ioa scores.
+  """
+    return np_mask_ops.ioa(
+        box_mask_list1.get_masks(), box_mask_list2.get_masks()
+    )
+
+
+def gather(box_mask_list, indices, fields=None):
+    """Gather boxes from np_box_mask_list.BoxMaskList according to indices.
+
+  By default, gather returns boxes corresponding to the input index list, as
+  well as all additional fields stored in the box_mask_list (indexing into the
+  first dimension).  However one can optionally only gather from a
+  subset of fields.
+
+  Args:
+    box_mask_list: np_box_mask_list.BoxMaskList holding N boxes
+    indices: a 1-d numpy array of type int_
+    fields: (optional) list of fields to also gather from.  If None (default),
+        all fields are gathered from.  Pass an empty fields list to only gather
+        the box coordinates.
+
+  Returns:
+    subbox_mask_list: a np_box_mask_list.BoxMaskList corresponding to the subset
+        of the input box_mask_list specified by indices
+
+  Raises:
+    ValueError: if specified field is not contained in box_mask_list or if the
+        indices are not of type int_
+  """
+    if fields is not None:
+        if "masks" not in fields:
+            fields.append("masks")
+    return box_list_to_box_mask_list(
+        np_box_list_ops.gather(
+            boxlist=box_mask_list, indices=indices, fields=fields
+        )
+    )
+
+
+def sort_by_field(
+    box_mask_list, field, order=np_box_list_ops.SortOrder.DESCEND
+):
+    """Sort boxes and associated fields according to a scalar field.
+
+  A common use case is reordering the boxes according to descending scores.
+
+  Args:
+    box_mask_list: BoxMaskList holding N boxes.
+    field: A BoxMaskList field for sorting and reordering the BoxMaskList.
+    order: (Optional) 'descend' or 'ascend'. Default is descend.
+
+  Returns:
+    sorted_box_mask_list: A sorted BoxMaskList with the field in the specified
+      order.
+  """
+    return box_list_to_box_mask_list(
+        np_box_list_ops.sort_by_field(
+            boxlist=box_mask_list, field=field, order=order
+        )
+    )
+
+
+def non_max_suppression(
+    box_mask_list,
+    max_output_size=10000,
+    iou_threshold=1.0,
+    score_threshold=-10.0,
+):
+    """Non maximum suppression.
+
+  This op greedily selects a subset of detection bounding boxes, pruning
+  away boxes that have high IOU (intersection over union) overlap (> thresh)
+  with already selected boxes. In each iteration, the detected bounding box with
+  highest score in the available pool is selected.
+
+  Args:
+    box_mask_list: np_box_mask_list.BoxMaskList holding N boxes.  Must contain
+      a 'scores' field representing detection scores. All scores belong to the
+      same class.
+    max_output_size: maximum number of retained boxes
+    iou_threshold: intersection over union threshold.
+    score_threshold: minimum score threshold. Remove the boxes with scores
+                     less than this value. Default value is set to -10. A very
+                     low threshold to pass pretty much all the boxes, unless
+                     the user sets a different score threshold.
+
+  Returns:
+    an np_box_mask_list.BoxMaskList holding M boxes where M <= max_output_size
+
+  Raises:
+    ValueError: if 'scores' field does not exist
+    ValueError: if threshold is not in [0, 1]
+    ValueError: if max_output_size < 0
+  """
+    if not box_mask_list.has_field("scores"):
+        raise ValueError("Field scores does not exist")
+    if iou_threshold < 0.0 or iou_threshold > 1.0:
+        raise ValueError("IOU threshold must be in [0, 1]")
+    if max_output_size < 0:
+        raise ValueError("max_output_size must be bigger than 0.")
+
+    box_mask_list = filter_scores_greater_than(box_mask_list, score_threshold)
+    if box_mask_list.num_boxes() == 0:
+        return box_mask_list
+
+    box_mask_list = sort_by_field(box_mask_list, "scores")
+
+    # Prevent further computation if NMS is disabled.
+    if iou_threshold == 1.0:
+        if box_mask_list.num_boxes() > max_output_size:
+            selected_indices = np.arange(max_output_size)
+            return gather(box_mask_list, selected_indices)
+        else:
+            return box_mask_list
+
+    masks = box_mask_list.get_masks()
+    num_masks = box_mask_list.num_boxes()
+
+    # is_index_valid is True only for all remaining valid boxes,
+    is_index_valid = np.full(num_masks, 1, dtype=bool)
+    selected_indices = []
+    num_output = 0
+    for i in range(num_masks):
+        if num_output < max_output_size:
+            if is_index_valid[i]:
+                num_output += 1
+                selected_indices.append(i)
+                is_index_valid[i] = False
+                valid_indices = np.where(is_index_valid)[0]
+                if valid_indices.size == 0:
+                    break
+
+                intersect_over_union = np_mask_ops.iou(
+                    np.expand_dims(masks[i], axis=0), masks[valid_indices]
+                )
+                intersect_over_union = np.squeeze(intersect_over_union, axis=0)
+                is_index_valid[valid_indices] = np.logical_and(
+                    is_index_valid[valid_indices],
+                    intersect_over_union <= iou_threshold,
+                )
+    return gather(box_mask_list, np.array(selected_indices))
+
+
+def multi_class_non_max_suppression(
+    box_mask_list, score_thresh, iou_thresh, max_output_size
+):
+    """Multi-class version of non maximum suppression.
+
+  This op greedily selects a subset of detection bounding boxes, pruning
+  away boxes that have high IOU (intersection over union) overlap (> thresh)
+  with already selected boxes.  It operates independently for each class for
+  which scores are provided (via the scores field of the input box_list),
+  pruning boxes with score less than a provided threshold prior to
+  applying NMS.
+
+  Args:
+    box_mask_list: np_box_mask_list.BoxMaskList holding N boxes.  Must contain a
+      'scores' field representing detection scores.  This scores field is a
+      tensor that can be 1 dimensional (in the case of a single class) or
+      2-dimensional, in which case we assume that it takes the
+      shape [num_boxes, num_classes]. We further assume that this rank is known
+      statically and that scores.shape[1] is also known (i.e., the number of
+      classes is fixed and known at graph construction time).
+    score_thresh: scalar threshold for score (low scoring boxes are removed).
+    iou_thresh: scalar threshold for IOU (boxes that that high IOU overlap
+      with previously selected boxes are removed).
+    max_output_size: maximum number of retained boxes per class.
+
+  Returns:
+    a box_mask_list holding M boxes with a rank-1 scores field representing
+      corresponding scores for each box with scores sorted in decreasing order
+      and a rank-1 classes field representing a class label for each box.
+  Raises:
+    ValueError: if iou_thresh is not in [0, 1] or if input box_mask_list does
+      not have a valid scores field.
+  """
+    if not 0 <= iou_thresh <= 1.0:
+        raise ValueError("thresh must be between 0 and 1")
+    if not isinstance(box_mask_list, np_box_mask_list.BoxMaskList):
+        raise ValueError("box_mask_list must be a box_mask_list")
+    if not box_mask_list.has_field("scores"):
+        raise ValueError("input box_mask_list must have 'scores' field")
+    scores = box_mask_list.get_field("scores")
+    if len(scores.shape) == 1:
+        scores = np.reshape(scores, [-1, 1])
+    elif len(scores.shape) == 2:
+        if scores.shape[1] is None:
+            raise ValueError(
+                "scores field must have statically defined second " "dimension"
+            )
+    else:
+        raise ValueError("scores field must be of rank 1 or 2")
+
+    num_boxes = box_mask_list.num_boxes()
+    num_scores = scores.shape[0]
+    num_classes = scores.shape[1]
+
+    if num_boxes != num_scores:
+        raise ValueError("Incorrect scores field length: actual vs expected.")
+
+    selected_boxes_list = []
+    for class_idx in range(num_classes):
+        box_mask_list_and_class_scores = np_box_mask_list.BoxMaskList(
+            box_data=box_mask_list.get(), mask_data=box_mask_list.get_masks()
+        )
+        class_scores = np.reshape(scores[0:num_scores, class_idx], [-1])
+        box_mask_list_and_class_scores.add_field("scores", class_scores)
+        box_mask_list_filt = filter_scores_greater_than(
+            box_mask_list_and_class_scores, score_thresh
+        )
+        nms_result = non_max_suppression(
+            box_mask_list_filt,
+            max_output_size=max_output_size,
+            iou_threshold=iou_thresh,
+            score_threshold=score_thresh,
+        )
+        nms_result.add_field(
+            "classes", np.zeros_like(nms_result.get_field("scores")) + class_idx
+        )
+        selected_boxes_list.append(nms_result)
+    selected_boxes = np_box_list_ops.concatenate(selected_boxes_list)
+    sorted_boxes = np_box_list_ops.sort_by_field(selected_boxes, "scores")
+    return box_list_to_box_mask_list(boxlist=sorted_boxes)
+
+
+def prune_non_overlapping_masks(box_mask_list1, box_mask_list2, minoverlap=0.0):
+    """Prunes the boxes in list1 that overlap less than thresh with list2.
+
+  For each mask in box_mask_list1, we want its IOA to be more than minoverlap
+  with at least one of the masks in box_mask_list2. If it does not, we remove
+  it. If the masks are not full size image, we do the pruning based on boxes.
+
+  Args:
+    box_mask_list1: np_box_mask_list.BoxMaskList holding N boxes and masks.
+    box_mask_list2: np_box_mask_list.BoxMaskList holding M boxes and masks.
+    minoverlap: Minimum required overlap between boxes, to count them as
+                overlapping.
+
+  Returns:
+    A pruned box_mask_list with size [N', 4].
+  """
+    intersection_over_area = ioa(
+        box_mask_list2, box_mask_list1
+    )  # [M, N] tensor
+    intersection_over_area = np.amax(
+        intersection_over_area, axis=0
+    )  # [N] tensor
+    keep_bool = np.greater_equal(intersection_over_area, np.array(minoverlap))
+    keep_inds = np.nonzero(keep_bool)[0]
+    new_box_mask_list1 = gather(box_mask_list1, keep_inds)
+    return new_box_mask_list1
+
+
+def concatenate(box_mask_lists, fields=None):
+    """Concatenate list of box_mask_lists.
+
+  This op concatenates a list of input box_mask_lists into a larger
+  box_mask_list.  It also
+  handles concatenation of box_mask_list fields as long as the field tensor
+  shapes are equal except for the first dimension.
+
+  Args:
+    box_mask_lists: list of np_box_mask_list.BoxMaskList objects
+    fields: optional list of fields to also concatenate.  By default, all
+      fields from the first BoxMaskList in the list are included in the
+      concatenation.
+
+  Returns:
+    a box_mask_list with number of boxes equal to
+      sum([box_mask_list.num_boxes() for box_mask_list in box_mask_list])
+  Raises:
+    ValueError: if box_mask_lists is invalid (i.e., is not a list, is empty, or
+      contains non box_mask_list objects), or if requested fields are not
+      contained in all box_mask_lists
+  """
+    if fields is not None:
+        if "masks" not in fields:
+            fields.append("masks")
+    return box_list_to_box_mask_list(
+        np_box_list_ops.concatenate(boxlists=box_mask_lists, fields=fields)
+    )
+
+
+def filter_scores_greater_than(box_mask_list, thresh):
+    """Filter to keep only boxes and masks with score exceeding a given threshold.
+
+  This op keeps the collection of boxes and masks whose corresponding scores are
+  greater than the input threshold.
+
+  Args:
+    box_mask_list: BoxMaskList holding N boxes and masks.  Must contain a
+      'scores' field representing detection scores.
+    thresh: scalar threshold
+
+  Returns:
+    a BoxMaskList holding M boxes and masks where M <= N
+
+  Raises:
+    ValueError: if box_mask_list not a np_box_mask_list.BoxMaskList object or
+      if it does not have a scores field
+  """
+    if not isinstance(box_mask_list, np_box_mask_list.BoxMaskList):
+        raise ValueError("box_mask_list must be a BoxMaskList")
+    if not box_mask_list.has_field("scores"):
+        raise ValueError("input box_mask_list must have 'scores' field")
+    scores = box_mask_list.get_field("scores")
+    if len(scores.shape) > 2:
+        raise ValueError("Scores should have rank 1 or 2")
+    if len(scores.shape) == 2 and scores.shape[1] != 1:
+        raise ValueError(
+            "Scores should have rank 1 or have shape "
+            "consistent with [None, 1]"
+        )
+    high_score_indices = np.reshape(
+        np.where(np.greater(scores, thresh)), [-1]
+    ).astype(np.int32)
+    return gather(box_mask_list, high_score_indices)
diff --git a/evaluator/ava_evaluation/np_box_ops.py b/evaluator/ava_evaluation/np_box_ops.py
new file mode 100644
index 0000000..1d74005
--- /dev/null
+++ b/evaluator/ava_evaluation/np_box_ops.py
@@ -0,0 +1,108 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Operations for [N, 4] numpy arrays representing bounding boxes.
+
+Example box operations that are supported:
+  * Areas: compute bounding box areas
+  * IOU: pairwise intersection-over-union scores
+"""
+from __future__ import (
+    absolute_import,
+    division,
+    print_function,
+    unicode_literals,
+)
+import numpy as np
+
+
+def area(boxes):
+    """Computes area of boxes.
+
+  Args:
+    boxes: Numpy array with shape [N, 4] holding N boxes
+
+  Returns:
+    a numpy array with shape [N*1] representing box areas
+  """
+    return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
+
+
+def intersection(boxes1, boxes2):
+    """Compute pairwise intersection areas between boxes.
+
+  Args:
+    boxes1: a numpy array with shape [N, 4] holding N boxes
+    boxes2: a numpy array with shape [M, 4] holding M boxes
+
+  Returns:
+    a numpy array with shape [N*M] representing pairwise intersection area
+  """
+    [y_min1, x_min1, y_max1, x_max1] = np.split(boxes1, 4, axis=1)
+    [y_min2, x_min2, y_max2, x_max2] = np.split(boxes2, 4, axis=1)
+
+    all_pairs_min_ymax = np.minimum(y_max1, np.transpose(y_max2))
+    all_pairs_max_ymin = np.maximum(y_min1, np.transpose(y_min2))
+    intersect_heights = np.maximum(
+        np.zeros(all_pairs_max_ymin.shape),
+        all_pairs_min_ymax - all_pairs_max_ymin,
+    )
+    all_pairs_min_xmax = np.minimum(x_max1, np.transpose(x_max2))
+    all_pairs_max_xmin = np.maximum(x_min1, np.transpose(x_min2))
+    intersect_widths = np.maximum(
+        np.zeros(all_pairs_max_xmin.shape),
+        all_pairs_min_xmax - all_pairs_max_xmin,
+    )
+    return intersect_heights * intersect_widths
+
+
+def iou(boxes1, boxes2):
+    """Computes pairwise intersection-over-union between box collections.
+
+  Args:
+    boxes1: a numpy array with shape [N, 4] holding N boxes.
+    boxes2: a numpy array with shape [M, 4] holding N boxes.
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise iou scores.
+  """
+    intersect = intersection(boxes1, boxes2)
+    area1 = area(boxes1)
+    area2 = area(boxes2)
+    union = (
+        np.expand_dims(area1, axis=1)
+        + np.expand_dims(area2, axis=0)
+        - intersect
+    )
+    return intersect / union
+
+
+def ioa(boxes1, boxes2):
+    """Computes pairwise intersection-over-area between box collections.
+
+  Intersection-over-area (ioa) between two boxes box1 and box2 is defined as
+  their intersection area over box2's area. Note that ioa is not symmetric,
+  that is, IOA(box1, box2) != IOA(box2, box1).
+
+  Args:
+    boxes1: a numpy array with shape [N, 4] holding N boxes.
+    boxes2: a numpy array with shape [M, 4] holding N boxes.
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise ioa scores.
+  """
+    intersect = intersection(boxes1, boxes2)
+    areas = np.expand_dims(area(boxes2), axis=0)
+    return intersect / areas
diff --git a/evaluator/ava_evaluation/np_mask_ops.py b/evaluator/ava_evaluation/np_mask_ops.py
new file mode 100644
index 0000000..4f3ec21
--- /dev/null
+++ b/evaluator/ava_evaluation/np_mask_ops.py
@@ -0,0 +1,130 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Operations for [N, height, width] numpy arrays representing masks.
+
+Example mask operations that are supported:
+  * Areas: compute mask areas
+  * IOU: pairwise intersection-over-union scores
+"""
+from __future__ import (
+    absolute_import,
+    division,
+    print_function,
+    unicode_literals,
+)
+import numpy as np
+
+EPSILON = 1e-7
+
+
+def area(masks):
+    """Computes area of masks.
+
+  Args:
+    masks: Numpy array with shape [N, height, width] holding N masks. Masks
+      values are of type np.uint8 and values are in {0,1}.
+
+  Returns:
+    a numpy array with shape [N*1] representing mask areas.
+
+  Raises:
+    ValueError: If masks.dtype is not np.uint8
+  """
+    if masks.dtype != np.uint8:
+        raise ValueError("Masks type should be np.uint8")
+    return np.sum(masks, axis=(1, 2), dtype=np.float32)
+
+
+def intersection(masks1, masks2):
+    """Compute pairwise intersection areas between masks.
+
+  Args:
+    masks1: a numpy array with shape [N, height, width] holding N masks. Masks
+      values are of type np.uint8 and values are in {0,1}.
+    masks2: a numpy array with shape [M, height, width] holding M masks. Masks
+      values are of type np.uint8 and values are in {0,1}.
+
+  Returns:
+    a numpy array with shape [N*M] representing pairwise intersection area.
+
+  Raises:
+    ValueError: If masks1 and masks2 are not of type np.uint8.
+  """
+    if masks1.dtype != np.uint8 or masks2.dtype != np.uint8:
+        raise ValueError("masks1 and masks2 should be of type np.uint8")
+    n = masks1.shape[0]
+    m = masks2.shape[0]
+    answer = np.zeros([n, m], dtype=np.float32)
+    for i in np.arange(n):
+        for j in np.arange(m):
+            answer[i, j] = np.sum(
+                np.minimum(masks1[i], masks2[j]), dtype=np.float32
+            )
+    return answer
+
+
+def iou(masks1, masks2):
+    """Computes pairwise intersection-over-union between mask collections.
+
+  Args:
+    masks1: a numpy array with shape [N, height, width] holding N masks. Masks
+      values are of type np.uint8 and values are in {0,1}.
+    masks2: a numpy array with shape [M, height, width] holding N masks. Masks
+      values are of type np.uint8 and values are in {0,1}.
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise iou scores.
+
+  Raises:
+    ValueError: If masks1 and masks2 are not of type np.uint8.
+  """
+    if masks1.dtype != np.uint8 or masks2.dtype != np.uint8:
+        raise ValueError("masks1 and masks2 should be of type np.uint8")
+    intersect = intersection(masks1, masks2)
+    area1 = area(masks1)
+    area2 = area(masks2)
+    union = (
+        np.expand_dims(area1, axis=1)
+        + np.expand_dims(area2, axis=0)
+        - intersect
+    )
+    return intersect / np.maximum(union, EPSILON)
+
+
+def ioa(masks1, masks2):
+    """Computes pairwise intersection-over-area between box collections.
+
+  Intersection-over-area (ioa) between two masks, mask1 and mask2 is defined as
+  their intersection area over mask2's area. Note that ioa is not symmetric,
+  that is, IOA(mask1, mask2) != IOA(mask2, mask1).
+
+  Args:
+    masks1: a numpy array with shape [N, height, width] holding N masks. Masks
+      values are of type np.uint8 and values are in {0,1}.
+    masks2: a numpy array with shape [M, height, width] holding N masks. Masks
+      values are of type np.uint8 and values are in {0,1}.
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise ioa scores.
+
+  Raises:
+    ValueError: If masks1 and masks2 are not of type np.uint8.
+  """
+    if masks1.dtype != np.uint8 or masks2.dtype != np.uint8:
+        raise ValueError("masks1 and masks2 should be of type np.uint8")
+    intersect = intersection(masks1, masks2)
+    areas = np.expand_dims(area(masks2), axis=0)
+    return intersect / (areas + EPSILON)
diff --git a/evaluator/ava_evaluation/object_detection_evaluation.py b/evaluator/ava_evaluation/object_detection_evaluation.py
new file mode 100644
index 0000000..7a54021
--- /dev/null
+++ b/evaluator/ava_evaluation/object_detection_evaluation.py
@@ -0,0 +1,824 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""object_detection_evaluation module.
+
+ObjectDetectionEvaluation is a class which manages ground truth information of a
+object detection dataset, and computes frequently used detection metrics such as
+Precision, Recall, CorLoc of the provided detection results.
+It supports the following operations:
+1) Add ground truth information of images sequentially.
+2) Add detection result of images sequentially.
+3) Evaluate detection metrics on already inserted detection results.
+4) Write evaluation result into a pickle file for future processing or
+   visualization.
+
+Note: This module operates on numpy boxes and box lists.
+"""
+
+from __future__ import (
+    absolute_import,
+    division,
+    print_function,
+    unicode_literals,
+)
+import collections
+import logging
+import numpy as np
+from abc import ABCMeta, abstractmethod
+
+from . import label_map_util, metrics, per_image_evaluation, standard_fields
+
+
+class DetectionEvaluator(object):
+    """Interface for object detection evalution classes.
+
+  Example usage of the Evaluator:
+  ------------------------------
+  evaluator = DetectionEvaluator(categories)
+
+  # Detections and groundtruth for image 1.
+  evaluator.add_single_groundtruth_image_info(...)
+  evaluator.add_single_detected_image_info(...)
+
+  # Detections and groundtruth for image 2.
+  evaluator.add_single_groundtruth_image_info(...)
+  evaluator.add_single_detected_image_info(...)
+
+  metrics_dict = evaluator.evaluate()
+  """
+
+    __metaclass__ = ABCMeta
+
+    def __init__(self, categories):
+        """Constructor.
+
+    Args:
+      categories: A list of dicts, each of which has the following keys -
+        'id': (required) an integer id uniquely identifying this category.
+        'name': (required) string representing category name e.g., 'cat', 'dog'.
+    """
+        self._categories = categories
+
+    @abstractmethod
+    def add_single_ground_truth_image_info(self, image_id, groundtruth_dict):
+        """Adds groundtruth for a single image to be used for evaluation.
+
+    Args:
+      image_id: A unique string/integer identifier for the image.
+      groundtruth_dict: A dictionary of groundtruth numpy arrays required
+        for evaluations.
+    """
+        pass
+
+    @abstractmethod
+    def add_single_detected_image_info(self, image_id, detections_dict):
+        """Adds detections for a single image to be used for evaluation.
+
+    Args:
+      image_id: A unique string/integer identifier for the image.
+      detections_dict: A dictionary of detection numpy arrays required
+        for evaluation.
+    """
+        pass
+
+    @abstractmethod
+    def evaluate(self):
+        """Evaluates detections and returns a dictionary of metrics."""
+        pass
+
+    @abstractmethod
+    def clear(self):
+        """Clears the state to prepare for a fresh evaluation."""
+        pass
+
+
+class ObjectDetectionEvaluator(DetectionEvaluator):
+    """A class to evaluate detections."""
+
+    def __init__(
+        self,
+        categories,
+        matching_iou_threshold=0.5,
+        evaluate_corlocs=False,
+        metric_prefix=None,
+        use_weighted_mean_ap=False,
+        evaluate_masks=False,
+    ):
+        """Constructor.
+
+    Args:
+      categories: A list of dicts, each of which has the following keys -
+        'id': (required) an integer id uniquely identifying this category.
+        'name': (required) string representing category name e.g., 'cat', 'dog'.
+      matching_iou_threshold: IOU threshold to use for matching groundtruth
+        boxes to detection boxes.
+      evaluate_corlocs: (optional) boolean which determines if corloc scores
+        are to be returned or not.
+      metric_prefix: (optional) string prefix for metric name; if None, no
+        prefix is used.
+      use_weighted_mean_ap: (optional) boolean which determines if the mean
+        average precision is computed directly from the scores and tp_fp_labels
+        of all classes.
+      evaluate_masks: If False, evaluation will be performed based on boxes.
+        If True, mask evaluation will be performed instead.
+
+    Raises:
+      ValueError: If the category ids are not 1-indexed.
+    """
+        super(ObjectDetectionEvaluator, self).__init__(categories)
+        self._num_classes = max([cat["id"] for cat in categories])
+        if min(cat["id"] for cat in categories) < 1:
+            raise ValueError("Classes should be 1-indexed.")
+        self._matching_iou_threshold = matching_iou_threshold
+        self._use_weighted_mean_ap = use_weighted_mean_ap
+        self._label_id_offset = 1
+        self._evaluate_masks = evaluate_masks
+        self._evaluation = ObjectDetectionEvaluation(
+            num_groundtruth_classes=self._num_classes,
+            matching_iou_threshold=self._matching_iou_threshold,
+            use_weighted_mean_ap=self._use_weighted_mean_ap,
+            label_id_offset=self._label_id_offset,
+        )
+        self._image_ids = set([])
+        self._evaluate_corlocs = evaluate_corlocs
+        self._metric_prefix = (metric_prefix + "_") if metric_prefix else ""
+
+    def add_single_ground_truth_image_info(self, image_id, groundtruth_dict):
+        """Adds groundtruth for a single image to be used for evaluation.
+
+    Args:
+      image_id: A unique string/integer identifier for the image.
+      groundtruth_dict: A dictionary containing -
+        standard_fields.InputDataFields.groundtruth_boxes: float32 numpy array
+          of shape [num_boxes, 4] containing `num_boxes` groundtruth boxes of
+          the format [ymin, xmin, ymax, xmax] in absolute image coordinates.
+        standard_fields.InputDataFields.groundtruth_classes: integer numpy array
+          of shape [num_boxes] containing 1-indexed groundtruth classes for the
+          boxes.
+        standard_fields.InputDataFields.groundtruth_difficult: Optional length
+          M numpy boolean array denoting whether a ground truth box is a
+          difficult instance or not. This field is optional to support the case
+          that no boxes are difficult.
+        standard_fields.InputDataFields.groundtruth_instance_masks: Optional
+          numpy array of shape [num_boxes, height, width] with values in {0, 1}.
+
+    Raises:
+      ValueError: On adding groundtruth for an image more than once. Will also
+        raise error if instance masks are not in groundtruth dictionary.
+    """
+        if image_id in self._image_ids:
+            raise ValueError("Image with id {} already added.".format(image_id))
+
+        groundtruth_classes = (
+            groundtruth_dict[
+                standard_fields.InputDataFields.groundtruth_classes
+            ]
+            - self._label_id_offset
+        )
+        # If the key is not present in the groundtruth_dict or the array is empty
+        # (unless there are no annotations for the groundtruth on this image)
+        # use values from the dictionary or insert None otherwise.
+        if standard_fields.InputDataFields.groundtruth_difficult in groundtruth_dict.keys() and (
+            groundtruth_dict[
+                standard_fields.InputDataFields.groundtruth_difficult
+            ].size
+            or not groundtruth_classes.size
+        ):
+            groundtruth_difficult = groundtruth_dict[
+                standard_fields.InputDataFields.groundtruth_difficult
+            ]
+        else:
+            groundtruth_difficult = None
+            if not len(self._image_ids) % 1000:
+                logging.warn(
+                    "image %s does not have groundtruth difficult flag specified",
+                    image_id,
+                )
+        groundtruth_masks = None
+        if self._evaluate_masks:
+            if (
+                standard_fields.InputDataFields.groundtruth_instance_masks
+                not in groundtruth_dict
+            ):
+                raise ValueError(
+                    "Instance masks not in groundtruth dictionary."
+                )
+            groundtruth_masks = groundtruth_dict[
+                standard_fields.InputDataFields.groundtruth_instance_masks
+            ]
+        self._evaluation.add_single_ground_truth_image_info(
+            image_key=image_id,
+            groundtruth_boxes=groundtruth_dict[
+                standard_fields.InputDataFields.groundtruth_boxes
+            ],
+            groundtruth_class_labels=groundtruth_classes,
+            groundtruth_is_difficult_list=groundtruth_difficult,
+            groundtruth_masks=groundtruth_masks,
+        )
+        self._image_ids.update([image_id])
+
+    def add_single_detected_image_info(self, image_id, detections_dict):
+        """Adds detections for a single image to be used for evaluation.
+
+    Args:
+      image_id: A unique string/integer identifier for the image.
+      detections_dict: A dictionary containing -
+        standard_fields.DetectionResultFields.detection_boxes: float32 numpy
+          array of shape [num_boxes, 4] containing `num_boxes` detection boxes
+          of the format [ymin, xmin, ymax, xmax] in absolute image coordinates.
+        standard_fields.DetectionResultFields.detection_scores: float32 numpy
+          array of shape [num_boxes] containing detection scores for the boxes.
+        standard_fields.DetectionResultFields.detection_classes: integer numpy
+          array of shape [num_boxes] containing 1-indexed detection classes for
+          the boxes.
+        standard_fields.DetectionResultFields.detection_masks: uint8 numpy
+          array of shape [num_boxes, height, width] containing `num_boxes` masks
+          of values ranging between 0 and 1.
+
+    Raises:
+      ValueError: If detection masks are not in detections dictionary.
+    """
+        detection_classes = (
+            detections_dict[
+                standard_fields.DetectionResultFields.detection_classes
+            ]
+            - self._label_id_offset
+        )
+        detection_masks = None
+        if self._evaluate_masks:
+            if (
+                standard_fields.DetectionResultFields.detection_masks
+                not in detections_dict
+            ):
+                raise ValueError(
+                    "Detection masks not in detections dictionary."
+                )
+            detection_masks = detections_dict[
+                standard_fields.DetectionResultFields.detection_masks
+            ]
+        self._evaluation.add_single_detected_image_info(
+            image_key=image_id,
+            detected_boxes=detections_dict[
+                standard_fields.DetectionResultFields.detection_boxes
+            ],
+            detected_scores=detections_dict[
+                standard_fields.DetectionResultFields.detection_scores
+            ],
+            detected_class_labels=detection_classes,
+            detected_masks=detection_masks,
+        )
+
+    def evaluate(self):
+        """Compute evaluation result.
+
+    Returns:
+      A dictionary of metrics with the following fields -
+
+      1. summary_metrics:
+        'Precision/mAP@<matching_iou_threshold>IOU': mean average precision at
+        the specified IOU threshold.
+
+      2. per_category_ap: category specific results with keys of the form
+        'PerformanceByCategory/mAP@<matching_iou_threshold>IOU/category'.
+    """
+        (
+            per_class_ap,
+            mean_ap,
+            _,
+            _,
+            per_class_corloc,
+            mean_corloc,
+        ) = self._evaluation.evaluate()
+        pascal_metrics = {
+            self._metric_prefix
+            + "Precision/mAP@{}IOU".format(
+                self._matching_iou_threshold
+            ): mean_ap
+        }
+        if self._evaluate_corlocs:
+            pascal_metrics[
+                self._metric_prefix
+                + "Precision/meanCorLoc@{}IOU".format(
+                    self._matching_iou_threshold
+                )
+            ] = mean_corloc
+        category_index = label_map_util.create_category_index(self._categories)
+        for idx in range(per_class_ap.size):
+            if idx + self._label_id_offset in category_index:
+                display_name = (
+                    self._metric_prefix
+                    + "PerformanceByCategory/AP@{}IOU/{}".format(
+                        self._matching_iou_threshold,
+                        category_index[idx + self._label_id_offset]["name"],
+                    )
+                )
+                pascal_metrics[display_name] = per_class_ap[idx]
+
+                # Optionally add CorLoc metrics.classes
+                if self._evaluate_corlocs:
+                    display_name = (
+                        self._metric_prefix
+                        + "PerformanceByCategory/CorLoc@{}IOU/{}".format(
+                            self._matching_iou_threshold,
+                            category_index[idx + self._label_id_offset]["name"],
+                        )
+                    )
+                    pascal_metrics[display_name] = per_class_corloc[idx]
+
+        return pascal_metrics
+
+    def clear(self):
+        """Clears the state to prepare for a fresh evaluation."""
+        self._evaluation = ObjectDetectionEvaluation(
+            num_groundtruth_classes=self._num_classes,
+            matching_iou_threshold=self._matching_iou_threshold,
+            use_weighted_mean_ap=self._use_weighted_mean_ap,
+            label_id_offset=self._label_id_offset,
+        )
+        self._image_ids.clear()
+
+
+class PascalDetectionEvaluator(ObjectDetectionEvaluator):
+    """A class to evaluate detections using PASCAL metrics."""
+
+    def __init__(self, categories, matching_iou_threshold=0.5):
+        super(PascalDetectionEvaluator, self).__init__(
+            categories,
+            matching_iou_threshold=matching_iou_threshold,
+            evaluate_corlocs=False,
+            metric_prefix="PascalBoxes",
+            use_weighted_mean_ap=False,
+        )
+
+
+class WeightedPascalDetectionEvaluator(ObjectDetectionEvaluator):
+    """A class to evaluate detections using weighted PASCAL metrics.
+
+  Weighted PASCAL metrics computes the mean average precision as the average
+  precision given the scores and tp_fp_labels of all classes. In comparison,
+  PASCAL metrics computes the mean average precision as the mean of the
+  per-class average precisions.
+
+  This definition is very similar to the mean of the per-class average
+  precisions weighted by class frequency. However, they are typically not the
+  same as the average precision is not a linear function of the scores and
+  tp_fp_labels.
+  """
+
+    def __init__(self, categories, matching_iou_threshold=0.5):
+        super(WeightedPascalDetectionEvaluator, self).__init__(
+            categories,
+            matching_iou_threshold=matching_iou_threshold,
+            evaluate_corlocs=False,
+            metric_prefix="WeightedPascalBoxes",
+            use_weighted_mean_ap=True,
+        )
+
+
+class PascalInstanceSegmentationEvaluator(ObjectDetectionEvaluator):
+    """A class to evaluate instance masks using PASCAL metrics."""
+
+    def __init__(self, categories, matching_iou_threshold=0.5):
+        super(PascalInstanceSegmentationEvaluator, self).__init__(
+            categories,
+            matching_iou_threshold=matching_iou_threshold,
+            evaluate_corlocs=False,
+            metric_prefix="PascalMasks",
+            use_weighted_mean_ap=False,
+            evaluate_masks=True,
+        )
+
+
+class WeightedPascalInstanceSegmentationEvaluator(ObjectDetectionEvaluator):
+    """A class to evaluate instance masks using weighted PASCAL metrics.
+
+  Weighted PASCAL metrics computes the mean average precision as the average
+  precision given the scores and tp_fp_labels of all classes. In comparison,
+  PASCAL metrics computes the mean average precision as the mean of the
+  per-class average precisions.
+
+  This definition is very similar to the mean of the per-class average
+  precisions weighted by class frequency. However, they are typically not the
+  same as the average precision is not a linear function of the scores and
+  tp_fp_labels.
+  """
+
+    def __init__(self, categories, matching_iou_threshold=0.5):
+        super(WeightedPascalInstanceSegmentationEvaluator, self).__init__(
+            categories,
+            matching_iou_threshold=matching_iou_threshold,
+            evaluate_corlocs=False,
+            metric_prefix="WeightedPascalMasks",
+            use_weighted_mean_ap=True,
+            evaluate_masks=True,
+        )
+
+
+class OpenImagesDetectionEvaluator(ObjectDetectionEvaluator):
+    """A class to evaluate detections using Open Images V2 metrics.
+
+    Open Images V2 introduce group_of type of bounding boxes and this metric
+    handles those boxes appropriately.
+  """
+
+    def __init__(
+        self, categories, matching_iou_threshold=0.5, evaluate_corlocs=False
+    ):
+        """Constructor.
+
+    Args:
+      categories: A list of dicts, each of which has the following keys -
+        'id': (required) an integer id uniquely identifying this category.
+        'name': (required) string representing category name e.g., 'cat', 'dog'.
+      matching_iou_threshold: IOU threshold to use for matching groundtruth
+        boxes to detection boxes.
+      evaluate_corlocs: if True, additionally evaluates and returns CorLoc.
+    """
+        super(OpenImagesDetectionEvaluator, self).__init__(
+            categories,
+            matching_iou_threshold,
+            evaluate_corlocs,
+            metric_prefix="OpenImagesV2",
+        )
+
+    def add_single_ground_truth_image_info(self, image_id, groundtruth_dict):
+        """Adds groundtruth for a single image to be used for evaluation.
+
+    Args:
+      image_id: A unique string/integer identifier for the image.
+      groundtruth_dict: A dictionary containing -
+        standard_fields.InputDataFields.groundtruth_boxes: float32 numpy array
+          of shape [num_boxes, 4] containing `num_boxes` groundtruth boxes of
+          the format [ymin, xmin, ymax, xmax] in absolute image coordinates.
+        standard_fields.InputDataFields.groundtruth_classes: integer numpy array
+          of shape [num_boxes] containing 1-indexed groundtruth classes for the
+          boxes.
+        standard_fields.InputDataFields.groundtruth_group_of: Optional length
+          M numpy boolean array denoting whether a groundtruth box contains a
+          group of instances.
+
+    Raises:
+      ValueError: On adding groundtruth for an image more than once.
+    """
+        if image_id in self._image_ids:
+            raise ValueError("Image with id {} already added.".format(image_id))
+
+        groundtruth_classes = (
+            groundtruth_dict[
+                standard_fields.InputDataFields.groundtruth_classes
+            ]
+            - self._label_id_offset
+        )
+        # If the key is not present in the groundtruth_dict or the array is empty
+        # (unless there are no annotations for the groundtruth on this image)
+        # use values from the dictionary or insert None otherwise.
+        if standard_fields.InputDataFields.groundtruth_group_of in groundtruth_dict.keys() and (
+            groundtruth_dict[
+                standard_fields.InputDataFields.groundtruth_group_of
+            ].size
+            or not groundtruth_classes.size
+        ):
+            groundtruth_group_of = groundtruth_dict[
+                standard_fields.InputDataFields.groundtruth_group_of
+            ]
+        else:
+            groundtruth_group_of = None
+            if not len(self._image_ids) % 1000:
+                logging.warn(
+                    "image %s does not have groundtruth group_of flag specified",
+                    image_id,
+                )
+        self._evaluation.add_single_ground_truth_image_info(
+            image_id,
+            groundtruth_dict[standard_fields.InputDataFields.groundtruth_boxes],
+            groundtruth_classes,
+            groundtruth_is_difficult_list=None,
+            groundtruth_is_group_of_list=groundtruth_group_of,
+        )
+        self._image_ids.update([image_id])
+
+
+ObjectDetectionEvalMetrics = collections.namedtuple(
+    "ObjectDetectionEvalMetrics",
+    [
+        "average_precisions",
+        "mean_ap",
+        "precisions",
+        "recalls",
+        "corlocs",
+        "mean_corloc",
+    ],
+)
+
+
+class ObjectDetectionEvaluation(object):
+    """Internal implementation of Pascal object detection metrics."""
+
+    def __init__(
+        self,
+        num_groundtruth_classes,
+        matching_iou_threshold=0.5,
+        nms_iou_threshold=1.0,
+        nms_max_output_boxes=10000,
+        use_weighted_mean_ap=False,
+        label_id_offset=0,
+    ):
+        if num_groundtruth_classes < 1:
+            raise ValueError(
+                "Need at least 1 groundtruth class for evaluation."
+            )
+
+        self.per_image_eval = per_image_evaluation.PerImageEvaluation(
+            num_groundtruth_classes=num_groundtruth_classes,
+            matching_iou_threshold=matching_iou_threshold,
+        )
+        self.num_class = num_groundtruth_classes
+        self.use_weighted_mean_ap = use_weighted_mean_ap
+        self.label_id_offset = label_id_offset
+
+        self.groundtruth_boxes = {}
+        self.groundtruth_class_labels = {}
+        self.groundtruth_masks = {}
+        self.groundtruth_is_difficult_list = {}
+        self.groundtruth_is_group_of_list = {}
+        self.num_gt_instances_per_class = np.zeros(self.num_class, dtype=int)
+        self.num_gt_imgs_per_class = np.zeros(self.num_class, dtype=int)
+
+        self._initialize_detections()
+
+    def _initialize_detections(self):
+        self.detection_keys = set()
+        self.scores_per_class = [[] for _ in range(self.num_class)]
+        self.tp_fp_labels_per_class = [[] for _ in range(self.num_class)]
+        self.num_images_correctly_detected_per_class = np.zeros(self.num_class)
+        self.average_precision_per_class = np.empty(self.num_class, dtype=float)
+        self.average_precision_per_class.fill(np.nan)
+        self.precisions_per_class = []
+        self.recalls_per_class = []
+        self.corloc_per_class = np.ones(self.num_class, dtype=float)
+
+    def clear_detections(self):
+        self._initialize_detections()
+
+    def add_single_ground_truth_image_info(
+        self,
+        image_key,
+        groundtruth_boxes,
+        groundtruth_class_labels,
+        groundtruth_is_difficult_list=None,
+        groundtruth_is_group_of_list=None,
+        groundtruth_masks=None,
+    ):
+        """Adds groundtruth for a single image to be used for evaluation.
+
+    Args:
+      image_key: A unique string/integer identifier for the image.
+      groundtruth_boxes: float32 numpy array of shape [num_boxes, 4]
+        containing `num_boxes` groundtruth boxes of the format
+        [ymin, xmin, ymax, xmax] in absolute image coordinates.
+      groundtruth_class_labels: integer numpy array of shape [num_boxes]
+        containing 0-indexed groundtruth classes for the boxes.
+      groundtruth_is_difficult_list: A length M numpy boolean array denoting
+        whether a ground truth box is a difficult instance or not. To support
+        the case that no boxes are difficult, it is by default set as None.
+      groundtruth_is_group_of_list: A length M numpy boolean array denoting
+          whether a ground truth box is a group-of box or not. To support
+          the case that no boxes are groups-of, it is by default set as None.
+      groundtruth_masks: uint8 numpy array of shape
+        [num_boxes, height, width] containing `num_boxes` groundtruth masks.
+        The mask values range from 0 to 1.
+    """
+        if image_key in self.groundtruth_boxes:
+            logging.warn(
+                "image %s has already been added to the ground truth database.",
+                image_key,
+            )
+            return
+
+        self.groundtruth_boxes[image_key] = groundtruth_boxes
+        self.groundtruth_class_labels[image_key] = groundtruth_class_labels
+        self.groundtruth_masks[image_key] = groundtruth_masks
+        if groundtruth_is_difficult_list is None:
+            num_boxes = groundtruth_boxes.shape[0]
+            groundtruth_is_difficult_list = np.zeros(num_boxes, dtype=bool)
+        self.groundtruth_is_difficult_list[
+            image_key
+        ] = groundtruth_is_difficult_list.astype(dtype=bool)
+        if groundtruth_is_group_of_list is None:
+            num_boxes = groundtruth_boxes.shape[0]
+            groundtruth_is_group_of_list = np.zeros(num_boxes, dtype=bool)
+        self.groundtruth_is_group_of_list[
+            image_key
+        ] = groundtruth_is_group_of_list.astype(dtype=bool)
+
+        self._update_ground_truth_statistics(
+            groundtruth_class_labels,
+            groundtruth_is_difficult_list.astype(dtype=bool),
+            groundtruth_is_group_of_list.astype(dtype=bool),
+        )
+
+    def add_single_detected_image_info(
+        self,
+        image_key,
+        detected_boxes,
+        detected_scores,
+        detected_class_labels,
+        detected_masks=None,
+    ):
+        """Adds detections for a single image to be used for evaluation.
+
+    Args:
+      image_key: A unique string/integer identifier for the image.
+      detected_boxes: float32 numpy array of shape [num_boxes, 4]
+        containing `num_boxes` detection boxes of the format
+        [ymin, xmin, ymax, xmax] in absolute image coordinates.
+      detected_scores: float32 numpy array of shape [num_boxes] containing
+        detection scores for the boxes.
+      detected_class_labels: integer numpy array of shape [num_boxes] containing
+        0-indexed detection classes for the boxes.
+      detected_masks: np.uint8 numpy array of shape [num_boxes, height, width]
+        containing `num_boxes` detection masks with values ranging
+        between 0 and 1.
+
+    Raises:
+      ValueError: if the number of boxes, scores and class labels differ in
+        length.
+    """
+        if len(detected_boxes) != len(detected_scores) or len(
+            detected_boxes
+        ) != len(detected_class_labels):
+            raise ValueError(
+                "detected_boxes, detected_scores and "
+                "detected_class_labels should all have same lengths. Got"
+                "[%d, %d, %d]" % len(detected_boxes),
+                len(detected_scores),
+                len(detected_class_labels),
+            )
+
+        if image_key in self.detection_keys:
+            logging.warn(
+                "image %s has already been added to the detection result database",
+                image_key,
+            )
+            return
+
+        self.detection_keys.add(image_key)
+        if image_key in self.groundtruth_boxes:
+            groundtruth_boxes = self.groundtruth_boxes[image_key]
+            groundtruth_class_labels = self.groundtruth_class_labels[image_key]
+            # Masks are popped instead of look up. The reason is that we do not want
+            # to keep all masks in memory which can cause memory overflow.
+            groundtruth_masks = self.groundtruth_masks.pop(image_key)
+            groundtruth_is_difficult_list = self.groundtruth_is_difficult_list[
+                image_key
+            ]
+            groundtruth_is_group_of_list = self.groundtruth_is_group_of_list[
+                image_key
+            ]
+        else:
+            groundtruth_boxes = np.empty(shape=[0, 4], dtype=float)
+            groundtruth_class_labels = np.array([], dtype=int)
+            if detected_masks is None:
+                groundtruth_masks = None
+            else:
+                groundtruth_masks = np.empty(shape=[0, 1, 1], dtype=float)
+            groundtruth_is_difficult_list = np.array([], dtype=bool)
+            groundtruth_is_group_of_list = np.array([], dtype=bool)
+        (
+            scores,
+            tp_fp_labels,
+        ) = self.per_image_eval.compute_object_detection_metrics(
+            detected_boxes=detected_boxes,
+            detected_scores=detected_scores,
+            detected_class_labels=detected_class_labels,
+            groundtruth_boxes=groundtruth_boxes,
+            groundtruth_class_labels=groundtruth_class_labels,
+            groundtruth_is_difficult_list=groundtruth_is_difficult_list,
+            groundtruth_is_group_of_list=groundtruth_is_group_of_list,
+            detected_masks=detected_masks,
+            groundtruth_masks=groundtruth_masks,
+        )
+
+        for i in range(self.num_class):
+            if scores[i].shape[0] > 0:
+                self.scores_per_class[i].append(scores[i])
+                self.tp_fp_labels_per_class[i].append(tp_fp_labels[i])
+
+    def _update_ground_truth_statistics(
+        self,
+        groundtruth_class_labels,
+        groundtruth_is_difficult_list,
+        groundtruth_is_group_of_list,
+    ):
+        """Update grouth truth statitistics.
+
+    1. Difficult boxes are ignored when counting the number of ground truth
+    instances as done in Pascal VOC devkit.
+    2. Difficult boxes are treated as normal boxes when computing CorLoc related
+    statitistics.
+
+    Args:
+      groundtruth_class_labels: An integer numpy array of length M,
+          representing M class labels of object instances in ground truth
+      groundtruth_is_difficult_list: A boolean numpy array of length M denoting
+          whether a ground truth box is a difficult instance or not
+      groundtruth_is_group_of_list: A boolean numpy array of length M denoting
+          whether a ground truth box is a group-of box or not
+    """
+        for class_index in range(self.num_class):
+            num_gt_instances = np.sum(
+                groundtruth_class_labels[
+                    ~groundtruth_is_difficult_list
+                    & ~groundtruth_is_group_of_list
+                ]
+                == class_index
+            )
+            self.num_gt_instances_per_class[class_index] += num_gt_instances
+            if np.any(groundtruth_class_labels == class_index):
+                self.num_gt_imgs_per_class[class_index] += 1
+
+    def evaluate(self):
+        """Compute evaluation result.
+
+    Returns:
+      A named tuple with the following fields -
+        average_precision: float numpy array of average precision for
+            each class.
+        mean_ap: mean average precision of all classes, float scalar
+        precisions: List of precisions, each precision is a float numpy
+            array
+        recalls: List of recalls, each recall is a float numpy array
+        corloc: numpy float array
+        mean_corloc: Mean CorLoc score for each class, float scalar
+    """
+        if (self.num_gt_instances_per_class == 0).any():
+            logging.info(
+                "The following classes have no ground truth examples: %s",
+                np.squeeze(np.argwhere(self.num_gt_instances_per_class == 0))
+                + self.label_id_offset,
+            )
+
+        if self.use_weighted_mean_ap:
+            all_scores = np.array([], dtype=float)
+            all_tp_fp_labels = np.array([], dtype=bool)
+
+        for class_index in range(self.num_class):
+            if self.num_gt_instances_per_class[class_index] == 0:
+                continue
+            if not self.scores_per_class[class_index]:
+                scores = np.array([], dtype=float)
+                tp_fp_labels = np.array([], dtype=bool)
+            else:
+                scores = np.concatenate(self.scores_per_class[class_index])
+                tp_fp_labels = np.concatenate(
+                    self.tp_fp_labels_per_class[class_index]
+                )
+            if self.use_weighted_mean_ap:
+                all_scores = np.append(all_scores, scores)
+                all_tp_fp_labels = np.append(all_tp_fp_labels, tp_fp_labels)
+            precision, recall = metrics.compute_precision_recall(
+                scores,
+                tp_fp_labels,
+                self.num_gt_instances_per_class[class_index],
+            )
+            self.precisions_per_class.append(precision)
+            self.recalls_per_class.append(recall)
+            average_precision = metrics.compute_average_precision(
+                precision, recall
+            )
+            self.average_precision_per_class[class_index] = average_precision
+
+        self.corloc_per_class = metrics.compute_cor_loc(
+            self.num_gt_imgs_per_class,
+            self.num_images_correctly_detected_per_class,
+        )
+
+        if self.use_weighted_mean_ap:
+            num_gt_instances = np.sum(self.num_gt_instances_per_class)
+            precision, recall = metrics.compute_precision_recall(
+                all_scores, all_tp_fp_labels, num_gt_instances
+            )
+            mean_ap = metrics.compute_average_precision(precision, recall)
+        else:
+            mean_ap = np.nanmean(self.average_precision_per_class)
+        mean_corloc = np.nanmean(self.corloc_per_class)
+        return ObjectDetectionEvalMetrics(
+            self.average_precision_per_class,
+            mean_ap,
+            self.precisions_per_class,
+            self.recalls_per_class,
+            self.corloc_per_class,
+            mean_corloc,
+        )
diff --git a/evaluator/ava_evaluation/per_image_evaluation.py b/evaluator/ava_evaluation/per_image_evaluation.py
new file mode 100644
index 0000000..ea0f4fe
--- /dev/null
+++ b/evaluator/ava_evaluation/per_image_evaluation.py
@@ -0,0 +1,453 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Evaluate Object Detection result on a single image.
+
+Annotate each detected result as true positives or false positive according to
+a predefined IOU ratio. Non Maximum Supression is used by default. Multi class
+detection is supported by default.
+Based on the settings, per image evaluation is either performed on boxes or
+on object masks.
+"""
+from __future__ import (
+    absolute_import,
+    division,
+    print_function,
+    unicode_literals,
+)
+import numpy as np
+
+from . import (
+    np_box_list,
+    np_box_list_ops,
+    np_box_mask_list,
+    np_box_mask_list_ops,
+)
+
+
+class PerImageEvaluation(object):
+    """Evaluate detection result of a single image."""
+
+    def __init__(self, num_groundtruth_classes, matching_iou_threshold=0.5):
+        """Initialized PerImageEvaluation by evaluation parameters.
+
+    Args:
+      num_groundtruth_classes: Number of ground truth object classes
+      matching_iou_threshold: A ratio of area intersection to union, which is
+          the threshold to consider whether a detection is true positive or not
+    """
+        self.matching_iou_threshold = matching_iou_threshold
+        self.num_groundtruth_classes = num_groundtruth_classes
+
+    def compute_object_detection_metrics(
+        self,
+        detected_boxes,
+        detected_scores,
+        detected_class_labels,
+        groundtruth_boxes,
+        groundtruth_class_labels,
+        groundtruth_is_difficult_list,
+        groundtruth_is_group_of_list,
+        detected_masks=None,
+        groundtruth_masks=None,
+    ):
+        """Evaluates detections as being tp, fp or ignored from a single image.
+
+    The evaluation is done in two stages:
+     1. All detections are matched to non group-of boxes; true positives are
+        determined and detections matched to difficult boxes are ignored.
+     2. Detections that are determined as false positives are matched against
+        group-of boxes and ignored if matched.
+
+    Args:
+      detected_boxes: A float numpy array of shape [N, 4], representing N
+          regions of detected object regions.
+          Each row is of the format [y_min, x_min, y_max, x_max]
+      detected_scores: A float numpy array of shape [N, 1], representing
+          the confidence scores of the detected N object instances.
+      detected_class_labels: A integer numpy array of shape [N, 1], repreneting
+          the class labels of the detected N object instances.
+      groundtruth_boxes: A float numpy array of shape [M, 4], representing M
+          regions of object instances in ground truth
+      groundtruth_class_labels: An integer numpy array of shape [M, 1],
+          representing M class labels of object instances in ground truth
+      groundtruth_is_difficult_list: A boolean numpy array of length M denoting
+          whether a ground truth box is a difficult instance or not
+      groundtruth_is_group_of_list: A boolean numpy array of length M denoting
+          whether a ground truth box has group-of tag
+      detected_masks: (optional) A uint8 numpy array of shape
+        [N, height, width]. If not None, the metrics will be computed based
+        on masks.
+      groundtruth_masks: (optional) A uint8 numpy array of shape
+        [M, height, width].
+
+    Returns:
+      scores: A list of C float numpy arrays. Each numpy array is of
+          shape [K, 1], representing K scores detected with object class
+          label c
+      tp_fp_labels: A list of C boolean numpy arrays. Each numpy array
+          is of shape [K, 1], representing K True/False positive label of
+          object instances detected with class label c
+    """
+        (
+            detected_boxes,
+            detected_scores,
+            detected_class_labels,
+            detected_masks,
+        ) = self._remove_invalid_boxes(
+            detected_boxes,
+            detected_scores,
+            detected_class_labels,
+            detected_masks,
+        )
+        scores, tp_fp_labels = self._compute_tp_fp(
+            detected_boxes=detected_boxes,
+            detected_scores=detected_scores,
+            detected_class_labels=detected_class_labels,
+            groundtruth_boxes=groundtruth_boxes,
+            groundtruth_class_labels=groundtruth_class_labels,
+            groundtruth_is_difficult_list=groundtruth_is_difficult_list,
+            groundtruth_is_group_of_list=groundtruth_is_group_of_list,
+            detected_masks=detected_masks,
+            groundtruth_masks=groundtruth_masks,
+        )
+
+        return scores, tp_fp_labels
+
+    def _compute_tp_fp(
+        self,
+        detected_boxes,
+        detected_scores,
+        detected_class_labels,
+        groundtruth_boxes,
+        groundtruth_class_labels,
+        groundtruth_is_difficult_list,
+        groundtruth_is_group_of_list,
+        detected_masks=None,
+        groundtruth_masks=None,
+    ):
+        """Labels true/false positives of detections of an image across all classes.
+
+    Args:
+      detected_boxes: A float numpy array of shape [N, 4], representing N
+          regions of detected object regions.
+          Each row is of the format [y_min, x_min, y_max, x_max]
+      detected_scores: A float numpy array of shape [N, 1], representing
+          the confidence scores of the detected N object instances.
+      detected_class_labels: A integer numpy array of shape [N, 1], repreneting
+          the class labels of the detected N object instances.
+      groundtruth_boxes: A float numpy array of shape [M, 4], representing M
+          regions of object instances in ground truth
+      groundtruth_class_labels: An integer numpy array of shape [M, 1],
+          representing M class labels of object instances in ground truth
+      groundtruth_is_difficult_list: A boolean numpy array of length M denoting
+          whether a ground truth box is a difficult instance or not
+      groundtruth_is_group_of_list: A boolean numpy array of length M denoting
+          whether a ground truth box has group-of tag
+      detected_masks: (optional) A np.uint8 numpy array of shape
+        [N, height, width]. If not None, the scores will be computed based
+        on masks.
+      groundtruth_masks: (optional) A np.uint8 numpy array of shape
+        [M, height, width].
+
+    Returns:
+      result_scores: A list of float numpy arrays. Each numpy array is of
+          shape [K, 1], representing K scores detected with object class
+          label c
+      result_tp_fp_labels: A list of boolean numpy array. Each numpy array is of
+          shape [K, 1], representing K True/False positive label of object
+          instances detected with class label c
+
+    Raises:
+      ValueError: If detected masks is not None but groundtruth masks are None,
+        or the other way around.
+    """
+        if detected_masks is not None and groundtruth_masks is None:
+            raise ValueError(
+                "Detected masks is available but groundtruth masks is not."
+            )
+        if detected_masks is None and groundtruth_masks is not None:
+            raise ValueError(
+                "Groundtruth masks is available but detected masks is not."
+            )
+
+        result_scores = []
+        result_tp_fp_labels = []
+        for i in range(self.num_groundtruth_classes):
+            groundtruth_is_difficult_list_at_ith_class = groundtruth_is_difficult_list[
+                groundtruth_class_labels == i
+            ]
+            groundtruth_is_group_of_list_at_ith_class = groundtruth_is_group_of_list[
+                groundtruth_class_labels == i
+            ]
+            (
+                gt_boxes_at_ith_class,
+                gt_masks_at_ith_class,
+                detected_boxes_at_ith_class,
+                detected_scores_at_ith_class,
+                detected_masks_at_ith_class,
+            ) = self._get_ith_class_arrays(
+                detected_boxes,
+                detected_scores,
+                detected_masks,
+                detected_class_labels,
+                groundtruth_boxes,
+                groundtruth_masks,
+                groundtruth_class_labels,
+                i,
+            )
+            scores, tp_fp_labels = self._compute_tp_fp_for_single_class(
+                detected_boxes=detected_boxes_at_ith_class,
+                detected_scores=detected_scores_at_ith_class,
+                groundtruth_boxes=gt_boxes_at_ith_class,
+                groundtruth_is_difficult_list=groundtruth_is_difficult_list_at_ith_class,
+                groundtruth_is_group_of_list=groundtruth_is_group_of_list_at_ith_class,
+                detected_masks=detected_masks_at_ith_class,
+                groundtruth_masks=gt_masks_at_ith_class,
+            )
+            result_scores.append(scores)
+            result_tp_fp_labels.append(tp_fp_labels)
+        return result_scores, result_tp_fp_labels
+
+    def _get_overlaps_and_scores_box_mode(
+        self,
+        detected_boxes,
+        detected_scores,
+        groundtruth_boxes,
+        groundtruth_is_group_of_list,
+    ):
+        """Computes overlaps and scores between detected and groudntruth boxes.
+
+    Args:
+      detected_boxes: A numpy array of shape [N, 4] representing detected box
+          coordinates
+      detected_scores: A 1-d numpy array of length N representing classification
+          score
+      groundtruth_boxes: A numpy array of shape [M, 4] representing ground truth
+          box coordinates
+      groundtruth_is_group_of_list: A boolean numpy array of length M denoting
+          whether a ground truth box has group-of tag. If a groundtruth box
+          is group-of box, every detection matching this box is ignored.
+
+    Returns:
+      iou: A float numpy array of size [num_detected_boxes, num_gt_boxes]. If
+          gt_non_group_of_boxlist.num_boxes() == 0 it will be None.
+      ioa: A float numpy array of size [num_detected_boxes, num_gt_boxes]. If
+          gt_group_of_boxlist.num_boxes() == 0 it will be None.
+      scores: The score of the detected boxlist.
+      num_boxes: Number of non-maximum suppressed detected boxes.
+    """
+        detected_boxlist = np_box_list.BoxList(detected_boxes)
+        detected_boxlist.add_field("scores", detected_scores)
+        gt_non_group_of_boxlist = np_box_list.BoxList(
+            groundtruth_boxes[~groundtruth_is_group_of_list]
+        )
+        iou = np_box_list_ops.iou(detected_boxlist, gt_non_group_of_boxlist)
+        scores = detected_boxlist.get_field("scores")
+        num_boxes = detected_boxlist.num_boxes()
+        return iou, None, scores, num_boxes
+
+    def _compute_tp_fp_for_single_class(
+        self,
+        detected_boxes,
+        detected_scores,
+        groundtruth_boxes,
+        groundtruth_is_difficult_list,
+        groundtruth_is_group_of_list,
+        detected_masks=None,
+        groundtruth_masks=None,
+    ):
+        """Labels boxes detected with the same class from the same image as tp/fp.
+
+    Args:
+      detected_boxes: A numpy array of shape [N, 4] representing detected box
+          coordinates
+      detected_scores: A 1-d numpy array of length N representing classification
+          score
+      groundtruth_boxes: A numpy array of shape [M, 4] representing ground truth
+          box coordinates
+      groundtruth_is_difficult_list: A boolean numpy array of length M denoting
+          whether a ground truth box is a difficult instance or not. If a
+          groundtruth box is difficult, every detection matching this box
+          is ignored.
+      groundtruth_is_group_of_list: A boolean numpy array of length M denoting
+          whether a ground truth box has group-of tag. If a groundtruth box
+          is group-of box, every detection matching this box is ignored.
+      detected_masks: (optional) A uint8 numpy array of shape
+        [N, height, width]. If not None, the scores will be computed based
+        on masks.
+      groundtruth_masks: (optional) A uint8 numpy array of shape
+        [M, height, width].
+
+    Returns:
+      Two arrays of the same size, containing all boxes that were evaluated as
+      being true positives or false positives; if a box matched to a difficult
+      box or to a group-of box, it is ignored.
+
+      scores: A numpy array representing the detection scores.
+      tp_fp_labels: a boolean numpy array indicating whether a detection is a
+          true positive.
+    """
+        if detected_boxes.size == 0:
+            return np.array([], dtype=float), np.array([], dtype=bool)
+
+        (
+            iou,
+            _,
+            scores,
+            num_detected_boxes,
+        ) = self._get_overlaps_and_scores_box_mode(
+            detected_boxes=detected_boxes,
+            detected_scores=detected_scores,
+            groundtruth_boxes=groundtruth_boxes,
+            groundtruth_is_group_of_list=groundtruth_is_group_of_list,
+        )
+
+        if groundtruth_boxes.size == 0:
+            return scores, np.zeros(num_detected_boxes, dtype=bool)
+
+        tp_fp_labels = np.zeros(num_detected_boxes, dtype=bool)
+        is_matched_to_difficult_box = np.zeros(num_detected_boxes, dtype=bool)
+        is_matched_to_group_of_box = np.zeros(num_detected_boxes, dtype=bool)
+
+        # The evaluation is done in two stages:
+        # 1. All detections are matched to non group-of boxes; true positives are
+        #    determined and detections matched to difficult boxes are ignored.
+        # 2. Detections that are determined as false positives are matched against
+        #    group-of boxes and ignored if matched.
+
+        # Tp-fp evaluation for non-group of boxes (if any).
+        if iou.shape[1] > 0:
+            groundtruth_nongroup_of_is_difficult_list = groundtruth_is_difficult_list[
+                ~groundtruth_is_group_of_list
+            ]
+            max_overlap_gt_ids = np.argmax(iou, axis=1)
+            is_gt_box_detected = np.zeros(iou.shape[1], dtype=bool)
+            for i in range(num_detected_boxes):
+                gt_id = max_overlap_gt_ids[i]
+                if iou[i, gt_id] >= self.matching_iou_threshold:
+                    if not groundtruth_nongroup_of_is_difficult_list[gt_id]:
+                        if not is_gt_box_detected[gt_id]:
+                            tp_fp_labels[i] = True
+                            is_gt_box_detected[gt_id] = True
+                    else:
+                        is_matched_to_difficult_box[i] = True
+
+        return (
+            scores[~is_matched_to_difficult_box & ~is_matched_to_group_of_box],
+            tp_fp_labels[
+                ~is_matched_to_difficult_box & ~is_matched_to_group_of_box
+            ],
+        )
+
+    def _get_ith_class_arrays(
+        self,
+        detected_boxes,
+        detected_scores,
+        detected_masks,
+        detected_class_labels,
+        groundtruth_boxes,
+        groundtruth_masks,
+        groundtruth_class_labels,
+        class_index,
+    ):
+        """Returns numpy arrays belonging to class with index `class_index`.
+
+    Args:
+      detected_boxes: A numpy array containing detected boxes.
+      detected_scores: A numpy array containing detected scores.
+      detected_masks: A numpy array containing detected masks.
+      detected_class_labels: A numpy array containing detected class labels.
+      groundtruth_boxes: A numpy array containing groundtruth boxes.
+      groundtruth_masks: A numpy array containing groundtruth masks.
+      groundtruth_class_labels: A numpy array containing groundtruth class
+        labels.
+      class_index: An integer index.
+
+    Returns:
+      gt_boxes_at_ith_class: A numpy array containing groundtruth boxes labeled
+        as ith class.
+      gt_masks_at_ith_class: A numpy array containing groundtruth masks labeled
+        as ith class.
+      detected_boxes_at_ith_class: A numpy array containing detected boxes
+        corresponding to the ith class.
+      detected_scores_at_ith_class: A numpy array containing detected scores
+        corresponding to the ith class.
+      detected_masks_at_ith_class: A numpy array containing detected masks
+        corresponding to the ith class.
+    """
+        selected_groundtruth = groundtruth_class_labels == class_index
+        gt_boxes_at_ith_class = groundtruth_boxes[selected_groundtruth]
+        if groundtruth_masks is not None:
+            gt_masks_at_ith_class = groundtruth_masks[selected_groundtruth]
+        else:
+            gt_masks_at_ith_class = None
+        selected_detections = detected_class_labels == class_index
+        detected_boxes_at_ith_class = detected_boxes[selected_detections]
+        detected_scores_at_ith_class = detected_scores[selected_detections]
+        if detected_masks is not None:
+            detected_masks_at_ith_class = detected_masks[selected_detections]
+        else:
+            detected_masks_at_ith_class = None
+        return (
+            gt_boxes_at_ith_class,
+            gt_masks_at_ith_class,
+            detected_boxes_at_ith_class,
+            detected_scores_at_ith_class,
+            detected_masks_at_ith_class,
+        )
+
+    def _remove_invalid_boxes(
+        self,
+        detected_boxes,
+        detected_scores,
+        detected_class_labels,
+        detected_masks=None,
+    ):
+        """Removes entries with invalid boxes.
+
+    A box is invalid if either its xmax is smaller than its xmin, or its ymax
+    is smaller than its ymin.
+
+    Args:
+      detected_boxes: A float numpy array of size [num_boxes, 4] containing box
+        coordinates in [ymin, xmin, ymax, xmax] format.
+      detected_scores: A float numpy array of size [num_boxes].
+      detected_class_labels: A int32 numpy array of size [num_boxes].
+      detected_masks: A uint8 numpy array of size [num_boxes, height, width].
+
+    Returns:
+      valid_detected_boxes: A float numpy array of size [num_valid_boxes, 4]
+        containing box coordinates in [ymin, xmin, ymax, xmax] format.
+      valid_detected_scores: A float numpy array of size [num_valid_boxes].
+      valid_detected_class_labels: A int32 numpy array of size
+        [num_valid_boxes].
+      valid_detected_masks: A uint8 numpy array of size
+        [num_valid_boxes, height, width].
+    """
+        valid_indices = np.logical_and(
+            detected_boxes[:, 0] < detected_boxes[:, 2],
+            detected_boxes[:, 1] < detected_boxes[:, 3],
+        )
+        detected_boxes = detected_boxes[valid_indices]
+        detected_scores = detected_scores[valid_indices]
+        detected_class_labels = detected_class_labels[valid_indices]
+        if detected_masks is not None:
+            detected_masks = detected_masks[valid_indices]
+        return [
+            detected_boxes,
+            detected_scores,
+            detected_class_labels,
+            detected_masks,
+        ]
diff --git a/evaluator/ava_evaluation/standard_fields.py b/evaluator/ava_evaluation/standard_fields.py
new file mode 100644
index 0000000..ea30a05
--- /dev/null
+++ b/evaluator/ava_evaluation/standard_fields.py
@@ -0,0 +1,225 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Contains classes specifying naming conventions used for object detection.
+
+
+Specifies:
+  InputDataFields: standard fields used by reader/preprocessor/batcher.
+  DetectionResultFields: standard fields returned by object detector.
+  BoxListFields: standard field used by BoxList
+  TfExampleFields: standard fields for tf-example data format (go/tf-example).
+"""
+
+
+from __future__ import (
+    absolute_import,
+    division,
+    print_function,
+    unicode_literals,
+)
+
+
+class InputDataFields(object):
+    """Names for the input tensors.
+
+  Holds the standard data field names to use for identifying input tensors. This
+  should be used by the decoder to identify keys for the returned tensor_dict
+  containing input tensors. And it should be used by the model to identify the
+  tensors it needs.
+
+  Attributes:
+    image: image.
+    original_image: image in the original input size.
+    key: unique key corresponding to image.
+    source_id: source of the original image.
+    filename: original filename of the dataset (without common path).
+    groundtruth_image_classes: image-level class labels.
+    groundtruth_boxes: coordinates of the ground truth boxes in the image.
+    groundtruth_classes: box-level class labels.
+    groundtruth_label_types: box-level label types (e.g. explicit negative).
+    groundtruth_is_crowd: [DEPRECATED, use groundtruth_group_of instead]
+      is the groundtruth a single object or a crowd.
+    groundtruth_area: area of a groundtruth segment.
+    groundtruth_difficult: is a `difficult` object
+    groundtruth_group_of: is a `group_of` objects, e.g. multiple objects of the
+      same class, forming a connected group, where instances are heavily
+      occluding each other.
+    proposal_boxes: coordinates of object proposal boxes.
+    proposal_objectness: objectness score of each proposal.
+    groundtruth_instance_masks: ground truth instance masks.
+    groundtruth_instance_boundaries: ground truth instance boundaries.
+    groundtruth_instance_classes: instance mask-level class labels.
+    groundtruth_keypoints: ground truth keypoints.
+    groundtruth_keypoint_visibilities: ground truth keypoint visibilities.
+    groundtruth_label_scores: groundtruth label scores.
+    groundtruth_weights: groundtruth weight factor for bounding boxes.
+    num_groundtruth_boxes: number of groundtruth boxes.
+    true_image_shapes: true shapes of images in the resized images, as resized
+      images can be padded with zeros.
+  """
+
+    image = "image"
+    original_image = "original_image"
+    key = "key"
+    source_id = "source_id"
+    filename = "filename"
+    groundtruth_image_classes = "groundtruth_image_classes"
+    groundtruth_boxes = "groundtruth_boxes"
+    groundtruth_classes = "groundtruth_classes"
+    groundtruth_label_types = "groundtruth_label_types"
+    groundtruth_is_crowd = "groundtruth_is_crowd"
+    groundtruth_area = "groundtruth_area"
+    groundtruth_difficult = "groundtruth_difficult"
+    groundtruth_group_of = "groundtruth_group_of"
+    proposal_boxes = "proposal_boxes"
+    proposal_objectness = "proposal_objectness"
+    groundtruth_instance_masks = "groundtruth_instance_masks"
+    groundtruth_instance_boundaries = "groundtruth_instance_boundaries"
+    groundtruth_instance_classes = "groundtruth_instance_classes"
+    groundtruth_keypoints = "groundtruth_keypoints"
+    groundtruth_keypoint_visibilities = "groundtruth_keypoint_visibilities"
+    groundtruth_label_scores = "groundtruth_label_scores"
+    groundtruth_weights = "groundtruth_weights"
+    num_groundtruth_boxes = "num_groundtruth_boxes"
+    true_image_shape = "true_image_shape"
+
+
+class DetectionResultFields(object):
+    """Naming conventions for storing the output of the detector.
+
+  Attributes:
+    source_id: source of the original image.
+    key: unique key corresponding to image.
+    detection_boxes: coordinates of the detection boxes in the image.
+    detection_scores: detection scores for the detection boxes in the image.
+    detection_classes: detection-level class labels.
+    detection_masks: contains a segmentation mask for each detection box.
+    detection_boundaries: contains an object boundary for each detection box.
+    detection_keypoints: contains detection keypoints for each detection box.
+    num_detections: number of detections in the batch.
+  """
+
+    source_id = "source_id"
+    key = "key"
+    detection_boxes = "detection_boxes"
+    detection_scores = "detection_scores"
+    detection_classes = "detection_classes"
+    detection_masks = "detection_masks"
+    detection_boundaries = "detection_boundaries"
+    detection_keypoints = "detection_keypoints"
+    num_detections = "num_detections"
+
+
+class BoxListFields(object):
+    """Naming conventions for BoxLists.
+
+  Attributes:
+    boxes: bounding box coordinates.
+    classes: classes per bounding box.
+    scores: scores per bounding box.
+    weights: sample weights per bounding box.
+    objectness: objectness score per bounding box.
+    masks: masks per bounding box.
+    boundaries: boundaries per bounding box.
+    keypoints: keypoints per bounding box.
+    keypoint_heatmaps: keypoint heatmaps per bounding box.
+  """
+
+    boxes = "boxes"
+    classes = "classes"
+    scores = "scores"
+    weights = "weights"
+    objectness = "objectness"
+    masks = "masks"
+    boundaries = "boundaries"
+    keypoints = "keypoints"
+    keypoint_heatmaps = "keypoint_heatmaps"
+
+
+class TfExampleFields(object):
+    """TF-example proto feature names for object detection.
+
+  Holds the standard feature names to load from an Example proto for object
+  detection.
+
+  Attributes:
+    image_encoded: JPEG encoded string
+    image_format: image format, e.g. "JPEG"
+    filename: filename
+    channels: number of channels of image
+    colorspace: colorspace, e.g. "RGB"
+    height: height of image in pixels, e.g. 462
+    width: width of image in pixels, e.g. 581
+    source_id: original source of the image
+    object_class_text: labels in text format, e.g. ["person", "cat"]
+    object_class_label: labels in numbers, e.g. [16, 8]
+    object_bbox_xmin: xmin coordinates of groundtruth box, e.g. 10, 30
+    object_bbox_xmax: xmax coordinates of groundtruth box, e.g. 50, 40
+    object_bbox_ymin: ymin coordinates of groundtruth box, e.g. 40, 50
+    object_bbox_ymax: ymax coordinates of groundtruth box, e.g. 80, 70
+    object_view: viewpoint of object, e.g. ["frontal", "left"]
+    object_truncated: is object truncated, e.g. [true, false]
+    object_occluded: is object occluded, e.g. [true, false]
+    object_difficult: is object difficult, e.g. [true, false]
+    object_group_of: is object a single object or a group of objects
+    object_depiction: is object a depiction
+    object_is_crowd: [DEPRECATED, use object_group_of instead]
+      is the object a single object or a crowd
+    object_segment_area: the area of the segment.
+    object_weight: a weight factor for the object's bounding box.
+    instance_masks: instance segmentation masks.
+    instance_boundaries: instance boundaries.
+    instance_classes: Classes for each instance segmentation mask.
+    detection_class_label: class label in numbers.
+    detection_bbox_ymin: ymin coordinates of a detection box.
+    detection_bbox_xmin: xmin coordinates of a detection box.
+    detection_bbox_ymax: ymax coordinates of a detection box.
+    detection_bbox_xmax: xmax coordinates of a detection box.
+    detection_score: detection score for the class label and box.
+  """
+
+    image_encoded = "image/encoded"
+    image_format = "image/format"  # format is reserved keyword
+    filename = "image/filename"
+    channels = "image/channels"
+    colorspace = "image/colorspace"
+    height = "image/height"
+    width = "image/width"
+    source_id = "image/source_id"
+    object_class_text = "image/object/class/text"
+    object_class_label = "image/object/class/label"
+    object_bbox_ymin = "image/object/bbox/ymin"
+    object_bbox_xmin = "image/object/bbox/xmin"
+    object_bbox_ymax = "image/object/bbox/ymax"
+    object_bbox_xmax = "image/object/bbox/xmax"
+    object_view = "image/object/view"
+    object_truncated = "image/object/truncated"
+    object_occluded = "image/object/occluded"
+    object_difficult = "image/object/difficult"
+    object_group_of = "image/object/group_of"
+    object_depiction = "image/object/depiction"
+    object_is_crowd = "image/object/is_crowd"
+    object_segment_area = "image/object/segment/area"
+    object_weight = "image/object/weight"
+    instance_masks = "image/segmentation/object"
+    instance_boundaries = "image/boundaries/object"
+    instance_classes = "image/segmentation/object/class"
+    detection_class_label = "image/detection/label"
+    detection_bbox_ymin = "image/detection/bbox/ymin"
+    detection_bbox_xmin = "image/detection/bbox/xmin"
+    detection_bbox_ymax = "image/detection/bbox/ymax"
+    detection_bbox_xmax = "image/detection/bbox/xmax"
+    detection_score = "image/detection/score"
diff --git a/evaluator/ava_evaluator.py b/evaluator/ava_evaluator.py
new file mode 100644
index 0000000..11b59fa
--- /dev/null
+++ b/evaluator/ava_evaluator.py
@@ -0,0 +1,249 @@
+import time
+import numpy as np
+import os
+from collections import defaultdict
+import torch
+import json
+
+from dataset.ava import AVA_Dataset
+
+from .ava_eval_helper import (
+    run_evaluation,
+    read_csv,
+    read_exclusions,
+    read_labelmap,
+    write_results
+)
+
+
+
+class AVA_Evaluator(object):
+    def __init__(self, 
+                 d_cfg,
+                 data_root,
+                 img_size,
+                 len_clip,
+                 sampling_rate,
+                 batch_size,
+                 transform,
+                 collate_fn,
+                 full_test_on_val=False,
+                 version='v2.2'):
+        self.all_preds = []
+        self.full_ava_test = full_test_on_val
+        self.data_root = data_root
+        self.backup_dir = d_cfg['backup_dir']
+        self.annotation_dir = os.path.join(data_root, d_cfg['annotation_dir'])
+        self.labelmap_file = os.path.join(self.annotation_dir, d_cfg['labelmap_file'])
+        self.frames_dir = os.path.join(data_root, d_cfg['frames_dir'])
+        self.frame_list = os.path.join(data_root, d_cfg['frame_list'])
+        self.exclusion_file = os.path.join(self.annotation_dir, d_cfg['val_exclusion_file'])
+        self.gt_box_list = os.path.join(self.annotation_dir, d_cfg['val_gt_box_list'])
+
+        # load data
+        self.excluded_keys = read_exclusions(self.exclusion_file)
+        self.categories, self.class_whitelist = read_labelmap(self.labelmap_file)
+        self.full_groundtruth = read_csv(self.gt_box_list, self.class_whitelist)
+        self.mini_groundtruth = self.get_ava_mini_groundtruth(self.full_groundtruth)
+        _, self.video_idx_to_name = self.load_image_lists(self.frames_dir, self.frame_list, is_train=False)
+
+        # create output_json file
+        os.makedirs(self.backup_dir, exist_ok=True)
+        self.backup_dir = os.path.join(self.backup_dir, 'ava_{}'.format(version))
+        os.makedirs(self.backup_dir, exist_ok=True)
+        self.output_json = os.path.join(self.backup_dir, 'ava_detections.json')
+
+        # dataset
+        self.testset = AVA_Dataset(
+            cfg=d_cfg,
+            data_root=data_root,
+            is_train=False,
+            img_size=img_size,
+            transform=transform,
+            len_clip=len_clip,
+            sampling_rate=sampling_rate
+        )
+        self.num_classes = self.testset.num_classes
+
+        # dataloader
+        self.testloader = torch.utils.data.DataLoader(
+            dataset=self.testset, 
+            batch_size=batch_size,
+            shuffle=False,
+            collate_fn=collate_fn, 
+            num_workers=4,
+            drop_last=False,
+            pin_memory=True
+            )
+    
+
+    def get_ava_mini_groundtruth(self, full_groundtruth):
+        """
+        Get the groundtruth annotations corresponding the "subset" of AVA val set.
+        We define the subset to be the frames such that (second % 4 == 0).
+        We optionally use subset for faster evaluation during training
+        (in order to track training progress).
+        Args:
+            full_groundtruth(dict): list of groundtruth.
+        """
+        ret = [defaultdict(list), defaultdict(list), defaultdict(list)]
+
+        for i in range(3):
+            for key in full_groundtruth[i].keys():
+                if int(key.split(",")[1]) % 4 == 0:
+                    ret[i][key] = full_groundtruth[i][key]
+        return ret
+
+
+    def load_image_lists(self, frames_dir, frame_list, is_train):
+        """
+        Loading image paths from corresponding files.
+
+        Args:
+            frames_dir (str): path to frames dir.
+            frame_list (str): path to frame list.
+            is_train (bool): if it is training dataset or not.
+
+        Returns:
+            image_paths (list[list]): a list of items. Each item (also a list)
+                corresponds to one video and contains the paths of images for
+                this video.
+            video_idx_to_name (list): a list which stores video names.
+        """
+        # frame_list_dir is /data3/ava/frame_lists/
+        # contains 'train.csv' and 'val.csv'
+        if is_train:
+            list_name = "train.csv"
+        else:
+            list_name = "val.csv"
+
+        list_filename = os.path.join(frame_list, list_name)
+
+        image_paths = defaultdict(list)
+        video_name_to_idx = {}
+        video_idx_to_name = []
+        with open(list_filename, "r") as f:
+            f.readline()
+            for line in f:
+                row = line.split()
+                # The format of each row should follow:
+                # original_vido_id video_id frame_id path labels.
+                assert len(row) == 5
+                video_name = row[0]
+
+                if video_name not in video_name_to_idx:
+                    idx = len(video_name_to_idx)
+                    video_name_to_idx[video_name] = idx
+                    video_idx_to_name.append(video_name)
+
+                data_key = video_name_to_idx[video_name]
+
+                image_paths[data_key].append(os.path.join(frames_dir, row[3]))
+
+        image_paths = [image_paths[i] for i in range(len(image_paths))]
+
+        print("Finished loading image paths from: {}".format(list_filename))
+
+        return image_paths, video_idx_to_name
+
+
+    def update_stats(self, preds):
+        self.all_preds.extend(preds)
+
+
+    def get_ava_eval_data(self):
+        out_scores = defaultdict(list)
+        out_labels = defaultdict(list)
+        out_boxes = defaultdict(list)
+        count = 0
+
+        # each pred is [[x1, y1, x2, y2], cls_out, [video_idx, src]]
+        for i in range(len(self.all_preds)):
+            pred = self.all_preds[i]
+            assert len(pred) == 3
+            video_idx = int(np.round(pred[-1][0]))
+            sec = int(np.round(pred[-1][1]))
+            box = pred[0]
+            scores = pred[1]
+            assert len(scores) == 80
+
+            video = self.video_idx_to_name[video_idx]
+            key = video + ',' + "%04d" % (sec)
+            box = [box[1], box[0], box[3], box[2]]  # turn to y1,x1,y2,x2
+
+            for cls_idx, score in enumerate(scores):
+                if cls_idx + 1 in self.class_whitelist:
+                    out_scores[key].append(score)
+                    out_labels[key].append(cls_idx + 1)
+                    out_boxes[key].append(box)
+                    count += 1
+
+        return out_boxes, out_labels, out_scores
+
+
+    def calculate_mAP(self, epoch):
+        eval_start = time.time()
+        detections = self.get_ava_eval_data()
+        if self.full_ava_test:
+            groundtruth = self.full_groundtruth
+        else:
+            groundtruth = self.mini_groundtruth
+
+        print("Evaluating with %d unique GT frames." % len(groundtruth[0]))
+        print("Evaluating with %d unique detection frames" % len(detections[0]))
+
+        write_results(detections, os.path.join(self.backup_dir, "detections_{}.csv".format(epoch)))
+        write_results(groundtruth, os.path.join(self.backup_dir, "groundtruth_{}.csv".format(epoch)))
+        results = run_evaluation(self.categories, groundtruth, detections, self.excluded_keys)
+        with open(self.output_json, 'w') as fp:
+            json.dump(results, fp)
+        print("Save eval results in {}".format(self.output_json))
+
+        print("AVA eval done in %f seconds." % (time.time() - eval_start))
+
+        return results["PascalBoxes_Precision/mAP@0.5IOU"]
+
+
+    def evaluate_frame_map(self, model, epoch=1):
+        model.eval()
+        epoch_size = len(self.testloader)
+
+        for iter_i, (_, batch_video_clip, batch_target) in enumerate(self.testloader):
+
+            # to device
+            batch_video_clip = batch_video_clip.to(model.device)
+
+            with torch.no_grad():
+                # inference
+                batch_bboxes = model(batch_video_clip)
+
+                # process batch
+                preds_list = []
+                for bi in range(len(batch_bboxes)):
+                    bboxes = batch_bboxes[bi]
+                    target = batch_target[bi]
+
+                    # video info
+                    video_idx = target['video_idx']
+                    sec = target['sec']
+
+                    for bbox in bboxes:
+                        x1, y1, x2, y2 = bbox[:4]
+                        det_conf = float(bbox[4])
+                        cls_out = [np.sqrt(det_conf * cls_conf) for cls_conf in bbox[5:]]
+
+                        preds_list.append([[x1,y1,x2,y2], cls_out, [video_idx, sec]])
+
+            self.update_stats(preds_list)
+            if iter_i % 100 == 0:
+                log_info = "[%d / %d]" % (iter_i, epoch_size)
+                print(log_info, flush=True)
+
+        mAP = self.calculate_mAP(epoch)
+        print("mAP: {}".format(mAP))
+
+        # clear
+        del self.all_preds
+        self.all_preds = []
+
+        return mAP
diff --git a/evaluator/cal_frame_mAP.py b/evaluator/cal_frame_mAP.py
new file mode 100644
index 0000000..4fdffdb
--- /dev/null
+++ b/evaluator/cal_frame_mAP.py
@@ -0,0 +1,1008 @@
+###########################################################################################
+#                                                                                         #
+# This sample shows how to evaluate object detections applying the following metrics:     #
+#  * Precision x Recall curve       ---->       used by VOC PASCAL 2012)                  #
+#  * Average Precision (AP)         ---->       used by VOC PASCAL 2012)                  #
+#                                                                                         #
+# Developed by: Rafael Padilla (rafael.padilla@smt.ufrj.br)                               #
+#        SMT - Signal Multimedia and Telecommunications Lab                               #
+#        COPPE - Universidade Federal do Rio de Janeiro                                   #
+#        Last modification: Oct 9th 2018                                                 #
+###########################################################################################
+
+import argparse
+import glob
+import os
+import shutil
+import sys
+import cv2
+from enum import Enum
+
+from collections import Counter
+import matplotlib.pyplot as plt
+import numpy as np
+
+
+class MethodAveragePrecision(Enum):
+    """
+    Class representing if the coordinates are relative to the
+    image size or are absolute values.
+
+        Developed by: Rafael Padilla
+        Last modification: Apr 28 2018
+    """
+    EveryPointInterpolation = 1
+    ElevenPointInterpolation = 2
+
+
+class CoordinatesType(Enum):
+    """
+    Class representing if the coordinates are relative to the
+    image size or are absolute values.
+
+        Developed by: Rafael Padilla
+        Last modification: Apr 28 2018
+    """
+    Relative = 1
+    Absolute = 2
+
+
+class BBType(Enum):
+    """
+    Class representing if the bounding box is groundtruth or not.
+
+        Developed by: Rafael Padilla
+        Last modification: May 24 2018
+    """
+    GroundTruth = 1
+    Detected = 2
+
+
+class BBFormat(Enum):
+    """
+    Class representing the format of a bounding box.
+    It can be (X,Y,width,height) => XYWH
+    or (X1,Y1,X2,Y2) => XYX2Y2
+
+        Developed by: Rafael Padilla
+        Last modification: May 24 2018
+    """
+    XYWH = 1
+    XYX2Y2 = 2
+
+
+# size => (width, height) of the image
+# box => (X1, X2, Y1, Y2) of the bounding box
+def convertToRelativeValues(size, box):
+    dw = 1. / (size[0])
+    dh = 1. / (size[1])
+    cx = (box[1] + box[0]) / 2.0
+    cy = (box[3] + box[2]) / 2.0
+    w = box[1] - box[0]
+    h = box[3] - box[2]
+    x = cx * dw
+    y = cy * dh
+    w = w * dw
+    h = h * dh
+    # x,y => (bounding_box_center)/width_of_the_image
+    # w => bounding_box_width / width_of_the_image
+    # h => bounding_box_height / height_of_the_image
+    return (x, y, w, h)
+
+
+# size => (width, height) of the image
+# box => (centerX, centerY, w, h) of the bounding box relative to the image
+def convertToAbsoluteValues(size, box):
+    # w_box = round(size[0] * box[2])
+    # h_box = round(size[1] * box[3])
+    xIn = round(((2 * float(box[0]) - float(box[2])) * size[0] / 2))
+    yIn = round(((2 * float(box[1]) - float(box[3])) * size[1] / 2))
+    xEnd = xIn + round(float(box[2]) * size[0])
+    yEnd = yIn + round(float(box[3]) * size[1])
+    if xIn < 0:
+        xIn = 0
+    if yIn < 0:
+        yIn = 0
+    if xEnd >= size[0]:
+        xEnd = size[0] - 1
+    if yEnd >= size[1]:
+        yEnd = size[1] - 1
+    return (xIn, yIn, xEnd, yEnd)
+
+
+def add_bb_into_image(image, bb, color=(255, 0, 0), thickness=2, label=None):
+    r = int(color[0])
+    g = int(color[1])
+    b = int(color[2])
+
+    font = cv2.FONT_HERSHEY_SIMPLEX
+    fontScale = 0.5
+    fontThickness = 1
+
+    x1, y1, x2, y2 = bb.getAbsoluteBoundingBox(BBFormat.XYX2Y2)
+    x1 = int(x1)
+    y1 = int(y1)
+    x2 = int(x2)
+    y2 = int(y2)
+    cv2.rectangle(image, (x1, y1), (x2, y2), (b, g, r), thickness)
+    # Add label
+    if label is not None:
+        # Get size of the text box
+        (tw, th) = cv2.getTextSize(label, font, fontScale, fontThickness)[0]
+        # Top-left coord of the textbox
+        (xin_bb, yin_bb) = (x1 + thickness, y1 - th + int(12.5 * fontScale))
+        # Checking position of the text top-left (outside or inside the bb)
+        if yin_bb - th <= 0:  # if outside the image
+            yin_bb = y1 + th  # put it inside the bb
+        r_Xin = x1 - int(thickness / 2)
+        r_Yin = y1 - th - int(thickness / 2)
+        # Draw filled rectangle to put the text in it
+        cv2.rectangle(image, (r_Xin, r_Yin - thickness),
+                      (r_Xin + tw + thickness * 3, r_Yin + th + int(12.5 * fontScale)), (b, g, r),
+                      -1)
+        cv2.putText(image, label, (xin_bb, yin_bb), font, fontScale, (0, 0, 0), fontThickness,
+                    cv2.LINE_AA)
+    return image
+
+
+# BoundingBox
+class BoundingBox:
+    def __init__(self,
+                 imageName,
+                 classId,
+                 x,
+                 y,
+                 w,
+                 h,
+                 typeCoordinates=None,
+                 imgSize=None,
+                 bbType=None,
+                 classConfidence=None,
+                 format=None):
+        """Constructor.
+        Args:
+            imageName: String representing the image name.
+            classId: String value representing class id.
+            x: Float value representing the X upper-left coordinate of the bounding box.
+            y: Float value representing the Y upper-left coordinate of the bounding box.
+            w: Float value representing the width bounding box.
+            h: Float value representing the height bounding box.
+            typeCoordinates: (optional) Enum (Relative or Absolute) represents if the bounding box
+            coordinates (x,y,w,h) are absolute or relative to size of the image. Default:'Absolute'.
+            imgSize: (optional) 2D vector (width, height)=>(int, int) represents the size of the
+            image of the bounding box. If typeCoordinates is 'Relative', imgSize is required.
+            bbType: (optional) Enum (Groundtruth or Detection) identifies if the bounding box
+            represents a ground truth or a detection. If it is a detection, the classConfidence has
+            to be informed.
+            classConfidence: (optional) Float value representing the confidence of the detected
+            class. If detectionType is Detection, classConfidence needs to be informed.
+            format: (optional) Enum (BBFormat.XYWH or BBFormat.XYX2Y2) indicating the format of the
+            coordinates of the bounding boxes. BBFormat.XYWH: <left> <top> <width> <height>
+            BBFormat.XYX2Y2: <left> <top> <right> <bottom>.
+        """
+        self._imageName = imageName
+        self._typeCoordinates = typeCoordinates
+        if typeCoordinates == CoordinatesType.Relative and imgSize is None:
+            raise IOError(
+                'Parameter \'imgSize\' is required. It is necessary to inform the image size.')
+        if bbType == BBType.Detected and classConfidence is None:
+            raise IOError(
+                'For bbType=\'Detection\', it is necessary to inform the classConfidence value.')
+        # if classConfidence != None and (classConfidence < 0 or classConfidence > 1):
+        # raise IOError('classConfidence value must be a real value between 0 and 1. Value: %f' %
+        # classConfidence)
+
+        self._classConfidence = classConfidence
+        self._bbType = bbType
+        self._classId = classId
+        self._format = format
+
+        # If relative coordinates, convert to absolute values
+        # For relative coords: (x,y,w,h)=(X_center/img_width , Y_center/img_height)
+        if (typeCoordinates == CoordinatesType.Relative):
+            (self._x, self._y, self._w, self._h) = convertToAbsoluteValues(imgSize, (x, y, w, h))
+            self._width_img = imgSize[0]
+            self._height_img = imgSize[1]
+            if format == BBFormat.XYWH:
+                self._x2 = self._w
+                self._y2 = self._h
+                self._w = self._x2 - self._x
+                self._h = self._y2 - self._y
+            else:
+                raise IOError(
+                    'For relative coordinates, the format must be XYWH (x,y,width,height)')
+        # For absolute coords: (x,y,w,h)=real bb coords
+        else:
+            self._x = x
+            self._y = y
+            if format == BBFormat.XYWH:
+                self._w = w
+                self._h = h
+                self._x2 = self._x + self._w
+                self._y2 = self._y + self._h
+            else:  # format == BBFormat.XYX2Y2: <left> <top> <right> <bottom>.
+                self._x2 = w
+                self._y2 = h
+                self._w = self._x2 - self._x
+                self._h = self._y2 - self._y
+        if imgSize is None:
+            self._width_img = None
+            self._height_img = None
+        else:
+            self._width_img = imgSize[0]
+            self._height_img = imgSize[1]
+
+    def getAbsoluteBoundingBox(self, format=None):
+        if format == BBFormat.XYWH:
+            return (self._x, self._y, self._w, self._h)
+        elif format == BBFormat.XYX2Y2:
+            return (self._x, self._y, self._x2, self._y2)
+
+    def getRelativeBoundingBox(self, imgSize=None):
+        if imgSize is None and self._width_img is None and self._height_img is None:
+            raise IOError(
+                'Parameter \'imgSize\' is required. It is necessary to inform the image size.')
+        if imgSize is None:
+            return convertToRelativeValues((imgSize[0], imgSize[1]),
+                                           (self._x, self._y, self._w, self._h))
+        else:
+            return convertToRelativeValues((self._width_img, self._height_img),
+                                           (self._x, self._y, self._w, self._h))
+
+    def getImageName(self):
+        return self._imageName
+
+    def getConfidence(self):
+        return self._classConfidence
+
+    def getFormat(self):
+        return self._format
+
+    def getClassId(self):
+        return self._classId
+
+    def getImageSize(self):
+        return (self._width_img, self._height_img)
+
+    def getCoordinatesType(self):
+        return self._typeCoordinates
+
+    def getBBType(self):
+        return self._bbType
+
+    @staticmethod
+    def compare(det1, det2):
+        det1BB = det1.getAbsoluteBoundingBox(format=BBFormat.XYWH)
+        det1ImgSize = det1.getImageSize()
+        det2BB = det2.getAbsoluteBoundingBox(format=BBFormat.XYWH)
+        det2ImgSize = det2.getImageSize()
+
+        if det1.getClassId() == det2.getClassId() and \
+           det1.classConfidence == det2.classConfidenc() and \
+           det1BB[0] == det2BB[0] and \
+           det1BB[1] == det2BB[1] and \
+           det1BB[2] == det2BB[2] and \
+           det1BB[3] == det2BB[3] and \
+           det1ImgSize[0] == det1ImgSize[0] and \
+           det2ImgSize[1] == det2ImgSize[1]:
+            return True
+        return False
+
+    @staticmethod
+    def clone(boundingBox):
+        absBB = boundingBox.getAbsoluteBoundingBox(format=BBFormat.XYWH)
+        # return (self._x,self._y,self._x2,self._y2)
+        newBoundingBox = BoundingBox(
+            boundingBox.getImageName(),
+            boundingBox.getClassId(),
+            absBB[0],
+            absBB[1],
+            absBB[2],
+            absBB[3],
+            typeCoordinates=boundingBox.getCoordinatesType(),
+            imgSize=boundingBox.getImageSize(),
+            bbType=boundingBox.getBBType(),
+            classConfidence=boundingBox.getConfidence(),
+            format=BBFormat.XYWH)
+        return newBoundingBox
+
+
+#BoundingBoxes
+class BoundingBoxes:
+    def __init__(self):
+        self._boundingBoxes = []
+
+    def addBoundingBox(self, bb):
+        self._boundingBoxes.append(bb)
+
+    def removeBoundingBox(self, _boundingBox):
+        for d in self._boundingBoxes:
+            if BoundingBox.compare(d, _boundingBox):
+                del self._boundingBoxes[d]
+                return
+
+    def removeAllBoundingBoxes(self):
+        self._boundingBoxes = []
+
+    def getBoundingBoxes(self):
+        return self._boundingBoxes
+
+    def getBoundingBoxByClass(self, classId):
+        boundingBoxes = []
+        for d in self._boundingBoxes:
+            if d.getClassId() == classId:  # get only specified bounding box type
+                boundingBoxes.append(d)
+        return boundingBoxes
+
+    def getClasses(self):
+        classes = []
+        for d in self._boundingBoxes:
+            c = d.getClassId()
+            if c not in classes:
+                classes.append(c)
+        return classes
+
+    def getBoundingBoxesByType(self, bbType):
+        # get only specified bb type
+        return [d for d in self._boundingBoxes if d.getBBType() == bbType]
+
+    def getBoundingBoxesByImageName(self, imageName):
+        # get only specified bb type
+        return [d for d in self._boundingBoxes if d.getImageName() == imageName]
+
+    def count(self, bbType=None):
+        if bbType is None:  # Return all bounding boxes
+            return len(self._boundingBoxes)
+        count = 0
+        for d in self._boundingBoxes:
+            if d.getBBType() == bbType:  # get only specified bb type
+                count += 1
+        return count
+
+    def clone(self):
+        newBoundingBoxes = BoundingBoxes()
+        for d in self._boundingBoxes:
+            det = BoundingBox.clone(d)
+            newBoundingBoxes.addBoundingBox(det)
+        return newBoundingBoxes
+
+    def drawAllBoundingBoxes(self, image, imageName):
+        bbxes = self.getBoundingBoxesByImageName(imageName)
+        for bb in bbxes:
+            if bb.getBBType() == BBType.GroundTruth:  # if ground truth
+                image = add_bb_into_image(image, bb, color=(0, 255, 0))  # green
+            else:  # if detection
+                image = add_bb_into_image(image, bb, color=(255, 0, 0))  # red
+        return image
+
+
+###########################################################################################
+#                                                                                         #
+# Evaluator class: Implements the most popular metrics for object detection               #
+#                                                                                         #
+# Developed by: Rafael Padilla (rafael.padilla@smt.ufrj.br)                               #
+#        SMT - Signal Multimedia and Telecommunications Lab                               #
+#        COPPE - Universidade Federal do Rio de Janeiro                                   #
+#        Last modification: Oct 9th 2018                                                 #
+###########################################################################################
+
+
+class Evaluator:
+    def __init__(self, dataset='ucf24'):
+        self.dataset = dataset
+
+        
+    def GetPascalVOCMetrics(self,
+                            boundingboxes,
+                            IOUThreshold=0.5,
+                            method=None):
+        """Get the metrics used by the VOC Pascal 2012 challenge.
+        Get
+        Args:
+            boundingboxes: Object of the class BoundingBoxes representing ground truth and detected
+            bounding boxes;
+            IOUThreshold: IOU threshold indicating which detections will be considered TP or FP
+            (default value = 0.5);
+            method (default = EveryPointInterpolation): It can be calculated as the implementation
+            in the official PASCAL VOC toolkit (EveryPointInterpolation), or applying the 11-point
+            interpolatio as described in the paper "The PASCAL Visual Object Classes(VOC) Challenge"
+            or EveryPointInterpolation"  (ElevenPointInterpolation);
+        Returns:
+            A list of dictionaries. Each dictionary contains information and metrics of each class.
+            The keys of each dictionary are:
+            dict['class']: class representing the current dictionary;
+            dict['precision']: array with the precision values;
+            dict['recall']: array with the recall values;
+            dict['AP']: average precision;
+            dict['interpolated precision']: interpolated precision values;
+            dict['interpolated recall']: interpolated recall values;
+            dict['total positives']: total number of ground truth positives;
+            dict['total TP']: total number of True Positive detections;
+            dict['total FP']: total number of False Negative detections;
+        """
+        ret = []  # list containing metrics (precision, recall, average precision) of each class
+        # List with all ground truths (Ex: [imageName,class,confidence=1, (bb coordinates XYX2Y2)])
+        groundTruths = []
+        # List with all detections (Ex: [imageName,class,confidence,(bb coordinates XYX2Y2)])
+        detections = []
+        # Get all classes
+        classes = []
+        # Loop through all bounding boxes and separate them into GTs and detections
+        for bb in boundingboxes.getBoundingBoxes():
+            # [imageName, class, confidence, (bb coordinates XYX2Y2)]
+            if bb.getBBType() == BBType.GroundTruth:
+                groundTruths.append([
+                    bb.getImageName(),
+                    bb.getClassId(), 1,
+                    bb.getAbsoluteBoundingBox(BBFormat.XYX2Y2)
+                ])
+            else:
+                detections.append([
+                    bb.getImageName(),
+                    bb.getClassId(),
+                    bb.getConfidence(),
+                    bb.getAbsoluteBoundingBox(BBFormat.XYX2Y2)
+                ])
+            # get class
+            if bb.getClassId() not in classes:
+                classes.append(bb.getClassId())
+        classes = sorted(classes)
+        # Precision x Recall is obtained individually by each class
+        # Loop through by classes
+        for c in classes:
+            # Get only detection of class c
+            dects = []
+            [dects.append(d) for d in detections if d[1] == c]
+            # Get only ground truths of class c
+            gts = []
+            [gts.append(g) for g in groundTruths if g[1] == c]
+            npos = len(gts)
+            # sort detections by decreasing confidence
+            dects = sorted(dects, key=lambda conf: conf[2], reverse=True)
+            TP = np.zeros(len(dects))
+            FP = np.zeros(len(dects))
+            # create dictionary with amount of gts for each image
+            det = Counter([cc[0] for cc in gts])
+            for key, val in det.items():
+                det[key] = np.zeros(val)
+            # print("Evaluating class: %s (%d detections)" % (str(c), len(dects)))
+            # Loop through detections
+            for d in range(len(dects)):
+                # print('dect %s => %s' % (dects[d][0], dects[d][3],))
+                # Find ground truth image
+                gt = [gt for gt in gts if gt[0] == dects[d][0]]
+                iouMax = sys.float_info.min
+                for j in range(len(gt)):
+                    # print('Ground truth gt => %s' % (gt[j][3],))
+                    iou = Evaluator.iou(dects[d][3], gt[j][3])
+                    if iou > iouMax:
+                        iouMax = iou
+                        jmax = j
+                # Assign detection as true positive/don't care/false positive
+                if iouMax >= IOUThreshold:
+                    if det[dects[d][0]][jmax] == 0:
+                        TP[d] = 1  # count as true positive
+                        det[dects[d][0]][jmax] = 1  # flag as already 'seen'
+                        # print("TP")
+                    else:
+                        FP[d] = 1  # count as false positive
+                        # print("FP")
+                # - A detected "cat" is overlaped with a GT "cat" with IOU >= IOUThreshold.
+                else:
+                    FP[d] = 1  # count as false positive
+                    # print("FP")
+            # compute precision, recall and average precision
+            acc_FP = np.cumsum(FP)
+            acc_TP = np.cumsum(TP)
+            rec = acc_TP / npos
+            prec = np.divide(acc_TP, (acc_FP + acc_TP))
+            # Depending on the method, call the right implementation
+            if method == MethodAveragePrecision.EveryPointInterpolation:
+                [ap, mpre, mrec, ii] = Evaluator.CalculateAveragePrecision(rec, prec)
+            else:
+                [ap, mpre, mrec, _] = Evaluator.ElevenPointInterpolatedAP(rec, prec)
+            # add class result in the dictionary to be returned
+            r = {
+                'class': c,
+                'precision': prec,
+                'recall': rec,
+                'AP': ap,
+                'interpolated precision': mpre,
+                'interpolated recall': mrec,
+                'total positives': npos,
+                'total TP': np.sum(TP),
+                'total FP': np.sum(FP)
+            }
+            ret.append(r)
+        return ret
+
+
+    def PlotPrecisionRecallCurve(self,
+                                 boundingBoxes,
+                                 IOUThreshold=0.5,
+                                 method=None,
+                                 showAP=False,
+                                 showInterpolatedPrecision=False,
+                                 savePath=None,
+                                 showGraphic=True):
+        """PlotPrecisionRecallCurve
+        Plot the Precision x Recall curve for a given class.
+        Args:
+            boundingBoxes: Object of the class BoundingBoxes representing ground truth and detected
+            bounding boxes;
+            IOUThreshold (optional): IOU threshold indicating which detections will be considered
+            TP or FP (default value = 0.5);
+            method (default = EveryPointInterpolation): It can be calculated as the implementation
+            in the official PASCAL VOC toolkit (EveryPointInterpolation), or applying the 11-point
+            interpolatio as described in the paper "The PASCAL Visual Object Classes(VOC) Challenge"
+            or EveryPointInterpolation"  (ElevenPointInterpolation).
+            showAP (optional): if True, the average precision value will be shown in the title of
+            the graph (default = False);
+            showInterpolatedPrecision (optional): if True, it will show in the plot the interpolated
+             precision (default = False);
+            savePath (optional): if informed, the plot will be saved as an image in this path
+            (ex: /home/mywork/ap.png) (default = None);
+            showGraphic (optional): if True, the plot will be shown (default = True)
+        Returns:
+            A list of dictionaries. Each dictionary contains information and metrics of each class.
+            The keys of each dictionary are:
+            dict['class']: class representing the current dictionary;
+            dict['precision']: array with the precision values;
+            dict['recall']: array with the recall values;
+            dict['AP']: average precision;
+            dict['interpolated precision']: interpolated precision values;
+            dict['interpolated recall']: interpolated recall values;
+            dict['total positives']: total number of ground truth positives;
+            dict['total TP']: total number of True Positive detections;
+            dict['total FP']: total number of False Negative detections;
+        """
+        results = self.GetPascalVOCMetrics(boundingBoxes, IOUThreshold, method=MethodAveragePrecision.EveryPointInterpolation)
+        result = None
+        # Each resut represents a class
+        for result in results:
+            if result is None:
+                raise IOError('Error: Class %d could not be found.' % classId)
+
+            classId = result['class']
+            precision = result['precision']
+            recall = result['recall']
+            average_precision = result['AP']
+            mpre = result['interpolated precision']
+            mrec = result['interpolated recall']
+            npos = result['total positives']
+            total_tp = result['total TP']
+            total_fp = result['total FP']
+
+            plt.close()
+            if showInterpolatedPrecision:
+                if method == MethodAveragePrecision.EveryPointInterpolation:
+                    plt.plot(mrec, mpre, '--r', label='Interpolated precision (every point)')
+                elif method == MethodAveragePrecision.ElevenPointInterpolation:
+                    # Uncomment the line below if you want to plot the area
+                    # plt.plot(mrec, mpre, 'or', label='11-point interpolated precision')
+                    # Remove duplicates, getting only the highest precision of each recall value
+                    nrec = []
+                    nprec = []
+                    for idx in range(len(mrec)):
+                        r = mrec[idx]
+                        if r not in nrec:
+                            idxEq = np.argwhere(mrec == r)
+                            nrec.append(r)
+                            nprec.append(max([mpre[int(id)] for id in idxEq]))
+                    plt.plot(nrec, nprec, 'or', label='11-point interpolated precision')
+            plt.plot(recall, precision, label='Precision')
+            plt.xlabel('recall')
+            plt.ylabel('precision')
+            if showAP:
+                ap_str = "{0:.2f}%".format(average_precision * 100)
+                # ap_str = "{0:.4f}%".format(average_precision * 100)
+                plt.title('Precision x Recall curve \nClass: %s, AP: %s' % (str(classId), ap_str))
+            else:
+                plt.title('Precision x Recall curve \nClass: %s' % str(classId))
+            plt.legend(shadow=True)
+            plt.grid()
+
+            if savePath is not None:
+                os.makedirs(savePath, exist_ok=True)
+                savePath_ = os.path.join(savePath, self.dataset)
+                os.makedirs(savePath_, exist_ok=True)
+
+                # save fig
+                plt.savefig(os.path.join(savePath_, classId + '.png'))
+
+            if showGraphic is True:
+                plt.show()
+                # plt.waitforbuttonpress()
+                plt.pause(0.05)
+
+        return results
+
+
+    @staticmethod
+    def CalculateAveragePrecision(rec, prec):
+        mrec = []
+        mrec.append(0)
+        [mrec.append(e) for e in rec]
+        mrec.append(1)
+        mpre = []
+        mpre.append(0)
+        [mpre.append(e) for e in prec]
+        mpre.append(0)
+        for i in range(len(mpre) - 1, 0, -1):
+            mpre[i - 1] = max(mpre[i - 1], mpre[i])
+        ii = []
+        for i in range(len(mrec) - 1):
+            if mrec[1:][i] != mrec[0:-1][i]:
+                ii.append(i + 1)
+        ap = 0
+        for i in ii:
+            ap = ap + np.sum((mrec[i] - mrec[i - 1]) * mpre[i])
+        # return [ap, mpre[1:len(mpre)-1], mrec[1:len(mpre)-1], ii]
+        return [ap, mpre[0:len(mpre) - 1], mrec[0:len(mpre) - 1], ii]
+
+
+    @staticmethod
+    def ElevenPointInterpolatedAP(rec, prec):
+        """ 11-point interpolated average precision """
+
+        # def CalculateAveragePrecision2(rec, prec):
+        mrec = []
+        # mrec.append(0)
+        [mrec.append(e) for e in rec]
+        # mrec.append(1)
+        mpre = []
+        # mpre.append(0)
+        [mpre.append(e) for e in prec]
+        # mpre.append(0)
+        recallValues = np.linspace(0, 1, 11)
+        recallValues = list(recallValues[::-1])
+        rhoInterp = []
+        recallValid = []
+        # For each recallValues (0, 0.1, 0.2, ... , 1)
+        for r in recallValues:
+            # Obtain all recall values higher or equal than r
+            argGreaterRecalls = np.argwhere(mrec[:] >= r)
+            pmax = 0
+            # If there are recalls above r
+            if argGreaterRecalls.size != 0:
+                pmax = max(mpre[argGreaterRecalls.min():])
+            recallValid.append(r)
+            rhoInterp.append(pmax)
+        # By definition AP = sum(max(precision whose recall is above r))/11
+        ap = sum(rhoInterp) / 11
+        # Generating values for the plot
+        rvals = []
+        rvals.append(recallValid[0])
+        [rvals.append(e) for e in recallValid]
+        rvals.append(0)
+        pvals = []
+        pvals.append(0)
+        [pvals.append(e) for e in rhoInterp]
+        pvals.append(0)
+        # rhoInterp = rhoInterp[::-1]
+        cc = []
+        for i in range(len(rvals)):
+            p = (rvals[i], pvals[i - 1])
+            if p not in cc:
+                cc.append(p)
+            p = (rvals[i], pvals[i])
+            if p not in cc:
+                cc.append(p)
+        recallValues = [i[0] for i in cc]
+        rhoInterp = [i[1] for i in cc]
+        return [ap, rhoInterp, recallValues, None]
+
+
+    @staticmethod
+    def _getAllIOUs(reference, detections):
+        """ For each detections, calculate IOU with reference """
+
+        ret = []
+        bbReference = reference.getAbsoluteBoundingBox(BBFormat.XYX2Y2)
+        # img = np.zeros((200,200,3), np.uint8)
+        for d in detections:
+            bb = d.getAbsoluteBoundingBox(BBFormat.XYX2Y2)
+            iou = Evaluator.iou(bbReference, bb)
+            # Show blank image with the bounding boxes
+            # img = add_bb_into_image(img, d, color=(255,0,0), thickness=2, label=None)
+            # img = add_bb_into_image(img, reference, color=(0,255,0), thickness=2, label=None)
+            ret.append((iou, reference, d))  # iou, reference, detection
+        # cv2.imshow("comparing",img)
+        # cv2.waitKey(0)
+        # cv2.destroyWindow("comparing")
+        return sorted(ret, key=lambda i: i[0], reverse=True)  # sort by iou (from highest to lowest)
+
+    @staticmethod
+    def iou(boxA, boxB):
+        # if boxes dont intersect
+        if Evaluator._boxesIntersect(boxA, boxB) is False:
+            return 0
+        interArea = Evaluator._getIntersectionArea(boxA, boxB)
+        union = Evaluator._getUnionAreas(boxA, boxB, interArea=interArea)
+        # intersection over union
+        iou = interArea / union
+        assert iou >= 0
+        return iou
+
+
+    @staticmethod
+    def _boxesIntersect(boxA, boxB):
+        """
+            boxA = (Ax1,Ay1,Ax2,Ay2)
+            boxB = (Bx1,By1,Bx2,By2)
+        """
+
+        if boxA[0] > boxB[2]:
+            return False  # boxA is right of boxB
+        if boxB[0] > boxA[2]:
+            return False  # boxA is left of boxB
+        if boxA[3] < boxB[1]:
+            return False  # boxA is above boxB
+        if boxA[1] > boxB[3]:
+            return False  # boxA is below boxB
+        return True
+
+
+    @staticmethod
+    def _getIntersectionArea(boxA, boxB):
+        xA = max(boxA[0], boxB[0])
+        yA = max(boxA[1], boxB[1])
+        xB = min(boxA[2], boxB[2])
+        yB = min(boxA[3], boxB[3])
+        # intersection area
+        return (xB - xA + 1) * (yB - yA + 1)
+
+
+    @staticmethod
+    def _getUnionAreas(boxA, boxB, interArea=None):
+        area_A = Evaluator._getArea(boxA)
+        area_B = Evaluator._getArea(boxB)
+        if interArea is None:
+            interArea = Evaluator._getIntersectionArea(boxA, boxB)
+        return float(area_A + area_B - interArea)
+
+
+    @staticmethod
+    def _getArea(box):
+        return (box[2] - box[0] + 1) * (box[3] - box[1] + 1)
+
+
+
+# Validate formats
+def ValidateFormats(argFormat, argName, errors):
+    if argFormat == 'xywh':
+        return BBFormat.XYWH
+    elif argFormat == 'xyrb':
+        return BBFormat.XYX2Y2
+    elif argFormat is None:
+        return BBFormat.XYWH  # default when nothing is passed
+    else:
+        errors.append(
+            'argument %s: invalid value. It must be either \'xywh\' or \'xyrb\'' % argName)
+
+
+# Validate mandatory args
+def ValidateMandatoryArgs(arg, argName, errors):
+    if arg is None:
+        errors.append('argument %s: required argument' % argName)
+    else:
+        return True
+
+
+def ValidateImageSize(arg, argName, argInformed, errors):
+    errorMsg = 'argument %s: required argument if %s is relative' % (argName, argInformed)
+    ret = None
+    if arg is None:
+        errors.append(errorMsg)
+    else:
+        arg = arg.replace('(', '').replace(')', '')
+        args = arg.split(',')
+        if len(args) != 2:
+            errors.append(
+                '%s. It must be in the format \'width,height\' (e.g. \'600,400\')' % errorMsg)
+        else:
+            if not args[0].isdigit() or not args[1].isdigit():
+                errors.append(
+                    '%s. It must be in INdiaTEGER the format \'width,height\' (e.g. \'600,400\')' %
+                    errorMsg)
+            else:
+                ret = (int(args[0]), int(args[1]))
+    return ret
+
+
+# Validate coordinate types
+def ValidateCoordinatesTypes(arg, argName, errors):
+    if arg == 'abs':
+        return CoordinatesType.Absolute
+    elif arg == 'rel':
+        return CoordinatesType.Relative
+    elif arg is None:
+        return CoordinatesType.Absolute  # default when nothing is passed
+    errors.append('argument %s: invalid value. It must be either \'rel\' or \'abs\'' % argName)
+
+
+def getBoundingBoxes(directory,
+                     isGT,
+                     bbFormat,
+                     coordType,
+                     allBoundingBoxes=None,
+                     allClasses=None,
+                     imgSize=(0, 0)):
+    """Read txt files containing bounding boxes (ground truth and detections)."""
+    print(directory)
+    if allBoundingBoxes is None:
+        allBoundingBoxes = BoundingBoxes()
+    if allClasses is None:
+        allClasses = []
+    # Read ground truths
+    os.chdir(directory)
+    files = glob.glob("*.txt")
+    files.sort()
+    # print(files)
+    # Read GT detections from txt file
+    # Each line of the files in the groundtruths folder represents a ground truth bounding box
+    # (bounding boxes that a detector should detect)
+    # Each value of each line is  "class_id, x, y, width, height" respectively
+    # Class_id represents the class of the bounding box
+    # x, y represents the most top-left coordinates of the bounding box
+    # x2, y2 represents the most bottom-right coordinates of the bounding box
+    for f in files:
+        nameOfImage = f.replace(".txt", "")
+        fh1 = open(f, "r")
+        for line in fh1:
+            line = line.replace("\n", "")
+            if line.replace(' ', '') == '':
+                continue
+            splitLine = line.split(" ")
+            if isGT:
+                # idClass = int(splitLine[0]) #class
+                idClass = (splitLine[0])  # class
+                x = float(splitLine[1])
+                y = float(splitLine[2])
+                w = float(splitLine[3])
+                h = float(splitLine[4])
+                bb = BoundingBox(
+                    nameOfImage,
+                    idClass,
+                    x,
+                    y,
+                    w,
+                    h,
+                    coordType,
+                    imgSize,
+                    BBType.GroundTruth,
+                    format=bbFormat)
+            else:
+                # idClass = int(splitLine[0]) #class
+                idClass = (splitLine[0])  # class
+                confidence = float(splitLine[1])
+                x = float(splitLine[2])
+                y = float(splitLine[3])
+                w = float(splitLine[4])
+                h = float(splitLine[5])
+                bb = BoundingBox(
+                    nameOfImage,
+                    idClass,
+                    x,
+                    y,
+                    w,
+                    h,
+                    coordType,
+                    imgSize,
+                    BBType.Detected,
+                    confidence,
+                    format=bbFormat)
+            allBoundingBoxes.addBoundingBox(bb)
+            if idClass not in allClasses:
+                allClasses.append(idClass)
+        fh1.close()
+    return allBoundingBoxes, allClasses
+
+
+def evaluate_frameAP(gtFolder, detFolder, threshold = 0.5, savePath = None, datatset = 'ucf24', show_pr_curve=False):
+    # Get current path to set default folders
+    #VERSION = '0.1 (beta)'
+    gtFormat = 'xyrb'
+    detFormat = 'xyrb'
+    gtCoordinates = 'abs'
+    detCoordinates = 'abs'
+
+    gtFolder = os.path.join(os.path.abspath('.'), gtFolder)
+    detFolder = os.path.join(os.path.abspath('.'), detFolder)
+    savePath = os.path.join(os.path.abspath('.'), savePath)
+
+    iouThreshold = threshold
+
+    # Arguments validation
+    errors = []
+    # Validate formats
+    gtFormat = ValidateFormats(gtFormat, 'gtFormat', errors)
+    detFormat = ValidateFormats(detFormat, '-detformat', errors)
+
+    # Coordinates types
+    gtCoordType = ValidateCoordinatesTypes(gtCoordinates, '-gtCoordinates', errors)
+    detCoordType = ValidateCoordinatesTypes(detCoordinates, '-detCoordinates', errors)
+    imgSize = (0, 0)
+
+    # # Create directory to save results
+    # shutil.rmtree(savePath, ignore_errors=True)  # Clear folder
+    # exit()
+    # if savePath is not None:
+    #     os.makedirs(savePath)
+    # Show plot during execution
+    # showPlot = args.showPlot
+
+    # print('iouThreshold= %f' % iouThreshold)
+    # #print('savePath = %s' % savePath)
+    # print('gtFormat = %s' % gtFormat)
+    # print('detFormat = %s' % detFormat)
+    # print('gtFolder = %s' % gtFolder)
+    # print('detFolder = %s' % detFolder)
+    # print('gtCoordType = %s' % gtCoordType)
+    # print('detCoordType = %s' % detCoordType)
+    #print('showPlot %s' % showPlot)
+
+    # Get groundtruth boxes
+    allBoundingBoxes, allClasses = getBoundingBoxes(
+        gtFolder, True, gtFormat, gtCoordType, imgSize=imgSize)
+    # Get detected boxes
+    allBoundingBoxes, allClasses = getBoundingBoxes(
+        detFolder, False, detFormat, detCoordType, allBoundingBoxes, allClasses, imgSize=imgSize)
+    allClasses.sort()
+
+    evaluator = Evaluator(dataset=datatset)
+    acc_AP = 0
+    validClasses = 0
+
+    # Plot Precision x Recall curve
+    detections = evaluator.PlotPrecisionRecallCurve(
+        allBoundingBoxes,  # Object containing all bounding boxes (ground truths and detections)
+        IOUThreshold=iouThreshold,  # IOU threshold
+        method=MethodAveragePrecision.EveryPointInterpolation,
+        showAP=True,  # Show Average Precision in the title of the plot
+        showInterpolatedPrecision=show_pr_curve,  # plot the interpolated precision curve
+        savePath=savePath,
+        showGraphic=False)
+
+    # f = open(os.path.join(savePath, 'results.txt'), 'w')
+    # f.write('Object Detection Metrics\n')
+    # f.write('https://github.com/rafaelpadilla/Object-Detection-Metrics\n\n\n')
+    # f.write('Average Precision (AP), Precision and Recall per class:')
+
+    # each detection is a class and store AP and mAP results in AP_res list
+    AP_res = []
+    for metricsPerClass in detections:
+
+        # Get metric values per each class
+        cl = metricsPerClass['class']
+        ap = metricsPerClass['AP']
+        precision = metricsPerClass['precision']
+        recall = metricsPerClass['recall']
+        totalPositives = metricsPerClass['total positives']
+        total_TP = metricsPerClass['total TP']
+        total_FP = metricsPerClass['total FP']
+
+        if totalPositives > 0:
+            validClasses = validClasses + 1
+            acc_AP = acc_AP + ap
+            prec = ['%.2f' % p for p in precision]
+            rec = ['%.2f' % r for r in recall]
+            ap_str = "{0:.2f}%".format(ap * 100)
+            # ap_str = "{0:.4f}%".format(ap * 100)
+            #print('AP: %s (%s)' % (ap_str, cl))
+            # f.write('\n\nClass: %s' % cl)
+            # f.write('\nAP: %s' % ap_str)
+            # f.write('\nPrecision: %s' % prec)
+            # f.write('\nRecall: %s' % rec)
+            AP_res.append('AP: %s (%s)' % (ap_str, cl))
+    mAP = acc_AP / validClasses
+    mAP_str = "{0:.2f}%".format(mAP * 100)
+    #print('mAP: %s' % mAP_str)
+    AP_res.append('mAP: %s' % mAP_str)
+    # f.write('\n\n\nmAP: %s' % mAP_str)
+
+    return AP_res
+
+
+if __name__ == '__main__':
+    evaluate_frameAP('groundtruths_ucf', 'detection_test')
diff --git a/evaluator/cal_video_mAP.py b/evaluator/cal_video_mAP.py
new file mode 100644
index 0000000..49ed4ba
--- /dev/null
+++ b/evaluator/cal_video_mAP.py
@@ -0,0 +1,245 @@
+# -*- coding:utf-8 -*-
+import numpy as np
+import os
+from .utils import bbox_iou, nms_3d, iou3d, iou3dt, iou3d, iou3dt, voc_ap
+
+
+def compute_score_one_class(bbox1, bbox2, w_iou=1.0, w_scores=1.0, w_scores_mul=0.5):
+    # bbx: <x1> <y1> <x2> <y2> <class score>
+    n_bbox1 = bbox1.shape[0]
+    n_bbox2 = bbox2.shape[0]
+    # for saving all possible scores between each two bbxes in successive frames
+    scores = np.zeros([n_bbox1, n_bbox2], dtype=np.float32)
+    for i in range(n_bbox1):
+        box1 = bbox1[i, :4]
+        for j in range(n_bbox2):
+            box2 = bbox2[j, :4]
+            bbox_iou_frames = bbox_iou(box1, box2, x1y1x2y2=True)
+            sum_score_frames = bbox1[i, 4] + bbox2[j, 4]
+            mul_score_frames = bbox1[i, 4] * bbox2[j, 4]
+            scores[i, j] = w_iou * bbox_iou_frames + w_scores * sum_score_frames + w_scores_mul * mul_score_frames
+
+    return scores
+
+
+def link_bbxes_between_frames(bbox_list, w_iou=1.0, w_scores=1.0, w_scores_mul=0.5):
+    # bbx_list: list of bounding boxes <x1> <y1> <x2> <y2> <class score>
+    # check no empty detections
+    ind_notempty = []
+    nfr = len(bbox_list)
+    for i in range(nfr):
+        if np.array(bbox_list[i]).size:
+            ind_notempty.append(i)
+    # no detections at all
+    if not ind_notempty:
+        return []
+    # miss some frames
+    elif len(ind_notempty)!=nfr:     
+        for i in range(nfr):
+            if not np.array(bbox_list[i]).size:
+                # copy the nearest detections to fill in the missing frames
+                ind_dis = np.abs(np.array(ind_notempty) - i)
+                nn = np.argmin(ind_dis)
+                bbox_list[i] = bbox_list[ind_notempty[nn]]
+
+    
+    detect = bbox_list
+    nframes = len(detect)
+    res = []
+
+    isempty_vertex = np.zeros([nframes,], dtype=np.bool)
+    edge_scores = [compute_score_one_class(detect[i], detect[i+1], w_iou=w_iou, w_scores=w_scores, w_scores_mul=w_scores_mul) for i in range(nframes-1)]
+    copy_edge_scores = edge_scores
+
+    while not np.any(isempty_vertex):
+        # initialize
+        scores = [np.zeros([d.shape[0],], dtype=np.float32) for d in detect]
+        index = [np.nan*np.ones([d.shape[0],], dtype=np.float32) for d in detect]
+        # viterbi
+        # from the second last frame back
+        for i in range(nframes-2, -1, -1):
+            edge_score = edge_scores[i] + scores[i+1]
+            # find the maximum score for each bbox in the i-th frame and the corresponding index
+            scores[i] = np.max(edge_score, axis=1)
+            index[i] = np.argmax(edge_score, axis=1)
+        # decode
+        idx = -np.ones([nframes], dtype=np.int32)
+        idx[0] = np.argmax(scores[0])
+        for i in range(0, nframes-1):
+            idx[i+1] = index[i][idx[i]]
+        # remove covered boxes and build output structures
+        this = np.empty((nframes, 6), dtype=np.float32)
+        this[:, 0] = 1 + np.arange(nframes)
+        for i in range(nframes):
+            j = idx[i]
+            iouscore = 0
+            if i < nframes-1:
+                iouscore = copy_edge_scores[i][j, idx[i+1]] - bbox_list[i][j, 4] - bbox_list[i+1][idx[i+1], 4]
+
+            if i < nframes-1: edge_scores[i] = np.delete(edge_scores[i], j, 0)
+            if i > 0: edge_scores[i-1] = np.delete(edge_scores[i-1], j, 1)
+            this[i, 1:5] = detect[i][j, :4]
+            this[i, 5] = detect[i][j, 4]
+            detect[i] = np.delete(detect[i], j, 0)
+            isempty_vertex[i] = (detect[i].size==0) # it is true when there is no detection in any frame
+        res.append( this )
+        if len(res) == 3:
+            break
+        
+    return res
+
+
+def link_video_one_class(vid_det, bNMS3d = False):
+    '''
+    linking for one class in a video (in full length)
+    vid_det: a list of [frame_index, [bbox cls_score]]
+    gtlen: the mean length of gt in training set
+    return a list of tube [array[frame_index, x1,y1,x2,y2, cls_score]]
+    '''
+    # list of bbox information [[bbox in frame 1], [bbox in frame 2], ...]
+    vdets = [vid_det[i][1] for i in range(len(vid_det))]
+    vres = link_bbxes_between_frames(vdets) 
+    if len(vres) != 0:
+        if bNMS3d:
+            tube = [b[:, :5] for b in vres]
+            # compute score for each tube
+            tube_scores = [np.mean(b[:, 5]) for b in vres]
+            dets = [(tube[t], tube_scores[t]) for t in range(len(tube))]
+            # nms for tubes
+            keep = nms_3d(dets, 0.3) # bug for nms3dt
+            if np.array(keep).size:
+                vres_keep = [vres[k] for k in keep]
+                # max subarray with penalization -|Lc-L|/Lc
+                vres = vres_keep
+
+    return vres
+
+
+def video_ap_one_class(gt, pred_videos, iou_thresh = 0.2, bTemporal = False):
+    '''
+    gt: [ video_index, array[frame_index, x1,y1,x2,y2] ]
+    pred_videos: [ video_index, [ [frame_index, [[x1,y1,x2,y2, score]] ] ] ]
+    '''
+    # link for prediction
+    pred = []
+    for pred_v in pred_videos:
+        video_index = pred_v[0]
+        pred_link_v = link_video_one_class(pred_v[1], True) # [array<frame_index, x1,y1,x2,y2, cls_score>]
+        for tube in pred_link_v:
+            pred.append((video_index, tube))
+
+    # sort tubes according to scores (descending order)
+    argsort_scores = np.argsort(-np.array([np.mean(b[:, 5]) for _, b in pred])) 
+    pr = np.empty((len(pred)+1, 2), dtype=np.float32) # precision, recall
+    pr[0,0] = 1.0
+    pr[0,1] = 0.0
+    fn = len(gt) #sum([len(a[1]) for a in gt])
+    fp = 0
+    tp = 0
+
+    gt_v_index = [g[0] for g in gt]
+    for i, k in enumerate(argsort_scores):
+        # if i % 100 == 0:
+        #     print ("%6.2f%% boxes processed, %d positives found, %d remain" %(100*float(i)/argsort_scores.size, tp, fn))
+        video_index, boxes = pred[k]
+        ispositive = False
+        if video_index in gt_v_index:
+            gt_this_index, gt_this = [], []
+            for j, g in enumerate(gt):
+                if g[0] == video_index:
+                    gt_this.append(g[1])
+                    gt_this_index.append(j)
+            if len(gt_this) > 0:
+                if bTemporal:
+                    iou = np.array([iou3dt(np.array(g), boxes[:, :5]) for g in gt_this])
+                else:            
+                    if boxes.shape[0] > gt_this[0].shape[0]:
+                        # in case some frame don't have gt 
+                        iou = np.array([iou3d(g, boxes[int(g[0,0]-1):int(g[-1,0]),:5]) for g in gt_this]) 
+                    elif boxes.shape[0]<gt_this[0].shape[0]:
+                        # in flow case 
+                        iou = np.array([iou3d(g[int(boxes[0,0]-1):int(boxes[-1,0]),:], boxes[:,:5]) for g in gt_this]) 
+                    else:
+                        iou = np.array([iou3d(g, boxes[:,:5]) for g in gt_this]) 
+
+                if iou.size > 0: # on ucf101 if invalid annotation ....
+                    argmax = np.argmax(iou)
+                    if iou[argmax] >= iou_thresh:
+                        ispositive = True
+                        del gt[gt_this_index[argmax]]
+        if ispositive:
+            tp += 1
+            fn -= 1
+        else:
+            fp += 1
+        pr[i+1,0] = float(tp)/float(tp+fp)
+        pr[i+1,1] = float(tp)/float(tp+fn + 0.00001)
+    ap = voc_ap(pr)
+
+    return ap
+
+
+def gt_to_videts(gt_v):
+    # return  [label, video_index, [[frame_index, x1,y1,x2,y2], [], []] ]
+    keys = list(gt_v.keys())
+    keys.sort()
+    res = []
+    for i in range(len(keys)):
+        # annotation of the video: tubes and gt_classes
+        v_annot = gt_v[keys[i]]
+        for j in range(len(v_annot['tubes'])):
+            res.append([v_annot['gt_classes'], i+1, v_annot['tubes'][j]])
+    return res
+
+
+def evaluate_videoAP(gt_videos, all_boxes, num_classes, iou_thresh = 0.2, bTemporal = False):
+    '''
+    gt_videos: {vname:{tubes: [[frame_index, x1,y1,x2,y2]], gt_classes: vlabel}} 
+    all_boxes: {imgname:{cls_ind:array[x1,y1,x2,y2, cls_score]}}
+    '''
+    def imagebox_to_videts(img_boxes, num_classes):
+        # image names
+        keys = list(all_boxes.keys())
+        keys.sort()
+        res = []
+        # without 'background'
+        for cls_ind in range(num_classes):
+            v_cnt = 1
+            frame_index = 1
+            v_dets = []
+            cls_ind += 1
+            # get the directory path of images
+            preVideo = os.path.dirname(keys[0])
+            for i in range(len(keys)):
+                curVideo = os.path.dirname(keys[i])
+                img_cls_dets = img_boxes[keys[i]][cls_ind]
+                v_dets.append([frame_index, img_cls_dets])
+                frame_index += 1
+                if preVideo!=curVideo:
+                    preVideo = curVideo
+                    frame_index = 1
+                    # tmp_dets = v_dets[-1]
+                    del v_dets[-1]
+                    res.append([cls_ind, v_cnt, v_dets])
+                    v_cnt += 1
+                    v_dets = []
+                    # v_dets.append(tmp_dets)
+                    v_dets.append([frame_index, img_cls_dets])
+                    frame_index += 1
+            # the last video
+            # print('num of videos:{}'.format(v_cnt))
+            res.append([cls_ind, v_cnt, v_dets])
+        return res
+
+    gt_videos_format = gt_to_videts(gt_videos)
+    pred_videos_format = imagebox_to_videts(all_boxes, num_classes)
+    ap_all = []    
+    for cls_ind in range(num_classes):
+        cls_ind += 1
+        # [ video_index, [[frame_index, x1,y1,x2,y2]] ]
+        gt = [g[1:] for g in gt_videos_format if g[0]==cls_ind]
+        pred_cls = [p[1:] for p in pred_videos_format if p[0]==cls_ind]
+        ap = video_ap_one_class(gt, pred_cls, iou_thresh, bTemporal)
+        ap_all.append(round(ap * 100, 2))
+
+    return ap_all
diff --git a/evaluator/groundtruths_ucf_jhmdb/groundtruths_jhmdb.zip b/evaluator/groundtruths_ucf_jhmdb/groundtruths_jhmdb.zip
new file mode 100644
index 0000000..b58eaa9
Binary files /dev/null and b/evaluator/groundtruths_ucf_jhmdb/groundtruths_jhmdb.zip differ
diff --git a/evaluator/groundtruths_ucf_jhmdb/groundtruths_ucf.zip b/evaluator/groundtruths_ucf_jhmdb/groundtruths_ucf.zip
new file mode 100644
index 0000000..cd07ea6
Binary files /dev/null and b/evaluator/groundtruths_ucf_jhmdb/groundtruths_ucf.zip differ
diff --git a/evaluator/ucf_jhmdb_evaluator.py b/evaluator/ucf_jhmdb_evaluator.py
new file mode 100644
index 0000000..a7b2533
--- /dev/null
+++ b/evaluator/ucf_jhmdb_evaluator.py
@@ -0,0 +1,256 @@
+import os
+import torch
+import numpy as np
+from scipy.io import loadmat
+
+from dataset.ucf_jhmdb import UCF_JHMDB_Dataset, UCF_JHMDB_VIDEO_Dataset
+from utils.box_ops import rescale_bboxes
+
+from .cal_frame_mAP import evaluate_frameAP
+from .cal_video_mAP import evaluate_videoAP
+
+
+class UCF_JHMDB_Evaluator(object):
+    def __init__(self,
+                 data_root=None,
+                 dataset='ucf24',
+                 model_name='yowo',
+                 metric='fmap',
+                 img_size=224,
+                 len_clip=1,
+                 batch_size=1,
+                 conf_thresh=0.01,
+                 iou_thresh=0.5,
+                 transform=None,
+                 collate_fn=None,
+                 gt_folder=None,
+                 save_path=None):
+        self.data_root = data_root
+        self.dataset = dataset
+        self.model_name = model_name
+        self.img_size = img_size
+        self.len_clip = len_clip
+        self.batch_size = batch_size
+        self.conf_thresh = conf_thresh
+        self.iou_thresh = iou_thresh
+        self.collate_fn = collate_fn
+
+        self.gt_folder = gt_folder
+        self.save_path = save_path
+
+        self.gt_file = os.path.join(data_root, 'splitfiles/finalAnnots.mat')
+        self.testlist = os.path.join(data_root, 'splitfiles/testlist01.txt')
+
+        # dataset
+        if metric == 'fmap':
+            self.testset = UCF_JHMDB_Dataset(
+                data_root=data_root,
+                dataset=dataset,
+                img_size=img_size,
+                transform=transform,
+                is_train=False,
+                len_clip=len_clip,
+                sampling_rate=1)
+            self.num_classes = self.testset.num_classes
+        elif metric == 'vmap':
+            self.testset = UCF_JHMDB_VIDEO_Dataset(
+                data_root=data_root,
+                dataset=dataset,
+                img_size=img_size,
+                transform=transform,
+                len_clip=len_clip,
+                sampling_rate=1)
+            self.num_classes = self.testset.num_classes
+
+
+    def evaluate_frame_map(self, model, epoch=1, show_pr_curve=False):
+        print("Metric: Frame mAP")
+        # dataloader
+        self.testloader = torch.utils.data.DataLoader(
+            dataset=self.testset, 
+            batch_size=self.batch_size,
+            shuffle=False,
+            collate_fn=self.collate_fn, 
+            num_workers=4,
+            drop_last=False,
+            pin_memory=True
+            )
+        
+        epoch_size = len(self.testloader)
+
+        # inference
+        for iter_i, (batch_frame_id, batch_video_clip, batch_target) in enumerate(self.testloader):
+            # to device
+            batch_video_clip = batch_video_clip.to(model.device)
+
+            with torch.no_grad():
+                # inference
+                batch_scores, batch_labels, batch_bboxes = model(batch_video_clip)
+
+                # process batch
+                for bi in range(len(batch_scores)):
+                    frame_id = batch_frame_id[bi]
+                    scores = batch_scores[bi]
+                    labels = batch_labels[bi]
+                    bboxes = batch_bboxes[bi]
+                    target = batch_target[bi]
+
+                    # rescale bbox
+                    orig_size = target['orig_size']
+                    bboxes = rescale_bboxes(bboxes, orig_size)
+
+                    if not os.path.exists('results'):
+                        os.mkdir('results')
+
+                    if self.dataset == 'ucf24':
+                        detection_path = os.path.join('results', 'ucf_detections', self.model_name, 'detections_' + str(epoch), frame_id)
+                        current_dir = os.path.join('results', 'ucf_detections',  self.model_name, 'detections_' + str(epoch))
+                        if not os.path.exists('results/ucf_detections/'):
+                            os.mkdir('results/ucf_detections/')
+                        if not os.path.exists('results/ucf_detections/'+self.model_name):
+                            os.mkdir('results/ucf_detections/'+self.model_name)
+                        if not os.path.exists(current_dir):
+                            os.mkdir(current_dir)
+                    else:
+                        detection_path = os.path.join('results', 'jhmdb_detections',  self.model_name, 'detections_' + str(epoch), frame_id)
+                        current_dir = os.path.join('results', 'jhmdb_detections',  self.model_name, 'detections_' + str(epoch))
+                        if not os.path.exists('results/jhmdb_detections/'):
+                            os.mkdir('results/jhmdb_detections/')
+                        if not os.path.exists('results/jhmdb_detections/'+self.model_name):
+                            os.mkdir('results/jhmdb_detections/'+self.model_name)
+                        if not os.path.exists(current_dir):
+                            os.mkdir(current_dir)
+
+                    with open(detection_path, 'w+') as f_detect:
+                        for score, label, bbox in zip(scores, labels, bboxes):
+                            x1 = round(bbox[0])
+                            y1 = round(bbox[1])
+                            x2 = round(bbox[2])
+                            y2 = round(bbox[3])
+                            cls_id = int(label) + 1
+
+                            f_detect.write(
+                                str(cls_id) + ' ' + str(score) + ' ' \
+                                    + str(x1) + ' ' + str(y1) + ' ' + str(x2) + ' ' + str(y2) + '\n')
+
+                if iter_i % 100 == 0:
+                    log_info = "[%d / %d]" % (iter_i, epoch_size)
+                    print(log_info, flush=True)
+
+        print('calculating Frame mAP ...')
+        metric_list = evaluate_frameAP(self.gt_folder, current_dir, self.iou_thresh,
+                              self.save_path, self.dataset, show_pr_curve)
+        for metric in metric_list:
+            print(metric)
+
+
+    def evaluate_video_map(self, model):
+        print("Metric: Video mAP")
+        video_testlist = []
+        with open(self.testlist, 'r') as file:
+            lines = file.readlines()
+            for line in lines:
+                line = line.rstrip()
+                video_testlist.append(line)
+
+        detected_boxes = {}
+        gt_videos = {}
+
+        gt_data = loadmat(self.gt_file)['annot']
+        n_videos = gt_data.shape[1]
+        print('loading gt tubes ...')
+        for i in range(n_videos):
+            video_name = gt_data[0][i][1][0]
+            if video_name in video_testlist:
+                n_tubes = len(gt_data[0][i][2][0])
+                v_annotation = {}
+                all_gt_boxes = []
+                for j in range(n_tubes):  
+                    gt_one_tube = [] 
+                    tube_start_frame = gt_data[0][i][2][0][j][1][0][0]
+                    tube_end_frame = gt_data[0][i][2][0][j][0][0][0]
+                    tube_class = gt_data[0][i][2][0][j][2][0][0]
+                    tube_data = gt_data[0][i][2][0][j][3]
+                    tube_length = tube_end_frame - tube_start_frame + 1
+                
+                    for k in range(tube_length):
+                        gt_boxes = []
+                        gt_boxes.append(int(tube_start_frame+k))
+                        gt_boxes.append(float(tube_data[k][0]))
+                        gt_boxes.append(float(tube_data[k][1]))
+                        gt_boxes.append(float(tube_data[k][0]) + float(tube_data[k][2]))
+                        gt_boxes.append(float(tube_data[k][1]) + float(tube_data[k][3]))
+                        gt_one_tube.append(gt_boxes)
+                    all_gt_boxes.append(gt_one_tube)
+
+                v_annotation['gt_classes'] = tube_class
+                v_annotation['tubes'] = np.array(all_gt_boxes)
+                gt_videos[video_name] = v_annotation
+
+        # inference
+        print('inference ...')
+        for i, line in enumerate(lines):
+            line = line.rstrip()
+            if i % 50 == 0:
+                print('Video: [%d / %d] - %s' % (i, len(lines), line))
+            
+            # set video
+            self.testset.set_video_data(line)
+
+            # dataloader
+            self.testloader = torch.utils.data.DataLoader(
+                dataset=self.testset, 
+                batch_size=self.batch_size,
+                shuffle=False,
+                collate_fn=self.collate_fn, 
+                num_workers=4,
+                drop_last=False,
+                pin_memory=True
+                )
+                    
+            for iter_i, (batch_img_name, batch_video_clip, batch_target) in enumerate(self.testloader):
+                # to device
+                batch_video_clip = batch_video_clip.to(model.device)
+
+                with torch.no_grad():
+                    # inference
+                    batch_scores, batch_labels, batch_bboxes = model(batch_video_clip)
+
+                    # process batch
+                    for bi in range(len(batch_scores)):
+                        img_name = batch_img_name[bi]
+                        scores = batch_scores[bi]
+                        labels = batch_labels[bi]
+                        bboxes = batch_bboxes[bi]
+                        target = batch_target[bi]
+
+                        # rescale bbox
+                        orig_size = target['orig_size']
+                        bboxes = rescale_bboxes(bboxes, orig_size)
+
+                        img_annotation = {}
+                        for cls_idx in range(self.num_classes):
+                            inds = np.where(labels == cls_idx)[0]
+                            c_bboxes = bboxes[inds]
+                            c_scores = scores[inds]
+                            # [n_box, 5]
+                            boxes = np.concatenate([c_bboxes, c_scores[..., None]], axis=-1)
+                            img_annotation[cls_idx+1] = boxes
+                        detected_boxes[img_name] = img_annotation
+
+            # delete testloader
+            del self.testloader
+
+        iou_list = [0.05, 0.1, 0.2, 0.3, 0.5, 0.75]
+        print('calculating video mAP ...')
+        for iou_th in iou_list:
+            per_ap = evaluate_videoAP(gt_videos, detected_boxes, self.num_classes, iou_th, True)
+            video_mAP = sum(per_ap) / len(per_ap)
+            print('-------------------------------')
+            print('V-mAP @ {} IoU:'.format(iou_th))
+            print('--Per AP: ', per_ap)
+            print('--mAP: ', round(video_mAP, 2))
+
+
+if __name__ == "__main__":
+    pass
\ No newline at end of file
diff --git a/evaluator/utils.py b/evaluator/utils.py
new file mode 100644
index 0000000..9a19515
--- /dev/null
+++ b/evaluator/utils.py
@@ -0,0 +1,116 @@
+import numpy as np
+
+
+def bbox_iou(box1, box2, x1y1x2y2=True):
+    if x1y1x2y2:
+        mx = min(box1[0], box2[0])
+        Mx = max(box1[2], box2[2])
+        my = min(box1[1], box2[1])
+        My = max(box1[3], box2[3])
+        w1 = box1[2] - box1[0]
+        h1 = box1[3] - box1[1]
+        w2 = box2[2] - box2[0]
+        h2 = box2[3] - box2[1]
+    else:
+        mx = min(float(box1[0]-box1[2]/2.0), float(box2[0]-box2[2]/2.0))
+        Mx = max(float(box1[0]+box1[2]/2.0), float(box2[0]+box2[2]/2.0))
+        my = min(float(box1[1]-box1[3]/2.0), float(box2[1]-box2[3]/2.0))
+        My = max(float(box1[1]+box1[3]/2.0), float(box2[1]+box2[3]/2.0))
+        w1 = box1[2]
+        h1 = box1[3]
+        w2 = box2[2]
+        h2 = box2[3]
+    uw = Mx - mx
+    uh = My - my
+    cw = w1 + w2 - uw
+    ch = h1 + h2 - uh
+    carea = 0
+    if cw <= 0 or ch <= 0:
+        return 0.0
+
+    area1 = w1 * h1
+    area2 = w2 * h2
+    carea = cw * ch
+    uarea = area1 + area2 - carea
+    return carea/uarea
+
+
+def area2d(b):
+    return (b[:,2]-b[:,0]+1)*(b[:,3]-b[:,1]+1)
+
+
+def overlap2d(b1, b2):
+    xmin = np.maximum( b1[:,0], b2[:,0] )
+    xmax = np.minimum( b1[:,2]+1, b2[:,2]+1)
+    width = np.maximum(0, xmax-xmin)
+    ymin = np.maximum( b1[:,1], b2[:,1] )
+    ymax = np.minimum( b1[:,3]+1, b2[:,3]+1)
+    height = np.maximum(0, ymax-ymin)   
+    return width*height
+
+
+def nms_3d(detections, overlap=0.5):
+    # detections: [(tube1, score1), (tube2, score2)]
+    if len(detections) == 0:
+        return np.array([], dtype=np.int32)
+    I = np.argsort([d[1] for d in detections])
+    indices = np.zeros(I.size, dtype=np.int32)
+    counter = 0
+    while I.size>0:
+        i = I[-1]
+        indices[counter] = i
+        counter += 1
+        ious = np.array([ iou3dt(detections[ii][0],detections[i][0]) for ii in I[:-1] ])
+        I  = I[np.where(ious<=overlap)[0]]
+    return indices[:counter]
+
+
+def iou3d(b1, b2):
+    assert b1.shape[0] == b2.shape[0]
+    assert np.all(b1[:,0] == b2[:,0])
+    o = overlap2d(b1[:,1:5],b2[:,1:5])
+    return np.mean( o/(area2d(b1[:,1:5])+area2d(b2[:,1:5])-o) )
+
+
+def iou3dt(b1, b2):
+    tmin = max(b1[0,0], b2[0,0])
+    tmax = min(b1[-1,0], b2[-1,0])
+    if tmax <= tmin: return 0.0    
+    temporal_inter = tmax-tmin+1
+    temporal_union = max(b1[-1,0], b2[-1,0]) - min(b1[0,0], b2[0,0]) + 1 
+    return iou3d( b1[np.where(b1[:,0]==tmin)[0][0]:np.where(b1[:,0]==tmax)[0][0]+1,:] , b2[np.where(b2[:,0]==tmin)[0][0]:np.where(b2[:,0]==tmax)[0][0]+1,:]  ) * temporal_inter / temporal_union
+
+
+def voc_ap(pr, use_07_metric=False):
+    """ ap = voc_ap(rec, prec, [use_07_metric])
+    Compute VOC AP given precision and recall.
+    If use_07_metric is true, uses the
+    VOC 07 11 point method (default:False).
+    """
+    rec, prec = pr[:,1], pr[:,0]
+    if use_07_metric:
+        # 11 point metric
+        ap = 0.
+        for t in np.arange(0., 1.1, 0.1):
+            if np.sum(rec >= t) == 0:
+                p = 0
+            else:
+                p = np.max(prec[rec >= t])
+            ap = ap + p / 11.
+    else:
+        # correct AP calculation
+        # first append sentinel values at the end
+        mrec = np.concatenate(([0.], rec, [1.]))
+        mpre = np.concatenate(([0.], prec, [0.]))
+
+        # compute the precision envelope
+        for i in range(mpre.size - 1, 0, -1):
+            mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])
+
+        # to calculate area under PR curve, look for points
+        # where X axis (recall) changes value
+        i = np.where(mrec[1:] != mrec[:-1])[0]
+
+        # and sum (\Delta recall) * prec
+        ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
+    return ap
diff --git a/models/__init__.py b/models/__init__.py
new file mode 100644
index 0000000..61990f6
--- /dev/null
+++ b/models/__init__.py
@@ -0,0 +1,24 @@
+from .yowo.build import build_yowo
+
+
+def build_model(args,
+                d_cfg,
+                m_cfg, 
+                device, 
+                num_classes=80, 
+                trainable=False,
+                resume=None):
+    # build action detector
+    if 'yowo_v2_' in args.version:
+        model, criterion = build_yowo(
+            args=args,
+            d_cfg=d_cfg,
+            m_cfg=m_cfg,
+            device=device,
+            num_classes=num_classes,
+            trainable=trainable,
+            resume=resume
+            )
+
+    return model, criterion
+
diff --git a/models/backbone/__init__.py b/models/backbone/__init__.py
new file mode 100644
index 0000000..f3a5153
--- /dev/null
+++ b/models/backbone/__init__.py
@@ -0,0 +1,13 @@
+from .backbone_2d.backbone_2d import Backbone2D
+from .backbone_3d.backbone_3d import Backbone3D
+
+
+def build_backbone_2d(cfg, pretrained=False):
+    backbone = Backbone2D(cfg, pretrained)
+    return backbone, backbone.feat_dims
+
+
+def build_backbone_3d(cfg, pretrained=False):
+    backbone = Backbone3D(cfg, pretrained)
+    return backbone, backbone.feat_dim
+
diff --git a/models/backbone/backbone_2d/__init__.py b/models/backbone/backbone_2d/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/models/backbone/backbone_2d/backbone_2d.py b/models/backbone/backbone_2d/backbone_2d.py
new file mode 100644
index 0000000..a1bd29a
--- /dev/null
+++ b/models/backbone/backbone_2d/backbone_2d.py
@@ -0,0 +1,26 @@
+import torch.nn as nn
+from .cnn_2d import build_2d_cnn
+
+
+class Backbone2D(nn.Module):
+    def __init__(self, cfg, pretrained=False):
+        super().__init__()
+        self.cfg = cfg
+
+        self.backbone, self.feat_dims = build_2d_cnn(cfg, pretrained)
+
+        
+    def forward(self, x):
+        """
+            Input:
+                x: (Tensor) -> [B, C, H, W]
+            Output:
+                y: (List) -> [
+                    (Tensor) -> [B, C1, H1, W1],
+                    (Tensor) -> [B, C2, H2, W2],
+                    (Tensor) -> [B, C3, H3, W3]
+                    ]
+        """
+        feat = self.backbone(x)
+
+        return feat
diff --git a/models/backbone/backbone_2d/cnn_2d/__init__.py b/models/backbone/backbone_2d/cnn_2d/__init__.py
new file mode 100644
index 0000000..6a154b5
--- /dev/null
+++ b/models/backbone/backbone_2d/cnn_2d/__init__.py
@@ -0,0 +1,18 @@
+# import 2D backbone
+from .yolo_free.yolo_free import build_yolo_free
+
+
+def build_2d_cnn(cfg, pretrained=False):
+    print('==============================')
+    print('2D Backbone: {}'.format(cfg['backbone_2d'].upper()))
+    print('--pretrained: {}'.format(pretrained))
+
+    if cfg['backbone_2d'] in ['yolo_free_nano', 'yolo_free_tiny', \
+                              'yolo_free_large', 'yolo_free_huge']:
+        model, feat_dims = build_yolo_free(cfg['backbone_2d'], pretrained)
+
+    else:
+        print('Unknown 2D Backbone ...')
+        exit()
+
+    return model, feat_dims
diff --git a/models/backbone/backbone_2d/cnn_2d/yolo_free/__init__.py b/models/backbone/backbone_2d/cnn_2d/yolo_free/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free.py b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free.py
new file mode 100644
index 0000000..7512815
--- /dev/null
+++ b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free.py
@@ -0,0 +1,222 @@
+import torch
+import numpy as np
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.hub import load_state_dict_from_url
+
+try:
+    from .yolo_free_backbone import build_backbone
+    from .yolo_free_neck import build_neck
+    from .yolo_free_fpn import build_fpn
+    from .yolo_free_head import build_head
+except:
+    from yolo_free_backbone import build_backbone
+    from yolo_free_neck import build_neck
+    from yolo_free_fpn import build_fpn
+    from yolo_free_head import build_head
+
+
+__all__ = ['build_yolo_free']
+
+
+model_urls = {
+    'yolo_free_nano': 'https://github.com/yjh0410/FreeYOLO/releases/download/weight/yolo_free_nano_coco.pth',
+    'yolo_free_tiny': 'https://github.com/yjh0410/FreeYOLO/releases/download/weight/yolo_free_tiny_coco.pth',
+    'yolo_free_large': 'https://github.com/yjh0410/FreeYOLO/releases/download/weight/yolo_free_large_coco.pth',
+}
+
+
+yolo_free_config = {
+    'yolo_free_nano': {
+        # model
+        'backbone': 'shufflenetv2_1.0x',
+        'pretrained': True,
+        'stride': [8, 16, 32],  # P3, P4, P5
+        'anchor_size': None,
+        # neck
+        'neck': 'sppf',
+        'neck_dim': 232,
+        'expand_ratio': 0.5,
+        'pooling_size': 5,
+        'neck_act': 'lrelu',
+        'neck_norm': 'BN',
+        'neck_depthwise': True,
+        # fpn
+        'fpn': 'pafpn_elan',
+        'fpn_size': 'nano',
+        'fpn_dim': [116, 232, 232],
+        'fpn_norm': 'BN',
+        'fpn_act': 'lrelu',
+        'fpn_depthwise': True,
+        # head
+        'head': 'decoupled_head',
+        'head_dim': 64,
+        'head_norm': 'BN',
+        'head_act': 'lrelu',
+        'num_cls_head': 2,
+        'num_reg_head': 2,
+        'head_depthwise': True,
+        },
+
+    'yolo_free_tiny': {
+        # model
+        'backbone': 'elannet_tiny',
+        'pretrained': True,
+        'stride': [8, 16, 32],  # P3, P4, P5
+        # neck
+        'neck': 'spp_block_csp',
+        'neck_dim': 256,
+        'expand_ratio': 0.5,
+        'pooling_size': [5, 9, 13],
+        'neck_act': 'lrelu',
+        'neck_norm': 'BN',
+        'neck_depthwise': False,
+        # fpn
+        'fpn': 'pafpn_elan',
+        'fpn_size': 'tiny', # 'tiny', 'large', 'huge
+        'fpn_dim': [128, 256, 256],
+        'fpn_norm': 'BN',
+        'fpn_act': 'lrelu',
+        'fpn_depthwise': False,
+        # head
+        'head': 'decoupled_head',
+        'head_dim': 64,
+        'head_norm': 'BN',
+        'head_act': 'lrelu',
+        'num_cls_head': 2,
+        'num_reg_head': 2,
+        'head_depthwise': False,
+        },
+
+    'yolo_free_large': {
+        # model
+        'backbone': 'elannet_large',
+        'pretrained': True,
+        'stride': [8, 16, 32],  # P3, P4, P5
+        # neck
+        'neck': 'spp_block_csp',
+        'neck_dim': 512,
+        'expand_ratio': 0.5,
+        'pooling_size': [5, 9, 13],
+        'neck_act': 'silu',
+        'neck_norm': 'BN',
+        'neck_depthwise': False,
+        # fpn
+        'fpn': 'pafpn_elan',
+        'fpn_size': 'large', # 'tiny', 'large', 'huge
+        'fpn_dim': [512, 1024, 512],
+        'fpn_norm': 'BN',
+        'fpn_act': 'silu',
+        'fpn_depthwise': False,
+        # head
+        'head': 'decoupled_head',
+        'head_dim': 256,
+        'head_norm': 'BN',
+        'head_act': 'silu',
+        'num_cls_head': 2,
+        'num_reg_head': 2,
+        'head_depthwise': False,
+        },
+
+}
+
+
+# Anchor-free YOLO
+class FreeYOLO(nn.Module):
+    def __init__(self, cfg):
+        super(FreeYOLO, self).__init__()
+        # --------- Basic Config -----------
+        self.cfg = cfg
+
+        # --------- Network Parameters ----------
+        ## backbone
+        self.backbone, bk_dim = build_backbone(self.cfg['backbone'])
+
+        ## neck
+        self.neck = build_neck(cfg=self.cfg, in_dim=bk_dim[-1], out_dim=self.cfg['neck_dim'])
+        
+        ## fpn
+        self.fpn = build_fpn(cfg=self.cfg, in_dims=self.cfg['fpn_dim'], out_dim=self.cfg['head_dim'])
+
+        ## non-shared heads
+        self.non_shared_heads = nn.ModuleList(
+            [build_head(cfg) 
+            for _ in range(len(cfg['stride']))
+            ])
+
+    def forward(self, x):
+        # backbone
+        feats = self.backbone(x)
+
+        # neck
+        feats['layer4'] = self.neck(feats['layer4'])
+
+        # fpn
+        pyramid_feats = [feats['layer2'], feats['layer3'], feats['layer4']]
+        pyramid_feats = self.fpn(pyramid_feats)
+
+        # non-shared heads
+        all_cls_feats = []
+        all_reg_feats = []
+        for feat, head in zip(pyramid_feats, self.non_shared_heads):
+            # [B, C, H, W]
+            cls_feat, reg_feat = head(feat)
+
+            all_cls_feats.append(cls_feat)
+            all_reg_feats.append(reg_feat)
+
+        return all_cls_feats, all_reg_feats
+
+
+# build FreeYOLO
+def build_yolo_free(model_name='yolo_free_large', pretrained=False):
+    # model config
+    cfg = yolo_free_config[model_name]
+
+    # FreeYOLO
+    model = FreeYOLO(cfg)
+    feat_dims = [model.cfg['head_dim']] * 3
+
+    # Load COCO pretrained weight
+    if pretrained:
+        url = model_urls[model_name]
+
+        # check
+        if url is None:
+            print('No 2D pretrained weight ...')
+            return model, feat_dims
+        else:
+            print('Loading 2D backbone pretrained weight: {}'.format(model_name.upper()))
+
+            # state dict
+            checkpoint = load_state_dict_from_url(url, map_location='cpu')
+            checkpoint_state_dict = checkpoint.pop('model')
+
+            # model state dict
+            model_state_dict = model.state_dict()
+            # check
+            for k in list(checkpoint_state_dict.keys()):
+                if k in model_state_dict:
+                    shape_model = tuple(model_state_dict[k].shape)
+                    shape_checkpoint = tuple(checkpoint_state_dict[k].shape)
+                    if shape_model != shape_checkpoint:
+                        # print(k)
+                        checkpoint_state_dict.pop(k)
+                else:
+                    checkpoint_state_dict.pop(k)
+                    # print(k)
+
+            model.load_state_dict(checkpoint_state_dict, strict=False)
+
+    return model, feat_dims
+
+
+if __name__ == '__main__':
+    model, fpn_dim = build_yolo_free(model_name='yolo_free_nano', pretrained=True)
+    model.eval()
+
+    x = torch.randn(2, 3, 64, 64)
+    cls_feats, reg_feats = model(x)
+
+    for cls_feat, reg_feat in zip(cls_feats, reg_feats):
+        print(cls_feat.shape, reg_feat.shape)
diff --git a/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_backbone.py b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_backbone.py
new file mode 100644
index 0000000..ee3b79c
--- /dev/null
+++ b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_backbone.py
@@ -0,0 +1,445 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+__all__ = ['build_backbone']
+
+# ======================  ELAN-Net ==========================
+# ELANNet
+def get_activation(act_type=None):
+    if act_type is None:
+        return nn.Identity()
+    elif act_type == 'relu':
+        return nn.ReLU(inplace=True)
+    elif act_type == 'lrelu':
+        return nn.LeakyReLU(0.1, inplace=True)
+    elif act_type == 'mish':
+        return nn.Mish(inplace=True)
+    elif act_type == 'silu':
+        return nn.SiLU(inplace=True)
+
+
+def get_norm(in_dim, norm_type=None):
+    if norm_type is None:
+        return nn.Identity()
+    elif norm_type == 'BN':
+        return nn.BatchNorm2d(in_dim)
+    elif norm_type == 'GN':
+        return nn.GroupNorm(32, in_dim)
+    elif norm_type == 'IN':
+        return nn.InstanceNorm2d(in_dim)
+
+
+class Conv(nn.Module):
+    def __init__(self, 
+                 c1,                   # in channels
+                 c2,                   # out channels 
+                 k=1,                  # kernel size 
+                 p=0,                  # padding
+                 s=1,                  # padding
+                 d=1,                  # dilation
+                 act_type='silu',
+                 norm_type='BN',       # activation
+                 depthwise=False):
+        super(Conv, self).__init__()
+        convs = []
+        add_bias = False if norm_type else True
+        if depthwise:
+            # depthwise conv
+            convs.append(nn.Conv2d(c1, c1, kernel_size=k, stride=s, padding=p, dilation=d, groups=c1, bias=add_bias))
+            convs.append(get_norm(c1, norm_type))
+            convs.append(get_activation(act_type))
+
+            # pointwise conv
+            convs.append(nn.Conv2d(c1, c2, kernel_size=1, stride=s, padding=0, dilation=d, groups=1, bias=add_bias))
+            convs.append(get_norm(c2, norm_type))
+            convs.append(get_activation(act_type))
+
+        else:
+            convs.append(nn.Conv2d(c1, c2, kernel_size=k, stride=s, padding=p, dilation=d, groups=1, bias=add_bias))
+            convs.append(get_norm(c2, norm_type))
+            convs.append(get_activation(act_type))
+
+        self.convs = nn.Sequential(*convs)
+
+
+    def forward(self, x):
+        return self.convs(x)
+
+
+class ELANBlock(nn.Module):
+    """
+    ELAN BLock of YOLOv7's backbone
+    """
+    def __init__(self, in_dim, out_dim, expand_ratio=0.5, model_size='large', act_type='silu', depthwise=False):
+        super(ELANBlock, self).__init__()
+        inter_dim = int(in_dim * expand_ratio)
+        if model_size == 'tiny':
+            depth = 1
+        elif model_size == 'large':
+            depth = 2
+        elif model_size == 'huge':
+            depth = 3
+        self.cv1 = Conv(in_dim, inter_dim, k=1, act_type=act_type)
+        self.cv2 = Conv(in_dim, inter_dim, k=1, act_type=act_type)
+        self.cv3 = nn.Sequential(*[
+            Conv(inter_dim, inter_dim, k=3, p=1, act_type=act_type, depthwise=depthwise)
+            for _ in range(depth)
+        ])
+        self.cv4 = nn.Sequential(*[
+            Conv(inter_dim, inter_dim, k=3, p=1, act_type=act_type, depthwise=depthwise)
+            for _ in range(depth)
+        ])
+
+        self.out = Conv(inter_dim*4, out_dim, k=1)
+
+
+
+    def forward(self, x):
+        """
+        Input:
+            x: [B, C, H, W]
+        Output:
+            out: [B, 2C, H, W]
+        """
+        x1 = self.cv1(x)
+        x2 = self.cv2(x)
+        x3 = self.cv3(x2)
+        x4 = self.cv4(x3)
+
+        # [B, C, H, W] -> [B, 2C, H, W]
+        out = self.out(torch.cat([x1, x2, x3, x4], dim=1))
+
+        return out
+
+
+class DownSample(nn.Module):
+    def __init__(self, in_dim, act_type='silu', norm_type='BN'):
+        super().__init__()
+        inter_dim = in_dim // 2
+        self.mp = nn.MaxPool2d((2, 2), 2)
+        self.cv1 = Conv(in_dim, inter_dim, k=1, act_type=act_type, norm_type=norm_type)
+        self.cv2 = nn.Sequential(
+            Conv(in_dim, inter_dim, k=1, act_type=act_type, norm_type=norm_type),
+            Conv(inter_dim, inter_dim, k=3, p=1, s=2, act_type=act_type, norm_type=norm_type)
+        )
+
+    def forward(self, x):
+        """
+        Input:
+            x: [B, C, H, W]
+        Output:
+            out: [B, C, H//2, W//2]
+        """
+        # [B, C, H, W] -> [B, C//2, H//2, W//2]
+        x1 = self.cv1(self.mp(x))
+        x2 = self.cv2(x)
+
+        # [B, C, H//2, W//2]
+        out = torch.cat([x1, x2], dim=1)
+
+        return out
+
+
+# ELANNet-Tiny
+class ELANNet_Tiny(nn.Module):
+    """
+    ELAN-Net of YOLOv7-Tiny.
+    """
+    def __init__(self, depthwise=False):
+        super(ELANNet_Tiny, self).__init__()
+        
+        # tiny backbone
+        self.layer_1 = Conv(3, 32, k=3, p=1, s=2, act_type='lrelu', depthwise=depthwise)       # P1/2
+
+        self.layer_2 = nn.Sequential(   
+            Conv(32, 64, k=3, p=1, s=2, act_type='lrelu', depthwise=depthwise),             
+            ELANBlock(in_dim=64, out_dim=64, expand_ratio=0.5,
+                      model_size='tiny', act_type='lrelu', depthwise=depthwise)                  # P2/4
+        )
+        self.layer_3 = nn.Sequential(
+            nn.MaxPool2d((2, 2), 2),             
+            ELANBlock(in_dim=64, out_dim=128, expand_ratio=0.5,
+                      model_size='tiny', act_type='lrelu', depthwise=depthwise)                  # P3/8
+        )
+        self.layer_4 = nn.Sequential(
+            nn.MaxPool2d((2, 2), 2),             
+            ELANBlock(in_dim=128, out_dim=256, expand_ratio=0.5,
+                      model_size='tiny', act_type='lrelu', depthwise=depthwise)                  # P4/16
+        )
+        self.layer_5 = nn.Sequential(
+            nn.MaxPool2d((2, 2), 2),             
+            ELANBlock(in_dim=256, out_dim=512, expand_ratio=0.5,
+                      model_size='tiny', act_type='lrelu', depthwise=depthwise)                  # P5/32
+        )
+
+
+    def forward(self, x):
+        c1 = self.layer_1(x)
+        c2 = self.layer_2(c1)
+        c3 = self.layer_3(c2)
+        c4 = self.layer_4(c3)
+        c5 = self.layer_5(c4)
+
+        outputs = {
+            'layer2': c3,
+            'layer3': c4,
+            'layer4': c5
+        }
+        return outputs
+
+
+# ELANNet-Large
+class ELANNet_Large(nn.Module):
+    """
+    ELAN-Net of YOLOv7.
+    """
+    def __init__(self, depthwise=False):
+        super(ELANNet_Large, self).__init__()
+        
+        # large backbone
+        self.layer_1 = nn.Sequential(
+            Conv(3, 32, k=3, p=1, act_type='silu', depthwise=depthwise),      
+            Conv(32, 64, k=3, p=1, s=2, act_type='silu', depthwise=depthwise),
+            Conv(64, 64, k=3, p=1, act_type='silu', depthwise=depthwise)                                                   # P1/2
+        )
+        self.layer_2 = nn.Sequential(   
+            Conv(64, 128, k=3, p=1, s=2, act_type='silu', depthwise=depthwise),             
+            ELANBlock(in_dim=128, out_dim=256, expand_ratio=0.5,
+                      model_size='large',act_type='silu', depthwise=depthwise)                     # P2/4
+        )
+        self.layer_3 = nn.Sequential(
+            DownSample(in_dim=256, act_type='silu'),             
+            ELANBlock(in_dim=256, out_dim=512, expand_ratio=0.5,
+                      model_size='large',act_type='silu', depthwise=depthwise)                     # P3/8
+        )
+        self.layer_4 = nn.Sequential(
+            DownSample(in_dim=512, act_type='silu'),             
+            ELANBlock(in_dim=512, out_dim=1024, expand_ratio=0.5,
+                      model_size='large',act_type='silu', depthwise=depthwise)                    # P4/16
+        )
+        self.layer_5 = nn.Sequential(
+            DownSample(in_dim=1024, act_type='silu'),             
+            ELANBlock(in_dim=1024, out_dim=1024, expand_ratio=0.25,
+                      model_size='large',act_type='silu', depthwise=depthwise)                  # P5/32
+        )
+
+
+    def forward(self, x):
+        c1 = self.layer_1(x)
+        c2 = self.layer_2(c1)
+        c3 = self.layer_3(c2)
+        c4 = self.layer_4(c3)
+        c5 = self.layer_5(c4)
+
+        outputs = {
+            'layer2': c3,
+            'layer3': c4,
+            'layer4': c5
+        }
+        return outputs
+
+
+## build ELAN-Net
+def build_elannet(model_name='elannet_large'):
+    # model
+    if model_name == 'elannet_large':
+        backbone = ELANNet_Large()
+        feat_dims = [512, 1024, 1024]
+    elif model_name == 'elannet_tiny':
+        backbone = ELANNet_Tiny()
+        feat_dims = [128, 256, 512]
+
+    return backbone, feat_dims
+
+
+# ====================== ShuffleNet-v2 ==========================
+# ShuffleNet-v2
+def channel_shuffle(x, groups):
+    # type: (torch.Tensor, int) -> torch.Tensor
+    batchsize, num_channels, height, width = x.data.size()
+    channels_per_group = num_channels // groups
+
+    # reshape
+    x = x.view(batchsize, groups,
+               channels_per_group, height, width)
+
+    x = torch.transpose(x, 1, 2).contiguous()
+
+    # flatten
+    x = x.view(batchsize, -1, height, width)
+
+    return x
+
+
+class ShuffleV2Block(nn.Module):
+    def __init__(self, inp, oup, stride):
+        super(ShuffleV2Block, self).__init__()
+
+        if not (1 <= stride <= 3):
+            raise ValueError('illegal stride value')
+        self.stride = stride
+
+        branch_features = oup // 2
+        assert (self.stride != 1) or (inp == branch_features << 1)
+
+        if self.stride > 1:
+            self.branch1 = nn.Sequential(
+                self.depthwise_conv(inp, inp, kernel_size=3, stride=self.stride, padding=1),
+                nn.BatchNorm2d(inp),
+                nn.Conv2d(inp, branch_features, kernel_size=1, stride=1, padding=0, bias=False),
+                nn.BatchNorm2d(branch_features),
+                nn.ReLU(inplace=True),
+            )
+        else:
+            self.branch1 = nn.Sequential()
+
+        self.branch2 = nn.Sequential(
+            nn.Conv2d(inp if (self.stride > 1) else branch_features,
+                      branch_features, kernel_size=1, stride=1, padding=0, bias=False),
+            nn.BatchNorm2d(branch_features),
+            nn.ReLU(inplace=True),
+            self.depthwise_conv(branch_features, branch_features, kernel_size=3, stride=self.stride, padding=1),
+            nn.BatchNorm2d(branch_features),
+            nn.Conv2d(branch_features, branch_features, kernel_size=1, stride=1, padding=0, bias=False),
+            nn.BatchNorm2d(branch_features),
+            nn.ReLU(inplace=True),
+        )
+
+    @staticmethod
+    def depthwise_conv(i, o, kernel_size, stride=1, padding=0, bias=False):
+        return nn.Conv2d(i, o, kernel_size, stride, padding, bias=bias, groups=i)
+
+    def forward(self, x):
+        if self.stride == 1:
+            x1, x2 = x.chunk(2, dim=1)
+            out = torch.cat((x1, self.branch2(x2)), dim=1)
+        else:
+            out = torch.cat((self.branch1(x), self.branch2(x)), dim=1)
+
+        out = channel_shuffle(out, 2)
+
+        return out
+
+
+class ShuffleNetV2(nn.Module):
+    def __init__(self,
+                 model_size='1.0x',
+                 out_stages=(2, 3, 4),
+                 with_last_conv=False,
+                 kernal_size=3):
+        super(ShuffleNetV2, self).__init__()
+        print('model size is ', model_size)
+
+        self.stage_repeats = [4, 8, 4]
+        self.model_size = model_size
+        self.out_stages = out_stages
+        self.with_last_conv = with_last_conv
+        self.kernal_size = kernal_size
+        if model_size == '0.5x':
+            self._stage_out_channels = [24, 48, 96, 192]
+        elif model_size == '1.0x':
+            self._stage_out_channels = [24, 116, 232, 464]
+        elif model_size == '1.5x':
+            self._stage_out_channels = [24, 176, 352, 704]
+        elif model_size == '2.0x':
+            self._stage_out_channels = [24, 244, 488, 976]
+        else:
+            raise NotImplementedError
+
+        # building first layer
+        input_channels = 3
+        output_channels = self._stage_out_channels[0]
+        self.conv1 = nn.Sequential(
+            nn.Conv2d(input_channels, output_channels, 3, 2, 1, bias=False),
+            nn.BatchNorm2d(output_channels),
+            nn.ReLU(inplace=True),
+        )
+        input_channels = output_channels
+
+        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
+
+        stage_names = ['stage{}'.format(i) for i in [2, 3, 4]]
+        for name, repeats, output_channels in zip(
+                stage_names, self.stage_repeats, self._stage_out_channels[1:]):
+            seq = [ShuffleV2Block(input_channels, output_channels, 2)]
+            for i in range(repeats - 1):
+                seq.append(ShuffleV2Block(output_channels, output_channels, 1))
+            setattr(self, name, nn.Sequential(*seq))
+            input_channels = output_channels
+        
+        self._initialize_weights()
+
+
+    def _initialize_weights(self):
+        print('init weights...')
+        for name, m in self.named_modules():
+            if isinstance(m, nn.Conv2d):
+                if 'first' in name:
+                    nn.init.normal_(m.weight, 0, 0.01)
+                else:
+                    nn.init.normal_(m.weight, 0, 1.0 / m.weight.shape[1])
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+            elif isinstance(m, nn.BatchNorm2d):
+                nn.init.constant_(m.weight, 1)
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0.0001)
+                nn.init.constant_(m.running_mean, 0)
+            elif isinstance(m, nn.BatchNorm1d):
+                nn.init.constant_(m.weight, 1)
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0.0001)
+                nn.init.constant_(m.running_mean, 0)
+            elif isinstance(m, nn.Linear):
+                nn.init.normal_(m.weight, 0, 0.01)
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.maxpool(x)
+        output = {}
+        for i in range(2, 5):
+            stage = getattr(self, 'stage{}'.format(i))
+            x = stage(x)
+            if i in self.out_stages:
+                output['layer{}'.format(i)] = x
+
+        return output
+
+
+## build ShuffleNet-v2
+def build_shufflenetv2(model_size='1.0x'):
+    """Constructs a shufflenetv2 model.
+
+    Args:
+        pretrained (bool): If True, returns a model pre-trained on ImageNet
+    """
+    backbone = ShuffleNetV2(model_size=model_size)
+    feat_dims = backbone._stage_out_channels[1:]
+
+    return backbone, feat_dims
+
+
+# build backbone
+def build_backbone(model_name='elannet_large'):
+    if model_name in ['elannet_nano', 'elannet_tiny', 'elannet_large', 'elannet_huge']:
+        return build_elannet(model_name)
+
+    elif model_name in ['shufflenetv2_0.5x', 'shufflenetv2_1.0x']:
+        return build_shufflenetv2(model_size=model_name[-4:])
+        
+
+if __name__ == '__main__':
+    import time
+    model, feats = build_backbone(model_name='shufflenetv2_1.0x')
+    x = torch.randn(1, 3, 224, 224)
+    t0 = time.time()
+    outputs = model(x)
+    t1 = time.time()
+    print('Time: ', t1 - t0)
+    for k in outputs.keys():
+        print(outputs[k].shape)
diff --git a/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_basic.py b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_basic.py
new file mode 100644
index 0000000..6cd708b
--- /dev/null
+++ b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_basic.py
@@ -0,0 +1,164 @@
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class SiLU(nn.Module):
+    """export-friendly version of nn.SiLU()"""
+
+    @staticmethod
+    def forward(x):
+        return x * torch.sigmoid(x)
+
+
+def get_conv2d(c1, c2, k, p, s, d, g, padding_mode='ZERO', bias=False):
+    if padding_mode == 'ZERO':
+        conv = nn.Conv2d(c1, c2, k, stride=s, padding=p, dilation=d, groups=g, bias=bias)
+    elif padding_mode == 'SAME':
+        conv = Conv2dSamePadding(c1, c2, k, stride=s, padding=p, dilation=d, groups=g, bias=bias)
+
+    return conv
+
+
+def get_activation(act_type=None):
+    if act_type == 'relu':
+        return nn.ReLU(inplace=True)
+    elif act_type == 'lrelu':
+        return nn.LeakyReLU(0.1, inplace=True)
+    elif act_type == 'mish':
+        return nn.Mish(inplace=True)
+    elif act_type == 'silu':
+        return nn.SiLU(inplace=True)
+
+
+def get_norm(norm_type, dim):
+    if norm_type == 'BN':
+        return nn.BatchNorm2d(dim)
+    elif norm_type == 'GN':
+        return nn.GroupNorm(num_groups=32, num_channels=dim)
+
+
+# Conv2d with "SAME" padding
+class Conv2dSamePadding(nn.Conv2d):
+    """
+    A wrapper around :class:`torch.nn.Conv2d` to support "SAME" padding mode and more features.
+    """
+
+    def __init__(self, *args, **kwargs):
+        """
+        Extra keyword arguments supported in addition to those in `torch.nn.Conv2d`:
+
+        Args:
+            norm (nn.Module, optional): a normalization layer
+            activation (callable(Tensor) -> Tensor): a callable activation function
+
+        It assumes that norm layer is used before activation.
+        """
+
+        # parse padding mode
+        self.padding_method = kwargs.pop("padding", None)
+        if self.padding_method is None:
+            if len(args) >= 5:
+                self.padding_method = args[4]
+            else:
+                self.padding_method = 0  # default padding number
+
+        if isinstance(self.padding_method, str):
+            if self.padding_method.upper() == "SAME":
+                # If the padding mode is `SAME`, it will be manually padded
+                super().__init__(*args, **kwargs, padding=0)
+                # stride
+                if isinstance(self.stride, int):
+                    self.stride = [self.stride] * 2
+                elif len(self.stride) == 1:
+                    self.stride = [self.stride[0]] * 2
+                # kernel size
+                if isinstance(self.kernel_size, int):
+                    self.kernel_size = [self.kernel_size] * 2
+                elif len(self.kernel_size) == 1:
+                    self.kernel_size = [self.kernel_size[0]] * 2
+                # dilation
+                if isinstance(self.dilation, int):
+                    self.dilation = [self.dilation] * 2
+                elif len(self.dilation) == 1:
+                    self.dilation = [self.dilation[0]] * 2
+            else:
+                raise ValueError("Unknown padding method: {}".format(self.padding_method))
+        else:
+            super().__init__(*args, **kwargs, padding=self.padding_method)
+
+    def forward(self, x):
+        if isinstance(self.padding_method, str):
+            if self.padding_method.upper() == "SAME":
+                input_h, input_w = x.shape[-2:]
+                stride_h, stride_w = self.stride
+                kernel_size_h, kernel_size_w = self.kernel_size
+                dilation_h, dilation_w = self.dilation
+
+                output_h = math.ceil(input_h / stride_h)
+                output_w = math.ceil(input_w / stride_w)
+
+                padding_needed_h = max(
+                    0, (output_h - 1) * stride_h + (kernel_size_h - 1) * dilation_h + 1 - input_h
+                )
+                padding_needed_w = max(
+                    0, (output_w - 1) * stride_w + (kernel_size_w - 1) * dilation_w + 1 - input_w
+                )
+
+                left = padding_needed_w // 2
+                right = padding_needed_w - left
+                top = padding_needed_h // 2
+                bottom = padding_needed_h - top
+
+                x = F.pad(x, [left, right, top, bottom])
+            else:
+                raise ValueError("Unknown padding method: {}".format(self.padding_method))
+
+        x = super().forward(x)
+
+        return x
+
+
+# Basic conv layer
+class Conv(nn.Module):
+    def __init__(self, 
+                 c1,                   # in channels
+                 c2,                   # out channels 
+                 k=1,                  # kernel size 
+                 p=0,                  # padding
+                 s=1,                  # padding
+                 d=1,                  # dilation
+                 act_type='',          # activation
+                 norm_type='',         # normalization
+                 padding_mode='ZERO',  # padding mode: "ZERO" or "SAME"
+                 depthwise=False):
+        super(Conv, self).__init__()
+        convs = []
+        add_bias = False if norm_type else True
+        if depthwise:
+            convs.append(get_conv2d(c1, c1, k=k, p=p, s=s, d=d, g=c1, padding_mode=padding_mode, bias=add_bias))
+            # depthwise conv
+            if norm_type:
+                convs.append(get_norm(norm_type, c1))
+            if act_type:
+                convs.append(get_activation(act_type))
+            # pointwise conv
+            convs.append(get_conv2d(c1, c2, k=1, p=0, s=1, d=d, g=1, bias=add_bias))
+            if norm_type:
+                convs.append(get_norm(norm_type, c2))
+            if act_type:
+                convs.append(get_activation(act_type))
+
+        else:
+            convs.append(get_conv2d(c1, c2, k=k, p=p, s=s, d=d, g=1, padding_mode=padding_mode, bias=add_bias))
+            if norm_type:
+                convs.append(get_norm(norm_type, c2))
+            if act_type:
+                convs.append(get_activation(act_type))
+            
+        self.convs = nn.Sequential(*convs)
+
+
+    def forward(self, x):
+        return self.convs(x)
diff --git a/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_fpn.py b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_fpn.py
new file mode 100644
index 0000000..1fd8a8b
--- /dev/null
+++ b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_fpn.py
@@ -0,0 +1,252 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+try:
+    from yolo_free_basic import Conv
+except:
+    from .yolo_free_basic import Conv
+
+
+class ELANBlock(nn.Module):
+    """
+    ELAN BLock of YOLOv7's head
+    """
+    def __init__(self, in_dim, out_dim, fpn_size='large', depthwise=False, act_type='silu', norm_type='BN'):
+        super(ELANBlock, self).__init__()
+        if fpn_size == 'tiny' or fpn_size =='nano':
+            e1, e2 = 0.25, 1.0
+            width = 2
+            depth = 1
+        elif fpn_size == 'large':
+            e1, e2 = 0.5, 0.5
+            width = 4
+            depth = 1
+        elif fpn_size == 'huge':
+            e1, e2 = 0.5, 0.5
+            width = 4
+            depth = 2
+        inter_dim = int(in_dim * e1)
+        inter_dim2 = int(inter_dim * e2) 
+        self.cv1 = Conv(in_dim, inter_dim, k=1, act_type=act_type, norm_type=norm_type)
+        self.cv2 = Conv(in_dim, inter_dim, k=1, act_type=act_type, norm_type=norm_type)
+        self.cv3 = nn.ModuleList()
+        for idx in range(width):
+            if idx == 0:
+                cvs = [Conv(inter_dim, inter_dim2, k=3, p=1, act_type=act_type, norm_type=norm_type, depthwise=depthwise)]
+            else:
+                cvs = [Conv(inter_dim2, inter_dim2, k=3, p=1, act_type=act_type, norm_type=norm_type, depthwise=depthwise)]
+            # deeper
+            if depth > 1:
+                for _ in range(1, depth):
+                    cvs.append(Conv(inter_dim2, inter_dim2, k=3, p=1, act_type=act_type, norm_type=norm_type, depthwise=depthwise))
+                self.cv3.append(nn.Sequential(*cvs))
+            else:
+                self.cv3.append(cvs[0])
+
+        self.out = Conv(inter_dim*2+inter_dim2*len(self.cv3), out_dim, k=1, act_type=act_type, norm_type=norm_type)
+
+
+    def forward(self, x):
+        """
+        Input:
+            x: [B, C_in, H, W]
+        Output:
+            out: [B, C_out, H, W]
+        """
+        x1 = self.cv1(x)
+        x2 = self.cv2(x)
+        inter_outs = [x1, x2]
+        for m in self.cv3:
+            y1 = inter_outs[-1]
+            y2 = m(y1)
+            inter_outs.append(y2)
+
+        # [B, C_in, H, W] -> [B, C_out, H, W]
+        out = self.out(torch.cat(inter_outs, dim=1))
+
+        return out
+
+
+class DownSample(nn.Module):
+    def __init__(self, in_dim, depthwise=False, act_type='silu', norm_type='BN'):
+        super().__init__()
+        inter_dim = in_dim
+        self.mp = nn.MaxPool2d((2, 2), 2)
+        self.cv1 = Conv(in_dim, inter_dim, k=1, act_type=act_type, norm_type=norm_type)
+        self.cv2 = nn.Sequential(
+            Conv(in_dim, inter_dim, k=1, act_type=act_type, norm_type=norm_type),
+            Conv(inter_dim, inter_dim, k=3, p=1, s=2, act_type=act_type, norm_type=norm_type, depthwise=depthwise)
+        )
+
+    def forward(self, x):
+        """
+        Input:
+            x: [B, C, H, W]
+        Output:
+            out: [B, 2C, H//2, W//2]
+        """
+        # [B, C, H, W] -> [B, C//2, H//2, W//2]
+        x1 = self.cv1(self.mp(x))
+        x2 = self.cv2(x)
+
+        # [B, C, H//2, W//2]
+        out = torch.cat([x1, x2], dim=1)
+
+        return out
+
+
+# PaFPN-ELAN
+class PaFPNELAN(nn.Module):
+    def __init__(self, 
+                 in_dims=[512, 1024, 1024],
+                 out_dim=256,
+                 fpn_size='large',
+                 depthwise=False,
+                 norm_type='BN',
+                 act_type='silu'):
+        super(PaFPNELAN, self).__init__()
+        self.in_dims = in_dims
+        self.out_dim = out_dim
+        c3, c4, c5 = in_dims
+        if fpn_size == 'tiny':
+            width = 0.5
+        elif fpn_size == 'nano':
+            assert depthwise
+            width = 0.5
+        elif fpn_size == 'large':
+            width = 1.0
+        elif fpn_size == 'huge':
+            width = 1.25
+
+        # top dwon
+        ## P5 -> P4
+        self.cv1 = Conv(c5, int(256 * width), k=1, norm_type=norm_type, act_type=act_type)
+        self.cv2 = Conv(c4, int(256 * width), k=1, norm_type=norm_type, act_type=act_type)
+        self.head_elan_1 = ELANBlock(in_dim=int(256 * width) + int(256 * width),
+                                     out_dim=int(256 * width),
+                                     fpn_size=fpn_size,
+                                     depthwise=depthwise,
+                                     norm_type=norm_type,
+                                     act_type=act_type)
+
+        # P4 -> P3
+        self.cv3 = Conv(int(256 * width), int(128 * width), k=1, norm_type=norm_type, act_type=act_type)
+        self.cv4 = Conv(c3, int(128 * width), k=1, norm_type=norm_type, act_type=act_type)
+        self.head_elan_2 = ELANBlock(in_dim=int(128 * width) + int(128 * width),
+                                     out_dim=int(128 * width),  # 128
+                                     fpn_size=fpn_size,
+                                     depthwise=depthwise,
+                                     norm_type=norm_type,
+                                     act_type=act_type)
+
+        # bottom up
+        # P3 -> P4
+        if fpn_size == 'large' or fpn_size == 'huge':
+            self.mp1 = DownSample(int(128 * width), act_type=act_type,
+                                  norm_type=norm_type, depthwise=depthwise)
+        elif fpn_size == 'tiny':
+            self.mp1 = Conv(int(128 * width), int(256 * width), k=3, p=1, s=2,
+                                act_type=act_type, norm_type=norm_type, depthwise=depthwise)
+        elif fpn_size == 'nano':
+            self.mp1 = nn.Sequential(
+                nn.MaxPool2d((2, 2), 2),
+                Conv(int(128 * width), int(256 * width), k=1, act_type=act_type, norm_type=norm_type)
+            )
+        self.head_elan_3 = ELANBlock(in_dim=int(256 * width) + int(256 * width),
+                                     out_dim=int(256 * width),  # 256
+                                     fpn_size=fpn_size,
+                                     depthwise=depthwise,
+                                     norm_type=norm_type,
+                                     act_type=act_type)
+
+        # P4 -> P5
+        if fpn_size == 'large' or fpn_size == 'huge':
+            self.mp2 = DownSample(int(256 * width), act_type=act_type,
+                                  norm_type=norm_type, depthwise=depthwise)
+        elif fpn_size == 'tiny':
+            self.mp2 = Conv(int(256 * width), int(512 * width), k=3, p=1, s=2,
+                                act_type=act_type, norm_type=norm_type, depthwise=depthwise)
+        elif fpn_size == 'nano':
+            self.mp2 = nn.Sequential(
+                nn.MaxPool2d((2, 2), 2),
+                Conv(int(256 * width), int(512 * width), k=1, act_type=act_type, norm_type=norm_type)
+            )
+        self.head_elan_4 = ELANBlock(in_dim=int(512 * width) + c5,
+                                     out_dim=int(512 * width),  # 512
+                                     fpn_size=fpn_size,
+                                     depthwise=depthwise,
+                                     norm_type=norm_type,
+                                     act_type=act_type)
+
+        self.head_conv_1 = Conv(int(128 * width), int(256 * width), k=3, p=1,
+                                act_type=act_type, norm_type=norm_type, depthwise=depthwise)
+        self.head_conv_2 = Conv(int(256 * width), int(512 * width), k=3, p=1,
+                                act_type=act_type, norm_type=norm_type, depthwise=depthwise)
+        self.head_conv_3 = Conv(int(512 * width), int(1024 * width), k=3, p=1,
+                                act_type=act_type, norm_type=norm_type, depthwise=depthwise)
+        # output proj layers
+        if self.out_dim is not None:
+            self.out_layers = nn.ModuleList([
+                Conv(in_dim, self.out_dim, k=1,
+                     norm_type=norm_type, act_type=act_type)
+                     for in_dim in [int(256 * width), int(512 * width), int(1024 * width)]
+                     ])
+
+
+    def forward(self, features):
+        c3, c4, c5 = features
+
+        # Top down
+        ## P5 -> P4
+        c6 = self.cv1(c5)
+        c7 = F.interpolate(c6, scale_factor=2.0)
+        c8 = torch.cat([c7, self.cv2(c4)], dim=1)
+        c9 = self.head_elan_1(c8)
+        ## P4 -> P3
+        c10 = self.cv3(c9)
+        c11 = F.interpolate(c10, scale_factor=2.0)
+        c12 = torch.cat([c11, self.cv4(c3)], dim=1)
+        c13 = self.head_elan_2(c12)
+
+        # Bottom up
+        # p3 -> P4
+        c14 = self.mp1(c13)
+        c15 = torch.cat([c14, c9], dim=1)
+        c16 = self.head_elan_3(c15)
+        # P4 -> P5
+        c17 = self.mp2(c16)
+        c18 = torch.cat([c17, c5], dim=1)
+        c19 = self.head_elan_4(c18)
+
+        c20 = self.head_conv_1(c13)
+        c21 = self.head_conv_2(c16)
+        c22 = self.head_conv_3(c19)
+
+        out_feats = [c20, c21, c22] # [P3, P4, P5]
+        
+        # output proj layers
+        if self.out_dim is not None:
+            out_feats_proj = []
+            for feat, layer in zip(out_feats, self.out_layers):
+                out_feats_proj.append(layer(feat))
+            return out_feats_proj
+
+        return out_feats
+
+
+def build_fpn(cfg, in_dims, out_dim):
+    model = cfg['fpn']
+    print('==============================')
+    print('FPN: {}'.format(model))
+    # build neck
+    if model == 'pafpn_elan':
+        fpn_net = PaFPNELAN(in_dims=in_dims,
+                            out_dim=out_dim,
+                            fpn_size=cfg['fpn_size'],
+                            depthwise=cfg['fpn_depthwise'],
+                            norm_type=cfg['fpn_norm'],
+                            act_type=cfg['fpn_act'])
+                                                        
+
+    return fpn_net
diff --git a/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_head.py b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_head.py
new file mode 100644
index 0000000..34028d5
--- /dev/null
+++ b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_head.py
@@ -0,0 +1,51 @@
+import torch
+import torch.nn as nn
+
+try:
+    from yolo_free_basic import Conv
+except:
+    from .yolo_free_basic import Conv
+
+
+class DecoupledHead(nn.Module):
+    def __init__(self, cfg):
+        super().__init__()
+
+        print('==============================')
+        print('Head: Decoupled Head')
+        self.num_cls_head=cfg['num_cls_head']
+        self.num_reg_head=cfg['num_reg_head']
+        self.act_type=cfg['head_act']
+        self.norm_type=cfg['head_norm']
+        self.head_dim = cfg['head_dim']
+
+        self.cls_feats = nn.Sequential(*[Conv(self.head_dim, 
+                                              self.head_dim, 
+                                              k=3, p=1, s=1, 
+                                              act_type=self.act_type, 
+                                              norm_type=self.norm_type,
+                                              depthwise=cfg['head_depthwise']) for _ in range(self.num_cls_head)])
+        self.reg_feats = nn.Sequential(*[Conv(self.head_dim, 
+                                              self.head_dim, 
+                                              k=3, p=1, s=1, 
+                                              act_type=self.act_type, 
+                                              norm_type=self.norm_type,
+                                              depthwise=cfg['head_depthwise']) for _ in range(self.num_reg_head)])
+
+
+    def forward(self, x):
+        """
+            in_feats: (Tensor) [B, C, H, W]
+        """
+        cls_feats = self.cls_feats(x)
+        reg_feats = self.reg_feats(x)
+
+        return cls_feats, reg_feats
+
+
+# build detection head
+def build_head(cfg):
+    head = DecoupledHead(cfg) 
+
+    return head
+    
\ No newline at end of file
diff --git a/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_neck.py b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_neck.py
new file mode 100644
index 0000000..7988623
--- /dev/null
+++ b/models/backbone/backbone_2d/cnn_2d/yolo_free/yolo_free_neck.py
@@ -0,0 +1,164 @@
+import torch
+import torch.nn as nn
+
+try:
+    from yolo_free_basic import Conv
+except:
+    from .yolo_free_basic import Conv
+
+
+# Spatial Pyramid Pooling
+class SPP(nn.Module):
+    """
+        Spatial Pyramid Pooling
+    """
+    def __init__(self, in_dim, out_dim, expand_ratio=0.5, pooling_size=[5, 9, 13], norm_type='BN', act_type='relu'):
+        super(SPP, self).__init__()
+        inter_dim = int(in_dim * expand_ratio)
+        self.cv1 = Conv(in_dim, inter_dim, k=1, act_type=act_type, norm_type=norm_type)
+        self.m = nn.ModuleList(
+            [
+                nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)
+                for k in pooling_size
+            ]
+        )
+        
+        self.cv2 = Conv(inter_dim*(len(pooling_size) + 1), out_dim, k=1, act_type=act_type, norm_type=norm_type)
+
+    def forward(self, x):
+        x = self.cv1(x)
+        x = torch.cat([x] + [m(x) for m in self.m], dim=1)
+        x = self.cv2(x)
+
+        return x
+
+
+# SPP block with CSP module
+class SPPBlock(nn.Module):
+    """
+        Spatial Pyramid Pooling Block
+    """
+    def __init__(self,
+                 in_dim,
+                 out_dim,
+                 expand_ratio=0.5,
+                 pooling_size=[5, 9, 13],
+                 act_type='lrelu',
+                 norm_type='BN',
+                 depthwise=False
+                 ):
+        super(SPPBlockCSP, self).__init__()
+        inter_dim = int(in_dim * expand_ratio)
+        self.cv1 = Conv(in_dim, inter_dim, k=1, act_type=act_type, norm_type=norm_type)
+        self.cv2 = nn.Sequential(
+            SPP(inter_dim, 
+                inter_dim, 
+                expand_ratio=1.0, 
+                pooling_size=pooling_size, 
+                act_type=act_type, 
+                norm_type=norm_type),
+        )
+        self.cv3 = Conv(inter_dim * 2, out_dim, k=1, act_type=act_type, norm_type=norm_type)
+
+        
+    def forward(self, x):
+        x1 = self.cv1(x)
+        x2 = self.cv2(x)
+        y = self.cv3(torch.cat([x1, x2], dim=1))
+
+        return y
+
+
+# SPP block with CSP module
+class SPPBlockCSP(nn.Module):
+    """
+        CSP Spatial Pyramid Pooling Block
+    """
+    def __init__(self,
+                 in_dim,
+                 out_dim,
+                 expand_ratio=0.5,
+                 pooling_size=[5, 9, 13],
+                 act_type='lrelu',
+                 norm_type='BN',
+                 depthwise=False
+                 ):
+        super(SPPBlockCSP, self).__init__()
+        inter_dim = int(in_dim * expand_ratio)
+        self.cv1 = Conv(in_dim, inter_dim, k=1, act_type=act_type, norm_type=norm_type)
+        self.cv2 = Conv(in_dim, inter_dim, k=1, act_type=act_type, norm_type=norm_type)
+        self.m = nn.Sequential(
+            Conv(inter_dim, inter_dim, k=3, p=1, 
+                 act_type=act_type, norm_type=norm_type, 
+                 depthwise=depthwise),
+            SPP(inter_dim, 
+                inter_dim, 
+                expand_ratio=1.0, 
+                pooling_size=pooling_size, 
+                act_type=act_type, 
+                norm_type=norm_type),
+            Conv(inter_dim, inter_dim, k=3, p=1, 
+                 act_type=act_type, norm_type=norm_type, 
+                 depthwise=depthwise)
+        )
+        self.cv3 = Conv(inter_dim * 2, out_dim, k=1, act_type=act_type, norm_type=norm_type)
+
+        
+    def forward(self, x):
+        x1 = self.cv1(x)
+        x2 = self.cv2(x)
+        x3 = self.m(x2)
+        y = self.cv3(torch.cat([x1, x3], dim=1))
+
+        return y
+
+
+# Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher
+class SPPF(nn.Module):
+    def __init__(self, in_dim, out_dim, k=5):  # equivalent to SPP(k=(5, 9, 13))
+        super().__init__()
+        inter_dim = in_dim // 2  # hidden channels
+        self.cv1 = Conv(in_dim, inter_dim, k=1)
+        self.cv2 = Conv(inter_dim * 4, out_dim, k=1)
+        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)
+
+    def forward(self, x):
+        x = self.cv1(x)
+        y1 = self.m(x)
+        y2 = self.m(y1)
+
+        return self.cv2(torch.cat((x, y1, y2, self.m(y2)), 1))
+
+
+def build_neck(cfg, in_dim, out_dim):
+    model = cfg['neck']
+    # build neck
+    if model == 'spp_block':
+        neck = SPPBlock(
+            in_dim, out_dim, 
+            expand_ratio=cfg['expand_ratio'], 
+            pooling_size=cfg['pooling_size'],
+            act_type=cfg['neck_act'],
+            norm_type=cfg['neck_norm'],
+            depthwise=cfg['neck_depthwise']
+            )
+            
+    elif model == 'spp_block_csp':
+        neck = SPPBlockCSP(
+            in_dim, out_dim, 
+            expand_ratio=cfg['expand_ratio'], 
+            pooling_size=cfg['pooling_size'],
+            act_type=cfg['neck_act'],
+            norm_type=cfg['neck_norm'],
+            depthwise=cfg['neck_depthwise']
+            )
+
+    elif model == 'sppf':
+        neck = SPPF(in_dim, out_dim, k=cfg['pooling_size'])
+
+
+    return neck
+
+
+if __name__ == '__main__':
+    pass
diff --git a/models/backbone/backbone_3d/__init__.py b/models/backbone/backbone_3d/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/models/backbone/backbone_3d/backbone_3d.py b/models/backbone/backbone_3d/backbone_3d.py
new file mode 100644
index 0000000..7b2e489
--- /dev/null
+++ b/models/backbone/backbone_3d/backbone_3d.py
@@ -0,0 +1,68 @@
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .cnn_3d import build_3d_cnn
+
+
+class Conv(nn.Module):
+    def __init__(self, in_dim, out_dim, k=3, p=1, s=1, depthwise=False):
+        super().__init__()
+        if depthwise:
+            self.convs = nn.Sequential(
+                nn.Conv2d(in_dim, in_dim, kernel_size=k, padding=p, stride=s, groups=in_dim, bias=False),
+                nn.BatchNorm2d(out_dim),
+                nn.ReLU(inplace=True),
+                nn.Conv2d(in_dim, out_dim, kernel_size=1, groups=in_dim, bias=False),
+                nn.BatchNorm2d(out_dim),
+                nn.ReLU(inplace=True),
+            )
+        else:
+            self.convs = nn.Sequential(
+                nn.Conv2d(in_dim, out_dim, kernel_size=k, padding=p, stride=s, bias=False),
+                nn.BatchNorm2d(out_dim),
+                nn.ReLU(inplace=True)
+            )
+
+    def forward(self, x):
+        return self.convs(x)
+    
+
+class ConvBlocks(nn.Module):
+    def __init__(self, in_dim, out_dim, nblocks=1, depthwise=False):
+        super().__init__()
+        assert in_dim == out_dim
+
+        conv_block = []
+        for _ in range(nblocks):
+            conv_block.append(
+                Conv(in_dim, out_dim, k=3, p=1, s=1, depthwise=depthwise)
+            )
+        self.conv_block = nn.Sequential(*conv_block)
+
+    def forward(self, x):
+        return self.conv_block(x)
+    
+
+class Backbone3D(nn.Module):
+    def __init__(self, cfg, pretrained=False):
+        super().__init__()
+        self.cfg = cfg
+
+        # 3D CNN
+        self.backbone, self.feat_dim = build_3d_cnn(cfg, pretrained)
+        
+       
+    def forward(self, x):
+        """
+            Input:
+                x: (Tensor) -> [B, C, T, H, W]
+            Output:
+                y: (List) -> [
+                    (Tensor) -> [B, C1, H1, W1],
+                    (Tensor) -> [B, C2, H2, W2],
+                    (Tensor) -> [B, C3, H3, W3]
+                    ]
+        """
+        feat = self.backbone(x)
+
+        return feat
diff --git a/models/backbone/backbone_3d/cnn_3d/__init__.py b/models/backbone/backbone_3d/cnn_3d/__init__.py
new file mode 100644
index 0000000..6148780
--- /dev/null
+++ b/models/backbone/backbone_3d/cnn_3d/__init__.py
@@ -0,0 +1,30 @@
+from .resnet import build_resnet_3d
+from .resnext import build_resnext_3d
+from .shufflnetv2 import build_shufflenetv2_3d
+
+
+def build_3d_cnn(cfg, pretrained=False):
+    print('==============================')
+    print('3D Backbone: {}'.format(cfg['backbone_3d'].upper()))
+    print('--pretrained: {}'.format(pretrained))
+
+    if 'resnet' in cfg['backbone_3d']:
+        model, feat_dims = build_resnet_3d(
+            model_name=cfg['backbone_3d'],
+            pretrained=pretrained
+            )
+    elif 'resnext' in cfg['backbone_3d']:
+        model, feat_dims = build_resnext_3d(
+            model_name=cfg['backbone_3d'],
+            pretrained=pretrained
+            )
+    elif 'shufflenetv2' in cfg['backbone_3d']:
+        model, feat_dims = build_shufflenetv2_3d(
+            model_size=cfg['model_size'],
+            pretrained=pretrained
+            )
+    else:
+        print('Unknown Backbone ...')
+        exit()
+
+    return model, feat_dims
diff --git a/models/backbone/backbone_3d/cnn_3d/resnet.py b/models/backbone/backbone_3d/cnn_3d/resnet.py
new file mode 100644
index 0000000..81d28a6
--- /dev/null
+++ b/models/backbone/backbone_3d/cnn_3d/resnet.py
@@ -0,0 +1,309 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.autograd import Variable
+from torch.hub import load_state_dict_from_url
+from functools import partial
+
+__all__ = [
+    'ResNet', 'resnet10', 'resnet18', 'resnet34', 'resnet50', 'resnet101',
+    'resnet152', 'resnet200'
+]
+
+
+model_urls = {
+    "resnet18": "https://github.com/yjh0410/YOWOF/releases/download/yowof-weight/resnet-18-kinetics.pth",
+    "resnet34": "https://github.com/yjh0410/YOWOF/releases/download/yowof-weight/resnet-34-kinetics.pth",
+    "resnet50": "https://github.com/yjh0410/YOWOF/releases/download/yowof-weight/resnet-50-kinetics.pth",
+    "resnet101": "https://github.com/yjh0410/YOWOF/releases/download/yowof-weight/resnet-101-kinetics.pth"
+}
+
+
+
+def conv3x3x3(in_planes, out_planes, stride=1):
+    # 3x3x3 convolution with padding
+    return nn.Conv3d(
+        in_planes,
+        out_planes,
+        kernel_size=3,
+        stride=stride,
+        padding=1,
+        bias=False)
+
+
+def downsample_basic_block(x, planes, stride):
+    out = F.avg_pool3d(x, kernel_size=1, stride=stride)
+    zero_pads = torch.Tensor(
+        out.size(0), planes - out.size(1), out.size(2), out.size(3),
+        out.size(4)).zero_()
+
+    if isinstance(out.data, torch.cuda.FloatTensor):
+        zero_pads = zero_pads.cuda()
+    zero_pads = zero_pads.to(out.data.device)
+    out = Variable(torch.cat([out.data, zero_pads], dim=1))
+
+    return out
+
+
+class BasicBlock(nn.Module):
+    expansion = 1
+
+    def __init__(self, inplanes, planes, stride=1, downsample=None):
+        super(BasicBlock, self).__init__()
+        self.conv1 = conv3x3x3(inplanes, planes, stride)
+        self.bn1 = nn.BatchNorm3d(planes)
+        self.relu = nn.ReLU(inplace=True)
+        self.conv2 = conv3x3x3(planes, planes)
+        self.bn2 = nn.BatchNorm3d(planes)
+        self.downsample = downsample
+        self.stride = stride
+
+    def forward(self, x):
+        residual = x
+
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+
+        out = self.conv2(out)
+        out = self.bn2(out)
+
+        if self.downsample is not None:
+            residual = self.downsample(x)
+
+        out += residual
+        out = self.relu(out)
+
+        return out
+
+
+class Bottleneck(nn.Module):
+    expansion = 4
+
+    def __init__(self, inplanes, planes, stride=1, downsample=None):
+        super(Bottleneck, self).__init__()
+        self.conv1 = nn.Conv3d(inplanes, planes, kernel_size=1, bias=False)
+        self.bn1 = nn.BatchNorm3d(planes)
+        self.conv2 = nn.Conv3d(
+            planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
+        self.bn2 = nn.BatchNorm3d(planes)
+        self.conv3 = nn.Conv3d(planes, planes * 4, kernel_size=1, bias=False)
+        self.bn3 = nn.BatchNorm3d(planes * 4)
+        self.relu = nn.ReLU(inplace=True)
+        self.downsample = downsample
+        self.stride = stride
+
+    def forward(self, x):
+        residual = x
+
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.relu(out)
+
+        out = self.conv3(out)
+        out = self.bn3(out)
+
+        if self.downsample is not None:
+            residual = self.downsample(x)
+
+        out += residual
+        out = self.relu(out)
+
+        return out
+
+
+class ResNet(nn.Module):
+
+    def __init__(self,
+                 block,
+                 layers,
+                 shortcut_type='B'):
+        self.inplanes = 64
+        super(ResNet, self).__init__()
+        self.conv1 = nn.Conv3d(
+            3,
+            64,
+            kernel_size=7,
+            stride=(1, 2, 2),
+            padding=(3, 3, 3),
+            bias=False)
+        self.bn1 = nn.BatchNorm3d(64)
+        self.relu = nn.ReLU(inplace=True)
+        self.maxpool = nn.MaxPool3d(kernel_size=(3, 3, 3), stride=2, padding=1)
+        self.layer1 = self._make_layer(block, 64, layers[0], shortcut_type)
+        self.layer2 = self._make_layer(
+            block, 128, layers[1], shortcut_type, stride=2)
+        self.layer3 = self._make_layer(
+            block, 256, layers[2], shortcut_type, stride=2)
+        self.layer4 = self._make_layer(
+            block, 512, layers[3], shortcut_type, stride=2)
+        # self.avgpool = nn.AvgPool3d((2, 1, 1), stride=1)
+
+        for m in self.modules():
+            if isinstance(m, nn.Conv3d):
+                m.weight = nn.init.kaiming_normal_(m.weight, mode='fan_out')
+            elif isinstance(m, nn.BatchNorm3d):
+                m.weight.data.fill_(1)
+                m.bias.data.zero_()
+
+    def _make_layer(self, block, planes, blocks, shortcut_type, stride=1):
+        downsample = None
+        if stride != 1 or self.inplanes != planes * block.expansion:
+            if shortcut_type == 'A':
+                downsample = partial(
+                    downsample_basic_block,
+                    planes=planes * block.expansion,
+                    stride=stride)
+            else:
+                downsample = nn.Sequential(
+                    nn.Conv3d(
+                        self.inplanes,
+                        planes * block.expansion,
+                        kernel_size=1,
+                        stride=stride,
+                        bias=False), nn.BatchNorm3d(planes * block.expansion))
+
+        layers = []
+        layers.append(block(self.inplanes, planes, stride, downsample))
+        self.inplanes = planes * block.expansion
+        for i in range(1, blocks):
+            layers.append(block(self.inplanes, planes))
+
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        c1 = self.conv1(x)
+        c1 = self.bn1(c1)
+        c1 = self.relu(c1)
+        c2 = self.maxpool(c1)
+
+        c2 = self.layer1(c2)
+        c3 = self.layer2(c2)
+        c4 = self.layer3(c3)
+        c5 = self.layer4(c4)
+        
+        if c5.size(2) > 1:
+            c5 = torch.mean(c5, dim=2, keepdim=True)
+        
+        return c5.squeeze(2)
+
+
+def load_weight(model, arch):
+    print('Loading pretrained weight ...')
+    url = model_urls[arch]
+    # check
+    if url is None:
+        print('No pretrained weight for 3D CNN: {}'.format(arch.upper()))
+        return model
+
+    print('Loading 3D backbone pretrained weight: {}'.format(arch.upper()))
+    # checkpoint state dict
+    checkpoint = load_state_dict_from_url(url=url, map_location="cpu", check_hash=True)
+    checkpoint_state_dict = checkpoint.pop('state_dict')
+
+    # model state dict
+    model_state_dict = model.state_dict()
+    # reformat checkpoint_state_dict:
+    new_state_dict = {}
+    for k in checkpoint_state_dict.keys():
+        v = checkpoint_state_dict[k]
+        new_state_dict[k[7:]] = v
+
+    # check
+    for k in list(new_state_dict.keys()):
+        if k in model_state_dict:
+            shape_model = tuple(model_state_dict[k].shape)
+            shape_checkpoint = tuple(new_state_dict[k].shape)
+            if shape_model != shape_checkpoint:
+                new_state_dict.pop(k)
+                # print(k)
+        else:
+            new_state_dict.pop(k)
+            # print(k)
+
+    model.load_state_dict(new_state_dict)
+        
+    return model
+
+
+def resnet18(pretrained=False, **kwargs):
+    """Constructs a 3D ResNet-18 model."""
+
+    model = ResNet(BasicBlock, [2, 2, 2, 2], **kwargs)
+
+    if pretrained:
+        model = load_weight(model, 'resnet18')
+
+    return model
+
+
+def resnet34(pretrained=False, **kwargs):
+    """Constructs a 3D ResNet-34 model."""
+
+    model = ResNet(BasicBlock, [3, 4, 6, 3], **kwargs)
+
+    if pretrained:
+        model = load_weight(model, 'resnet34')
+
+    return model
+
+
+def resnet50(pretrained=False, **kwargs):
+    """Constructs a 3D ResNet-50 model. """
+
+    model = ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)
+
+    if pretrained:
+        model = load_weight(model, 'resnet50')
+
+    return model
+
+
+def resnet101(pretrained=False, **kwargs):
+    """Constructs a 3D ResNet-101 model."""
+
+    model = ResNet(Bottleneck, [3, 4, 23, 3], **kwargs)
+
+    if pretrained:
+        model = load_weight(model, 'resnet101')
+
+    return model
+
+
+# build 3D resnet
+def build_resnet_3d(model_name='resnet18', pretrained=False):
+    if model_name == 'resnet18':
+        model = resnet18(pretrained=pretrained, shortcut_type='A')
+        feats = 512
+
+    elif model_name == 'resnet50':
+        model = resnet50(pretrained=pretrained, shortcut_type='B')
+        feats = 2048
+
+    elif model_name == 'resnet101':
+        model = resnet101(pretrained=pretrained, shortcut_type='b')
+        feats = 2048
+
+    return model, feats
+
+
+if __name__ == '__main__':
+    import time
+    model, feats = build_resnet_3d(model_name='resnet18', pretrained=True)
+    if torch.cuda.is_available():
+        device = torch.device("cuda")
+    else:
+        device = torch.device("cpu")
+    model = model.to(device)
+
+    x = torch.randn(1, 3, 16, 64, 64).to(device)
+    # star time
+    t0 = time.time()
+    out = model(x)
+    print('time', time.time() - t0)
+
+    print(out.shape)
diff --git a/models/backbone/backbone_3d/cnn_3d/resnext.py b/models/backbone/backbone_3d/cnn_3d/resnext.py
new file mode 100644
index 0000000..72ec711
--- /dev/null
+++ b/models/backbone/backbone_3d/cnn_3d/resnext.py
@@ -0,0 +1,272 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.autograd import Variable
+from torch.hub import load_state_dict_from_url
+from functools import partial
+
+__all__ = ['resnext50', 'resnext101', 'resnet152']
+
+
+model_urls = {
+    "resnext50": "https://github.com/yjh0410/PyTorch_YOWO/releases/download/yowo-weight/resnext-50-kinetics.pth",
+    "resnext101": "https://github.com/yjh0410/PyTorch_YOWO/releases/download/yowo-weight/resnext-101-kinetics.pth",
+    "resnext152": "https://github.com/yjh0410/PyTorch_YOWO/releases/download/yowo-weight/resnext-152-kinetics.pth"
+}
+
+
+
+def downsample_basic_block(x, planes, stride):
+    out = F.avg_pool3d(x, kernel_size=1, stride=stride)
+    zero_pads = torch.Tensor(
+        out.size(0), planes - out.size(1), out.size(2), out.size(3),
+        out.size(4)).zero_()
+
+    if isinstance(out.data, torch.cuda.FloatTensor):
+        zero_pads = zero_pads.cuda()
+    zero_pads = zero_pads.to(out.data.device)
+    out = Variable(torch.cat([out.data, zero_pads], dim=1))
+
+    return out
+
+
+class ResNeXtBottleneck(nn.Module):
+    expansion = 2
+
+    def __init__(self, inplanes, planes, cardinality, stride=1,
+                 downsample=None):
+        super(ResNeXtBottleneck, self).__init__()
+        mid_planes = cardinality * int(planes / 32)
+        self.conv1 = nn.Conv3d(inplanes, mid_planes, kernel_size=1, bias=False)
+        self.bn1 = nn.BatchNorm3d(mid_planes)
+        self.conv2 = nn.Conv3d(
+            mid_planes,
+            mid_planes,
+            kernel_size=3,
+            stride=stride,
+            padding=1,
+            groups=cardinality,
+            bias=False)
+        self.bn2 = nn.BatchNorm3d(mid_planes)
+        self.conv3 = nn.Conv3d(
+            mid_planes, planes * self.expansion, kernel_size=1, bias=False)
+        self.bn3 = nn.BatchNorm3d(planes * self.expansion)
+        self.relu = nn.ReLU(inplace=True)
+        self.downsample = downsample
+        self.stride = stride
+
+    def forward(self, x):
+        residual = x
+
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.relu(out)
+
+        out = self.conv3(out)
+        out = self.bn3(out)
+
+        if self.downsample is not None:
+            residual = self.downsample(x)
+
+        out += residual
+        out = self.relu(out)
+
+        return out
+
+
+class ResNeXt(nn.Module):
+
+    def __init__(self,
+                 block,
+                 layers,
+                 shortcut_type='B',
+                 cardinality=32):
+        self.inplanes = 64
+        super(ResNeXt, self).__init__()
+        self.conv1 = nn.Conv3d(
+            3,
+            64,
+            kernel_size=7,
+            stride=(1, 2, 2),
+            padding=(3, 3, 3),
+            bias=False)
+        self.bn1 = nn.BatchNorm3d(64)
+        self.relu = nn.ReLU(inplace=True)
+        
+        self.maxpool = nn.MaxPool3d(kernel_size=(3, 3, 3), stride=2, padding=1)
+        
+        self.layer1 = self._make_layer(block, 128, layers[0], shortcut_type,
+                                       cardinality)
+        
+        self.layer2 = self._make_layer(
+            block, 256, layers[1], shortcut_type, cardinality, stride=2)
+        
+        self.layer3 = self._make_layer(
+            block, 512, layers[2], shortcut_type, cardinality, stride=2)
+        
+        self.layer4 = self._make_layer(
+            block, 1024, layers[3], shortcut_type, cardinality, stride=2)
+
+        for m in self.modules():
+            if isinstance(m, nn.Conv3d):
+                m.weight = nn.init.kaiming_normal_(m.weight, mode='fan_out')
+            elif isinstance(m, nn.BatchNorm3d):
+                m.weight.data.fill_(1)
+                m.bias.data.zero_()
+
+    def _make_layer(self,
+                    block,
+                    planes,
+                    blocks,
+                    shortcut_type,
+                    cardinality,
+                    stride=1):
+        downsample = None
+        if stride != 1 or self.inplanes != planes * block.expansion:
+            if shortcut_type == 'A':
+                downsample = partial(
+                    downsample_basic_block,
+                    planes=planes * block.expansion,
+                    stride=stride)
+            else:
+                downsample = nn.Sequential(
+                    nn.Conv3d(
+                        self.inplanes,
+                        planes * block.expansion,
+                        kernel_size=1,
+                        stride=stride,
+                        bias=False), nn.BatchNorm3d(planes * block.expansion))
+
+        layers = []
+        layers.append(
+            block(self.inplanes, planes, cardinality, stride, downsample))
+        self.inplanes = planes * block.expansion
+        for i in range(1, blocks):
+            layers.append(block(self.inplanes, planes, cardinality))
+
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        c1 = self.conv1(x)
+        c1 = self.bn1(c1)
+        c1 = self.relu(c1)
+        c2 = self.maxpool(c1)
+
+        c2 = self.layer1(c2)
+        c3 = self.layer2(c2)
+        c4 = self.layer3(c3)
+        c5 = self.layer4(c4)
+
+        if c5.size(2) > 1:
+            c5 = torch.mean(c5, dim=2, keepdim=True)
+        
+        return c5.squeeze(2)
+
+
+def load_weight(model, arch):
+    print('Loading pretrained weight ...')
+    url = model_urls[arch]
+    # check
+    if url is None:
+        print('No pretrained weight for 3D CNN: {}'.format(arch.upper()))
+        return model
+        
+    print('Loading 3D backbone pretrained weight: {}'.format(arch.upper()))
+    # checkpoint state dict
+    checkpoint = load_state_dict_from_url(url=url, map_location="cpu", check_hash=True)
+    checkpoint_state_dict = checkpoint.pop('state_dict')
+
+    # model state dict
+    model_state_dict = model.state_dict()
+    # reformat checkpoint_state_dict:
+    new_state_dict = {}
+    for k in checkpoint_state_dict.keys():
+        v = checkpoint_state_dict[k]
+        new_state_dict[k[7:]] = v
+
+    # check
+    for k in list(new_state_dict.keys()):
+        if k in model_state_dict:
+            shape_model = tuple(model_state_dict[k].shape)
+            shape_checkpoint = tuple(new_state_dict[k].shape)
+            if shape_model != shape_checkpoint:
+                new_state_dict.pop(k)
+                # print(k)
+        else:
+            new_state_dict.pop(k)
+            # print(k)
+
+    model.load_state_dict(new_state_dict)
+        
+    return model
+
+
+def resnext50(pretrained=False, **kwargs):
+    """Constructs a ResNet-50 model.
+    """
+    model = ResNeXt(ResNeXtBottleneck, [3, 4, 6, 3], **kwargs)
+
+    if pretrained:
+        model = load_weight(model, 'resnext50')
+
+    return model
+
+
+def resnext101(pretrained=False, **kwargs):
+    """Constructs a ResNet-101 model.
+    """
+    model = ResNeXt(ResNeXtBottleneck, [3, 4, 23, 3], **kwargs)
+
+    if pretrained:
+        model = load_weight(model, 'resnext101')
+
+    return model
+
+
+def resnext152(pretrained=False, **kwargs):
+    """Constructs a ResNet-101 model.
+    """
+    model = ResNeXt(ResNeXtBottleneck, [3, 8, 36, 3], **kwargs)
+
+    if pretrained:
+        model = load_weight(model, 'resnext152')
+
+    return model
+
+
+# build 3D resnet
+def build_resnext_3d(model_name='resnext101', pretrained=True):
+    if model_name == 'resnext50':
+        model = resnext50(pretrained=pretrained)
+        feats = 2048
+
+    elif model_name == 'resnext101':
+        model = resnext101(pretrained=pretrained)
+        feats = 2048
+
+    elif model_name == 'resnext152':
+        model = resnext152(pretrained=pretrained)
+        feats = 2048
+
+    return model, feats
+
+
+if __name__ == '__main__':
+    import time
+    model, feats = build_resnext_3d(model_name='resnext50', pretrained=False)
+    if torch.cuda.is_available():
+        device = torch.device("cuda")
+    else:
+        device = torch.device("cpu")
+    model = model.to(device)
+
+    x = torch.randn(1, 3, 16, 64, 64).to(device)
+    # star time
+    t0 = time.time()
+    out = model(x)
+    print('time', time.time() - t0)
+    print(out.shape)
diff --git a/models/backbone/backbone_3d/cnn_3d/shufflnetv2.py b/models/backbone/backbone_3d/cnn_3d/shufflnetv2.py
new file mode 100644
index 0000000..9fa6823
--- /dev/null
+++ b/models/backbone/backbone_3d/cnn_3d/shufflnetv2.py
@@ -0,0 +1,236 @@
+'''ShuffleNetV2 in PyTorch.
+
+See the paper "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design" for more details.
+'''
+
+import torch
+import torch.nn as nn
+from torch.hub import load_state_dict_from_url
+
+
+__all__ = ['resnext50', 'resnext101', 'resnet152']
+
+
+model_urls = {
+    "0.25x": "https://github.com/yjh0410/PyTorch_YOWO/releases/download/yowo-weight/kinetics_shufflenetv2_0.25x_RGB_16_best.pth",
+    "1.0x": "https://github.com/yjh0410/PyTorch_YOWO/releases/download/yowo-weight/kinetics_shufflenetv2_1.0x_RGB_16_best.pth",
+    "1.5x": "https://github.com/yjh0410/PyTorch_YOWO/releases/download/yowo-weight/kinetics_shufflenetv2_1.5x_RGB_16_best.pth",
+    "2.0x": "https://github.com/yjh0410/PyTorch_YOWO/releases/download/yowo-weight/kinetics_shufflenetv2_2.0x_RGB_16_best.pth",
+}
+
+
+# basic component
+def conv_bn(inp, oup, stride):
+    return nn.Sequential(
+        nn.Conv3d(inp, oup, kernel_size=3, stride=stride, padding=(1,1,1), bias=False),
+        nn.BatchNorm3d(oup),
+        nn.ReLU(inplace=True)
+    )
+
+
+def conv_1x1x1_bn(inp, oup):
+    return nn.Sequential(
+        nn.Conv3d(inp, oup, 1, 1, 0, bias=False),
+        nn.BatchNorm3d(oup),
+        nn.ReLU(inplace=True)
+    )
+
+
+def channel_shuffle(x, groups):
+    '''Channel shuffle: [N,C,H,W] -> [N,g,C/g,H,W] -> [N,C/g,g,H,w] -> [N,C,H,W]'''
+    batchsize, num_channels, depth, height, width = x.data.size()
+    channels_per_group = num_channels // groups
+    # reshape
+    x = x.view(batchsize, groups, 
+        channels_per_group, depth, height, width)
+    #permute
+    x = x.permute(0,2,1,3,4,5).contiguous()
+    # flatten
+    x = x.view(batchsize, num_channels, depth, height, width)
+    return x
+    
+
+class InvertedResidual(nn.Module):
+    def __init__(self, inp, oup, stride):
+        super(InvertedResidual, self).__init__()
+        self.stride = stride
+        assert stride in [1, 2]
+
+        oup_inc = oup//2
+        
+        if self.stride == 1:
+            self.banch2 = nn.Sequential(
+                # pw
+                nn.Conv3d(oup_inc, oup_inc, 1, 1, 0, bias=False),
+                nn.BatchNorm3d(oup_inc),
+                nn.ReLU(inplace=True),
+                # dw
+                nn.Conv3d(oup_inc, oup_inc, 3, stride, 1, groups=oup_inc, bias=False),
+                nn.BatchNorm3d(oup_inc),
+                # pw-linear
+                nn.Conv3d(oup_inc, oup_inc, 1, 1, 0, bias=False),
+                nn.BatchNorm3d(oup_inc),
+                nn.ReLU(inplace=True)
+            )
+        
+        else:
+            self.banch1 = nn.Sequential(
+                # dw
+                nn.Conv3d(inp, inp, 3, stride, 1, groups=inp, bias=False),
+                nn.BatchNorm3d(inp),
+                # pw-linear
+                nn.Conv3d(inp, oup_inc, 1, 1, 0, bias=False),
+                nn.BatchNorm3d(oup_inc),
+                nn.ReLU(inplace=True)
+            )
+            self.banch2 = nn.Sequential(
+                # pw
+                nn.Conv3d(inp, oup_inc, 1, 1, 0, bias=False),
+                nn.BatchNorm3d(oup_inc),
+                nn.ReLU(inplace=True),
+                # dw
+                nn.Conv3d(oup_inc, oup_inc, 3, stride, 1, groups=oup_inc, bias=False),
+                nn.BatchNorm3d(oup_inc),
+                # pw-linear
+                nn.Conv3d(oup_inc, oup_inc, 1, 1, 0, bias=False),
+                nn.BatchNorm3d(oup_inc),
+                nn.ReLU(inplace=True)
+            )
+
+
+    @staticmethod
+    def _concat(x, out):
+        # concatenate along channel axis
+        return torch.cat((x, out), 1)        
+
+
+    def forward(self, x):
+        if self.stride == 1:
+            x1 = x[:, :(x.shape[1]//2), :, :, :]
+            x2 = x[:, (x.shape[1]//2):, :, :, :]
+            out = self._concat(x1, self.banch2(x2))
+        elif self.stride == 2:
+            out = self._concat(self.banch1(x), self.banch2(x))
+
+        return channel_shuffle(out, 2)
+
+
+# ShuffleNet-v2
+class ShuffleNetV2(nn.Module):
+    def __init__(self, width_mult='1.0x', num_classes=600):
+        super(ShuffleNetV2, self).__init__()
+        
+        self.stage_repeats = [4, 8, 4]
+        # index 0 is invalid and should never be called.
+        # only used for indexing convenience.
+        if width_mult == '0.25x':
+            self.stage_out_channels = [-1, 24,  32,  64, 128]
+        elif width_mult == '0.5x':
+            self.stage_out_channels = [-1, 24,  48,  96, 192]
+        elif width_mult == '1.0x':
+            self.stage_out_channels = [-1, 24, 116, 232, 464]
+        elif width_mult == '1.5x':
+            self.stage_out_channels = [-1, 24, 176, 352, 704]
+        elif width_mult == '2.0x':
+            self.stage_out_channels = [-1, 24, 224, 488, 976]
+
+        # building first layer
+        input_channel = self.stage_out_channels[1]
+        self.conv1 = conv_bn(3, input_channel, stride=(1,2,2))
+        self.maxpool = nn.MaxPool3d(kernel_size=3, stride=2, padding=1)
+        
+        self.features = []
+        # building inverted residual blocks
+        for idxstage in range(len(self.stage_repeats)):
+            numrepeat = self.stage_repeats[idxstage]
+            output_channel = self.stage_out_channels[idxstage+2]
+            for i in range(numrepeat):
+                stride = 2 if i == 0 else 1
+                self.features.append(InvertedResidual(input_channel, output_channel, stride))
+                input_channel = output_channel
+                
+        # make it nn.Sequential
+        self.features = nn.Sequential(*self.features)
+
+        # # building last several layers
+        # self.conv_last      = conv_1x1x1_bn(input_channel, self.stage_out_channels[-1])
+        # self.avgpool        = nn.AvgPool3d((2, 1, 1), stride=1)
+    
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.maxpool(x)
+        x = self.features(x)
+        # out = self.conv_last(out) 
+
+        if x.size(2) > 1:
+            x = torch.mean(x, dim=2, keepdim=True)
+        
+        return x.squeeze(2)
+
+
+def load_weight(model, arch):
+    print('Loading pretrained weight ...')
+    url = model_urls[arch]
+    # check
+    if url is None:
+        print('No pretrained weight for 3D CNN: {}'.format(arch.upper()))
+        return model
+
+    print('Loading 3D backbone pretrained weight: {}'.format(arch.upper()))
+    # checkpoint state dict
+    checkpoint = load_state_dict_from_url(url=url, map_location="cpu", check_hash=True)
+    checkpoint_state_dict = checkpoint.pop('state_dict')
+
+    # model state dict
+    model_state_dict = model.state_dict()
+    # reformat checkpoint_state_dict:
+    new_state_dict = {}
+    for k in checkpoint_state_dict.keys():
+        v = checkpoint_state_dict[k]
+        new_state_dict[k[7:]] = v
+
+    # check
+    for k in list(new_state_dict.keys()):
+        if k in model_state_dict:
+            shape_model = tuple(model_state_dict[k].shape)
+            shape_checkpoint = tuple(new_state_dict[k].shape)
+            if shape_model != shape_checkpoint:
+                new_state_dict.pop(k)
+                print(k)
+        else:
+            new_state_dict.pop(k)
+            print(k)
+
+    model.load_state_dict(new_state_dict)
+        
+    return model
+
+
+# build 3D shufflenet_v2
+def build_shufflenetv2_3d(model_size='0.25x', pretrained=False):
+    model = ShuffleNetV2(model_size)
+    feats = model.stage_out_channels[-1]
+
+    if pretrained:
+        model = load_weight(model, model_size)
+
+    return model, feats
+
+
+if __name__ == '__main__':
+    import time
+    model, feat = build_shufflenetv2_3d(model_size='1.0x', pretrained=True)
+    if torch.cuda.is_available():
+        device = torch.device("cuda")
+    else:
+        device = torch.device("cpu")
+    model = model.to(device)
+
+    # [B, C, T, H, W]
+    x = torch.randn(1, 3, 16, 64, 64).to(device)
+    # star time
+    t0 = time.time()
+    out = model(x)
+    print('time', time.time() - t0)
+    print(out.shape)
diff --git a/models/basic/__init__.py b/models/basic/__init__.py
new file mode 100644
index 0000000..14f8665
--- /dev/null
+++ b/models/basic/__init__.py
@@ -0,0 +1 @@
+from .conv import Conv2d
\ No newline at end of file
diff --git a/models/basic/conv.py b/models/basic/conv.py
new file mode 100644
index 0000000..00e986c
--- /dev/null
+++ b/models/basic/conv.py
@@ -0,0 +1,127 @@
+import torch.nn as nn
+
+
+def get_activation(act_type=None):
+    if act_type == 'relu':
+        return nn.ReLU(inplace=True)
+    elif act_type == 'lrelu':
+        return nn.LeakyReLU(0.1, inplace=True)
+    elif act_type == 'mish':
+        return nn.Mish(inplace=True)
+    elif act_type == 'silu':
+        return nn.SiLU(inplace=True)
+
+
+# 2D Conv
+def get_conv2d(c1, c2, k, p, s, d, g, bias=False):
+    conv = nn.Conv2d(c1, c2, k, stride=s, padding=p, dilation=d, groups=g, bias=bias)
+    return conv
+
+
+def get_norm2d(norm_type, dim):
+    if norm_type == 'BN':
+        return nn.BatchNorm2d(dim)
+    elif norm_type == 'IN':
+        return nn.InstanceNorm2d(dim)
+
+
+class Conv2d(nn.Module):
+    def __init__(self, 
+                 c1,                   # in channels
+                 c2,                   # out channels 
+                 k=1,                  # kernel size 
+                 p=0,                  # padding
+                 s=1,                  # padding
+                 d=1,                  # dilation
+                 g=1,
+                 act_type='',          # activation
+                 norm_type='',         # normalization
+                 depthwise=False):
+        super(Conv2d, self).__init__()
+        convs = []
+        add_bias = False if norm_type else True
+        if depthwise:
+            assert c1 == c2, "In depthwise conv, the in_dim (c1) should be equal to out_dim (c2)."
+            convs.append(get_conv2d(c1, c2, k=k, p=p, s=s, d=d, g=c1, bias=add_bias))
+            # depthwise conv
+            if norm_type:
+                convs.append(get_norm2d(norm_type, c2))
+            if act_type:
+                convs.append(get_activation(act_type))
+            # pointwise conv
+            convs.append(get_conv2d(c1, c2, k=1, p=0, s=1, d=d, g=1, bias=add_bias))
+            if norm_type:
+                convs.append(get_norm2d(norm_type, c2))
+            if act_type:
+                convs.append(get_activation(act_type))
+
+        else:
+            convs.append(get_conv2d(c1, c2, k=k, p=p, s=s, d=d, g=g, bias=add_bias))
+            if norm_type:
+                convs.append(get_norm2d(norm_type, c2))
+            if act_type:
+                convs.append(get_activation(act_type))
+            
+        self.convs = nn.Sequential(*convs)
+
+
+    def forward(self, x):
+        return self.convs(x)
+
+
+# 3D Conv
+def get_conv3d(c1, c2, k, p, s, d, g, bias=False):
+    conv = nn.Conv3d(c1, c2, k, stride=s, padding=p, dilation=d, groups=g, bias=bias)
+    return conv
+
+
+def get_norm3d(norm_type, dim):
+    if norm_type == 'BN':
+        return nn.BatchNorm3d(dim)
+    elif norm_type == 'IN':
+        return nn.InstanceNorm3d(dim)
+
+
+class Conv3d(nn.Module):
+    def __init__(self, 
+                 c1,                   # in channels
+                 c2,                   # out channels 
+                 k=1,                  # kernel size 
+                 p=0,                  # padding
+                 s=1,                  # padding
+                 d=1,                  # dilation
+                 g=1,
+                 act_type='',          # activation
+                 norm_type='',         # normalization
+                 depthwise=False):
+        super(Conv3d, self).__init__()
+        convs = []
+        add_bias = False if norm_type else True
+        if depthwise:
+            assert c1 == c2, "In depthwise conv, the in_dim (c1) should be equal to out_dim (c2)."
+            convs.append(get_conv3d(c1, c2, k=k, p=p, s=s, d=d, g=c1, bias=add_bias))
+            # depthwise conv
+            if norm_type:
+                convs.append(get_norm3d(norm_type, c2))
+            if act_type:
+                convs.append(get_activation(act_type))
+            # pointwise conv
+            convs.append(get_conv3d(c1, c2, k=1, p=0, s=1, d=d, g=1, bias=add_bias))
+            if norm_type:
+                convs.append(get_norm3d(norm_type, c2))
+            if act_type:
+                convs.append(get_activation(act_type))
+
+        else:
+            convs.append(get_conv3d(c1, c2, k=k, p=p, s=s, d=d, g=g, bias=add_bias))
+            if norm_type:
+                convs.append(get_norm3d(norm_type, c2))
+            if act_type:
+                convs.append(get_activation(act_type))
+            
+        self.convs = nn.Sequential(*convs)
+
+
+    def forward(self, x):
+        return self.convs(x)
+
diff --git a/models/yowo/build.py b/models/yowo/build.py
new file mode 100644
index 0000000..86cd8f2
--- /dev/null
+++ b/models/yowo/build.py
@@ -0,0 +1,55 @@
+import torch
+from .yowo import YOWO
+from .loss import build_criterion
+
+
+# build YOWO detector
+def build_yowo(args,
+                d_cfg,
+                m_cfg, 
+                device, 
+                num_classes=80, 
+                trainable=False,
+                resume=None):
+    print('==============================')
+    print('Build {} ...'.format(args.version.upper()))
+
+    # build YOWO
+    model = YOWO(
+        cfg = m_cfg,
+        device = device,
+        num_classes = num_classes,
+        conf_thresh = args.conf_thresh,
+        nms_thresh = args.nms_thresh,
+        topk = args.topk,
+        trainable = trainable,
+        multi_hot = d_cfg['multi_hot'],
+        )
+
+    if trainable:
+        # Freeze backbone
+        if args.freeze_backbone_2d:
+            print('Freeze 2D Backbone ...')
+            for m in model.backbone_2d.parameters():
+                m.requires_grad = False
+        if args.freeze_backbone_3d:
+            print('Freeze 3D Backbone ...')
+            for m in model.backbone_3d.parameters():
+                m.requires_grad = False
+            
+        # keep training       
+        if resume is not None:
+            print('keep training: ', resume)
+            checkpoint = torch.load(resume, map_location='cpu')
+            # checkpoint state dict
+            checkpoint_state_dict = checkpoint.pop("model")
+            model.load_state_dict(checkpoint_state_dict)
+
+        # build criterion
+        criterion = build_criterion(
+            args, d_cfg['train_size'], num_classes, d_cfg['multi_hot'])
+    
+    else:
+        criterion = None
+                        
+    return model, criterion
diff --git a/models/yowo/encoder.py b/models/yowo/encoder.py
new file mode 100644
index 0000000..5924850
--- /dev/null
+++ b/models/yowo/encoder.py
@@ -0,0 +1,147 @@
+import torch
+import torch.nn as nn
+from ..basic.conv import Conv2d
+
+
+# Channel Self Attetion Module
+class CSAM(nn.Module):
+    """ Channel attention module """
+    def __init__(self):
+        super(CSAM, self).__init__()
+        self.gamma = nn.Parameter(torch.zeros(1))
+        self.softmax  = nn.Softmax(dim=-1)
+
+
+    def forward(self, x):
+        """
+            inputs :
+                x : input feature maps( B x C x H x W )
+            returns :
+                out : attention value + input feature
+                attention: B x C x C
+        """
+        B, C, H, W = x.size()
+        # query / key / value
+        query = x.view(B, C, -1)
+        key = x.view(B, C, -1).permute(0, 2, 1)
+        value = x.view(B, C, -1)
+
+        # attention matrix
+        energy = torch.bmm(query, key)
+        energy_new = torch.max(energy, -1, keepdim=True)[0].expand_as(energy) - energy
+        attention = self.softmax(energy_new)
+
+        # attention
+        out = torch.bmm(attention, value)
+        out = out.view(B, C, H, W)
+
+        # output
+        out = self.gamma * out + x
+
+        return out
+
+
+# Spatial Self Attetion Module
+class SSAM(nn.Module):
+    """ Channel attention module """
+    def __init__(self):
+        super(SSAM, self).__init__()
+        self.gamma = nn.Parameter(torch.zeros(1))
+        self.softmax  = nn.Softmax(dim=-1)
+
+
+    def forward(self, x):
+        """
+            inputs :
+                x : input feature maps( B x C x H x W )
+            returns :
+                out : attention value + input feature
+                attention: B x C x C
+        """
+        B, C, H, W = x.size()
+        # query / key / value
+        query = x.view(B, C, -1).permute(0, 2, 1)   # [B, N, C]
+        key = x.view(B, C, -1)                      # [B, C, N]
+        value = x.view(B, C, -1).permute(0, 2, 1)   # [B, N, C]
+
+        # attention matrix
+        energy = torch.bmm(query, key)
+        energy_new = torch.max(energy, -1, keepdim=True)[0].expand_as(energy) - energy
+        attention = self.softmax(energy_new)
+
+        # attention
+        out = torch.bmm(attention, value)
+        out = out.permute(0, 2, 1).contiguous().view(B, C, H, W)
+
+        # output
+        out = self.gamma * out + x
+
+        return out
+
+
+# Channel Encoder
+class ChannelEncoder(nn.Module):
+    def __init__(self, in_dim, out_dim, act_type='', norm_type=''):
+        super().__init__()
+        self.fuse_convs = nn.Sequential(
+            Conv2d(in_dim, out_dim, k=1, act_type=act_type, norm_type=norm_type),
+            Conv2d(out_dim, out_dim, k=3, p=1, act_type=act_type, norm_type=norm_type),
+            CSAM(),
+            Conv2d(out_dim, out_dim, k=3, p=1, act_type=act_type, norm_type=norm_type),
+            nn.Dropout(0.1, inplace=False),
+            nn.Conv2d(out_dim, out_dim, kernel_size=1)
+        )
+
+    def forward(self, x1, x2):
+        """
+            x: [B, C, H, W]
+        """
+        x = torch.cat([x1, x2], dim=1)
+        # [B, CN, H, W] -> [B, C, H, W]
+        x = self.fuse_convs(x)
+
+        return x
+
+
+# Spatial Encoder
+class SpatialEncoder(nn.Module):
+    def __init__(self, in_dim, out_dim, act_type='', norm_type=''):
+        super().__init__()
+        self.fuse_convs = nn.Sequential(
+            Conv2d(in_dim, out_dim, k=1, act_type=act_type, norm_type=norm_type),
+            Conv2d(out_dim, out_dim, k=3, p=1, act_type=act_type, norm_type=norm_type),
+            SSAM(),
+            Conv2d(out_dim, out_dim, k=3, p=1, act_type=act_type, norm_type=norm_type),
+            nn.Dropout(0.1, inplace=False),
+            nn.Conv2d(out_dim, out_dim, kernel_size=1)
+        )
+
+    def forward(self, x):
+        """
+            x: [B, C, H, W]
+        """
+        x = self.fuse_convs(x)
+
+        return x
+
+
+def build_channel_encoder(cfg, in_dim, out_dim):
+    encoder = ChannelEncoder(
+            in_dim=in_dim,
+            out_dim=out_dim,
+            act_type=cfg['head_act'],
+            norm_type=cfg['head_norm']
+        )
+
+    return encoder
+
+
+def build_spatial_encoder(cfg, in_dim, out_dim):
+    encoder = SpatialEncoder(
+            in_dim=in_dim,
+            out_dim=out_dim,
+            act_type=cfg['head_act'],
+            norm_type=cfg['head_norm']
+        )
+
+    return encoder
diff --git a/models/yowo/head.py b/models/yowo/head.py
new file mode 100644
index 0000000..893a6c6
--- /dev/null
+++ b/models/yowo/head.py
@@ -0,0 +1,47 @@
+import torch
+import torch.nn as nn
+
+from ..basic.conv import Conv2d
+
+
+class DecoupledHead(nn.Module):
+    def __init__(self, cfg):
+        super().__init__()
+
+        print('==============================')
+        print('Head: Decoupled Head')
+        self.num_cls_heads = cfg['num_cls_heads']
+        self.num_reg_heads = cfg['num_reg_heads']
+        self.act_type = cfg['head_act']
+        self.norm_type = cfg['head_norm']
+        self.head_dim = cfg['head_dim']
+        self.depthwise = cfg['head_depthwise']
+
+        self.cls_head = nn.Sequential(*[
+            Conv2d(self.head_dim, 
+                   self.head_dim, 
+                   k=3, p=1, s=1, 
+                   act_type=self.act_type, 
+                   norm_type=self.norm_type,
+                   depthwise=self.depthwise)
+                   for _ in range(self.num_cls_heads)])
+        self.reg_head = nn.Sequential(*[
+            Conv2d(self.head_dim, 
+                   self.head_dim, 
+                   k=3, p=1, s=1, 
+                   act_type=self.act_type, 
+                   norm_type=self.norm_type,
+                   depthwise=self.depthwise)
+                   for _ in range(self.num_reg_heads)])
+
+
+    def forward(self, cls_feat, reg_feat):
+        cls_feats = self.cls_head(cls_feat)
+        reg_feats = self.reg_head(reg_feat)
+
+        return cls_feats, reg_feats
+
+
+def build_head(cfg):
+    return DecoupledHead(cfg)
+    
\ No newline at end of file
diff --git a/models/yowo/loss.py b/models/yowo/loss.py
new file mode 100644
index 0000000..7cc5a6b
--- /dev/null
+++ b/models/yowo/loss.py
@@ -0,0 +1,173 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .matcher import SimOTA
+from utils.box_ops import get_ious
+from utils.distributed_utils import get_world_size, is_dist_avail_and_initialized
+
+
+class SigmoidFocalLoss(object):
+    def __init__(self, alpha=0.25, gamma=2.0, reduction='none'):
+        self.alpha = alpha
+        self.gamma = gamma
+        self.reduction = reduction
+
+    def __call__(self, logits, targets):      
+        p = torch.sigmoid(logits)
+        ce_loss = F.binary_cross_entropy_with_logits(input=logits, 
+                                                        target=targets, 
+                                                        reduction="none")
+        p_t = p * targets + (1.0 - p) * (1.0 - targets)
+        loss = ce_loss * ((1.0 - p_t) ** self.gamma)
+
+        if self.alpha >= 0:
+            alpha_t = self.alpha * targets + (1.0 - self.alpha) * (1.0 - targets)
+            loss = alpha_t * loss
+
+        if self.reduction == "mean":
+            loss = loss.mean()
+
+        elif self.reduction == "sum":
+            loss = loss.sum()
+
+        return loss
+
+
+class Criterion(object):
+    def __init__(self, args, img_size, num_classes=80, multi_hot=False):
+        self.num_classes = num_classes
+        self.img_size = img_size
+        self.loss_conf_weight = args.loss_conf_weight
+        self.loss_cls_weight = args.loss_cls_weight
+        self.loss_reg_weight = args.loss_reg_weight
+        self.focal_loss = args.focal_loss
+        self.multi_hot = multi_hot
+
+        # loss
+        self.obj_lossf = nn.BCEWithLogitsLoss(reduction='none')
+        self.cls_lossf = nn.BCEWithLogitsLoss(reduction='none')
+            
+        # matcher
+        self.matcher = SimOTA(
+            num_classes=num_classes,
+            center_sampling_radius=args.center_sampling_radius,
+            topk_candidate=args.topk_candicate
+            )
+
+    def __call__(self, outputs, targets):        
+        """
+            outputs['pred_conf']: List(Tensor) [B, M, 1]
+            outputs['pred_cls']: List(Tensor) [B, M, C]
+            outputs['pred_box']: List(Tensor) [B, M, 4]
+            outputs['strides']: List(Int) [8, 16, 32] output stride
+            targets: (List) [dict{'boxes': [...], 
+                                 'labels': [...], 
+                                 'orig_size': ...}, ...]
+        """
+        bs = outputs['pred_cls'][0].shape[0]
+        device = outputs['pred_cls'][0].device
+        fpn_strides = outputs['strides']
+        anchors = outputs['anchors']
+        # preds: [B, M, C]
+        conf_preds = torch.cat(outputs['pred_conf'], dim=1)
+        cls_preds = torch.cat(outputs['pred_cls'], dim=1)
+        box_preds = torch.cat(outputs['pred_box'], dim=1)
+
+        # label assignment
+        cls_targets = []
+        box_targets = []
+        conf_targets = []
+        fg_masks = []
+
+        for batch_idx in range(bs):
+            tgt_labels = targets[batch_idx]["labels"].to(device)
+            tgt_bboxes = targets[batch_idx]["boxes"].to(device)
+
+            # denormalize tgt_bbox
+            tgt_bboxes *= self.img_size
+
+            # check target
+            if len(tgt_labels) == 0 or tgt_bboxes.max().item() == 0.:
+                num_anchors = sum([ab.shape[0] for ab in anchors])
+                # There is no valid gt
+                cls_target = conf_preds.new_zeros((0, self.num_classes))
+                box_target = conf_preds.new_zeros((0, 4))
+                conf_target = conf_preds.new_zeros((num_anchors, 1))
+                fg_mask = conf_preds.new_zeros(num_anchors).bool()
+            else:
+                (
+                    gt_matched_classes,
+                    fg_mask,
+                    pred_ious_this_matching,
+                    matched_gt_inds,
+                    num_fg_img,
+                ) = self.matcher(
+                    fpn_strides = fpn_strides,
+                    anchors = anchors,
+                    pred_conf = conf_preds[batch_idx],
+                    pred_cls = cls_preds[batch_idx], 
+                    pred_box = box_preds[batch_idx],
+                    tgt_labels = tgt_labels,
+                    tgt_bboxes = tgt_bboxes,
+                    )
+
+                conf_target = fg_mask.unsqueeze(-1)
+                box_target = tgt_bboxes[matched_gt_inds]
+                if self.multi_hot:
+                    cls_target = gt_matched_classes.float()
+                else:
+                    cls_target = F.one_hot(gt_matched_classes.long(), self.num_classes)
+                cls_target = cls_target * pred_ious_this_matching.unsqueeze(-1)
+
+            cls_targets.append(cls_target)
+            box_targets.append(box_target)
+            conf_targets.append(conf_target)
+            fg_masks.append(fg_mask)
+
+        cls_targets = torch.cat(cls_targets, 0)
+        box_targets = torch.cat(box_targets, 0)
+        conf_targets = torch.cat(conf_targets, 0)
+        fg_masks = torch.cat(fg_masks, 0)
+        num_foregrounds = fg_masks.sum()
+
+        if is_dist_avail_and_initialized():
+            torch.distributed.all_reduce(num_foregrounds)
+        num_foregrounds = (num_foregrounds / get_world_size()).clamp(1.0)
+
+        # conf loss
+        loss_conf = self.obj_lossf(conf_preds.view(-1, 1), conf_targets.float())
+        loss_conf = loss_conf.sum() / num_foregrounds
+        
+        # cls loss
+        matched_cls_preds = cls_preds.view(-1, self.num_classes)[fg_masks]
+        loss_cls = self.cls_lossf(matched_cls_preds, cls_targets)
+        loss_cls = loss_cls.sum() / num_foregrounds
+
+        # box loss
+        matched_box_preds = box_preds.view(-1, 4)[fg_masks]
+        ious = get_ious(matched_box_preds,
+                        box_targets,
+                        box_mode="xyxy",
+                        iou_type='giou')
+        loss_box = (1.0 - ious).sum() / num_foregrounds
+
+        # total loss
+        losses = self.loss_conf_weight * loss_conf + \
+                 self.loss_cls_weight * loss_cls + \
+                 self.loss_reg_weight * loss_box
+
+        loss_dict = dict(
+                loss_conf = loss_conf,
+                loss_cls = loss_cls,
+                loss_box = loss_box,
+                losses = losses
+        )
+
+        return loss_dict
+
+
+def build_criterion(args, img_size, num_classes, multi_hot=False):
+    criterion = Criterion(args, img_size, num_classes, multi_hot)
+    
+    return criterion
+    
\ No newline at end of file
diff --git a/models/yowo/matcher.py b/models/yowo/matcher.py
new file mode 100644
index 0000000..5fb7e06
--- /dev/null
+++ b/models/yowo/matcher.py
@@ -0,0 +1,200 @@
+import torch
+import torch.nn.functional as F
+from utils.box_ops import *
+
+
+
+# SimOTA
+class SimOTA(object):
+    def __init__(self, num_classes, center_sampling_radius, topk_candidate):
+        self.num_classes = num_classes
+        self.center_sampling_radius = center_sampling_radius
+        self.topk_candidate = topk_candidate
+
+
+    @torch.no_grad()
+    def __call__(self, 
+                 fpn_strides, 
+                 anchors, 
+                 pred_conf, 
+                 pred_cls, 
+                 pred_box, 
+                 tgt_labels,
+                 tgt_bboxes):
+        # [M,]
+        strides = torch.cat([torch.ones_like(anchor_i[:, 0]) * stride_i
+                                for stride_i, anchor_i in zip(fpn_strides, anchors)], dim=-1)
+        # List[F, M, 2] -> [M, 2]
+        anchors = torch.cat(anchors, dim=0)
+        num_anchor = anchors.shape[0]        
+        num_gt = len(tgt_labels)
+
+        fg_mask, is_in_boxes_and_center = \
+            self.get_in_boxes_info(
+                tgt_bboxes,
+                anchors,
+                strides,
+                num_anchor,
+                num_gt
+                )
+
+        conf_preds_ = pred_conf[fg_mask]   # [Mp, 1]
+        cls_preds_ = pred_cls[fg_mask]   # [Mp, C]
+        box_preds_ = pred_box[fg_mask]   # [Mp, 4]
+        num_in_boxes_anchor = box_preds_.shape[0]
+
+        # [N, Mp]
+        pair_wise_ious, _ = box_iou(tgt_bboxes, box_preds_)
+        pair_wise_ious_loss = -torch.log(pair_wise_ious + 1e-8)
+
+        if len(tgt_labels.shape) == 1:
+            gt_cls = F.one_hot(tgt_labels.long(), self.num_classes)
+        elif len(tgt_labels.shape) == 2:
+            gt_cls = tgt_labels
+
+        # [N, C] -> [N, Mp, C]
+        gt_cls = gt_cls.float().unsqueeze(1).repeat(1, num_in_boxes_anchor, 1)
+
+        with torch.cuda.amp.autocast(enabled=False):
+            score_preds_ = torch.sqrt(
+                cls_preds_.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
+                * conf_preds_.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
+            ) # [N, Mp, C]
+            pair_wise_cls_loss = F.binary_cross_entropy(
+                score_preds_, gt_cls, reduction="none"
+            ).sum(-1) # [N, Mp]
+        del score_preds_
+
+        cost = (
+            pair_wise_cls_loss
+            + 3.0 * pair_wise_ious_loss
+            + 100000.0 * (~is_in_boxes_and_center)
+        ) # [N, Mp]
+
+        (
+            num_fg,
+            gt_matched_classes,         # [num_fg,]
+            pred_ious_this_matching,    # [num_fg,]
+            matched_gt_inds,            # [num_fg,]
+        ) = self.dynamic_k_matching(
+            cost,
+            pair_wise_ious,
+            tgt_labels,
+            num_gt,
+            fg_mask
+            )
+        del pair_wise_cls_loss, cost, pair_wise_ious, pair_wise_ious_loss
+
+        return (
+                gt_matched_classes,
+                fg_mask,
+                pred_ious_this_matching,
+                matched_gt_inds,
+                num_fg,
+        )
+
+
+    def get_in_boxes_info(
+        self,
+        gt_bboxes,   # [N, 4]
+        anchors,     # [M, 2]
+        strides,     # [M,]
+        num_anchors, # M
+        num_gt,      # N
+        ):
+        # anchor center
+        x_centers = anchors[:, 0]
+        y_centers = anchors[:, 1]
+
+        # [M,] -> [1, M] -> [N, M]
+        x_centers = x_centers.unsqueeze(0).repeat(num_gt, 1)
+        y_centers = y_centers.unsqueeze(0).repeat(num_gt, 1)
+
+        # [N,] -> [N, 1] -> [N, M]
+        gt_bboxes_l = gt_bboxes[:, 0].unsqueeze(1).repeat(1, num_anchors) # x1
+        gt_bboxes_t = gt_bboxes[:, 1].unsqueeze(1).repeat(1, num_anchors) # y1
+        gt_bboxes_r = gt_bboxes[:, 2].unsqueeze(1).repeat(1, num_anchors) # x2
+        gt_bboxes_b = gt_bboxes[:, 3].unsqueeze(1).repeat(1, num_anchors) # y2
+
+        b_l = x_centers - gt_bboxes_l
+        b_r = gt_bboxes_r - x_centers
+        b_t = y_centers - gt_bboxes_t
+        b_b = gt_bboxes_b - y_centers
+        bbox_deltas = torch.stack([b_l, b_t, b_r, b_b], 2)
+
+        is_in_boxes = bbox_deltas.min(dim=-1).values > 0.0
+        is_in_boxes_all = is_in_boxes.sum(dim=0) > 0
+        # in fixed center
+        center_radius = self.center_sampling_radius
+
+        # [N, 2]
+        gt_centers = (gt_bboxes[:, :2] + gt_bboxes[:, 2:]) * 0.5
+        
+        # [1, M]
+        center_radius_ = center_radius * strides.unsqueeze(0)
+
+        gt_bboxes_l = gt_centers[:, 0].unsqueeze(1).repeat(1, num_anchors) - center_radius_ # x1
+        gt_bboxes_t = gt_centers[:, 1].unsqueeze(1).repeat(1, num_anchors) - center_radius_ # y1
+        gt_bboxes_r = gt_centers[:, 0].unsqueeze(1).repeat(1, num_anchors) + center_radius_ # x2
+        gt_bboxes_b = gt_centers[:, 1].unsqueeze(1).repeat(1, num_anchors) + center_radius_ # y2
+
+        c_l = x_centers - gt_bboxes_l
+        c_r = gt_bboxes_r - x_centers
+        c_t = y_centers - gt_bboxes_t
+        c_b = gt_bboxes_b - y_centers
+        center_deltas = torch.stack([c_l, c_t, c_r, c_b], 2)
+        is_in_centers = center_deltas.min(dim=-1).values > 0.0
+        is_in_centers_all = is_in_centers.sum(dim=0) > 0
+
+        # in boxes and in centers
+        is_in_boxes_anchor = is_in_boxes_all | is_in_centers_all
+
+        is_in_boxes_and_center = (
+            is_in_boxes[:, is_in_boxes_anchor] & is_in_centers[:, is_in_boxes_anchor]
+        )
+        return is_in_boxes_anchor, is_in_boxes_and_center
+    
+    
+    def dynamic_k_matching(
+        self, 
+        cost, 
+        pair_wise_ious, 
+        gt_classes, 
+        num_gt, 
+        fg_mask
+        ):
+        # Dynamic K
+        # ---------------------------------------------------------------
+        matching_matrix = torch.zeros_like(cost, dtype=torch.uint8)
+
+        ious_in_boxes_matrix = pair_wise_ious
+        n_candidate_k = min(self.topk_candidate, ious_in_boxes_matrix.size(1))
+        topk_ious, _ = torch.topk(ious_in_boxes_matrix, n_candidate_k, dim=1)
+        dynamic_ks = torch.clamp(topk_ious.sum(1).int(), min=1)
+        dynamic_ks = dynamic_ks.tolist()
+        for gt_idx in range(num_gt):
+            _, pos_idx = torch.topk(
+                cost[gt_idx], k=dynamic_ks[gt_idx], largest=False
+            )
+            matching_matrix[gt_idx][pos_idx] = 1
+
+        del topk_ious, dynamic_ks, pos_idx
+
+        anchor_matching_gt = matching_matrix.sum(0)
+        if (anchor_matching_gt > 1).sum() > 0:
+            _, cost_argmin = torch.min(cost[:, anchor_matching_gt > 1], dim=0)
+            matching_matrix[:, anchor_matching_gt > 1] *= 0
+            matching_matrix[cost_argmin, anchor_matching_gt > 1] = 1
+        fg_mask_inboxes = matching_matrix.sum(0) > 0
+        num_fg = fg_mask_inboxes.sum().item()
+
+        fg_mask[fg_mask.clone()] = fg_mask_inboxes
+
+        matched_gt_inds = matching_matrix[:, fg_mask_inboxes].argmax(0)
+        gt_matched_classes = gt_classes[matched_gt_inds]
+
+        pred_ious_this_matching = (matching_matrix * pair_wise_ious).sum(0)[
+            fg_mask_inboxes
+        ]
+        return num_fg, gt_matched_classes, pred_ious_this_matching, matched_gt_inds
+        
\ No newline at end of file
diff --git a/models/yowo/yowo.py b/models/yowo/yowo.py
new file mode 100644
index 0000000..d233c6c
--- /dev/null
+++ b/models/yowo/yowo.py
@@ -0,0 +1,421 @@
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from ..backbone import build_backbone_2d
+from ..backbone import build_backbone_3d
+from .encoder import build_channel_encoder
+from .head import build_head
+
+from utils.nms import multiclass_nms
+
+
+# You Only Watch Once
+class YOWO(nn.Module):
+    def __init__(self, 
+                 cfg,
+                 device,
+                 num_classes = 20, 
+                 conf_thresh = 0.05,
+                 nms_thresh = 0.6,
+                 topk = 50,
+                 trainable = False,
+                 multi_hot = False):
+        super(YOWO, self).__init__()
+        self.cfg = cfg
+        self.device = device
+        self.stride = cfg['stride']
+        self.num_classes = num_classes
+        self.trainable = trainable
+        self.conf_thresh = conf_thresh
+        self.nms_thresh = nms_thresh
+        self.topk = topk
+        self.multi_hot = multi_hot
+
+        # ------------------ Network ---------------------
+        ## 2D backbone
+        self.backbone_2d, bk_dim_2d = build_backbone_2d(
+            cfg, pretrained=cfg['pretrained_2d'] and trainable)
+            
+        ## 3D backbone
+        self.backbone_3d, bk_dim_3d = build_backbone_3d(
+            cfg, pretrained=cfg['pretrained_3d'] and trainable)
+
+        ## cls channel encoder
+        self.cls_channel_encoders = nn.ModuleList(
+            [build_channel_encoder(cfg, bk_dim_2d[i]+bk_dim_3d, cfg['head_dim'])
+                for i in range(len(cfg['stride']))])
+            
+        ## reg channel & spatial encoder
+        self.reg_channel_encoders = nn.ModuleList(
+            [build_channel_encoder(cfg, bk_dim_2d[i]+bk_dim_3d, cfg['head_dim'])
+                for i in range(len(cfg['stride']))])
+
+        ## head
+        self.heads = nn.ModuleList(
+            [build_head(cfg) for _ in range(len(cfg['stride']))]
+        ) 
+
+        ## pred
+        head_dim = cfg['head_dim']
+        self.conf_preds = nn.ModuleList(
+            [nn.Conv2d(head_dim, 1, kernel_size=1)
+                for _ in range(len(cfg['stride']))
+                ]) 
+        self.cls_preds = nn.ModuleList(
+            [nn.Conv2d(head_dim, self.num_classes, kernel_size=1)
+                for _ in range(len(cfg['stride']))
+                ]) 
+        self.reg_preds = nn.ModuleList(
+            [nn.Conv2d(head_dim, 4, kernel_size=1) 
+                for _ in range(len(cfg['stride']))
+                ])                 
+
+        # init yowo
+        self.init_yowo()
+
+
+    def init_yowo(self): 
+        # Init yolo
+        for m in self.modules():
+            if isinstance(m, nn.BatchNorm2d):
+                m.eps = 1e-3
+                m.momentum = 0.03
+                
+        # Init bias
+        init_prob = 0.01
+        bias_value = -torch.log(torch.tensor((1. - init_prob) / init_prob))
+        # obj pred
+        for conf_pred in self.conf_preds:
+            b = conf_pred.bias.view(1, -1)
+            b.data.fill_(bias_value.item())
+            conf_pred.bias = torch.nn.Parameter(b.view(-1), requires_grad=True)
+        # cls pred
+        for cls_pred in self.cls_preds:
+            b = cls_pred.bias.view(1, -1)
+            b.data.fill_(bias_value.item())
+            cls_pred.bias = torch.nn.Parameter(b.view(-1), requires_grad=True)
+
+
+    def generate_anchors(self, fmp_size, stride):
+        """
+            fmp_size: (List) [H, W]
+        """
+        # generate grid cells
+        fmp_h, fmp_w = fmp_size
+        anchor_y, anchor_x = torch.meshgrid([torch.arange(fmp_h), torch.arange(fmp_w)])
+        # [H, W, 2] -> [HW, 2]
+        anchor_xy = torch.stack([anchor_x, anchor_y], dim=-1).float().view(-1, 2) + 0.5
+        anchor_xy *= stride
+        anchors = anchor_xy.to(self.device)
+
+        return anchors
+        
+
+    def decode_boxes(self, anchors, pred_reg, stride):
+        """
+            anchors:  (List[Tensor]) [1, M, 2] or [M, 2]
+            pred_reg: (List[Tensor]) [B, M, 4] or [B, M, 4]
+        """
+        # center of bbox
+        pred_ctr_xy = anchors + pred_reg[..., :2] * stride
+        # size of bbox
+        pred_box_wh = pred_reg[..., 2:].exp() * stride
+
+        pred_x1y1 = pred_ctr_xy - 0.5 * pred_box_wh
+        pred_x2y2 = pred_ctr_xy + 0.5 * pred_box_wh
+        pred_box = torch.cat([pred_x1y1, pred_x2y2], dim=-1)
+
+        return pred_box
+
+
+    def post_process_one_hot(self, conf_preds, cls_preds, reg_preds, anchors):
+        """
+        Input:
+            conf_preds: (Tensor) [H x W, 1]
+            cls_preds: (Tensor) [H x W, C]
+            reg_preds: (Tensor) [H x W, 4]
+        """
+        
+        all_scores = []
+        all_labels = []
+        all_bboxes = []
+        
+        for level, (conf_pred_i, cls_pred_i, reg_pred_i, anchors_i) in enumerate(zip(conf_preds, cls_preds, reg_preds, anchors)):
+            # (H x W x C,)
+            scores_i = (torch.sqrt(conf_pred_i.sigmoid() * cls_pred_i.sigmoid())).flatten()
+
+            # Keep top k top scoring indices only.
+            num_topk = min(self.topk, reg_pred_i.size(0))
+
+            # torch.sort is actually faster than .topk (at least on GPUs)
+            predicted_prob, topk_idxs = scores_i.sort(descending=True)
+            topk_scores = predicted_prob[:num_topk]
+            topk_idxs = topk_idxs[:num_topk]
+
+            # filter out the proposals with low confidence score
+            keep_idxs = topk_scores > self.conf_thresh
+            scores = topk_scores[keep_idxs]
+            topk_idxs = topk_idxs[keep_idxs]
+
+            anchor_idxs = torch.div(topk_idxs, self.num_classes, rounding_mode='floor')
+            labels = topk_idxs % self.num_classes
+
+            reg_pred_i = reg_pred_i[anchor_idxs]
+            anchors_i = anchors_i[anchor_idxs]
+
+            # decode box: [M, 4]
+            bboxes = self.decode_boxes(anchors_i, reg_pred_i, self.stride[level])
+
+            all_scores.append(scores)
+            all_labels.append(labels)
+            all_bboxes.append(bboxes)
+
+        scores = torch.cat(all_scores)
+        labels = torch.cat(all_labels)
+        bboxes = torch.cat(all_bboxes)
+
+        # to cpu
+        scores = scores.cpu().numpy()
+        labels = labels.cpu().numpy()
+        bboxes = bboxes.cpu().numpy()
+
+        # nms
+        scores, labels, bboxes = multiclass_nms(
+            scores, labels, bboxes, self.nms_thresh, self.num_classes, False)
+
+        return scores, labels, bboxes
+    
+
+    def post_process_multi_hot(self, conf_preds, cls_preds, reg_preds, anchors):
+        """
+        Input:
+            cls_pred: (Tensor) [H x W, C]
+            reg_pred: (Tensor) [H x W, 4]
+        """        
+        all_conf_preds = []
+        all_cls_preds = []
+        all_box_preds = []
+        for level, (conf_pred_i, cls_pred_i, reg_pred_i, anchors_i) in enumerate(zip(conf_preds, cls_preds, reg_preds, anchors)):
+            # decode box
+            box_pred_i = self.decode_boxes(anchors_i, reg_pred_i, self.stride[level])
+            
+            # conf pred 
+            conf_pred_i = torch.sigmoid(conf_pred_i.squeeze(-1))   # [M,]
+
+            # cls_pred
+            cls_pred_i = torch.sigmoid(cls_pred_i)                 # [M, C]
+
+            # topk
+            topk_conf_pred_i, topk_inds = torch.topk(conf_pred_i, self.topk)
+            topk_cls_pred_i = cls_pred_i[topk_inds]
+            topk_box_pred_i = box_pred_i[topk_inds]
+
+            # threshold
+            keep = topk_conf_pred_i.gt(self.conf_thresh)
+            topk_conf_pred_i = topk_conf_pred_i[keep]
+            topk_cls_pred_i = topk_cls_pred_i[keep]
+            topk_box_pred_i = topk_box_pred_i[keep]
+
+            all_conf_preds.append(topk_conf_pred_i)
+            all_cls_preds.append(topk_cls_pred_i)
+            all_box_preds.append(topk_box_pred_i)
+
+        # concatenate
+        conf_preds = torch.cat(all_conf_preds, dim=0)  # [M,]
+        cls_preds = torch.cat(all_cls_preds, dim=0)    # [M, C]
+        box_preds = torch.cat(all_box_preds, dim=0)    # [M, 4]
+
+        # to cpu
+        scores = conf_preds.cpu().numpy()
+        labels = cls_preds.cpu().numpy()
+        bboxes = box_preds.cpu().numpy()
+
+        # nms
+        scores, labels, bboxes = multiclass_nms(
+            scores, labels, bboxes, self.nms_thresh, self.num_classes, True)
+
+        # [M, 5 + C]
+        out_boxes = np.concatenate([bboxes, scores[..., None], labels], axis=-1)
+
+        return out_boxes
+    
+
+    @torch.no_grad()
+    def inference(self, video_clips):
+        """
+        Input:
+            video_clips: (Tensor) -> [B, 3, T, H, W].
+        return:
+        """
+        B, _, _, img_h, img_w = video_clips.shape
+
+        # key frame
+        key_frame = video_clips[:, :, -1, :, :]
+        # 3D backbone
+        feat_3d = self.backbone_3d(video_clips)
+
+        # 2D backbone
+        cls_feats, reg_feats = self.backbone_2d(key_frame)
+
+        # non-shared heads
+        all_conf_preds = []
+        all_cls_preds = []
+        all_reg_preds = []
+        all_anchors = []
+        for level, (cls_feat, reg_feat) in enumerate(zip(cls_feats, reg_feats)):
+            # upsample
+            feat_3d_up = F.interpolate(feat_3d, scale_factor=2 ** (2 - level))
+
+            # encoder
+            cls_feat = self.cls_channel_encoders[level](cls_feat, feat_3d_up)
+            reg_feat = self.reg_channel_encoders[level](reg_feat, feat_3d_up)
+
+            # head
+            cls_feat, reg_feat = self.heads[level](cls_feat, reg_feat)
+
+            # pred
+            conf_pred = self.conf_preds[level](reg_feat)
+            cls_pred = self.cls_preds[level](cls_feat)
+            reg_pred = self.reg_preds[level](reg_feat)
+        
+            # generate anchors
+            fmp_size = conf_pred.shape[-2:]
+            anchors = self.generate_anchors(fmp_size, self.stride[level])
+
+            # [B, C, H, W] -> [B, H, W, C] -> [B, M, C], M = HW
+            conf_pred = conf_pred.permute(0, 2, 3, 1).contiguous().view(B, -1, 1)
+            cls_pred = cls_pred.permute(0, 2, 3, 1).contiguous().view(B, -1, self.num_classes)
+            reg_pred = reg_pred.permute(0, 2, 3, 1).contiguous().view(B, -1, 4)
+
+            all_conf_preds.append(conf_pred)
+            all_cls_preds.append(cls_pred)
+            all_reg_preds.append(reg_pred)
+            all_anchors.append(anchors)
+        
+        # batch process
+        if self.multi_hot:
+            batch_bboxes = []
+            for batch_idx in range(video_clips.size(0)):
+                cur_conf_preds = []
+                cur_cls_preds = []
+                cur_reg_preds = []
+                for conf_preds, cls_preds, reg_preds in zip(all_conf_preds, all_cls_preds, all_reg_preds):
+                    # [B, M, C] -> [M, C]
+                    cur_conf_preds.append(conf_preds[batch_idx])
+                    cur_cls_preds.append(cls_preds[batch_idx])
+                    cur_reg_preds.append(reg_preds[batch_idx])
+
+                # post-process
+                out_boxes = self.post_process_multi_hot(
+                    cur_conf_preds, cur_cls_preds, cur_reg_preds, all_anchors)
+
+                # normalize bbox
+                out_boxes[..., :4] /= max(img_h, img_w)
+                out_boxes[..., :4] = out_boxes[..., :4].clip(0., 1.)
+
+                batch_bboxes.append(out_boxes)
+
+            return batch_bboxes
+
+        else:
+            batch_scores = []
+            batch_labels = []
+            batch_bboxes = []
+            for batch_idx in range(conf_pred.size(0)):
+                # [B, M, C] -> [M, C]
+                cur_conf_preds = []
+                cur_cls_preds = []
+                cur_reg_preds = []
+                for conf_preds, cls_preds, reg_preds in zip(all_conf_preds, all_cls_preds, all_reg_preds):
+                    # [B, M, C] -> [M, C]
+                    cur_conf_preds.append(conf_preds[batch_idx])
+                    cur_cls_preds.append(cls_preds[batch_idx])
+                    cur_reg_preds.append(reg_preds[batch_idx])
+
+                # post-process
+                scores, labels, bboxes = self.post_process_one_hot(
+                    cur_conf_preds, cur_cls_preds, cur_reg_preds, all_anchors)
+
+                # normalize bbox
+                bboxes /= max(img_h, img_w)
+                bboxes = bboxes.clip(0., 1.)
+
+                batch_scores.append(scores)
+                batch_labels.append(labels)
+                batch_bboxes.append(bboxes)
+
+            return batch_scores, batch_labels, batch_bboxes
+
+
+    def forward(self, video_clips):
+        """
+        Input:
+            video_clips: (Tensor) -> [B, 3, T, H, W].
+        return:
+            outputs: (Dict) -> {
+                'pred_conf': (Tensor) [B, M, 1]
+                'pred_cls':  (Tensor) [B, M, C]
+                'pred_reg':  (Tensor) [B, M, 4]
+                'anchors':   (Tensor) [M, 2]
+                'stride':    (Int)
+            }
+        """                        
+        if not self.trainable:
+            return self.inference(video_clips)
+        else:
+            # key frame
+            key_frame = video_clips[:, :, -1, :, :]
+            # 3D backbone
+            feat_3d = self.backbone_3d(video_clips)
+
+            # 2D backbone
+            cls_feats, reg_feats = self.backbone_2d(key_frame)
+
+            # non-shared heads
+            all_conf_preds = []
+            all_cls_preds = []
+            all_box_preds = []
+            all_anchors = []
+            for level, (cls_feat, reg_feat) in enumerate(zip(cls_feats, reg_feats)):
+                # upsample
+                feat_3d_up = F.interpolate(feat_3d, scale_factor=2 ** (2 - level))
+
+                # encoder
+                cls_feat = self.cls_channel_encoders[level](cls_feat, feat_3d_up)
+                reg_feat = self.reg_channel_encoders[level](reg_feat, feat_3d_up)
+
+                # head
+                cls_feat, reg_feat = self.heads[level](cls_feat, reg_feat)
+
+                # pred
+                conf_pred = self.conf_preds[level](reg_feat)
+                cls_pred = self.cls_preds[level](cls_feat)
+                reg_pred = self.reg_preds[level](reg_feat)
+        
+                # generate anchors
+                fmp_size = conf_pred.shape[-2:]
+                anchors = self.generate_anchors(fmp_size, self.stride[level])
+
+                # [B, C, H, W] -> [B, H, W, C] -> [B, M, C]
+                conf_pred = conf_pred.permute(0, 2, 3, 1).contiguous().flatten(1, 2)
+                cls_pred = cls_pred.permute(0, 2, 3, 1).contiguous().flatten(1, 2)
+                reg_pred = reg_pred.permute(0, 2, 3, 1).contiguous().flatten(1, 2)
+
+                # decode box: [M, 4]
+                box_pred = self.decode_boxes(anchors, reg_pred, self.stride[level])
+
+                all_conf_preds.append(conf_pred)
+                all_cls_preds.append(cls_pred)
+                all_box_preds.append(box_pred)
+                all_anchors.append(anchors)
+            
+            # output dict
+            outputs = {"pred_conf": all_conf_preds,       # List(Tensor) [B, M, 1]
+                       "pred_cls": all_cls_preds,         # List(Tensor) [B, M, C]
+                       "pred_box": all_box_preds,         # List(Tensor) [B, M, 4]
+                       "anchors": all_anchors,            # List(Tensor) [B, M, 2]
+                       "strides": self.stride}            # List(Int)
+
+            return outputs
diff --git a/test.py b/test.py
new file mode 100644
index 0000000..6908f17
--- /dev/null
+++ b/test.py
@@ -0,0 +1,287 @@
+import argparse
+import cv2
+import os
+import time
+import numpy as np
+import torch
+
+from dataset.ucf_jhmdb import UCF_JHMDB_Dataset
+from dataset.ava import AVA_Dataset
+from dataset.transforms import BaseTransform
+
+from utils.misc import load_weight
+from utils.box_ops import rescale_bboxes
+from utils.vis_tools import convert_tensor_to_cv2img, vis_detection
+
+from config import build_dataset_config, build_model_config
+from models import build_model
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='YOWOv2')
+
+    # basic
+    parser.add_argument('-size', '--img_size', default=224, type=int,
+                        help='the size of input frame')
+    parser.add_argument('--show', action='store_true', default=False,
+                        help='show the visulization results.')
+    parser.add_argument('--cuda', action='store_true', default=False, 
+                        help='use cuda.')
+    parser.add_argument('--save', action='store_true', default=False, 
+                        help='save detection results.')
+    parser.add_argument('--save_folder', default='./det_results/', type=str,
+                        help='Dir to save results')
+    parser.add_argument('-vs', '--vis_thresh', default=0.4, type=float,
+                        help='threshold for visualization')
+    parser.add_argument('-sid', '--start_index', default=0, type=int,
+                        help='start index to test.')
+
+    # model
+    parser.add_argument('-v', '--version', default='yowo_v2_large', type=str,
+                        help='build YOWOv2')
+    parser.add_argument('--weight', default=None,
+                        type=str, help='Trained state_dict file path to open')
+    parser.add_argument('-ct', '--conf_thresh', default=0.1, type=float,
+                        help='confidence threshold')
+    parser.add_argument('-nt', '--nms_thresh', default=0.5, type=float,
+                        help='NMS threshold')
+    parser.add_argument('--topk', default=40, type=int,
+                        help='NMS threshold')
+    parser.add_argument('-K', '--len_clip', default=16, type=int,
+                        help='video clip length.')
+    
+    # dataset
+    parser.add_argument('-d', '--dataset', default='ucf24',
+                        help='ucf24, ava.')
+    parser.add_argument('--root', default='/mnt/share/ssd2/dataset/STAD/',
+                        help='data root')
+
+    return parser.parse_args()
+
+    
+@torch.no_grad()
+def inference_ucf24_jhmdb21(args, model, device, dataset, class_names=None, class_colors=None):
+    # path to save 
+    if args.save:
+        save_path = os.path.join(
+            args.save_folder, args.dataset, 
+            args.version, 'video_clips')
+        os.makedirs(save_path, exist_ok=True)
+
+    # inference
+    for index in range(args.start_index, len(dataset)):
+        print('Video clip {:d}/{:d}....'.format(index+1, len(dataset)))
+        frame_id, video_clip, target = dataset[index]
+        orig_size = target['orig_size']  # width, height
+
+        # prepare
+        video_clip = video_clip.unsqueeze(0).to(device) # [B, 3, T, H, W], B=1
+
+        t0 = time.time()
+        # inference
+        batch_scores, batch_labels, batch_bboxes = model(video_clip)
+        print("inference time ", time.time() - t0, "s")
+
+        # batch size = 1
+        scores = batch_scores[0]
+        labels = batch_labels[0]
+        bboxes = batch_bboxes[0]
+        
+        # rescale
+        bboxes = rescale_bboxes(bboxes, orig_size)
+
+        # vis results of key-frame
+        key_frame_tensor = video_clip[0, :, -1, :, :]
+        key_frame = convert_tensor_to_cv2img(key_frame_tensor)
+
+        # resize key_frame to orig size
+        key_frame = cv2.resize(key_frame, orig_size)
+
+        # visualize detection
+        vis_results = vis_detection(
+            frame=key_frame,
+            scores=scores,
+            labels=labels,
+            bboxes=bboxes,
+            vis_thresh=args.vis_thresh,
+            class_names=class_names,
+            class_colors=class_colors
+            )
+
+        if args.show:
+            cv2.imshow('key-frame detection', vis_results)
+            cv2.waitKey(0)
+
+        if args.save:
+            # save result
+            cv2.imwrite(os.path.join(save_path,
+            '{:0>5}.jpg'.format(index)), vis_results)
+        
+
+@torch.no_grad()
+def inference_ava(args, model, device, dataset, class_names=None, class_colors=None):
+    # path to save 
+    if args.save:
+        save_path = os.path.join(
+            args.save_folder, args.dataset, 
+            args.version, 'video_clips')
+        os.makedirs(save_path, exist_ok=True)
+
+    # inference
+    for index in range(args.start_index, len(dataset)):
+        print('Video clip {:d}/{:d}....'.format(index+1, len(dataset)))
+        frame_id, video_clip, target = dataset[index]
+        orig_size = target['orig_size']  # width, height
+
+        # prepare
+        video_clip = video_clip.unsqueeze(0).to(device) # [B, 3, T, H, W], B=1
+
+        t0 = time.time()
+        # inference
+        batch_bboxes = model(video_clip)
+        print("inference time ", time.time() - t0, "s")
+
+        # vis results of key-frame
+        key_frame_tensor = video_clip[0, :, -1, :, :]
+        key_frame = convert_tensor_to_cv2img(key_frame_tensor)
+        
+        # resize key_frame to orig size
+        key_frame = cv2.resize(key_frame, orig_size)
+
+        # batch size = 1
+        bboxes = batch_bboxes[0]
+
+        # visualize detection results
+        for bbox in bboxes:
+            x1, y1, x2, y2 = bbox[:4]
+            det_conf = float(bbox[4])
+            cls_scores = np.sqrt(det_conf * bbox[5:])
+        
+            # rescale bbox
+            x1, x2 = int(x1 * orig_size[0]), int(x2 * orig_size[0])
+            y1, y2 = int(y1 * orig_size[1]), int(y2 * orig_size[1])
+
+            # thresh
+            indices = np.where(cls_scores > args.vis_thresh)
+            scores = cls_scores[indices]
+            indices = list(indices[0])
+            scores = list(scores)
+
+            if len(scores) > 0:
+                # draw bbox
+                cv2.rectangle(key_frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
+                # draw text
+                blk   = np.zeros(key_frame.shape, np.uint8)
+                font  = cv2.FONT_HERSHEY_SIMPLEX
+                coord = []
+                text  = []
+                text_size = []
+                # scores, indices  = [list(a) for a in zip(*sorted(zip(scores,indices), reverse=True))] # if you want, you can sort according to confidence level
+                for _, cls_ind in enumerate(indices):
+                    text.append("[{:.2f}] ".format(scores[_]) + str(class_names[cls_ind]))
+                    text_size.append(cv2.getTextSize(text[-1], font, fontScale=0.25, thickness=1)[0])
+                    coord.append((x1+3, y1+7+10*_))
+                    cv2.rectangle(blk, (coord[-1][0]-1, coord[-1][1]-6), (coord[-1][0]+text_size[-1][0]+1, coord[-1][1]+text_size[-1][1]-4), (0, 255, 0), cv2.FILLED)
+                key_frame = cv2.addWeighted(key_frame, 1.0, blk, 0.25, 1)
+                for t in range(len(text)):
+                    cv2.putText(key_frame, text[t], coord[t], font, 0.25, (0, 0, 0), 1)
+        
+        if args.show:
+            cv2.imshow('key-frame detection', key_frame)
+            cv2.waitKey(0)
+
+        if args.save:
+            # save result
+            cv2.imwrite(os.path.join(save_path,
+            '{:0>5}.jpg'.format(index)), key_frame)
+        
+
+if __name__ == '__main__':
+    args = parse_args()
+    # cuda
+    if args.cuda:
+        print('use cuda')
+        device = torch.device("cuda")
+    else:
+        device = torch.device("cpu")
+
+    # config
+    d_cfg = build_dataset_config(args)
+    m_cfg = build_model_config(args)
+
+    # transform
+    basetransform = BaseTransform(img_size=args.img_size)
+
+    # dataset
+    if args.dataset in ['ucf24', 'jhmdb21']:
+        data_dir = os.path.join(args.root, 'ucf24')
+        dataset = UCF_JHMDB_Dataset(
+            data_root=data_dir,
+            dataset=args.dataset,
+            img_size=args.img_size,
+            transform=basetransform,
+            is_train=False,
+            len_clip=args.len_clip,
+            sampling_rate=d_cfg['sampling_rate']
+            )
+        class_names = d_cfg['label_map']
+        num_classes = dataset.num_classes
+
+    elif args.dataset == 'ava_v2.2':
+        data_dir = os.path.join(args.root, 'AVA_Dataset')
+        dataset = AVA_Dataset(
+            cfg=d_cfg,
+            data_root=data_dir,
+            is_train=False,
+            img_size=args.img_size,
+            transform=basetransform,
+            len_clip=args.len_clip,
+            sampling_rate=d_cfg['sampling_rate']
+        )
+        class_names = d_cfg['label_map']
+        num_classes = dataset.num_classes
+
+    else:
+        print('unknow dataset !! Only support ucf24 & jhmdb21 & ava_v2.2 and coco !!')
+        exit(0)
+
+    np.random.seed(100)
+    class_colors = [(np.random.randint(255),
+                     np.random.randint(255),
+                     np.random.randint(255)) for _ in range(num_classes)]
+
+    # build model
+    model, _ = build_model(
+        args=args,
+        d_cfg=d_cfg,
+        m_cfg=m_cfg,
+        device=device, 
+        num_classes=num_classes, 
+        trainable=False
+        )
+
+    # load trained weight
+    model = load_weight(model=model, path_to_ckpt=args.weight)
+
+    # to eval
+    model = model.to(device).eval()
+
+    # run
+    if args.dataset in ['ucf24', 'jhmdb21']:
+        inference_ucf24_jhmdb21(
+            args=args,
+            model=model,
+            device=device,
+            dataset=dataset,
+            class_names=class_names,
+            class_colors=class_colors
+            )
+    elif args.dataset in ['ava_v2.2']:
+        inference_ava(
+            args=args,
+            model=model,
+            device=device,
+            dataset=dataset,
+            class_names=class_names,
+            class_colors=class_colors
+            )
diff --git a/test_video_ava.py b/test_video_ava.py
new file mode 100644
index 0000000..c05837a
--- /dev/null
+++ b/test_video_ava.py
@@ -0,0 +1,187 @@
+import argparse
+import cv2
+import os
+import time
+import numpy as np
+import torch
+from PIL import Image
+
+from dataset.transforms import BaseTransform
+from utils.misc import load_weight
+from config import build_dataset_config, build_model_config
+from models import build_model
+
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='YOWOv2')
+
+    # basic
+    parser.add_argument('-size', '--img_size', default=224, type=int,
+                        help='the size of input frame')
+    parser.add_argument('--show', action='store_true', default=False,
+                        help='show the visulization results.')
+    parser.add_argument('--cuda', action='store_true', default=False, 
+                        help='use cuda.')
+    parser.add_argument('--save_folder', default='det_results/', type=str,
+                        help='Dir to save results')
+    parser.add_argument('-vs', '--vis_thresh', default=0.35, type=float,
+                        help='threshold for visualization')
+    parser.add_argument('--video', default='9Y_l9NsnYE0.mp4', type=str,
+                        help='AVA video name.')
+    parser.add_argument('-d', '--dataset', default='ava_v2.2',
+                        help='ava_v2.2')
+
+    # model
+    parser.add_argument('-v', '--version', default='yowo_v2_large', type=str,
+                        help='build YOWOv2')
+    parser.add_argument('--weight', default='weight/',
+                        type=str, help='Trained state_dict file path to open')
+    parser.add_argument('--topk', default=40, type=int,
+                        help='NMS threshold')
+
+    return parser.parse_args()
+                    
+
+@torch.no_grad()
+def run(args, d_cfg, model, device, transform, class_names):
+    # path to save 
+    save_path = os.path.join(args.save_folder, 'ava_video')
+    os.makedirs(save_path, exist_ok=True)
+
+    # path to video
+    path_to_video = os.path.join(d_cfg['data_root'], 'videos_15min', args.video)
+
+    # video
+    video = cv2.VideoCapture(path_to_video)
+    fourcc = cv2.VideoWriter_fourcc(*'XVID')
+    save_size = (640, 480)
+    save_name = os.path.join(save_path, 'detection.avi')
+    fps = 15.0
+    out = cv2.VideoWriter(save_name, fourcc, fps, save_size)
+
+    video_clip = []
+    while(True):
+        ret, frame = video.read()
+        
+        if ret:
+            # to PIL image
+            frame_pil = Image.fromarray(frame.astype(np.uint8))
+
+            # prepare
+            if len(video_clip) <= 0:
+                for _ in range(d_cfg['len_clip']):
+                    video_clip.append(frame_pil)
+
+            video_clip.append(frame_pil)
+            del video_clip[0]
+
+            # orig size
+            orig_h, orig_w = frame.shape[:2]
+
+            # transform
+            x, _ = transform(video_clip)
+            # List [T, 3, H, W] -> [3, T, H, W]
+            x = torch.stack(x, dim=1)
+            x = x.unsqueeze(0).to(device) # [B, 3, T, H, W], B=1
+
+            t0 = time.time()
+            # inference
+            batch_bboxes = model(x)
+            print("inference time ", time.time() - t0, "s")
+
+            # batch size = 1
+            bboxes = batch_bboxes[0]
+
+            # visualize detection results
+            for bbox in bboxes:
+                x1, y1, x2, y2 = bbox[:4]
+                det_conf = float(bbox[4])
+                cls_out = [det_conf * cls_conf.cpu().numpy() for cls_conf in bbox[5]]
+            
+                # rescale bbox
+                x1, x2 = int(x1 * orig_w), int(x2 * orig_w)
+                y1, y2 = int(y1 * orig_h), int(y2 * orig_h)
+
+                cls_scores = np.array(cls_out)
+                indices = np.where(cls_scores > 0.4)
+                scores = cls_scores[indices]
+                indices = list(indices[0])
+                scores = list(scores)
+
+                cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
+
+                if len(scores) > 0:
+                    blk   = np.zeros(frame.shape, np.uint8)
+                    font  = cv2.FONT_HERSHEY_SIMPLEX
+                    coord = []
+                    text  = []
+                    text_size = []
+
+                    for _, cls_ind in enumerate(indices):
+                        text.append("[{:.2f}] ".format(scores[_]) + str(class_names[cls_ind]))
+                        text_size.append(cv2.getTextSize(text[-1], font, fontScale=0.25, thickness=1)[0])
+                        coord.append((x1+3, y1+7+10*_))
+                        cv2.rectangle(blk, (coord[-1][0]-1, coord[-1][1]-6), (coord[-1][0]+text_size[-1][0]+1, coord[-1][1]+text_size[-1][1]-4), (0, 255, 0), cv2.FILLED)
+                    frame = cv2.addWeighted(frame, 1.0, blk, 0.25, 1)
+                    for t in range(len(text)):
+                        cv2.putText(frame, text[t], coord[t], font, 0.25, (0, 0, 0), 1)
+
+            # save
+            out.write(frame)
+
+            if args.show:
+                # show
+                cv2.imshow('key-frame detection', frame)
+                cv2.waitKey(1)
+
+        else:
+            break
+
+    video.release()
+    out.release()
+    cv2.destroyAllWindows()
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    # cuda
+    if args.cuda:
+        print('use cuda')
+        device = torch.device("cuda")
+    else:
+        device = torch.device("cpu")
+
+    # config
+    d_cfg = build_dataset_config(args)
+    m_cfg = build_model_config(args)
+
+    class_names = d_cfg['label_map']
+    num_classes = 80
+
+    # transform
+    basetransform = BaseTransform(
+        img_size=d_cfg['test_size'],
+        pixel_mean=d_cfg['pixel_mean'],
+        pixel_std=d_cfg['pixel_std']
+        )
+
+    # build model
+    model = build_model(
+        args=args,
+        d_cfg=d_cfg,
+        m_cfg=m_cfg,
+        device=device, 
+        num_classes=num_classes, 
+        trainable=False
+        )
+
+    # load trained weight
+    model = load_weight(model=model, path_to_ckpt=args.weight)
+
+    # to eval
+    model = model.to(device).eval()
+
+    # run
+    run(args=args, d_cfg=d_cfg, model=model, device=device,
+        transform=basetransform, class_names=class_names)
diff --git a/train.py b/train.py
new file mode 100644
index 0000000..addab57
--- /dev/null
+++ b/train.py
@@ -0,0 +1,329 @@
+import cv2
+cv2.setNumThreads(0)
+cv2.ocl.setUseOpenCL(False)
+
+import os
+import time
+import argparse
+from copy import deepcopy
+import torch
+import torch.backends.cudnn as cudnn
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel as DDP
+
+from utils import distributed_utils
+from utils.com_flops_params import FLOPs_and_Params
+from utils.misc import CollateFunc, build_dataset, build_dataloader
+from utils.solver.optimizer import build_optimizer
+from utils.solver.warmup_schedule import build_warmup
+
+from config import build_dataset_config, build_model_config
+from models import build_model
+
+
+GLOBAL_SEED = 42
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='YOWOv2')
+    # CUDA
+    parser.add_argument('--cuda', action='store_true', default=False,
+                        help='use cuda.')
+
+    # Visualization
+    parser.add_argument('--tfboard', action='store_true', default=False,
+                        help='use tensorboard')
+    parser.add_argument('--save_folder', default='./weights/', type=str, 
+                        help='path to save weight')
+    parser.add_argument('--vis_data', action='store_true', default=False,
+                        help='use tensorboard')
+
+    # Evaluation
+    parser.add_argument('--eval', action='store_true', default=False, 
+                        help='do evaluation during training.')
+    parser.add_argument('--eval_epoch', default=1, type=int, 
+                        help='after eval epoch, the model is evaluated on val dataset.')
+    parser.add_argument('--save_dir', default='inference_results/',
+                        type=str, help='save inference results.')
+    parser.add_argument('--eval_first', action='store_true', default=False,
+                        help='evaluate model before training.')
+
+    # Batchsize
+    parser.add_argument('-bs', '--batch_size', default=16, type=int, 
+                        help='batch size on a single GPU.')
+    parser.add_argument('-tbs', '--test_batch_size', default=16, type=int, 
+                        help='test batch size on a single GPU.')
+    parser.add_argument('-accu', '--accumulate', default=1, type=int, 
+                        help='gradient accumulate.')
+    parser.add_argument('-lr', '--base_lr', default=0.0001, type=float, 
+                        help='base lr.')
+    parser.add_argument('-ldr', '--lr_decay_ratio', default=0.5, type=float, 
+                        help='base lr.')
+
+    # Epoch
+    parser.add_argument('--max_epoch', default=10, type=int, 
+                        help='max epoch.')
+    parser.add_argument('--lr_epoch', nargs='+', default=[2,3,4], type=int,
+                        help='lr epoch to decay')
+
+    # Model
+    parser.add_argument('-v', '--version', default='yowo_v2_tiny', type=str,
+                        help='build YOWOv2')
+    parser.add_argument('-r', '--resume', default=None, type=str,
+                        help='keep training')
+    parser.add_argument('-ct', '--conf_thresh', default=0.1, type=float,
+                        help='confidence threshold. We suggest 0.005 for UCF24 and 0.1 for AVA.')
+    parser.add_argument('-nt', '--nms_thresh', default=0.5, type=float,
+                        help='NMS threshold. We suggest 0.5 for UCF24 and AVA.')
+    parser.add_argument('--topk', default=40, type=int,
+                        help='topk prediction candidates.')
+    parser.add_argument('-K', '--len_clip', default=16, type=int,
+                        help='video clip length.')
+    parser.add_argument('--freeze_backbone_2d', action="store_true", default=False,
+                        help="freeze 2D backbone.")
+    parser.add_argument('--freeze_backbone_3d', action="store_true", default=False,
+                        help="freeze 3d backbone.")
+    parser.add_argument('-m', '--memory', action="store_true", default=False,
+                        help="memory propagate.")
+
+    # Dataset
+    parser.add_argument('-d', '--dataset', default='ucf24',
+                        help='ucf24, ava_v2.2')
+    parser.add_argument('--root', default='/mnt/share/ssd2/dataset/STAD/',
+                        help='data root')
+    parser.add_argument('--num_workers', default=4, type=int, 
+                        help='Number of workers used in dataloading')
+
+    # Matcher
+    parser.add_argument('--center_sampling_radius', default=2.5, type=float, 
+                        help='conf loss weight factor.')
+    parser.add_argument('--topk_candicate', default=10, type=int, 
+                        help='cls loss weight factor.')
+
+    # Loss
+    parser.add_argument('--loss_conf_weight', default=1, type=float, 
+                        help='conf loss weight factor.')
+    parser.add_argument('--loss_cls_weight', default=1, type=float, 
+                        help='cls loss weight factor.')
+    parser.add_argument('--loss_reg_weight', default=5, type=float, 
+                        help='reg loss weight factor.')
+    parser.add_argument('-fl', '--focal_loss', action="store_true", default=False,
+                        help="use focal loss for classification.")
+    
+    # DDP train
+    parser.add_argument('-dist', '--distributed', action='store_true', default=False,
+                        help='distributed training')
+    parser.add_argument('--dist_url', default='env://', 
+                        help='url used to set up distributed training')
+    parser.add_argument('--world_size', default=1, type=int,
+                        help='number of distributed processes')
+    parser.add_argument('--sybn', action='store_true', default=False, 
+                        help='use sybn.')
+
+    return parser.parse_args()
+
+
+def train():
+    args = parse_args()
+    print("Setting Arguments.. : ", args)
+    print("----------------------------------------------------------")
+
+    # dist
+    print('World size: {}'.format(distributed_utils.get_world_size()))
+    if args.distributed:
+        distributed_utils.init_distributed_mode(args)
+        print("git:\n  {}\n".format(distributed_utils.get_sha()))
+
+    # path to save model
+    path_to_save = os.path.join(args.save_folder, args.dataset, args.version)
+    os.makedirs(path_to_save, exist_ok=True)
+
+    # cuda
+    if args.cuda:
+        print('use cuda')
+        cudnn.benchmark = True
+        device = torch.device("cuda")
+    else:
+        device = torch.device("cpu")
+
+    # config
+    d_cfg = build_dataset_config(args)
+    m_cfg = build_model_config(args)
+
+    # dataset and evaluator
+    dataset, evaluator, num_classes = build_dataset(d_cfg, args, is_train=True)
+
+    # dataloader
+    batch_size = args.batch_size * distributed_utils.get_world_size()
+    dataloader = build_dataloader(args, dataset, batch_size, CollateFunc(), is_train=True)
+
+    # build model
+    model, criterion = build_model(
+        args=args,
+        d_cfg=d_cfg,
+        m_cfg=m_cfg,
+        device=device,
+        num_classes=num_classes, 
+        trainable=True,
+        resume=args.resume
+        )
+    model = model.to(device).train()
+
+    # SyncBatchNorm
+    if args.sybn and args.distributed:
+        print('use SyncBatchNorm ...')
+        model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
+
+    # DDP
+    model_without_ddp = model
+    if args.distributed:
+        model = DDP(model, device_ids=[args.gpu])
+        model_without_ddp = model.module
+
+    # Compute FLOPs and Params
+    if distributed_utils.is_main_process():
+        model_copy = deepcopy(model_without_ddp)
+        FLOPs_and_Params(
+            model=model_copy,
+            img_size=d_cfg['test_size'],
+            len_clip=args.len_clip,
+            device=device)
+        del model_copy
+
+    # optimizer
+    base_lr = args.base_lr
+    accumulate = args.accumulate
+    optimizer, start_epoch = build_optimizer(d_cfg, model_without_ddp, base_lr, args.resume)
+
+    # lr scheduler
+    lr_scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, args.lr_epoch, args.lr_decay_ratio)
+
+    # warmup scheduler
+    warmup_scheduler = build_warmup(d_cfg, base_lr=base_lr)
+
+    # training configuration
+    max_epoch = args.max_epoch
+    epoch_size = len(dataloader)
+    warmup = True
+
+    # eval before training
+    if args.eval_first and distributed_utils.is_main_process():
+        # to check whether the evaluator can work
+        eval_one_epoch(args, model_without_ddp, evaluator, 0, path_to_save)
+
+    # start to train
+    t0 = time.time()
+    for epoch in range(start_epoch, max_epoch):
+        if args.distributed:
+            dataloader.batch_sampler.sampler.set_epoch(epoch)            
+
+        # train one epoch
+        for iter_i, (frame_ids, video_clips, targets) in enumerate(dataloader):
+            ni = iter_i + epoch * epoch_size
+
+            # warmup
+            if ni < d_cfg['wp_iter'] and warmup:
+                warmup_scheduler.warmup(ni, optimizer)
+
+            elif ni == d_cfg['wp_iter'] and warmup:
+                # warmup is over
+                print('Warmup is over')
+                warmup = False
+                warmup_scheduler.set_lr(optimizer, lr=base_lr, base_lr=base_lr)
+
+            # to device
+            video_clips = video_clips.to(device)
+
+            # inference
+            outputs = model(video_clips)
+            
+            # loss
+            loss_dict = criterion(outputs, targets)
+            losses = loss_dict['losses']
+
+            # reduce            
+            loss_dict_reduced = distributed_utils.reduce_dict(loss_dict)
+
+            # check loss
+            if torch.isnan(losses):
+                print('loss is NAN !!')
+                continue
+
+            # Backward
+            losses /= accumulate
+            losses.backward()
+
+            # Optimize
+            if ni % accumulate == 0:
+                optimizer.step()
+                optimizer.zero_grad()
+                    
+            # Display
+            if distributed_utils.is_main_process() and iter_i % 10 == 0:
+                t1 = time.time()
+                cur_lr = [param_group['lr']  for param_group in optimizer.param_groups]
+                print_log(cur_lr, epoch,  max_epoch, iter_i, epoch_size,loss_dict_reduced, t1-t0, accumulate)
+            
+                t0 = time.time()
+
+        lr_scheduler.step()
+        
+        # evaluation
+        if epoch % args.eval_epoch == 0 or (epoch + 1) == max_epoch:
+            eval_one_epoch(args, model_without_ddp, evaluator, epoch, path_to_save)
+
+
+def eval_one_epoch(args, model_eval, evaluator, epoch, path_to_save):
+    # check evaluator
+    if distributed_utils.is_main_process():
+        if evaluator is None:
+            print('No evaluator ... save model and go on training.')
+            
+        else:
+            print('eval ...')
+            # set eval mode
+            model_eval.trainable = False
+            model_eval.eval()
+
+            # evaluate
+            evaluator.evaluate_frame_map(model_eval, epoch + 1)
+                
+            # set train mode.
+            model_eval.trainable = True
+            model_eval.train()
+
+        # save model
+        print('Saving state, epoch:', epoch + 1)
+        weight_name = '{}_epoch_{}.pth'.format(args.version, epoch+1)
+        checkpoint_path = os.path.join(path_to_save, weight_name)
+        torch.save({'model': model_eval.state_dict(),
+                    'epoch': epoch,
+                    'args': args}, 
+                    checkpoint_path)                      
+
+    if args.distributed:
+        # wait for all processes to synchronize
+        dist.barrier()
+
+
+def print_log(lr, epoch, max_epoch, iter_i, epoch_size, loss_dict, time, accumulate):
+    # basic infor
+    log =  '[Epoch: {}/{}]'.format(epoch+1, max_epoch)
+    log += '[Iter: {}/{}]'.format(iter_i, epoch_size)
+    log += '[lr: {:.6f}]'.format(lr[0])
+    # loss infor
+    for k in loss_dict.keys():
+        if k == 'losses':
+            log += '[{}: {:.2f}]'.format(k, loss_dict[k] * accumulate)
+        else:
+            log += '[{}: {:.2f}]'.format(k, loss_dict[k])
+
+    # other infor
+    log += '[time: {:.2f}]'.format(time)
+
+    # print log infor
+    print(log, flush=True)
+
+
+if __name__ == '__main__':
+    train()
diff --git a/train_ava.sh b/train_ava.sh
new file mode 100644
index 0000000..6862b00
--- /dev/null
+++ b/train_ava.sh
@@ -0,0 +1,16 @@
+# Train YOWOv2 on AVA dataset
+python train.py \
+        --cuda \
+        -d ava_v2.2 \
+        -v yowo_v2_nano \
+        --root /mnt/share/sda1/dataset/STAD/ \
+        --num_workers 4 \
+        --eval_epoch 1 \
+        --eval \
+        --max_epoch 9 \
+        --lr_epoch 3 4 5 6 \
+        -lr 0.0001 \
+        -ldr 0.5 \
+        -bs 8 \
+        -accu 16 \
+        -K 32
diff --git a/train_ucf.sh b/train_ucf.sh
new file mode 100644
index 0000000..55000f9
--- /dev/null
+++ b/train_ucf.sh
@@ -0,0 +1,16 @@
+# Train YOWOv2 on UCF24 dataset
+python train.py \
+        --cuda \
+        -d ucf24 \
+        -v yowo_v2_large \
+        --root /mnt/share/ssd2/dataset/STAD/ \
+        --num_workers 4 \
+        --eval_epoch 1 \
+        --max_epoch 7 \
+        --lr_epoch 2 3 4 5 \
+        -lr 0.0001 \
+        -ldr 0.5 \
+        -bs 8 \
+        -accu 16 \
+        -K 32
+        # --eval \
diff --git a/utils/__init__.py b/utils/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/utils/box_ops.py b/utils/box_ops.py
new file mode 100644
index 0000000..92c882f
--- /dev/null
+++ b/utils/box_ops.py
@@ -0,0 +1,92 @@
+import numpy as np
+import torch
+from torchvision.ops.boxes import box_area
+
+
+def get_ious(bboxes1,
+             bboxes2,
+             box_mode="xyxy",
+             iou_type="iou"):
+    """
+    Compute iou loss of type ['iou', 'giou', 'linear_iou']
+
+    Args:
+        inputs (tensor): pred values
+        targets (tensor): target values
+        weight (tensor): loss weight
+        box_mode (str): 'xyxy' or 'ltrb', 'ltrb' is currently supported.
+        loss_type (str): 'giou' or 'iou' or 'linear_iou'
+        reduction (str): reduction manner
+
+    Returns:
+        loss (tensor): computed iou loss.
+    """
+    if box_mode == "ltrb":
+        bboxes1 = torch.cat((-bboxes1[..., :2], bboxes1[..., 2:]), dim=-1)
+        bboxes2 = torch.cat((-bboxes2[..., :2], bboxes2[..., 2:]), dim=-1)
+    elif box_mode != "xyxy":
+        raise NotImplementedError
+
+    eps = torch.finfo(torch.float32).eps
+
+    bboxes1_area = (bboxes1[..., 2] - bboxes1[..., 0]).clamp_(min=0) \
+        * (bboxes1[..., 3] - bboxes1[..., 1]).clamp_(min=0)
+    bboxes2_area = (bboxes2[..., 2] - bboxes2[..., 0]).clamp_(min=0) \
+        * (bboxes2[..., 3] - bboxes2[..., 1]).clamp_(min=0)
+
+    w_intersect = (torch.min(bboxes1[..., 2], bboxes2[..., 2])
+                   - torch.max(bboxes1[..., 0], bboxes2[..., 0])).clamp_(min=0)
+    h_intersect = (torch.min(bboxes1[..., 3], bboxes2[..., 3])
+                   - torch.max(bboxes1[..., 1], bboxes2[..., 1])).clamp_(min=0)
+
+    area_intersect = w_intersect * h_intersect
+    area_union = bboxes2_area + bboxes1_area - area_intersect
+    ious = area_intersect / area_union.clamp(min=eps)
+
+    if iou_type == "iou":
+        return ious
+    elif iou_type == "giou":
+        g_w_intersect = torch.max(bboxes1[..., 2], bboxes2[..., 2]) \
+            - torch.min(bboxes1[..., 0], bboxes2[..., 0])
+        g_h_intersect = torch.max(bboxes1[..., 3], bboxes2[..., 3]) \
+            - torch.min(bboxes1[..., 1], bboxes2[..., 1])
+        ac_uion = g_w_intersect * g_h_intersect
+        gious = ious - (ac_uion - area_union) / ac_uion.clamp(min=eps)
+        return gious
+    else:
+        raise NotImplementedError
+
+
+# modified from torchvision to also return the union
+def box_iou(boxes1, boxes2):
+    area1 = box_area(boxes1)
+    area2 = box_area(boxes2)
+
+    lt = torch.max(boxes1[:, None, :2], boxes2[:, :2])  # [N,M,2]
+    rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])  # [N,M,2]
+
+    wh = (rb - lt).clamp(min=0)  # [N,M,2]
+    inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]
+
+    union = area1[:, None] + area2 - inter
+
+    iou = inter / union
+    return iou, union
+
+
+def rescale_bboxes(bboxes, orig_size):
+    orig_w, orig_h = orig_size[0], orig_size[1]
+    bboxes[..., [0, 2]] = np.clip(
+        bboxes[..., [0, 2]] * orig_w, a_min=0., a_max=orig_w
+        )
+    bboxes[..., [1, 3]] = np.clip(
+        bboxes[..., [1, 3]] * orig_h, a_min=0., a_max=orig_h
+        )
+    
+    return bboxes
+
+
+
+if __name__ == '__main__':
+    box1 = torch.tensor([[10, 10, 20, 20]])
+    box2 = torch.tensor([[15, 15, 25, 25]])
diff --git a/utils/com_flops_params.py b/utils/com_flops_params.py
new file mode 100644
index 0000000..2502aa7
--- /dev/null
+++ b/utils/com_flops_params.py
@@ -0,0 +1,25 @@
+import torch
+from thop import profile
+
+
+def FLOPs_and_Params(model, img_size, len_clip, device):
+    # generate init video clip
+    video_clip = torch.randn(1, 3, len_clip, img_size, img_size).to(device)
+
+    # set eval mode
+    model.trainable = False
+    model.eval()
+
+    print('==============================')
+    flops, params = profile(model, inputs=(video_clip, ))
+    print('==============================')
+    print('FLOPs : {:.2f} G'.format(flops * 2.0 / 1e9))
+    print('Params : {:.2f} M'.format(params / 1e6))
+    
+    # set train mode.
+    model.trainable = True
+    model.train()
+
+
+if __name__ == "__main__":
+    pass
diff --git a/utils/distributed_utils.py b/utils/distributed_utils.py
new file mode 100644
index 0000000..82cc36e
--- /dev/null
+++ b/utils/distributed_utils.py
@@ -0,0 +1,166 @@
+# from github: https://github.com/ruinmessi/ASFF/blob/master/utils/distributed_util.py
+
+import torch
+import torch.distributed as dist
+import os
+import subprocess
+import pickle
+
+
+def all_gather(data):
+    """
+    Run all_gather on arbitrary picklable data (not necessarily tensors)
+    Args:
+        data: any picklable object
+    Returns:
+        list[data]: list of data gathered from each rank
+    """
+    world_size = get_world_size()
+    if world_size == 1:
+        return [data]
+
+    # serialized to a Tensor
+    buffer = pickle.dumps(data)
+    storage = torch.ByteStorage.from_buffer(buffer)
+    tensor = torch.ByteTensor(storage).to("cuda")
+
+    # obtain Tensor size of each rank
+    local_size = torch.tensor([tensor.numel()], device="cuda")
+    size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)]
+    dist.all_gather(size_list, local_size)
+    size_list = [int(size.item()) for size in size_list]
+    max_size = max(size_list)
+
+    # receiving Tensor from all ranks
+    # we pad the tensor because torch all_gather does not support
+    # gathering tensors of different shapes
+    tensor_list = []
+    for _ in size_list:
+        tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda"))
+    if local_size != max_size:
+        padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device="cuda")
+        tensor = torch.cat((tensor, padding), dim=0)
+    dist.all_gather(tensor_list, tensor)
+
+    data_list = []
+    for size, tensor in zip(size_list, tensor_list):
+        buffer = tensor.cpu().numpy().tobytes()[:size]
+        data_list.append(pickle.loads(buffer))
+
+    return data_list
+
+
+def reduce_dict(input_dict, average=True):
+    """
+    Args:
+        input_dict (dict): all the values will be reduced
+        average (bool): whether to do average or sum
+    Reduce the values in the dictionary from all processes so that all processes
+    have the averaged results. Returns a dict with the same fields as
+    input_dict, after reduction.
+    """
+    world_size = get_world_size()
+    if world_size < 2:
+        return input_dict
+    with torch.no_grad():
+        names = []
+        values = []
+        # sort the keys so that they are consistent across processes
+        for k in sorted(input_dict.keys()):
+            names.append(k)
+            values.append(input_dict[k])
+        values = torch.stack(values, dim=0)
+        dist.all_reduce(values)
+        if average:
+            values /= world_size
+        reduced_dict = {k: v for k, v in zip(names, values)}
+    return reduced_dict
+
+
+def get_sha():
+    cwd = os.path.dirname(os.path.abspath(__file__))
+
+    def _run(command):
+        return subprocess.check_output(command, cwd=cwd).decode('ascii').strip()
+    sha = 'N/A'
+    diff = "clean"
+    branch = 'N/A'
+    try:
+        sha = _run(['git', 'rev-parse', 'HEAD'])
+        subprocess.check_output(['git', 'diff'], cwd=cwd)
+        diff = _run(['git', 'diff-index', 'HEAD'])
+        diff = "has uncommited changes" if diff else "clean"
+        branch = _run(['git', 'rev-parse', '--abbrev-ref', 'HEAD'])
+    except Exception:
+        pass
+    message = f"sha: {sha}, status: {diff}, branch: {branch}"
+    return message
+
+
+def setup_for_distributed(is_master):
+    """
+    This function disables printing when not in master process
+    """
+    import builtins as __builtin__
+    builtin_print = __builtin__.print
+
+    def print(*args, **kwargs):
+        force = kwargs.pop('force', False)
+        if is_master or force:
+            builtin_print(*args, **kwargs)
+
+    __builtin__.print = print
+
+
+def is_dist_avail_and_initialized():
+    if not dist.is_available():
+        return False
+    if not dist.is_initialized():
+        return False
+    return True
+
+
+def get_world_size():
+    if not is_dist_avail_and_initialized():
+        return 1
+    return dist.get_world_size()
+
+
+def get_rank():
+    if not is_dist_avail_and_initialized():
+        return 0
+    return dist.get_rank()
+
+
+def is_main_process():
+    return get_rank() == 0
+
+
+def save_on_master(*args, **kwargs):
+    if is_main_process():
+        torch.save(*args, **kwargs)
+
+
+def init_distributed_mode(args):
+    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
+        args.rank = int(os.environ["RANK"])
+        args.world_size = int(os.environ['WORLD_SIZE'])
+        args.gpu = int(os.environ['LOCAL_RANK'])
+    elif 'SLURM_PROCID' in os.environ:
+        args.rank = int(os.environ['SLURM_PROCID'])
+        args.gpu = args.rank % torch.cuda.device_count()
+    else:
+        print('Not using distributed mode')
+        args.distributed = False
+        return
+
+    args.distributed = True
+
+    torch.cuda.set_device(args.gpu)
+    args.dist_backend = 'nccl'
+    print('| distributed init (rank {}): {}'.format(
+        args.rank, args.dist_url), flush=True)
+    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
+                                         world_size=args.world_size, rank=args.rank)
+    torch.distributed.barrier()
+    setup_for_distributed(args.rank == 0)
diff --git a/utils/misc.py b/utils/misc.py
new file mode 100644
index 0000000..2d168be
--- /dev/null
+++ b/utils/misc.py
@@ -0,0 +1,190 @@
+import os
+
+import torch
+import torch.nn as nn
+
+from dataset.ucf_jhmdb import UCF_JHMDB_Dataset
+from dataset.ava import AVA_Dataset
+from dataset.transforms import Augmentation, BaseTransform
+
+from evaluator.ucf_jhmdb_evaluator import UCF_JHMDB_Evaluator
+from evaluator.ava_evaluator import AVA_Evaluator
+
+
+def build_dataset(d_cfg, args, is_train=False):
+    """
+        d_cfg: dataset config
+    """
+    # transform
+    augmentation = Augmentation(
+        img_size=d_cfg['train_size'],
+        jitter=d_cfg['jitter'],
+        hue=d_cfg['hue'],
+        saturation=d_cfg['saturation'],
+        exposure=d_cfg['exposure']
+        )
+    basetransform = BaseTransform(
+        img_size=d_cfg['test_size'],
+        )
+
+    # dataset
+    if args.dataset in ['ucf24', 'jhmdb21']:
+        data_dir = os.path.join(args.root, 'ucf24')
+
+        # dataset
+        dataset = UCF_JHMDB_Dataset(
+            data_root=data_dir,
+            dataset=args.dataset,
+            img_size=d_cfg['train_size'],
+            transform=augmentation,
+            is_train=is_train,
+            len_clip=args.len_clip,
+            sampling_rate=d_cfg['sampling_rate']
+            )
+        num_classes = dataset.num_classes
+
+        # evaluator
+        evaluator = UCF_JHMDB_Evaluator(
+            data_root=data_dir,
+            dataset=args.dataset,
+            model_name=args.version,
+            metric='fmap',
+            img_size=d_cfg['test_size'],
+            len_clip=args.len_clip,
+            batch_size=args.test_batch_size,
+            conf_thresh=0.01,
+            iou_thresh=0.5,
+            gt_folder=d_cfg['gt_folder'],
+            save_path='./evaluator/eval_results/',
+            transform=basetransform,
+            collate_fn=CollateFunc()            
+        )
+
+    elif args.dataset == 'ava_v2.2':
+        data_dir = os.path.join(args.root, 'AVA_Dataset')
+        
+        # dataset
+        dataset = AVA_Dataset(
+            cfg=d_cfg,
+            data_root=data_dir,
+            is_train=True,
+            img_size=d_cfg['train_size'],
+            transform=augmentation,
+            len_clip=args.len_clip,
+            sampling_rate=d_cfg['sampling_rate']
+        )
+        num_classes = 80
+
+        # evaluator
+        evaluator = AVA_Evaluator(
+            d_cfg=d_cfg,
+            data_root=data_dir,
+            img_size=d_cfg['test_size'],
+            len_clip=args.len_clip,
+            sampling_rate=d_cfg['sampling_rate'],
+            batch_size=args.test_batch_size,
+            transform=basetransform,
+            collate_fn=CollateFunc(),
+            full_test_on_val=False,
+            version='v2.2'
+            )
+
+    else:
+        print('unknow dataset !! Only support ucf24 & jhmdb21 & ava_v2.2 !!')
+        exit(0)
+
+    print('==============================')
+    print('Training model on:', args.dataset)
+    print('The dataset size:', len(dataset))
+
+    if not args.eval:
+        # no evaluator during training stage
+        evaluator = None
+
+    return dataset, evaluator, num_classes
+
+
+def build_dataloader(args, dataset, batch_size, collate_fn=None, is_train=False):
+    if is_train:
+        # distributed
+        if args.distributed:
+            sampler = torch.utils.data.distributed.DistributedSampler(dataset)
+        else:
+            sampler = torch.utils.data.RandomSampler(dataset)
+
+        batch_sampler_train = torch.utils.data.BatchSampler(sampler, 
+                                                            batch_size, 
+                                                            drop_last=True)
+        # train dataloader
+        dataloader = torch.utils.data.DataLoader(
+            dataset=dataset, 
+            batch_sampler=batch_sampler_train,
+            collate_fn=collate_fn, 
+            num_workers=args.num_workers,
+            pin_memory=True
+            )
+    else:
+        # test dataloader
+        dataloader = torch.utils.data.DataLoader(
+            dataset=dataset, 
+            shuffle=False,
+            collate_fn=collate_fn, 
+            num_workers=args.num_workers,
+            drop_last=False,
+            pin_memory=True
+            )
+    
+    return dataloader
+    
+
+def load_weight(model, path_to_ckpt=None):
+    if path_to_ckpt is None:
+        print('No trained weight ..')
+        return model
+        
+    checkpoint = torch.load(path_to_ckpt, map_location='cpu')
+    # checkpoint state dict
+    checkpoint_state_dict = checkpoint.pop("model")
+    # model state dict
+    model_state_dict = model.state_dict()
+    # check
+    for k in list(checkpoint_state_dict.keys()):
+        if k in model_state_dict:
+            shape_model = tuple(model_state_dict[k].shape)
+            shape_checkpoint = tuple(checkpoint_state_dict[k].shape)
+            if shape_model != shape_checkpoint:
+                checkpoint_state_dict.pop(k)
+        else:
+            checkpoint_state_dict.pop(k)
+            print(k)
+
+    model.load_state_dict(checkpoint_state_dict)
+    print('Finished loading model!')
+
+    return model
+
+
+def is_parallel(model):
+    # Returns True if model is of type DP or DDP
+    return type(model) in (nn.parallel.DataParallel, nn.parallel.DistributedDataParallel)
+
+
+class CollateFunc(object):
+    def __call__(self, batch):
+        batch_frame_id = []
+        batch_key_target = []
+        batch_video_clips = []
+
+        for sample in batch:
+            key_frame_id = sample[0]
+            video_clip = sample[1]
+            key_target = sample[2]
+            
+            batch_frame_id.append(key_frame_id)
+            batch_video_clips.append(video_clip)
+            batch_key_target.append(key_target)
+
+        # List [B, 3, T, H, W] -> [B, 3, T, H, W]
+        batch_video_clips = torch.stack(batch_video_clips)
+        
+        return batch_frame_id, batch_video_clips, batch_key_target
diff --git a/utils/nms.py b/utils/nms.py
new file mode 100644
index 0000000..85e751a
--- /dev/null
+++ b/utils/nms.py
@@ -0,0 +1,71 @@
+import numpy as np
+
+
+def nms(bboxes, scores, nms_thresh):
+    """"Pure Python NMS."""
+    x1 = bboxes[:, 0]  #xmin
+    y1 = bboxes[:, 1]  #ymin
+    x2 = bboxes[:, 2]  #xmax
+    y2 = bboxes[:, 3]  #ymax
+
+    areas = (x2 - x1) * (y2 - y1)
+    order = scores.argsort()[::-1]
+
+    keep = []
+    while order.size > 0:
+        i = order[0]
+        keep.append(i)
+        # compute iou
+        xx1 = np.maximum(x1[i], x1[order[1:]])
+        yy1 = np.maximum(y1[i], y1[order[1:]])
+        xx2 = np.minimum(x2[i], x2[order[1:]])
+        yy2 = np.minimum(y2[i], y2[order[1:]])
+
+        w = np.maximum(1e-10, xx2 - xx1)
+        h = np.maximum(1e-10, yy2 - yy1)
+        inter = w * h
+
+        iou = inter / (areas[i] + areas[order[1:]] - inter + 1e-14)
+        #reserve all the boundingbox whose ovr less than thresh
+        inds = np.where(iou <= nms_thresh)[0]
+        order = order[inds + 1]
+
+    return keep
+
+
+def multiclass_nms_class_agnostic(scores, labels, bboxes, nms_thresh):
+    # nms
+    keep = nms(bboxes, scores, nms_thresh)
+
+    scores = scores[keep]
+    labels = labels[keep]
+    bboxes = bboxes[keep]
+
+    return scores, labels, bboxes
+
+
+def multiclass_nms_class_aware(scores, labels, bboxes, nms_thresh, num_classes):
+    # nms
+    keep = np.zeros(len(bboxes), dtype=np.int)
+    for i in range(num_classes):
+        inds = np.where(labels == i)[0]
+        if len(inds) == 0:
+            continue
+        c_bboxes = bboxes[inds]
+        c_scores = scores[inds]
+        c_keep = nms(c_bboxes, c_scores, nms_thresh)
+        keep[inds[c_keep]] = 1
+
+    keep = np.where(keep > 0)
+    scores = scores[keep]
+    labels = labels[keep]
+    bboxes = bboxes[keep]
+
+    return scores, labels, bboxes
+
+
+def multiclass_nms(scores, labels, bboxes, nms_thresh, num_classes, class_agnostic=False):
+    if class_agnostic:
+        return multiclass_nms_class_agnostic(scores, labels, bboxes, nms_thresh)
+    else:
+        return multiclass_nms_class_aware(scores, labels, bboxes, nms_thresh, num_classes)
diff --git a/utils/solver/__init__.py b/utils/solver/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/utils/solver/optimizer.py b/utils/solver/optimizer.py
new file mode 100644
index 0000000..22c2600
--- /dev/null
+++ b/utils/solver/optimizer.py
@@ -0,0 +1,40 @@
+import torch
+from torch import optim
+
+
+def build_optimizer(cfg, model, base_lr=0.0, resume=None):
+    print('==============================')
+    print('Optimizer: {}'.format(cfg['optimizer']))
+    print('--momentum: {}'.format(cfg['momentum']))
+    print('--weight_decay: {}'.format(cfg['weight_decay']))
+
+    if cfg['optimizer'] == 'sgd':
+        optimizer = optim.SGD(
+            model.parameters(), 
+            lr=base_lr,
+            momentum=cfg['momentum'],
+            weight_decay=cfg['weight_decay'])
+
+    elif cfg['optimizer'] == 'adam':
+        optimizer = optim.Adam(
+            model.parameters(), 
+            lr=base_lr,
+            eight_decay=cfg['weight_decay'])
+                                
+    elif cfg['optimizer'] == 'adamw':
+        optimizer = optim.AdamW(
+            model.parameters(), 
+            lr=base_lr,
+            weight_decay=cfg['weight_decay'])
+          
+    start_epoch = 0
+    if resume is not None:
+        print('keep training: ', resume)
+        checkpoint = torch.load(resume)
+        # checkpoint state dict
+        checkpoint_state_dict = checkpoint.pop("optimizer")
+        optimizer.load_state_dict(checkpoint_state_dict)
+        start_epoch = checkpoint.pop("epoch")
+                        
+                                
+    return optimizer, start_epoch
diff --git a/utils/solver/warmup_schedule.py b/utils/solver/warmup_schedule.py
new file mode 100644
index 0000000..c88398a
--- /dev/null
+++ b/utils/solver/warmup_schedule.py
@@ -0,0 +1,58 @@
+
+# Build warmup scheduler
+
+
+def build_warmup(cfg, base_lr=0.01):
+    print('==============================')
+    print('WarmUpScheduler: {}'.format(cfg['warmup']))
+    print('--base_lr: {}'.format(base_lr))
+    print('--warmup_factor: {}'.format(cfg['warmup_factor']))
+    print('--wp_iter: {}'.format(cfg['wp_iter']))
+
+    warmup_scheduler = WarmUpScheduler(
+        name=cfg['warmup'], 
+        base_lr=base_lr, 
+        wp_iter=cfg['wp_iter'], 
+        warmup_factor=cfg['warmup_factor']
+        )
+    
+    return warmup_scheduler
+
+                           
+# Basic Warmup Scheduler
+class WarmUpScheduler(object):
+    def __init__(self, 
+                 name='linear', 
+                 base_lr=0.01, 
+                 wp_iter=500, 
+                 warmup_factor=0.00066667):
+        self.name = name
+        self.base_lr = base_lr
+        self.wp_iter = wp_iter
+        self.warmup_factor = warmup_factor
+
+
+    def set_lr(self, optimizer, lr, base_lr):
+        for param_group in optimizer.param_groups:
+            init_lr = param_group['initial_lr']
+            ratio = init_lr / base_lr
+            param_group['lr'] = lr * ratio
+
+
+    def warmup(self, iter, optimizer):
+        # warmup
+        assert iter < self.wp_iter
+        if self.name == 'exp':
+            tmp_lr = self.base_lr * pow(iter / self.wp_iter, 4)
+            self.set_lr(optimizer, tmp_lr, self.base_lr)
+
+        elif self.name == 'linear':
+            alpha = iter / self.wp_iter
+            warmup_factor = self.warmup_factor * (1 - alpha) + alpha
+            tmp_lr = self.base_lr * warmup_factor
+            self.set_lr(optimizer, tmp_lr, self.base_lr)
+
+
+    def __call__(self, iter, optimizer):
+        self.warmup(iter, optimizer)
+        
\ No newline at end of file
diff --git a/utils/vis_tools.py b/utils/vis_tools.py
new file mode 100644
index 0000000..f5c48b6
--- /dev/null
+++ b/utils/vis_tools.py
@@ -0,0 +1,87 @@
+import cv2
+import numpy as np
+
+
+def vis_targets(video_clips, targets):
+    """
+        video_clips: (Tensor) -> [B, C, T, H, W]
+        targets: List[Dict] -> [{'boxes': (Tensor) [N, 4],
+                                 'labels': (Tensor) [N,]}, 
+                                 ...],
+    """
+    batch_size = len(video_clips)
+
+    for batch_index in range(batch_size):
+        video_clip = video_clips[batch_index]
+        target = targets[batch_index]
+
+        key_frame = video_clip[:, :, -1, :, :]
+        tgt_bboxes = target['boxes']
+        tgt_labels = target['labels']
+
+        key_frame = convert_tensor_to_cv2img(key_frame)
+        width, height = key_frame.shape[:-1]
+
+        for box, label in zip(tgt_bboxes, tgt_labels):
+            x1, y1, x2, y2 = box
+            label = int(label)
+
+            x1 *= width
+            y1 *= height
+            x2 *= width
+            y2 *= height
+
+            # draw bbox
+            cv2.rectangle(key_frame,
+                            (int(x1), int(y1)),
+                            (int(x2), int(y2)),
+                            (255, 0, 0), 2)
+        cv2.imshow('groundtruth', key_frame)
+        cv2.waitKey(0)
+
+
+def convert_tensor_to_cv2img(img_tensor):
+    """ convert torch.Tensor to cv2 image """
+    # to numpy
+    img_tensor = img_tensor.permute(1, 2, 0).cpu().numpy()
+    # to cv2 img Mat
+    cv2_img = img_tensor.astype(np.uint8)
+    # to BGR
+    cv2_img = cv2_img.copy()[..., (2, 1, 0)]
+
+    return cv2_img
+
+
+def plot_bbox_labels(img, bbox, label=None, cls_color=None, text_scale=0.4):
+    x1, y1, x2, y2 = bbox
+    x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2)
+    t_size = cv2.getTextSize(label, 0, fontScale=1, thickness=2)[0]
+    # plot bbox
+    cv2.rectangle(img, (x1, y1), (x2, y2), cls_color, 2)
+    
+    if label is not None:
+        # plot title bbox
+        cv2.rectangle(img, (x1, y1-t_size[1]), (int(x1 + t_size[0] * text_scale), y1), cls_color, -1)
+        # put the test on the title bbox
+        cv2.putText(img, label, (int(x1), int(y1 - 5)), 0, text_scale, (0, 0, 0), 1, lineType=cv2.LINE_AA)
+
+    return img
+
+
+def vis_detection(frame, scores, labels, bboxes, vis_thresh, class_names, class_colors):
+    ts = 0.4
+    for i, bbox in enumerate(bboxes):
+        if scores[i] > vis_thresh:
+            label = int(labels[i])
+            cls_color = class_colors[label]
+                
+            if len(class_names) > 1:
+                mess = '%s: %.2f' % (class_names[label], scores[i])
+            else:
+                cls_color = [255, 0, 0]
+                mess = None
+                # visualize bbox
+            frame = plot_bbox_labels(frame, bbox, mess, cls_color, text_scale=ts)
+
+    return frame
+        
\ No newline at end of file
diff --git a/utils/weight_init.py b/utils/weight_init.py
new file mode 100644
index 0000000..5fe88be
--- /dev/null
+++ b/utils/weight_init.py
@@ -0,0 +1,110 @@
+#!/usr/bin/env python3
+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
+import math
+
+import torch.nn as nn
+
+
+def constant_init(module, val, bias=0):
+    nn.init.constant_(module.weight, val)
+    if hasattr(module, 'bias') and module.bias is not None:
+        nn.init.constant_(module.bias, bias)
+
+
+def xavier_init(module, gain=1, bias=0, distribution='normal'):
+    assert distribution in ['uniform', 'normal']
+    if distribution == 'uniform':
+        nn.init.xavier_uniform_(module.weight, gain=gain)
+    else:
+        nn.init.xavier_normal_(module.weight, gain=gain)
+    if hasattr(module, 'bias') and module.bias is not None:
+        nn.init.constant_(module.bias, bias)
+
+
+def normal_init(module, mean=0, std=1, bias=0):
+    nn.init.normal_(module.weight, mean, std)
+    if hasattr(module, 'bias') and module.bias is not None:
+        nn.init.constant_(module.bias, bias)
+
+
+def uniform_init(module, a=0, b=1, bias=0):
+    nn.init.uniform_(module.weight, a, b)
+    if hasattr(module, 'bias') and module.bias is not None:
+        nn.init.constant_(module.bias, bias)
+
+
+def kaiming_init(module,
+                 a=0,
+                 mode='fan_out',
+                 nonlinearity='relu',
+                 bias=0,
+                 distribution='normal'):
+    assert distribution in ['uniform', 'normal']
+    if distribution == 'uniform':
+        nn.init.kaiming_uniform_(module.weight,
+                                 a=a,
+                                 mode=mode,
+                                 nonlinearity=nonlinearity)
+    else:
+        nn.init.kaiming_normal_(module.weight,
+                                a=a,
+                                mode=mode,
+                                nonlinearity=nonlinearity)
+    if hasattr(module, 'bias') and module.bias is not None:
+        nn.init.constant_(module.bias, bias)
+
+
+def caffe2_xavier_init(module, bias=0):
+    # `XavierFill` in Caffe2 corresponds to `kaiming_uniform_` in PyTorch
+    # Acknowledgment to FAIR's internal code
+    kaiming_init(module,
+                 a=1,
+                 mode='fan_in',
+                 nonlinearity='leaky_relu',
+                 bias=bias,
+                 distribution='uniform')
+
+
+def c2_xavier_fill(module: nn.Module):
+    """
+    Initialize `module.weight` using the "XavierFill" implemented in Caffe2.
+    Also initializes `module.bias` to 0.
+
+    Args:
+        module (torch.nn.Module): module to initialize.
+    """
+    # Caffe2 implementation of XavierFill in fact
+    # corresponds to kaiming_uniform_ in PyTorch
+    nn.init.kaiming_uniform_(module.weight, a=1)
+    if module.bias is not None:
+        nn.init.constant_(module.bias, 0)
+
+
+def c2_msra_fill(module: nn.Module):
+    """
+    Initialize `module.weight` using the "MSRAFill" implemented in Caffe2.
+    Also initializes `module.bias` to 0.
+
+    Args:
+        module (torch.nn.Module): module to initialize.
+    """
+    nn.init.kaiming_normal_(module.weight, mode="fan_out", nonlinearity="relu")
+    if module.bias is not None:
+        nn.init.constant_(module.bias, 0)
+
+
+def init_weights(m: nn.Module, zero_init_final_gamma=False):
+    """Performs ResNet-style weight initialization."""
+    if isinstance(m, nn.Conv2d):
+        # Note that there is no bias due to BN
+        fan_out = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
+        m.weight.data.normal_(mean=0.0, std=math.sqrt(2.0 / fan_out))
+    elif isinstance(m, nn.BatchNorm2d):
+        zero_init_gamma = (
+            hasattr(m, "final_bn") and m.final_bn and zero_init_final_gamma
+        )
+        m.weight.data.fill_(0.0 if zero_init_gamma else 1.0)
+        m.bias.data.zero_()
+    elif isinstance(m, nn.Linear):
+        m.weight.data.normal_(mean=0.0, std=0.01)
+        m.bias.data.zero_()