Skip to content

Commit

Permalink
Hw1 (#1)
Browse files Browse the repository at this point in the history
* Fill out blanked codes

I didn't test yet, but I wrote most of logical flows that the homework requires.
I wrote draw function and main logic. However, I didn't understand 4th requirements.

* Change indentation unit to tab

In previous commit, two spaces and four spaces are mixed for indentation.
So, I changed all to tab.

* Create layers, not yet weights

I created YOLO-v2-tiny model, but pretrained model is delivered in pickle format, so I haven't understood how to use it. But soon I'll load it

* Refactor duplicated codes into a function

* Fix bugs in opening video file and resizing function

* Set batch to 1 because it only proceeds inference

* (Incomplete) Consume Too much(more than 100GB) memory

* Fix a typo

* Fix memory allocation error

I tried to make filters from 16 until 1024, but I accidently did from 16 to pow(16, 9) ~ 6e10 ~ 60G.
That was why my code cannot allocate enough memory.

Plus, I fixed professor's wrong code. sess.run evaluates only last layer.

* Set fixed precision on throughput

* Add some information about yolov2tiny architecture

* (Incomplete) Find a bottleneck on non-max-suppression

To find out, I made a tracer and set on some functions

* Fix wrongly indented codes in nms function

* Reshape bounding boxes from (416, 416) to original video resolution

* Add parameters

* Show attributes

* Store every layer on first frame into intermediate folder

* WIP: Make a room for bias, but not yet implemented cuz still figuring out to use weights

* Change video codec to mp4v

* WIP: create layers hierarchically

YOLO-v2-tiny consists of nine composit layers.
And each layer consists of smaller layers such as conv, batch_norm, bias, maxpool, leakyReLU.
Therefore, I mimicked its hierarchy.

* Infer objects correctly

* Add explicit bias layers right after each conv layer.
* Load weights by using tf.Variable and attentive layers in tf.nn.
* Use leftupper coordinate and rightbottom coordinate in draw function since coordinates are already shaped in restore_shape function.
* Remove unused comments and debug lines
* Use original image in draw function.

* Measure inference, end-to-end, fps and total time

* Update for LaTeX

* Add report template

* Limit yolo takes only 70% of GPU VRAM

In small VRAM environment, allow_grouth option is not enough for preventing out of memory error.
So, referencing some information, I forced not to take more than 70% VRAM

* Clarify which values we save

* Update yolov2tiny.py

delete out_chan, default value of stride
통일성 위해서 maxpool도 그냥 max_pool2d로 하면 어떤지? (일단 주석해놈)

* Update yolov2tiny.py

* Update __init__.py

Add start time of "end-to-end time", beg_start and change previous beg -> beg_infer
Does change to name of "beg" in obj_detection effect to measure function? (I'm not sure)

* Update yolov2tiny.py

put back n_... values to post processing

* Create consider.txt

* Update yolov2tiny.py

confirm "tf.nn.max_pool2d" working well

* Update consider.txt

* Update __init__.py

move down  the saving intermediate first frame result(tensor) part for measuring necessary time

* Update consider.txt

* Update __init__.py

add printing total time

* Add some details

* Add comment about inference FPS

* Write detailed info of why I chose tf.nn functions

* Upload whole model visualization

I visualized the whole tf graph by using tf.train.Saver.
The only catch is, it is too verbose to see the main logic.
But I decided to save the visualized graph just in case.

* Add GPU benchmark

* Add cpu benchmark

* Add first draft of the report

* Update report.tex

Some change

* Update report.tex

* Change code location and table

* Edit figures

Co-authored-by: jehoon315 <[email protected]>
  • Loading branch information
snowphone and jehoon315 authored Apr 20, 2020
1 parent 4c5f23c commit 94de1fc
Show file tree
Hide file tree
Showing 15 changed files with 8,679 additions and 250 deletions.
6 changes: 6 additions & 0 deletions .gitignore
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,9 @@ dmypy.json

# Pyre type checker
.pyre/


# LaTeX
**/*.aux
**/*-eps-converted-to.pdf
**/*.log
225 changes: 156 additions & 69 deletions __init__.py
Original file line number Diff line number Diff line change
@@ -1,84 +1,171 @@
import os
import sys
import numpy as np
import cv2 as cv2
import time
import yolov2tiny

def open_video_with_opencv(in_video_path, out_video_path):
#
# This function takes input and output video path and open them.
#
# Your code from here. You may clear the comments.
#
print('open_video_with_opencv is not yet implemented')
sys.exit()

# Open an object of input video using cv2.VideoCapture.


# Open an object of output video using cv2.VideoWriter.


# Return the video objects and anything you want for further process.


def resize_input(im):
imsz = cv2.resize(im, (416, 416))
imsz = imsz / 255.
imsz = imsz[:,:,::-1]
return np.asarray(imsz, dtype=np.float32)

def video_object_detection(in_video_path, out_video_path, proc="cpu"):
#
# This function runs the inference for each frame and creates the output video.
#
# Your code from here. You may clear the comments.
#
print('video_object_detection is not yet implemented')
sys.exit()

# Open video using open_video_with_opencv.


# Check if video is opened. Otherwise, exit.
from datetime import datetime
from functools import reduce, wraps
from typing import List, Tuple

import cv2
import numpy as np

# Create an instance of the YOLO_V2_TINY class. Pass the dimension of
# the input, a path to weight file, and which device you will use as arguments.
import yolov2tiny


# Start the main loop. For each frame of the video, the loop must do the followings:
# 1. Do the inference.
# 2. Run postprocessing using the inference result, accumulate them through the video writer object.
# The coordinates from postprocessing are calculated according to resized input; you must adjust
# them to fit into the original video.
# 3. Measure the end-to-end time and the time spent only for inferencing.
# 4. Save the intermediate values for the first layer.
# Note that your input must be adjusted to fit into the algorithm,
# including resizing the frame and changing the dimension.
def measure(func):
""" Measure how long a function takes time """
@wraps(func)
def impl(*args, **kargs):
beg=datetime.now()
ret = func(*args, **kargs)
time = (datetime.now() - beg).total_seconds()
print("{}: {}s".format(func.__name__, time))
return ret

return impl


def open_video_with_opencv(
in_video_path: str,
out_video_path: str) -> (cv2.VideoCapture, cv2.VideoWriter):

reader = cv2.VideoCapture(in_video_path)
if not reader.isOpened():
raise Exception("Failed to open \'{}\'".format(in_video_path))

# Check the inference peformance; end-to-end elapsed time and inferencing time.
# Check how many frames are processed per second respectivly.

fps = reader.get(cv2.CAP_PROP_FPS)
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
width = int(reader.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(reader.get(cv2.CAP_PROP_FRAME_HEIGHT))

writer = cv2.VideoWriter(out_video_path, fourcc, fps, (width, height))
if not writer.isOpened():
raise Exception(
"Failed to create video named \'{}\'".format(out_video_path))

return reader, writer


def resize_input(im: np.ndarray) -> np.ndarray:
imsz = cv2.resize(im, (416, 416), interpolation=cv2.INTER_AREA)
imsz = imsz / 255.
imsz = imsz[:, :, ::-1]
imsz = np.asarray(imsz, dtype=np.float32)
return imsz.reshape((1, *imsz.shape))


color_t = Tuple[float, float, float]
coord_t = Tuple[int, int]
proposal_t = Tuple[str, coord_t, coord_t, color_t]


def restore_shape(proposals: List[proposal_t], restore_width: int,
restore_height: int) -> List[proposal_t]:
"""
Read proposal list and reshape proposal coordinates into original video's resolution
"""
def reshape(record: proposal_t) -> proposal_t:
"""
Get a record and reshape coordinates into original ratio.
cf) lu means left upper and rb means right bottom.
"""
calc_coord = lambda x, new_d: np.clip(int(x / 416 * new_d), 0, new_d)
name, (lux, luy), (rbx, rby), color = record
lux, rbx = map(lambda x: calc_coord(x, restore_width), [lux, rbx])
luy, rby = map(lambda y: calc_coord(y, restore_height), [luy, rby])
return (name, (lux, luy), (rbx, rby), color)

return [reshape(it) for it in proposals]


def draw(image: np.ndarray, proposals: List[proposal_t]) -> np.ndarray:
'''
Draw bounding boxes into image and return it
proposals contains a list of (best_class_name, lefttop, rightbottom, color).
'''
for name, lefttop, rightbottom, color in proposals:
height, width, _channel = image.shape

cv2.rectangle(image, lefttop, rightbottom, color, 2)
cv2.putText(image, name, (lefttop[0], max(0, lefttop[1] - 10)),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

return image


def store_tensors(tensors: List[np.ndarray]):
os.makedirs("intermediate", exist_ok=True)
for i, tensor in enumerate(tensors):
path = os.path.join("intermediate", "layer_{}.npy".format(i))
np.save(path, tensor)


@measure
def video_object_detection(in_video_path: str,
out_video_path: str,
proc="cpu"):
"""
Read a videofile, scan each frame and draw objects using pretrained yolo_v2_tiny model.
Finally, store drawed frames into 'out_video_path'
"""
reader, writer = open_video_with_opencv(in_video_path, out_video_path)
yolo = yolov2tiny.YOLO_V2_TINY((416, 416, 3), "./y2t_weights.pickle", proc)

width = int(reader.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(reader.get(cv2.CAP_PROP_FRAME_HEIGHT))

acc, firstTime = [], True
while reader.isOpened():
okay, original_image = reader.read()
if not okay:
break
beg_start = datetime.now()
image = resize_input(original_image)
beg_infer = datetime.now()
batched_tensors_list = yolo.inference(image)
inference_time = (datetime.now() - beg_infer).total_seconds()

tensor = batched_tensors_list[-1][0]

proposals = yolov2tiny.postprocessing(tensor)
proposals = restore_shape(proposals, width, height)
out_image = draw(original_image, proposals)
writer.write(out_image)

end_to_end_time = (datetime.now() - beg_start).total_seconds()
acc.append((inference_time, end_to_end_time))
print("#{} inference: {:.3f}\tend-to-end: {:.3f}".format(len(acc), inference_time, end_to_end_time))

if firstTime:
store_tensors(map(lambda x: x[0], batched_tensors_list)) # Remove batch shape
firstTime = False

reader.release()
writer.release()
inference_sum, end_to_end_sum = reduce(lambda x,y: (x[0] + y[0], x[1] + y[1]), acc)
size = len(acc)
print("Total inference: {:.3f}s\ttotal end-to-end: {:.3f}s".format(inference_sum, end_to_end_sum))
print("Average inference: {:.3f}s\taverage end-to-end: {:.3f}s".format(inference_sum/size, end_to_end_sum/size))
print("Throughput: {:.3f}fps".format(size / end_to_end_sum))
return

# Release the opened videos.


def main():
if len(sys.argv) < 3:
print ("Usage: python3 __init__.py [in_video.mp4] [out_video.mp4] ([cpu|gpu])")
sys.exit()
if len(sys.argv) < 3:
print(
"Usage: python3 __init__.py [in_video.mp4] [out_video.mp4] ([cpu|gpu])"
)
sys.exit()

in_video_path = sys.argv[1]
out_video_path = sys.argv[2]

in_video_path = sys.argv[1]
out_video_path = sys.argv[2]
if len(sys.argv) == 4:
proc = sys.argv[3]
else:
proc = "cpu"

if len(sys.argv) == 4:
proc = sys.argv[3]
else:
proc = "cpu"
video_object_detection(in_video_path, out_video_path, proc)

video_object_detection(in_video_path, out_video_path, proc)

if __name__ == "__main__":
main()
main()
Binary file modified proj1.pdf
Binary file not shown.
1 change: 1 addition & 0 deletions references/Model.onnx.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
54 changes: 54 additions & 0 deletions references/architecture.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
Type: <class 'list'>
Length: 9
Element type: <class 'collections.OrderedDict'>
conv0
conv0[kernel]: (3, 3, 3, 16)
conv0[biases]: (16,)
conv0[moving_variance]: (16,)
conv0[gamma]: (16,)
conv0[moving_mean]: (16,)
conv1
conv1[kernel]: (3, 3, 16, 32)
conv1[biases]: (32,)
conv1[moving_variance]: (32,)
conv1[gamma]: (32,)
conv1[moving_mean]: (32,)
conv2
conv2[kernel]: (3, 3, 32, 64)
conv2[biases]: (64,)
conv2[moving_variance]: (64,)
conv2[gamma]: (64,)
conv2[moving_mean]: (64,)
conv3
conv3[kernel]: (3, 3, 64, 128)
conv3[biases]: (128,)
conv3[moving_variance]: (128,)
conv3[gamma]: (128,)
conv3[moving_mean]: (128,)
conv4
conv4[kernel]: (3, 3, 128, 256)
conv4[biases]: (256,)
conv4[moving_variance]: (256,)
conv4[gamma]: (256,)
conv4[moving_mean]: (256,)
conv5
conv5[kernel]: (3, 3, 256, 512)
conv5[biases]: (512,)
conv5[moving_variance]: (512,)
conv5[gamma]: (512,)
conv5[moving_mean]: (512,)
conv6
conv6[kernel]: (3, 3, 512, 1024)
conv6[biases]: (1024,)
conv6[moving_variance]: (1024,)
conv6[gamma]: (1024,)
conv6[moving_mean]: (1024,)
conv7
conv7[kernel]: (3, 3, 1024, 1024)
conv7[biases]: (1024,)
conv7[moving_variance]: (1024,)
conv7[gamma]: (1024,)
conv7[moving_mean]: (1024,)
conv8
conv8[kernel]: (1, 1, 1024, 125)
conv8[biases]: (125,)
1 change: 1 addition & 0 deletions references/yolov2-tiny.cfg.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
62 changes: 62 additions & 0 deletions report/consider.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
Write a report of one or two pages long. Your report must include

1. how you implemented (뭔가 중요한 포인트가 더 있다면 적어주시면 됩니다)

1) open video with openCV

R/W Video openCV 이용, out video attributes는 reader.get 메소드를 이용하여 최대한 원본을 유지하였으나, 비디오 코덱의 경우 OS마다 기본적으로 지원 가능한 코덱이 다르므로, fourcc는 mp4v를 사용함.

2) yolov2tiny tensor graph building

gpu memory 0.7 -> 메모리 터져서 넣어줌. allow_gpu_growth 옵션만으로는 부족하여 추가했다.
tensorflow를 이용해서, Yolov2tiny의 40layer를 구현하였다.
이번 과제는 주어진 weight ndarry 값을 사용하여 weight parameter를 직접 설정하고 inference를 하는것이 목적이므로, 편의성이 강조된 대신 weight값을 변경할 수 없는 tf.contrib 모듈 대신 weight parameter를 manual하게 부여해줄 수 있는 tf.nn 모듈 내의 함수를 사용하였다.
각 레이어를 생성하는 과정에서 각 레이어에 알맞은 weight 값을 적용하였다. 제공된 pickle 파일에서 kernel은 conv_2d 레이어를, biases는 bias_add 레이어를, (moving_variance, gamma, moving_mean)은 batch_normalization 레이어를 초기화 하는데 사용되었다.

3) obj detection

매 loop마다 input frame을 yolov2tiny에 맞게 resize해주고 이를 input으로 inference 한다. Inference된 텐서를 postprocessing 함수에 넣어 bounding box를 추출한다. 이 과정에서 기준값 이하의 confidence 값을 가진 bounding box들은 소거되고, 남은 box들도 non-max-suppression 을 통해 각 object마다 confidence값이 가장 높은 하나의 bounding box만 남겨놓는다.
box를 원래의 사이즈로 reszie하고 input frame과 합쳐서 output frame을 저장한다.

2. execution time and how many FPS processed (end-to-end, only for inference)

#1~2에서 시간이 많이 걸리는데 이에대한 이유 서술하면 좋을듯.
why? Due to initializing tensor? 혹은 캐싱?ㅠ

첫 프레임 텐서 저장하는 거 때문에 그런가 했는데 옮겨도 똑같았음.(약간의 차이는 있긴 했지만 메인이 이거 때문이 아님)

#1 #2 #3 ...
CPU : 0.157 0.083 0.078 ...
GPU : 1.352 0.104 0.011 ...

total / Inference(frame) / end-to-end(frame) / FPS
CPU : 43.591 / 0.058 / 0.096 / 10.392
GPU : 24.778 / 0.016 / 0.055 / 18.282

Total : 마지막으로 출력되는 값인 줄 알았는데 그거 함수 시간 재는거 였음 여기 total은 end-to-end_sum임 (total/453 = Avg.end-to-end)

위에 값들 몇 번 돌려서 평균값 해야될 것 같음 지금은 그냥 1번 해서 나온 값임

Inference FPS도 필요한지 잘 모르겠네요. lecture #5에 FPS measurement exclusively for DNN computation 라는 말이 있었음.
이후에 FPS improve 하려고하면 resizing part는 어차피 동일 할테니 inference FPS가 더 명확한 수치이기도 한 것 같고..

->결과 분석시 inference FPS간 비교를 해서 inference는 GPU 가속의 힘을 봤지만, postprocessing은 CPU만을 이용하여 sequential하게 구현되었기 때문에 gpu 모드로 동작할 때 병목으로 작용하였다 같은 식으로 서술해보아도 좋지 않을까요?

3. comparison the execution time from CPU and GPU and analyze it

inference 빼면 시간 비슷하지 않을까 싶음 -> 맞음 비슷함
end-to-end - inference
CPU : 0.038
GPU : 0.039

GPU using improvement over CPU(문법이 맞나?)
Inference : 3.625x
Total : 1.760x


The purpose of the report is to show your understanding. Please write the answer short and clear.

Video frame size : 540 x 540
Video fps = 30
Video length = 15s
Video frame number : 453
Loading

0 comments on commit 94de1fc

Please sign in to comment.