Hw1 (#1)

* Fill out blanked codes I didn't test yet, but I wrote most of logical flows that the homework requires. I wrote draw function and main logic. However, I didn't understand 4th requirements. * Change indentation unit to tab In previous commit, two spaces and four spaces are mixed for indentation. So, I changed all to tab. * Create layers, not yet weights I created YOLO-v2-tiny model, but pretrained model is delivered in pickle format, so I haven't understood how to use it. But soon I'll load it * Refactor duplicated codes into a function * Fix bugs in opening video file and resizing function * Set batch to 1 because it only proceeds inference * (Incomplete) Consume Too much(more than 100GB) memory * Fix a typo * Fix memory allocation error I tried to make filters from 16 until 1024, but I accidently did from 16 to pow(16, 9) ~ 6e10 ~ 60G. That was why my code cannot allocate enough memory. Plus, I fixed professor's wrong code. sess.run evaluates only last layer. * Set fixed precision on throughput * Add some information about yolov2tiny architecture * (Incomplete) Find a bottleneck on non-max-suppression To find out, I made a tracer and set on some functions * Fix wrongly indented codes in nms function * Reshape bounding boxes from (416, 416) to original video resolution * Add parameters * Show attributes * Store every layer on first frame into intermediate folder * WIP: Make a room for bias, but not yet implemented cuz still figuring out to use weights * Change video codec to mp4v * WIP: create layers hierarchically YOLO-v2-tiny consists of nine composit layers. And each layer consists of smaller layers such as conv, batch_norm, bias, maxpool, leakyReLU. Therefore, I mimicked its hierarchy. * Infer objects correctly * Add explicit bias layers right after each conv layer. * Load weights by using tf.Variable and attentive layers in tf.nn. * Use leftupper coordinate and rightbottom coordinate in draw function since coordinates are already shaped in restore_shape function. * Remove unused comments and debug lines * Use original image in draw function. * Measure inference, end-to-end, fps and total time * Update for LaTeX * Add report template * Limit yolo takes only 70% of GPU VRAM In small VRAM environment, allow_grouth option is not enough for preventing out of memory error. So, referencing some information, I forced not to take more than 70% VRAM * Clarify which values we save * Update yolov2tiny.py delete out_chan, default value of stride 통일성 위해서 maxpool도 그냥 max_pool2d로 하면 어떤지? (일단 주석해놈) * Update yolov2tiny.py * Update __init__.py Add start time of "end-to-end time", beg_start and change previous beg -> beg_infer Does change to name of "beg" in obj_detection effect to measure function? (I'm not sure) * Update yolov2tiny.py put back n_... values to post processing * Create consider.txt * Update yolov2tiny.py confirm "tf.nn.max_pool2d" working well * Update consider.txt * Update __init__.py move down the saving intermediate first frame result(tensor) part for measuring necessary time * Update consider.txt * Update __init__.py add printing total time * Add some details * Add comment about inference FPS * Write detailed info of why I chose tf.nn functions * Upload whole model visualization I visualized the whole tf graph by using tf.train.Saver. The only catch is, it is too verbose to see the main logic. But I decided to save the visualized graph just in case. * Add GPU benchmark * Add cpu benchmark * Add first draft of the report * Update report.tex Some change * Update report.tex * Change code location and table * Edit figures Co-authored-by: jehoon315 <[email protected]>
snowphone · Apr 20, 2020 · 94de1fc · 94de1fc
1 parent 4c5f23c
commit 94de1fc
Show file tree

Hide file tree

Showing 15 changed files with 8,679 additions and 250 deletions.
diff --git a/.gitignore b/.gitignore
@@ -127,3 +127,9 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+
+# LaTeX
+**/*.aux
+**/*-eps-converted-to.pdf
+**/*.log
diff --git a/__init__.py b/__init__.py
@@ -1,84 +1,171 @@
+import os
 import sys
-import numpy as np
-import cv2 as cv2
-import time
-import yolov2tiny
-
-def open_video_with_opencv(in_video_path, out_video_path):
-    #
-    # This function takes input and output video path and open them.
-    #
-    # Your code from here. You may clear the comments.
-    #
-    print('open_video_with_opencv is not yet implemented')
-    sys.exit()
-
-    # Open an object of input video using cv2.VideoCapture.
-
-
-    # Open an object of output video using cv2.VideoWriter.
-
-
-    # Return the video objects and anything you want for further process.
-
-
-def resize_input(im):
-    imsz = cv2.resize(im, (416, 416))
-    imsz = imsz / 255.
-    imsz = imsz[:,:,::-1]
-    return np.asarray(imsz, dtype=np.float32)
-
-def video_object_detection(in_video_path, out_video_path, proc="cpu"):
-    #
-    # This function runs the inference for each frame and creates the output video.
-    #
-    # Your code from here. You may clear the comments.
-    #
-    print('video_object_detection is not yet implemented')
-    sys.exit()
-
-    # Open video using open_video_with_opencv.
-
-
-    # Check if video is opened. Otherwise, exit.
+from datetime import datetime
+from functools import reduce, wraps
+from typing import List, Tuple
 
+import cv2
+import numpy as np
 
-    # Create an instance of the YOLO_V2_TINY class. Pass the dimension of
-    # the input, a path to weight file, and which device you will use as arguments.
+import yolov2tiny
 
 
-    # Start the main loop. For each frame of the video, the loop must do the followings:
-    # 1. Do the inference.
-    # 2. Run postprocessing using the inference result, accumulate them through the video writer object.
-    #    The coordinates from postprocessing are calculated according to resized input; you must adjust
-    #    them to fit into the original video.
-    # 3. Measure the end-to-end time and the time spent only for inferencing.
-    # 4. Save the intermediate values for the first layer.
-    # Note that your input must be adjusted to fit into the algorithm,
-    # including resizing the frame and changing the dimension.
+def measure(func):
+	""" Measure how long a function takes time """
+	@wraps(func)
+	def impl(*args, **kargs):
+		beg=datetime.now()
+		ret = func(*args, **kargs)
+		time = (datetime.now() - beg).total_seconds()
+		print("{}: {}s".format(func.__name__, time))
+		return ret
+
+	return impl
+
+
+def open_video_with_opencv(
+        in_video_path: str,
+        out_video_path: str) -> (cv2.VideoCapture, cv2.VideoWriter):
 
+	reader = cv2.VideoCapture(in_video_path)
+	if not reader.isOpened():
+		raise Exception("Failed to open \'{}\'".format(in_video_path))
 
-    # Check the inference peformance; end-to-end elapsed time and inferencing time.
-    # Check how many frames are processed per second respectivly.
-
+	fps = reader.get(cv2.CAP_PROP_FPS)
+	fourcc = cv2.VideoWriter_fourcc(*"mp4v")
+	width = int(reader.get(cv2.CAP_PROP_FRAME_WIDTH))
+	height = int(reader.get(cv2.CAP_PROP_FRAME_HEIGHT))
+
+	writer = cv2.VideoWriter(out_video_path, fourcc, fps, (width, height))
+	if not writer.isOpened():
+		raise Exception(
+		    "Failed to create video named \'{}\'".format(out_video_path))
+
+	return reader, writer
+
+
+def resize_input(im: np.ndarray) -> np.ndarray:
+	imsz = cv2.resize(im, (416, 416), interpolation=cv2.INTER_AREA)
+	imsz = imsz / 255.
+	imsz = imsz[:, :, ::-1]
+	imsz = np.asarray(imsz, dtype=np.float32)
+	return imsz.reshape((1, *imsz.shape))
+
+
+color_t = Tuple[float, float, float]
+coord_t = Tuple[int, int]
+proposal_t = Tuple[str, coord_t, coord_t, color_t]
+
+
+def restore_shape(proposals: List[proposal_t], restore_width: int,
+                  restore_height: int) -> List[proposal_t]:
+	"""
+	Read proposal list and reshape proposal coordinates into original video's resolution
+	"""
+	def reshape(record: proposal_t) -> proposal_t:
+		"""
+		Get a record and reshape coordinates into original ratio.
+		cf) lu means left upper and rb means right bottom.
+		"""
+		calc_coord = lambda x, new_d: np.clip(int(x / 416 * new_d), 0, new_d)
+		name, (lux, luy), (rbx, rby), color = record
+		lux, rbx = map(lambda x: calc_coord(x, restore_width), [lux, rbx])
+		luy, rby = map(lambda y: calc_coord(y, restore_height), [luy, rby])
+		return (name, (lux, luy), (rbx, rby), color)
+
+	return [reshape(it) for it in proposals]
+
+
+def draw(image: np.ndarray, proposals: List[proposal_t]) -> np.ndarray:
+	'''
+	Draw bounding boxes into image and return it
+
+	proposals contains a list of (best_class_name, lefttop, rightbottom, color).
+	'''
+	for name, lefttop, rightbottom, color in proposals:
+		height, width, _channel = image.shape
+
+		cv2.rectangle(image, lefttop, rightbottom, color, 2)
+		cv2.putText(image, name, (lefttop[0], max(0, lefttop[1] - 10)),
+		            cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
+
+	return image
+
+
+def store_tensors(tensors: List[np.ndarray]):
+	os.makedirs("intermediate", exist_ok=True)
+	for i, tensor in enumerate(tensors):
+		path = os.path.join("intermediate", "layer_{}.npy".format(i))
+		np.save(path, tensor)
+
+
+@measure
+def video_object_detection(in_video_path: str,
+                           out_video_path: str,
+                           proc="cpu"):
+	"""
+	Read a videofile, scan each frame and draw objects using pretrained yolo_v2_tiny model.
+	Finally, store drawed frames into 'out_video_path'
+	"""
+	reader, writer = open_video_with_opencv(in_video_path, out_video_path)
+	yolo = yolov2tiny.YOLO_V2_TINY((416, 416, 3), "./y2t_weights.pickle", proc)
+
+	width = int(reader.get(cv2.CAP_PROP_FRAME_WIDTH))
+	height = int(reader.get(cv2.CAP_PROP_FRAME_HEIGHT))
+
+	acc, firstTime = [], True
+	while reader.isOpened():
+		okay, original_image = reader.read()
+		if not okay:
+			break
+		beg_start = datetime.now()
+		image = resize_input(original_image)
+		beg_infer = datetime.now()
+		batched_tensors_list = yolo.inference(image)
+		inference_time = (datetime.now() - beg_infer).total_seconds()
+
+		tensor = batched_tensors_list[-1][0]
+
+		proposals = yolov2tiny.postprocessing(tensor)
+		proposals = restore_shape(proposals, width, height)
+		out_image = draw(original_image, proposals)
+		writer.write(out_image)
+
+		end_to_end_time = (datetime.now() - beg_start).total_seconds()
+		acc.append((inference_time, end_to_end_time))
+		print("#{} inference: {:.3f}\tend-to-end: {:.3f}".format(len(acc), inference_time, end_to_end_time))
+
+		if firstTime:
+			store_tensors(map(lambda x: x[0], batched_tensors_list))	# Remove batch shape
+			firstTime = False
+
+	reader.release()
+	writer.release()
+	inference_sum, end_to_end_sum = reduce(lambda x,y: (x[0] + y[0], x[1] + y[1]), acc)
+	size = len(acc)
+	print("Total inference: {:.3f}s\ttotal end-to-end: {:.3f}s".format(inference_sum, end_to_end_sum))
+	print("Average inference: {:.3f}s\taverage end-to-end: {:.3f}s".format(inference_sum/size, end_to_end_sum/size))
+	print("Throughput: {:.3f}fps".format(size / end_to_end_sum))
+	return
 
-    # Release the opened videos.
-
 
 def main():
-    if len(sys.argv) < 3:
-        print ("Usage: python3 __init__.py [in_video.mp4] [out_video.mp4] ([cpu|gpu])")
-        sys.exit()
+	if len(sys.argv) < 3:
+		print(
+		    "Usage: python3 __init__.py [in_video.mp4] [out_video.mp4] ([cpu|gpu])"
+		)
+		sys.exit()
+
+	in_video_path = sys.argv[1]
+	out_video_path = sys.argv[2]
 
-    in_video_path = sys.argv[1] 
-    out_video_path = sys.argv[2] 
+	if len(sys.argv) == 4:
+		proc = sys.argv[3]
+	else:
+		proc = "cpu"
 
-    if len(sys.argv) == 4:
-        proc = sys.argv[3]
-    else:
-        proc = "cpu"
+	video_object_detection(in_video_path, out_video_path, proc)
 
-    video_object_detection(in_video_path, out_video_path, proc)
 
 if __name__ == "__main__":
-    main()
+	main()
diff --git a/proj1.pdf b/proj1.pdf
diff --git a/references/Model.onnx.svg b/references/Model.onnx.svg
diff --git a/references/architecture.txt b/references/architecture.txt
@@ -0,0 +1,54 @@
+Type: <class 'list'>
+Length: 9
+Element type: <class 'collections.OrderedDict'>
+conv0
+	conv0[kernel]: (3, 3, 3, 16)
+	conv0[biases]: (16,)
+	conv0[moving_variance]: (16,)
+	conv0[gamma]: (16,)
+	conv0[moving_mean]: (16,)
+conv1
+	conv1[kernel]: (3, 3, 16, 32)
+	conv1[biases]: (32,)
+	conv1[moving_variance]: (32,)
+	conv1[gamma]: (32,)
+	conv1[moving_mean]: (32,)
+conv2
+	conv2[kernel]: (3, 3, 32, 64)
+	conv2[biases]: (64,)
+	conv2[moving_variance]: (64,)
+	conv2[gamma]: (64,)
+	conv2[moving_mean]: (64,)
+conv3
+	conv3[kernel]: (3, 3, 64, 128)
+	conv3[biases]: (128,)
+	conv3[moving_variance]: (128,)
+	conv3[gamma]: (128,)
+	conv3[moving_mean]: (128,)
+conv4
+	conv4[kernel]: (3, 3, 128, 256)
+	conv4[biases]: (256,)
+	conv4[moving_variance]: (256,)
+	conv4[gamma]: (256,)
+	conv4[moving_mean]: (256,)
+conv5
+	conv5[kernel]: (3, 3, 256, 512)
+	conv5[biases]: (512,)
+	conv5[moving_variance]: (512,)
+	conv5[gamma]: (512,)
+	conv5[moving_mean]: (512,)
+conv6
+	conv6[kernel]: (3, 3, 512, 1024)
+	conv6[biases]: (1024,)
+	conv6[moving_variance]: (1024,)
+	conv6[gamma]: (1024,)
+	conv6[moving_mean]: (1024,)
+conv7
+	conv7[kernel]: (3, 3, 1024, 1024)
+	conv7[biases]: (1024,)
+	conv7[moving_variance]: (1024,)
+	conv7[gamma]: (1024,)
+	conv7[moving_mean]: (1024,)
+conv8
+	conv8[kernel]: (1, 1, 1024, 125)
+	conv8[biases]: (125,)
diff --git a/references/yolov2-tiny.cfg.svg b/references/yolov2-tiny.cfg.svg
diff --git a/report/consider.txt b/report/consider.txt
@@ -0,0 +1,62 @@
+Write a report of one or two pages long. Your report must include
+
+1. how you implemented (뭔가 중요한 포인트가 더 있다면 적어주시면 됩니다)
+
+  1) open video with openCV
+
+    R/W Video openCV 이용, out video attributes는 reader.get 메소드를 이용하여 최대한 원본을 유지하였으나, 비디오 코덱의 경우 OS마다 기본적으로 지원 가능한 코덱이 다르므로, fourcc는 mp4v를 사용함.
+
+  2) yolov2tiny tensor graph building
+
+    gpu memory 0.7 -> 메모리 터져서 넣어줌. allow_gpu_growth 옵션만으로는 부족하여 추가했다.
+    tensorflow를 이용해서, Yolov2tiny의 40layer를 구현하였다. 
+    이번 과제는 주어진 weight ndarry 값을 사용하여 weight parameter를 직접 설정하고 inference를 하는것이 목적이므로, 편의성이 강조된 대신 weight값을 변경할 수 없는 tf.contrib 모듈 대신 weight parameter를 manual하게 부여해줄 수 있는 tf.nn 모듈 내의 함수를 사용하였다.
+    각 레이어를 생성하는 과정에서 각 레이어에 알맞은 weight 값을 적용하였다. 제공된 pickle 파일에서 kernel은 conv_2d 레이어를, biases는 bias_add 레이어를, (moving_variance, gamma, moving_mean)은 batch_normalization 레이어를 초기화 하는데 사용되었다.
+
+  3) obj detection
+
+    매 loop마다 input frame을 yolov2tiny에 맞게 resize해주고 이를 input으로 inference 한다. Inference된 텐서를 postprocessing 함수에 넣어 bounding box를 추출한다. 이 과정에서 기준값 이하의 confidence 값을 가진 bounding box들은 소거되고, 남은 box들도 non-max-suppression 을 통해 각 object마다 confidence값이 가장 높은 하나의 bounding box만 남겨놓는다.
+    box를 원래의 사이즈로 reszie하고 input frame과 합쳐서 output frame을 저장한다. 
+
+2. execution time and how many FPS processed (end-to-end, only for inference)
+
+#1~2에서 시간이 많이 걸리는데 이에대한 이유 서술하면 좋을듯.
+why? Due to initializing tensor? 혹은 캐싱?ㅠ
+
+첫 프레임 텐서 저장하는 거 때문에 그런가 했는데 옮겨도 똑같았음.(약간의 차이는 있긴 했지만 메인이 이거 때문이 아님)
+
+       #1      #2     #3   ...
+CPU : 0.157  0.083  0.078  ...
+GPU : 1.352  0.104  0.011  ...
+
+      total  / Inference(frame) / end-to-end(frame)  /   FPS
+CPU : 43.591 /       0.058      /      0.096         /  10.392
+GPU : 24.778 /       0.016      /      0.055         /  18.282
+
+Total : 마지막으로 출력되는 값인 줄 알았는데 그거 함수 시간 재는거 였음 여기 total은 end-to-end_sum임 (total/453 = Avg.end-to-end)
+
+위에 값들 몇 번 돌려서 평균값 해야될 것 같음 지금은 그냥 1번 해서 나온 값임
+
+Inference FPS도 필요한지 잘 모르겠네요. lecture #5에 FPS measurement exclusively for DNN computation 라는 말이 있었음.
+이후에 FPS improve 하려고하면 resizing part는 어차피 동일 할테니 inference FPS가 더 명확한 수치이기도 한 것 같고..
+
+->결과 분석시 inference FPS간 비교를 해서 inference는 GPU 가속의 힘을 봤지만, postprocessing은 CPU만을 이용하여 sequential하게 구현되었기 때문에 gpu 모드로 동작할 때 병목으로 작용하였다 같은 식으로 서술해보아도 좋지 않을까요? 
+
+3. comparison the execution time from CPU and GPU and analyze it
+
+inference 빼면 시간 비슷하지 않을까 싶음 -> 맞음 비슷함
+      end-to-end - inference
+CPU :         0.038
+GPU :         0.039
+
+GPU using improvement over CPU(문법이 맞나?)
+Inference : 3.625x
+Total     : 1.760x
+
+
+The purpose of the report is to show your understanding. Please write the answer short and clear.
+
+Video frame size : 540 x 540
+Video fps = 30
+Video length = 15s
+Video frame number : 453