From 130c07f40375c9a5a89e78a25672ebc7fadbcfdf Mon Sep 17 00:00:00 2001 From: chenxinfeng Date: Sat, 6 Apr 2024 03:10:00 +0800 Subject: [PATCH] add toCUDA --- README.md | 61 ++++++++++++++++++++++++++++++++++++++------ README_CN.md | 50 ++++++++++++++++++++++++++++++++++-- ffmpegcv/__init__.py | 8 ++++++ 3 files changed, 109 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 5798de8..6225fda 100644 --- a/README.md +++ b/README.md @@ -8,16 +8,18 @@ English Version | [中文版本](./README_CN.md) | [Resume 开发者简历 陈昕枫](https://gitee.com/lilab/chenxinfeng-cv/blob/master/README.md) -The ffmpegcv provide Video Reader and Video Witer with ffmpeg backbone, which are faster and powerful than cv2. +The ffmpegcv provide Video Reader and Video Witer with ffmpeg backbone, which are faster and powerful than cv2. Integrating ffmpegcv into your deeplearning pipeline is very smooth. - The ffmpegcv is api **compatible** to open-cv. - The ffmpegcv can use **GPU accelerate** encoding and decoding*. -- The ffmpegcv support much more video **codecs** v.s. open-cv. -- The ffmpegcv support **RGB** & BGR & GRAY format as you like. -- The ffmpegcv support **Stream reading** (IP Camera) in low latency. -- The ffmpegcv can support ROI operations.You can **crop**, **resize** and **pad** the ROI. +- The ffmpegcv supports much more video **codecs** v.s. open-cv. +- The ffmpegcv supports **RGB** & BGR & GRAY format as you like. +- The ffmpegcv supports fp32 CHW & HWC format short to cuda. +- The ffmpegcv supports **Stream reading** (IP Camera) in low latency. +- The ffmpegcv supports ROI operations.You can **crop**, **resize** and **pad** the ROI. +- The ffmpegcv supports shortcut to CUDA memory copy. -In all, ffmpegcv is just similar to opencv api. But is has more codecs and does't require opencv installed. +In all, ffmpegcv is just similar to opencv api. But is has more codecs and does't require opencv installed. It's great for deeplearning pipeline. ## Functions: - `VideoWriter`: Write a video file. @@ -28,6 +30,7 @@ In all, ffmpegcv is just similar to opencv api. But is has more codecs and does' - `VideoCaptureStream`: Read a RTP/RTSP/RTMP/HTTP stream. - `VideoCaptureStreamRT`: Read a RTSP stream (IP Camera) in real time low latency as possible. - `noblock`: Read/Write a video file in background using mulitprocssing. +- `toCUDA`: Translate a video/stream as CHW/HWC-float32 format into CUDA device, >2x faster. ## Install You need to download ffmpeg before you can use ffmpegcv. @@ -38,7 +41,8 @@ You need to download ffmpeg before you can use ffmpegcv. #1D. CONDA: conda install ffmpeg=6.0.0 #don't use the default 4.x.x version #2. python - pip install ffmpegcv + pip install ffmpegcv #stable verison + pip install git+https://github.com/chenxinfeng4/ffmpegcv #latest verison ``` ## When should choose `ffmpegcv` other than `opencv`: @@ -46,7 +50,7 @@ You need to download ffmpeg before you can use ffmpegcv. - The `opencv` packages too much image processing toolbox. You just want a simple video/camero IO with GPU accessible. - The `opencv` didn't support `h264`/`h265` and other video writers. - You want to **crop**, **resize** and **pad** the video/camero ROI. - +- You are interested in deeplearning pipeline. ## Basic example Read a video by CPU, and rewrite it by GPU. ```python @@ -67,6 +71,17 @@ cap = ffmpegcv.VideoCaptureCAM(0) cap = ffmpegcv.VideoCaptureCAM("Integrated Camera") ``` +Deeplearning pipeline. +```python +# video -> crop -> resize -> RGB -> CUDA:CHW float32 -> model +cap = ffmpegcv.toCUDA( + ffmpegcv.VideoCaptureNV(file, pix_fmt='nv12', resize=(W,H)), + tensor_format='CHW') + +for frame_CHW_cuda in cap: + frame_CHW_cuda = (frame_CHW_cuda - mean) / std + result = model(frame_CHW_cuda) +``` ## Cross platform The ffmpegcv is based on Python+FFmpeg, it can cross platform among `Windows, Linux, Mac, X86, Arm`systems. @@ -172,6 +187,36 @@ cap = ffmpegcv.VideoCapture(file, resize=(640, 480), resize_keepratio=True) cap = ffmpegcv.VideoCapture(file, crop_xywh=(0, 0, 640, 480), resize=(512, 512)) ``` +## toCUDA device +--- +The ffmpegcv can translate the video/stream from HWC-uint8 cpu to CHW-float32 in CUDA device. It significantly reduce your cpu load, and get >2x faster than your manually convertion. + +Prepare your environment. The cuda environment is required. The `pycuda` package is required. The `pytorch` package is non-essential. +> nvcc --version # check you've installed NVIDIA CUDA Compiler +> pip install pycuda # install the pycuda, make sure + +```python +# Read a video file to CUDA device, original +cap = ffmpegcv.VideoCaptureNV(file, pix_fmt='rgb24') +ret, frame_HWC_CPU = cap.read() +frame_CHW_CUDA = torch.from_numpy(frame_HWC_CPU).permute(2, 0, 1).cuda().contiguous().float() # 120fps, 1200% CPU load + +# speed up +cap = toCUDA(ffmpegcv.VideoCapture(file, pix_fmt='yuv420p')) #must, yuv420p for cpu codec +cap = toCUDA(ffmpegcv.VideoCaptureNV(file, pix_fmt='nv12')) #must, nv12 for gpu codec + +ret, frame_CHW_pycuda = cap.read() #380fps, 200% CPU load, [pycuda array] +ret, frame_CHW_pycudamem = cap.read_cudamem() #same as [pycuda mem_alloc] +ret, frame_CHW_CUDA = cap.read_torch() #same as [pytorch tensor] + +frame_CHW_pycuda[:] = (frame_CHW_pycuda - mean) / std #normalize +``` + +Why `toCUDA` is faster in your deeplearning pipeline? +> 1. The ffmpeg uses the cpu to convert video pix_fmt from original YUV to RGB24, which is slow. The ffmpegcv use the cuda to accelerate pix_fmt convertion. +> 2. Use `yuv420p` or `nv12` can save the cpu load and reduce the memory copy from CPU to GPU. +> 2. The ffmpeg stores the image as HWC format. The ffmpegcv can use HWC & CHW format to accelerate the video reading. + ## Video Writer --- ```python diff --git a/README_CN.md b/README_CN.md index 34a4b38..a100ee3 100644 --- a/README_CN.md +++ b/README_CN.md @@ -8,7 +8,7 @@ [English Version](./README.md) | 中文版本 | [Resume 开发者简历 陈昕枫](https://gitee.com/lilab/chenxinfeng-cv/blob/master/README.md) -ffmpegcv提供了基于ffmpeg的视频读取器和视频编写器,比cv2更快和更强大。 +ffmpegcv提供了基于ffmpeg的视频读取器和视频编写器,比cv2更快和更强大。适合深度学习的视频处理。 - ffmpegcv与open-cv具有**兼容**的API。 - ffmpegcv可以使用**GPU加速**编码和解码。 @@ -17,6 +17,7 @@ ffmpegcv提供了基于ffmpeg的视频读取器和视频编写器,比cv2更快 - ffmpegcv支持网络**流视频读取** (网线监控相机)。 - ffmpegcv支持ROI(感兴趣区域)操作,可以对ROI进行**裁剪**、**调整大小**和**填充**。 总之,ffmpegcv与opencv的API非常相似。但它具有更多的编码器,并且不需要安装opencv。 +- ffmpegcv支持导出图像帧到CUDA设备。 ## 功能: - `VideoWriter`:写入视频文件。 @@ -27,6 +28,7 @@ ffmpegcv提供了基于ffmpeg的视频读取器和视频编写器,比cv2更快 - `VideoCaptureStream`:读取RTP/RTSP/RTMP/HTTP流。 - `VideoCaptureStreamRT`: 读取RTSP流 (网线监控相机),实时、低延迟。 - `noblock`:在后台读取视频文件(更快),使用多进程。 +- `toCUDA`:将图像帧导出到CUDA设备,以 CHW/HWC-float32 格式存储,超过2倍性能提升。 ## 安装 在使用ffmpegcv之前,您需要下载`ffmpeg`。 @@ -37,7 +39,8 @@ ffmpegcv提供了基于ffmpeg的视频读取器和视频编写器,比cv2更快 #1D. CONDA: conda install ffmpeg=6.0.0 #2. python - pip install ffmpegcv + pip install ffmpegcv #stable verison + pip install git+https://github.com/chenxinfeng4/ffmpegcv #latest verison ``` ## 何时选择 `ffmpegcv` 而不是 `opencv`: @@ -67,6 +70,18 @@ cap = ffmpegcv.VideoCaptureCAM(0) cap = ffmpegcv.VideoCaptureCAM("Integrated Camera") ``` +深度学习流水线 +```python +# video -> crop -> resize -> RGB -> CUDA:CHW float32 -> model +cap = ffmpegcv.toCUDA( + ffmpegcv.VideoCaptureNV(file, pix_fmt='nv12', resize=(W,H)), + tensor_format='CHW') + +for frame_CHW_cuda in cap: + frame_CHW_cuda = (frame_CHW_cuda - mean) / std + result = model(frame_CHW_cuda) +``` + ## GPU加速 - 仅支持NVIDIA显卡,在 x86_64 上测试。 - 原生支持**Windows**, **Linux**, **Anaconda**。 @@ -167,6 +182,37 @@ cap = ffmpegcv.VideoCapture(file, resize=(640, 480), resize_keepratio=True) cap = ffmpegcv.VideoCapture(file, crop_xywh=(0, 0, 640, 480), resize=(512, 512)) ``` +## toCUDA 将图像帧快速导出到CUDA设备 +--- +ffmpegcv 可以将 HWC-uint8 cpu 中的视频/流转换为 CUDA 设备中的 CHW-float32。它可以显著减少你的 CPU 负载,并比你的手动转换快 2 倍以上。 + +准备环境。你需要具备 cuda 环境,并且安装 pycuda 包。注意,pytorch 包是非必须的。 +> nvcc --version # 检查你是否已经安装了 NVIDIA CUDA 编译器 +> pip install pycuda # 安装 pycuda + +```python +# 读取视频到CUDA设备,加速前 +cap = ffmpegcv.VideoCaptureNV(file, pix_fmt='rgb24') +ret, frame_HWC_CPU = cap.read() +frame_CHW_CUDA = torch.from_numpy(frame_HWC_CPU).permute(2, 0, 1).cuda().contiguous().float() # 120fps, 1200% CPU 使用率 + +# 加速后 +cap = toCUDA(ffmpegcv.VideoCapture(file, pix_fmt='yuv420p')) #必须设置, yuv420p 针对 cpu +cap = toCUDA(ffmpegcv.VideoCaptureNV(file, pix_fmt='nv12')) #必须设置, nv12 针对 gpu + +ret, frame_CHW_pycuda = cap.read() #380fps, 200% CPU load, [pycuda array] +ret, frame_CHW_pycudamem = cap.read_cudamem() #same as [pycuda mem_alloc] +ret, frame_CHW_CUDA = cap.read_torch() #same as [pytorch tensor] + +frame_CHW_pycuda[:] = (frame_CHW_pycuda - mean) / std #归一化 +``` + +为什么在深度学习流水线中使用 toCUDA 会更快? + +> 1. ffmpeg 使用 CPU 将视频像素格式从原始 YUV 转换为 RGB24,这个过程很慢。`toCUDA` 使用 cuda 加速像素格式转换。 +> 2. 使用 yuv420p 或 nv12 可以节省 CPU 负载并减少从 CPU 到 GPU 的内存复制。 +> 3. ffmpeg 将图像存储为 HWC 格式。ffmpegcv 可以使用 HWC 和 CHW 格式来加速视频存储。 + ## 视频写入器 --- ```python diff --git a/ffmpegcv/__init__.py b/ffmpegcv/__init__.py index daeba43..0596308 100644 --- a/ffmpegcv/__init__.py +++ b/ffmpegcv/__init__.py @@ -517,3 +517,11 @@ def VideoCapturePannels( ) VideoReaderPannels = VideoCapturePannels + + +def toCUDA(vid:FFmpegReader, gpu:int=0, tensor_format:str='chw') -> FFmpegReader: + """ + Convert frames to CUDA tensor float32 in 'chw' or 'hwc' format. + """ + from ffmpegcv.ffmpeg_reader_cuda import FFmpegReaderCUDA + return FFmpegReaderCUDA(vid, gpu, tensor_format)