Is Hardware-Accelerated Video Decoding Better for CPU and Memory Efficiency? #1869

vutung1671997 · 2025-04-15T08:37:58Z

vutung1671997
Apr 15, 2025

🧠 Context

I’m building a real-time RTSP video streaming application using PyAV and evaluating the performance of software decoding vs. hardware-accelerated decoding, specifically looking at CPU and RAM usage per Full HD stream.

Software Decoder (Direct Stream Setup)

self.container = av.open(url, container_options=container_options)
self.stream = next((s for s in self.container.streams if s.type == 'video'), None)
codec_context = self.stream.codec_context

# Apply decoding optimizations directly to the stream
self.stream.codec_context.thread_type = "AUTO"
self.stream.codec_context.thread_count = cv2.getNumberOfCPUs()
self.stream.codec_context.skip_frame = 'BIDIR'
self.stream.codec_context.flags2 |= 1 << 0  # Fast decode
self.stream.codec_context.flags |= 1 << 16  # Low-delay

✅ All flags and parameters take effect as expected.
✅ RAM usage is low, around 80 MB per Full HD RTSP stream.
⚠️ CPU usage is moderate due to software decoding.

**Hardware Decoder (Custom Context Setup)

decoder_name = 'h264_cuvid' or 'hevc_cuvid' if h265 (i did written a hardware acceleration wrapper auto find hardware support of current devices before set to av.Codec)
codec = av.Codec(decoder_name, 'r')  # e.g., 'h264'

def _setup_decoder_context(self, codec, codec_context, hw_type=None, hw_codec=None):
    try:
        decoder_ctx = codec.create()

        # Apply the same flags as software decoding
        decoder_ctx.width = codec_context.width
        decoder_ctx.height = codec_context.height
        decoder_ctx.thread_type = "AUTO"
        decoder_ctx.skip_frame = 'BIDIR'
        decoder_ctx.flags2 |= 1 << 0  # Fast decode
        decoder_ctx.flags |= 1 << 16  # Low-delay
        decoder_ctx.thread_count = cv2.getNumberOfCPUs()
        decoder_ctx.extradata = codec_context.extradata
        decoder_ctx.open()

        return decoder_ctx, True
    except Exception as e:
        logger.error("Failed to set up decoder context: %s", str(e))
        return None, False

⚠️ Same flags do not appear to have any effect when using a custom decoder context.
❌ RAM usage is much higher, around 200–300 MB per Full HD RTSP stream.
✅ CPU usage is lower thanks to hardware acceleration.

🤔 Questions
Why are the decoding flags (e.g., thread_type, skip_frame, flags2, flags) ignored or ineffective when using a manually created hardware decoder context?
Is there a proper way to apply these parameters for hardware-accelerated codecs in PyAV?
Is the increased memory usage a known limitation of hardware decoding, or is there a workaround to reduce the footprint?

Let me know if you'd like to add a section for hardware specs or attach profiling logs as well!

📦 Current Setup
PyAV Version: 14.3.0 (or your current version)
Python Version: 3.12.0 (specify the Python version you are using)
🖥 Hardware Specifications
CPU: Intel Core i5-13500
GPU: NVidia T1000 8GB
RAM: 32 GB
OS: Windows 11
Hardware Acceleration: cuda

This comment was marked as spam.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is Hardware-Accelerated Video Decoding Better for CPU and Memory Efficiency? #1869

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

This comment was marked as spam.

Select a reply

Uh oh!

Is Hardware-Accelerated Video Decoding Better for CPU and Memory Efficiency? #1869

Uh oh!

vutung1671997 Apr 15, 2025

🧠 Context

Software Decoder (Direct Stream Setup)

**Hardware Decoder (Custom Context Setup)

Replies: 1 comment

This comment was marked as spam.

vutung1671997
Apr 15, 2025