[WIP] Use vt for manually decoding frames. Fixes #533 #535

felipejfc · 2022-11-26T19:14:06Z

Two main changes

1 - Use VideoToolbox to manually decode each frame instead of submitting it directly to AVSampleBufferDisplayLayer; I'm not proud of this change, but it was needed to fix #533. There may be some way to fix the issue without needing this change, but I still didn't manage to do it.

2 - Latency and smoothness changes
2.1 - Use Direct Submit in VideoDecodeRenderer (reduces latency)
2.2 - Use PTS information correctly per frame instead of using the DisplayImmediately flag in each sampleBuffer. Together with the change above, I was able to replicate smooth low latency stream as I get into the Nvidia Shield. I think using the flag messed with frame time and caused jittering.

Right now, I'm breaking the "Smooth Stream" option that we added some months ago, but wanted to create the PR either way for us to discuss options @cgutman

use pts from moonlight server to schedule frame display use decompression callback unused frameRef field to propagate frameType information Use obj-c cb for decode session Revert to direct decode, use PTS correctly

cgutman

These changes are looking pretty good.

I think this has the potential to improve the frame pacing option too. Now that we have access to the decoded samples, we can keep a queue of those and submit those in our display link callback. Since we were queuing compressed video samples before, there was the potential that we could miss a frame deadline if the decoder couldn't finish decoding that frame by the time it was due to be displayed.

cgutman · 2022-11-27T21:36:19Z

Limelight/Stream/VideoDecoderRenderer.m

@@ -378,4 +373,65 @@ - (int)submitDecodeBuffer:(unsigned char *)data length:(int)length bufferType:(i
    return DR_OK;
 }

+- (OSStatus) decodeFrameWithSampleBuffer:(CMSampleBufferRef)sampleBuffer frameType:(int)frameType{
+    VTDecodeFrameFlags flags = kVTDecodeFrame_EnableAsynchronousDecompression;


Does async decompression result in improved performance vs synchronous?

Honestly, I didn't compare sync/async here because I couldn't find a reliable way to measure performance other than my gut feeling. Any ideas?

Not every frame is decompressed asynchronously, this is the correct setting to increase speed.

Limelight/Stream/VideoDecoderRenderer.m

cgutman · 2022-11-27T21:37:43Z

Limelight/Stream/VideoDecoderRenderer.m

+
+        OSStatus res = CMVideoFormatDescriptionCreateForImageBuffer(kCFAllocatorDefault, imageBuffer, &formatDescriptionRef);
+        if (res != noErr){
+            NSLog(@"Failed to create video format description from imageBuffer");


Please change the NSLog() calls to our Log() function instead.

should we also call LiRequestIdrFrame() here?

Limelight/Stream/VideoDecoderRenderer.m

cgutman · 2022-11-27T21:40:48Z

Limelight/Stream/VideoDecoderRenderer.m

+        } else {
+            // I-frame
+            CFDictionarySetValue(dict, kCMSampleAttachmentKey_NotSync, kCFBooleanFalse);
+            CFDictionarySetValue(dict, kCMSampleAttachmentKey_DependsOnOthers, kCFBooleanFalse);


These attributes should be set on the H.264/HEVC CMSampleBuffer that we pass to the VTDecompressionSession rather than the CMSampleBuffer we pass to AVBufferSampleDisplayLayer (which is now just raw YUV data).

makes total sense! will change it

Limelight/Stream/VideoDecoderRenderer.m

felipejfc · 2022-11-28T12:46:47Z

Plot Twist: As part of tackling the improvements you suggested @cgutman, I hit the reason for the original issue #533. Still unsure about the root cause, but it's these lines:

https://github.com/moonlight-stream/moonlight-ios/blob/master/Limelight/Stream/VideoDecoderRenderer.m#L346-L360

I ran some tests, and we don't even need to set any value in the dict; only getting it will cause the decoder to go nuts:

   CFArrayRef attachments = CMSampleBufferGetSampleAttachmentsArray(sampleBuffer, YES);
    CFMutableDictionaryRef dict = (CFMutableDictionaryRef)CFArrayGetValueAtIndex(attachments, 0);

This is enough to make the decoder fail when the sample buffers contain HDR data. I imagine it must be due to some OS bug. If I remove the whole block, the decoder will work just fine.
Interestingly enough, these lines will also break the manual VTDecompression flow (error kVTVideoDecoderMalfunctionErr | -12911); this is how I figured I should try commenting on them in the original solution.

Given that:

Do you think the solution to decouple decoding still makes sense? Or should we move forward with removing these lines only? Honestly, I'm not sure if they make any difference. From my tests, I see none when I change them.
I can send another PR with only this change and using the PTS information; Since we can't get the reference to the dict this will be needed as we can't set DisplayImmediately flag. WDYT?

cgutman · 2022-11-30T00:29:39Z

Do you think the solution to decouple decoding still makes sense? Or should we move forward with removing these lines only? Honestly, I'm not sure if they make any difference. From my tests, I see none when I change them.

Removing those lines should be fine.

I can send another PR with only this change and using the PTS information; Since we can't get the reference to the dict this will be needed as we can't set DisplayImmediately flag. WDYT?

Yep, let's do that to fix #533 ASAP and we can see if using a VTDecompressionSession improves things further vs the current pure AVSampleBufferDisplayLayer solution.

Do you see a frame pacing regression using the PTS info with the pacing option enabled? If so, we can just use this solution for HDR streaming only for now.

felipejfc · 2022-12-02T02:35:01Z

@cgutman as part of the changes, I wanted to test different ways of decoding and rendering, using VT to manually decode and update a CALayer with the resulting image, continue to pass the encoded buffer directly to AVSampleBufferDisplayLayer, and then measure the latency of each approach.
Any ideas on how I could benchmark these solutions reliably?

cgutman · 2022-12-02T03:38:23Z

I suppose you could use a phone in slow motion mode.

For now though, let's try to get HDR on the new Apple TV, then we can fine tune things later.

Can you send your basic PR with just the HDR fix?

felipejfc · 2022-12-02T12:34:17Z

@cgutman I isolated the changes to fix HDR here #536. Will keep on with researching different methods of drawing to improve latency -- in the newer ATV 4K it's still more noticeable than M1 Macs and Nvidia Shield

Starlank · 2022-12-02T13:12:38Z

Going to build and test the lowest latency option on my Apple TV 4K 2021 with MoCA setup and report back!

Starlank · 2022-12-02T19:19:03Z

@felipejfc do the latency improvements only apply to the 2022 Apple TV 4K?

felipejfc · 2022-12-04T20:22:52Z

@Starlank the changes here and in the other PR should not improve latency; but should improve stream "smoothness" using the low latency pacing mode. I'm currently studying latency improvements locally.

Alanko5 · 2022-12-16T10:55:02Z

Limelight/Stream/VideoDecoderRenderer.m

+
+- (void) setupDecompressionSession {
+    if (decompressionSession != NULL){
+        VTDecompressionSessionInvalidate(decompressionSession);


it is also necessary to do the following:
VTDecompressionSessionWaitForAsynchronousFrames(decompressionSession);

Alanko5 · 2022-12-16T10:55:48Z

Limelight/Stream/VideoDecoderRenderer.m

+        }
+
+        // Enqueue the next frame
+        [self->displayLayer enqueueSampleBuffer:sampleBuffer];


I would add a flush before the [self->displayLayer enqueueSampleBuffer:sampleBuffer] function.

if (![self->displayLayer isReadyForMoreMediaData]) { [self->displayLayer flush]; }

Sometimes it happens that not all data is played, and the buffer fills up, so the playback stops.

Alanko5 · 2022-12-16T10:56:28Z

Limelight/Stream/VideoDecoderRenderer.m

+        }
+
+        CMSampleBufferRef sampleBuffer;
+        CMSampleTimingInfo sampleTiming = {kCMTimeInvalid, presentationTimestamp, presentationDuration};


I recommend using CACurrentMediaTime() for timing info.
In this way, the frames will be played immediately.

I have the impression if I set it to display immediately, more jittering is generated, maybe because presentationDuration gets messed up?

set the duration if you know the fps.

but try it without setting the duration, I don't notice the jittering in another project.

Alanko5 · 2022-12-16T10:58:15Z

Limelight/Stream/VideoDecoderRenderer.m

+        decompressionSession = nil;
+    }
+
+    int status = VTDecompressionSessionCreate(kCFAllocatorDefault,


For initialize VTDecompressionSession need set parameters.
If I understand it correctly, the sender can send data from various sources.
Therefore, it is necessary to prepare for the fact that the data can be of various types.
For example:

let imageBufferAttributes = [ //kCVPixelBufferPixelFormatTypeKey: NSNumber(value: kCVPixelFormatType_......), /// if needed kCVPixelBufferIOSurfacePropertiesKey: [:] as AnyObject, kCVPixelBufferOpenGLESCompatibilityKey: NSNumber(booleanLiteral: true), kCVPixelBufferMetalCompatibilityKey: NSNumber(booleanLiteral: true), kCVPixelBufferOpenGLCompatibilityKey: NSNumber(booleanLiteral: true) ]

Alanko5 · 2022-12-16T11:00:47Z

If you like SDR change to HDR:

let pixelTransferProperties = [kVTPixelTransferPropertyKey_DestinationColorPrimaries: kCVImageBufferColorPrimaries_ITU_R_2020,
                                           kVTPixelTransferPropertyKey_DestinationTransferFunction: kCVImageBufferTransferFunction_SMPTE_ST_2084_PQ,
                                           kVTPixelTransferPropertyKey_DestinationYCbCrMatrix: kCVImageBufferYCbCrMatrix_ITU_R_2020]

VTSessionSetProperty(decompressionSession,
                                 key: kVTDecompressionPropertyKey_PixelTransferProperties,
                                 value: pixelTransferProperties as CFDictionary)

Do not forget that on tvOS it is necessary to switch the TV to HDR mode.

felipejfc · 2022-12-16T13:22:38Z

Thanks for the review @Alanko5. I have doubts regarding the manual decompression approach though, as I wasn't able to reduce video latency.
The way that reduced it the most was using kCVPixelBufferIOSurfacePropertiesKey property so that the image received in the decompression callback has a backing IOSurface and setting the displayLayer contents directly (ditching the SampleBufferDisplayLayer basically)
Btw given how much latency the AppleTV, even newest model has, when compared to ipads or iphones I think that it's some hardware related latency between the ATV and the display (monitor/TV)

felipejfc · 2022-12-16T13:23:20Z

If you like SDR change to HDR:

let pixelTransferProperties = [kVTPixelTransferPropertyKey_DestinationColorPrimaries: kCVImageBufferColorPrimaries_ITU_R_2020,
                                           kVTPixelTransferPropertyKey_DestinationTransferFunction: kCVImageBufferTransferFunction_SMPTE_ST_2084_PQ,
                                           kVTPixelTransferPropertyKey_DestinationYCbCrMatrix: kCVImageBufferYCbCrMatrix_ITU_R_2020]

VTSessionSetProperty(decompressionSession,
                                 key: kVTDecompressionPropertyKey_PixelTransferProperties,
                                 value: pixelTransferProperties as CFDictionary)

Do not forget that on tvOS it is necessary to switch the TV to HDR mode.

Is this SDR->HDR mapping?

Alanko5 · 2022-12-16T14:26:15Z

If you like SDR change to HDR:

let pixelTransferProperties = [kVTPixelTransferPropertyKey_DestinationColorPrimaries: kCVImageBufferColorPrimaries_ITU_R_2020,
                                           kVTPixelTransferPropertyKey_DestinationTransferFunction: kCVImageBufferTransferFunction_SMPTE_ST_2084_PQ,
                                           kVTPixelTransferPropertyKey_DestinationYCbCrMatrix: kCVImageBufferYCbCrMatrix_ITU_R_2020]

VTSessionSetProperty(decompressionSession,
                                 key: kVTDecompressionPropertyKey_PixelTransferProperties,
                                 value: pixelTransferProperties as CFDictionary)

Do not forget that on tvOS it is necessary to switch the TV to HDR mode.

Is this SDR->HDR mapping?

Yes, apple mentions it somewhere in the documentation.
It does not generate an HDR image, but it improves SDR.

Alanko5 · 2022-12-16T14:28:55Z

Thanks for the review @Alanko5. I have doubts regarding the manual decompression approach though, as I wasn't able to reduce video latency.
The way that reduced it the most was using kCVPixelBufferIOSurfacePropertiesKey property so that the image received in the decompression callback has a backing IOSurface and setting the displayLayer contents directly (ditching the SampleBufferDisplayLayer basically)
Btw given how much latency the AppleTV, even newest model has, when compared to ipads or iphones I think that it's some hardware related latency between the ATV and the display (monitor/TV)

I don't think it's caused by HW.
tvOS is a different system than iOS.
I think that it is enough to find some setting that will only be enabled.
But I may be wrong.

Alanko5 · 2022-12-16T14:59:07Z

Thanks for the review @Alanko5. I have doubts regarding the manual decompression approach though, as I wasn't able to reduce video latency.
The way that reduced it the most was using kCVPixelBufferIOSurfacePropertiesKey property so that the image received in the decompression callback has a backing IOSurface and setting the displayLayer contents directly (ditching the SampleBufferDisplayLayer basically)
Btw given how much latency the AppleTV, even newest model has, when compared to ipads or iphones I think that it's some hardware related latency between the ATV and the display (monitor/TV)

What latency are we talking about?
Now I tried to measure the decoding time.
H265 decoding took 0.001sec.
Where do you see this delay?
Maybe I don't fully understand the problem.

felipejfc · 2022-12-16T15:02:04Z

Thanks for the review @Alanko5. I have doubts regarding the manual decompression approach though, as I wasn't able to reduce video latency.
The way that reduced it the most was using kCVPixelBufferIOSurfacePropertiesKey property so that the image received in the decompression callback has a backing IOSurface and setting the displayLayer contents directly (ditching the SampleBufferDisplayLayer basically)
Btw given how much latency the AppleTV, even newest model has, when compared to ipads or iphones I think that it's some hardware related latency between the ATV and the display (monitor/TV)

What latency are we talking about? Now I tried to measure the decoding time. H265 decoding took 0.001sec. Where do you see this delay? Maybe I don't fully understand the problem.

There's streaming delay witn ATV4K when compared to streaming with an iphone/ipad or nvidia shield. I compared them using a stopwatch application and slow-mo iPhone camera to compare PC screen time with streaming screen time

Alanko5 · 2022-12-16T15:28:16Z

I understand.
Measure how long it takes you to decode.
According to my measurements, it is 0.001sec.
If you think the delay is causing the decoding, switch to H264.
There, the delay is even smaller.

In my opinion, the timing will help you solve the problem.

Do you not use WiFi when measuring? :-)

felipejfc · 2022-12-16T17:10:51Z

I understand. Measure how long it takes you to decode. According to my measurements, it is 0.001sec. If you think the delay is causing the decoding, switch to H264. There, the delay is even smaller.

In my opinion, the timing will help you solve the problem.

Do you not use WiFi when measuring? :-)

For 4k HEVC I think I was getting 8ms time to decode each frame. 10~11 ms total time to receive the whole frame, pack it together and decode

Alanko5 · 2022-12-16T17:44:18Z

Did you measure the decompression time of the Key and non-Key frames?
Can you set the server to send fewer keyframes? (for example one per two seconds)

What is the total delay of the image that you measured with the camera?

What version of apple tv do you have?

How do you create a VTDecompressionSession?
I mean, what parameters are you setting?

felipejfc · 2022-12-16T17:58:59Z

Code is in this branch https://github.com/felipejfc/moonlight-ios/tree/ds_queue_surface

I have the latest 4K Apple TV(2022) with the iPhone 12 pro processor.

What is the total delay of the image that you measured with the camera?
The streamed image would be always 25~50 hundredths behind the original image; when testing with m1 MacBook, or nvidia shield, most of the time the images would be in sync.

Did you measure the decompression time of the Key and non-Key frames?
I measured all frames and they all took this same amnt of time.

Can you set the server to send fewer keyframes?

Pretty sure gamestream won't allow me to do it

Alanko5 · 2022-12-16T18:38:33Z

According to what you write, the problem is not in decoding.
I think that by improving the decoding you can gain a maximum of 5ms.
The first thing I would look for is a network or rendering delay.
Because a delay of 250~500ms is huge!

Well, you can try as follows:

It is necessary that you set this value (As I wrote above):
kCVPixelBufferMetalCompatibilityKey

Your destinationImageBufferAttributes:

NSDictionary *pixelAttributes = @{
        (id)kCVPixelBufferMetalCompatibilityKey : (id)kCFBooleanTrue,
        (id)kCVPixelBufferIOSurfaceCoreAnimationCompatibilityKey : (id)kCFBooleanTrue,
        (id)kCVPixelBufferIOSurfacePropertiesKey : @{},
    };

I think that during rendering it would help if the layer could use Metal.

NSDictionary *videoDecoderSpec = @{
         (id) kCMFormatDescriptionExtension_FullRangeVideo : FORMAT_DESC_FullRangeVideo,
         (id) kCVImageBufferChromaLocationBottomFieldKey: kCVImageBufferChromaLocation_Left,
         (id) kCVImageBufferChromaLocationTopFieldKey: kCVImageBufferChromaLocation_Left,
         (id) kCVImageBufferPixelAspectRatioKey: FORMAT_DESC_AspectRatio,
         (id) kCVImageBufferColorPrimariesKey: FORMAT_DESC_ColorPrimaries,
         (id) kCVImageBufferTransferFunctionKey: FORMAT_DESC_TransferFunction,
         (id) kCVImageBufferYCbCrMatrixKey: FORMAT_DESC_YCbCrMatrix
};

felipejfc · 2022-12-16T18:54:24Z

According to what you write, the problem is not in decoding. I think that by improving the decoding you can gain a maximum of 5ms. The first thing I would look for is a network or rendering delay. Because a delay of 250~500ms is huge!

Well, you can try as follows:

It is necessary that you set this value (As I wrote above): kCVPixelBufferMetalCompatibilityKey

Your destinationImageBufferAttributes:
NSDictionary *pixelAttributes = @{
        (id)kCVPixelBufferMetalCompatibilityKey : (id)kCFBooleanTrue,
        (id)kCVPixelBufferIOSurfaceCoreAnimationCompatibilityKey : (id)kCFBooleanTrue,
        (id)kCVPixelBufferIOSurfacePropertiesKey : @{},
    };
I think that during rendering it would help if the layer could use Metal.
NSDictionary *videoDecoderSpec = @{
         (id) kCMFormatDescriptionExtension_FullRangeVideo : FORMAT_DESC_FullRangeVideo,
         (id) kCVImageBufferChromaLocationBottomFieldKey: kCVImageBufferChromaLocation_Left,
         (id) kCVImageBufferChromaLocationTopFieldKey: kCVImageBufferChromaLocation_Left,
         (id) kCVImageBufferPixelAspectRatioKey: FORMAT_DESC_AspectRatio,
         (id) kCVImageBufferColorPrimariesKey: FORMAT_DESC_ColorPrimaries,
         (id) kCVImageBufferTransferFunctionKey: FORMAT_DESC_TransferFunction,
         (id) kCVImageBufferYCbCrMatrixKey: FORMAT_DESC_YCbCrMatrix
};

Sorry, I misspelled it. It's actually 25~50ms delay! I will try your changes anyways when I get home; travelling right now so that's only next week

felipejfc force-pushed the direct_submit branch from df3e5c1 to a8af6ad Compare November 26, 2022 19:19

Use vt for manually decoding frames. Fixes moonlight-stream#533

31560a0

use pts from moonlight server to schedule frame display use decompression callback unused frameRef field to propagate frameType information Use obj-c cb for decode session Revert to direct decode, use PTS correctly

felipejfc force-pushed the direct_submit branch from a8af6ad to 31560a0 Compare November 27, 2022 00:05

cgutman requested changes Nov 27, 2022

View reviewed changes

felipejfc added 2 commits November 27, 2022 23:03

Improve rendering code

c004cd1

Fix decoding

df707d2

Alanko5 reviewed Dec 16, 2022

View reviewed changes

rgov mentioned this pull request Jul 26, 2024

Input lag of around 100-150ms (might be related to AppleTV, not Moonlight....) #643

Open

[WIP] Use vt for manually decoding frames. Fixes #533 #535

Are you sure you want to change the base?

[WIP] Use vt for manually decoding frames. Fixes #533 #535

Conversation

felipejfc commented Nov 26, 2022

cgutman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipejfc commented Nov 28, 2022 • edited Loading

cgutman commented Nov 30, 2022

felipejfc commented Dec 2, 2022

cgutman commented Dec 2, 2022 • edited Loading

felipejfc commented Dec 2, 2022

Starlank commented Dec 2, 2022

Starlank commented Dec 2, 2022

felipejfc commented Dec 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alanko5 commented Dec 16, 2022

felipejfc commented Dec 16, 2022

felipejfc commented Dec 16, 2022

Alanko5 commented Dec 16, 2022

Alanko5 commented Dec 16, 2022

Alanko5 commented Dec 16, 2022

felipejfc commented Dec 16, 2022

Alanko5 commented Dec 16, 2022 • edited Loading

felipejfc commented Dec 16, 2022 • edited Loading

Alanko5 commented Dec 16, 2022

felipejfc commented Dec 16, 2022

Alanko5 commented Dec 16, 2022 • edited Loading

felipejfc commented Dec 16, 2022

felipejfc commented Nov 28, 2022 •

edited

Loading

cgutman commented Dec 2, 2022 •

edited

Loading

Alanko5 commented Dec 16, 2022 •

edited

Loading

felipejfc commented Dec 16, 2022 •

edited

Loading

Alanko5 commented Dec 16, 2022 •

edited

Loading