Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory locality for higher performances and reduced memory working set #212

Open
padenot opened this issue May 3, 2021 · 12 comments
Labels
extension Interface changes that extend without breaking.

Comments

@padenot
Copy link
Collaborator

padenot commented May 3, 2021

We can add a second parameter to decode/encode and the output callbacks (for audio, video and images), that would be respectively an output buffer or handle, and the input buffer or handle (and also mirror it for encode vs. decode), to allow authors to reuse memory and avoid lots of memory traffic. GCs are amazing, but it's always better to not GC, reuse allocations, and always use the same bit of the address space, especially in very demanding multimedia scenarios.

This is really compelling whenever the input (resp. output for decode) is regular memory, since it means the working set is going to be pretty tight (for example help with w3c/webtransport#231), but is really nice even when some buffers are backed by (say) GPU memory (because you're not doing malloc/free pairs for input/output buffers for respectively decoding and encoding). I'm not sure if GPU-memory backed surfaces can work in the same way.

Essentially, the proposal is to "pass through" the input to the output so that it can be reused. This reduces allocator use, but more importantly greatly improves memory locality.

When not using this new parameter, the implementation allocates as usual. This means that authors don't have to try to guess the buffer size, it's allocated once and then can be resized. The buffer is released naturally because it isn't referenced, immediately. In short, this is not a breaking change, but allows developers that want to control their memory allocation to be able to do so (and I believe this is something that is important for quite a few domains).

When the provided buffer is too small, the browser can allocate a new buffer, or use more complex tricks (realloc, or maybe it finds that the array buffer backing memory storage was in fact big enough, etc.).

When the provided buffer is way too big, the implementation can decide to swap it for a buffer that has a more appropriate size.

@aboba
Copy link
Collaborator

aboba commented May 3, 2021

@padenot This sounds interesting. Having some WebIDL might help in walking through the implications for various use cases (and related APIs, particularly those based on WHATWG Streams).

A use case that comes to mind is media capture of a track, followed by WebCodecs encode and then sending (e.g. via WebTransport, RTCDataChannel, etc.).

There is work underway to support zero-copy capture. My understanding is that this will (in concert with MediaCapture Transform) provide a VideoFrame in a GPU buffer, and can then be provided to WebCodecs for Encoding, yielding an encodedVideoChunk that can then be sent via a transport.

My understanding is that the GPU buffers are allocated and written to within the Media Capture implementation, so that WebCodecs encode can only take a reference to the GPU buffer as input. So I'm wondering how the new parameters you are suggesting would interface with Streams-based APIs such as MediaCapture Transform and WebTransport.

@chcunningham
Copy link
Collaborator

Agree, sounds interesting! I think I follow, but some quick pseudo code would help me be sure.

Re: VideoFrame, one thing to note is that Chrome does do internal pooling of video frame resources for both GPU and CPU backed frames. For instance, using this link you can see the pooling done for a number of the internal decoders. Similarly, here you can see the CPU frame pool being used for WC VideoFrame's created via the planes constructor. When all references to the underlying frames are released, they go back to pool.

We do similar pooling with audio decoders and the backing memory for AudioData (confusingly called AudioBuffer in our internal code).

Re: capture, this too uses pooling internally. See here for an overview of the IPC API that dishes out buffers and to observers (render processes) which then release them at some later point.

@aboba
Copy link
Collaborator

aboba commented May 4, 2021

@chcunningham Thanks for the summary. The argument seems to be that in the absence of GC, the pool may become fragmented, so that explicit locality can be beneficial. For simple use cases (e.g. capture -> encode) it seems like this could be avoided by a good pooling implementation (e.g. a GPU buffer can be reused soon after being freed). But perhaps for more complex scenarios memory leaks are more likely and locality benefits will suffer until GC cleans up the mess.

@chcunningham
Copy link
Collaborator

chcunningham commented May 4, 2021

Sounds good. I look forward to exploring it more.

Another thing to mention: For VideoFrame's coming out of a VideoDecoder or the MediaStreamTrackProcessor, waiting for GC will actually stall the decode/camera stream. In these cases, authors really must call close() ASAP (and we warn them if they wait for GC) when they're done with a frame.

@padenot
Copy link
Collaborator Author

padenot commented May 4, 2021

In terms of IDL, it would look something like this. I agree that what Chrome does is sensible (but I'm always impressed a how modern allocator make pools less useful than they used to be for regular memory), but this is also about decoder input and encoder outputs. I'm a bit wary of imposing pooling to developers, though.

@@ -5,9 +5,9 @@
   readonly attribute CodecState state;
   readonly attribute long decodeQueueSize;
 
   undefined configure(AudioDecoderConfig config);
-  undefined decode(EncodedAudioChunk chunk);
+  undefined decode(EncodedAudioChunk chunk, optional ArrayBuffer output);
   Promise<undefined> flush();
   undefined reset();
   undefined close();
 
@@ -18,9 +18,9 @@
   required AudioFrameOutputCallback output;
   required WebCodecsErrorCallback error;
 };
 
-callback AudioFrameOutputCallback = undefined(AudioFrame output);
+callback AudioFrameOutputCallback = undefined(AudioFrame output, BufferSource input);
 
 
 [Exposed=(Window,DedicatedWorker)]
 interface VideoDecoder {
@@ -29,9 +29,9 @@
   readonly attribute CodecState state;
   readonly attribute long decodeQueueSize;
 
   undefined configure(VideoDecoderConfig config);
-  undefined decode(EncodedVideoChunk chunk);
+  undefined decode(EncodedVideoChunk chunk, optional VideoFrame output);
   Promise<undefined> flush();
   undefined reset();
   undefined close();
 
@@ -42,9 +42,9 @@
   required VideoFrameOutputCallback output;
   required WebCodecsErrorCallback error;
 };
 
-callback VideoFrameOutputCallback = undefined(VideoFrame output);
+callback VideoFrameOutputCallback = undefined(VideoFrame output, BufferSource input);
 
 
 [Exposed=(Window,DedicatedWorker)]
 interface AudioEncoder {
@@ -53,9 +53,9 @@
   readonly attribute CodecState state;
   readonly attribute long encodeQueueSize;
 
   undefined configure(AudioEncoderConfig config);
-  undefined encode(AudioFrame frame);
+  undefined encode(AudioFrame frame, optional ArrayBuffer output);
   Promise<undefined> flush();
   undefined reset();
   undefined close();
 
@@ -66,9 +66,9 @@
   required EncodedAudioChunkOutputCallback output;
   required WebCodecsErrorCallback error;
 };
 
-callback EncodedAudioChunkOutputCallback = undefined(EncodedAudioChunk output);
+callback EncodedAudioChunkOutputCallback = undefined(EncodedAudioChunk output, ArrayBuffer input);
 
 
 [Exposed=(Window,DedicatedWorker)]
 interface VideoEncoder {
@@ -77,9 +77,9 @@
   readonly attribute CodecState state;
   readonly attribute long encodeQueueSize;
 
   undefined configure(VideoEncoderConfig config);
-  undefined encode(VideoFrame frame, optional VideoEncoderEncodeOptions options = {});
+  undefined encode(VideoFrame frame, optional VideoEncoderEncodeOptions options = {}, optional BufferSource input);
   Promise<undefined> flush();
   undefined reset();
   undefined close();
 
@@ -90,9 +90,9 @@
   required EncodedVideoChunkOutputCallback output;
   required WebCodecsErrorCallback error;
 };
 
-callback EncodedVideoChunkOutputCallback = undefined(EncodedVideoChunk output, VideoDecoderConfig? output_config);
+callback EncodedVideoChunkOutputCallback = undefined(EncodedVideoChunk output, VideoDecoderConfig? output_config, optional VideoFrame input);
 
 
 dictionary AudioDecoderSupport {
   boolean supported;

@chcunningham
Copy link
Collaborator

Triage note: marking 'extension', as the proposed interface changes extend existing methods/callbacks with extra arguments (not breaking).

@sandersdan
Copy link
Contributor

In regards to the IDL proposal, I do not think this would be straightforward to implement. Specifically, during VideoFrame encode(), the resources could be in use by a different system (eg. be in use for a copyTo() operation, or displayed in a <video> via MediaStreamTrackGenerator). We could not return the underlying memory in that case.

What we could reliably do is implement a release callback that is provided at VideoFrame construction time, which would be called after all uses are finished.

@padenot
Copy link
Collaborator Author

padenot commented May 31, 2023

As you note, this doesn't always work as-is. This proposal however always works when detaching inputs and I made in response to my comment in #104 (comment), but we can think about extending it. Another related issue is #287.

The general idea was to allow authors to write efficient code, but if we find a way to make it extra useful in all case, I'm all for it.

As usual, we need to be careful about not exposing GC, but that seems doable here. As long as we can specify the time the callback is called, this is fine to do -- essentially we can't rely on GC to call a callback.

@sandersdan
Copy link
Contributor

This proposal however always works when detaching inputs

I don't think I understand what you mean by this. Detaching the WebCodecs wrapper doesn't tell us anything about other uses of the backing resource, for example in the case that a frame has been clone()d.

@padenot
Copy link
Collaborator Author

padenot commented Jun 6, 2023

You can't detach an input if it doesn't have a refcount of 1, can you? #104 (comment) is about detaching, as in the ES operation on an ArrayBuffer (or other backing storage), where the memory isn't accessible anymore. At this point, the engine is the only owner of a piece of memory (or surface, etc.). The interface change then gives you back a reference to this memory (or other resource) when not in used by the engine anymore.

Are you referring to a copy-on-write scheme or something of that nature? Gecko makes extensive use of that kind of thing internally, maybe we could expose it, but it's harder to reason about.

@sandersdan
Copy link
Contributor

You can absolutely detach a VideoFrame where the ref count on the media resource is greater than one:

let frame2 = frame1.clone();
encoder.encode(frame1, {transfer: [frame1]});  // detach frame1
// frame1 is detached but Chrome cannot return its memory until frame2 is closed

This can also happen by an async operation, or for other internal operations (eg. held while being displayed in a <video> via MSTG).

frame.copyTo(buf);  // start async operation
encoder.encode(frame, {transfer: [frame]});  // detach frame
// frame is detached but Chrome cannot return its memory until the copy completes

VideoFrame is more like ArrayBufferView than ArrayBuffer. (More precisely it follows the model of ImageBitmap but exposes a lot more state and explicit cloning.)

@padenot
Copy link
Collaborator Author

padenot commented Jun 7, 2023

It feels like we're talking past each other, the fact that the term detach can have multiple meanings doesn't help.

https://tc39.es/ecma262/#sec-detacharraybuffer is what I'm talking about. After this operation, the implementation owns the memory, and this opens the door to the optimizations that I've been proposing. This works on an ArrayBuffer, so this would work on the internals of a VideoFrame / AudioData, namely the media resource.

As you note, this only works in some specific cases. The goal of this proposal is to allow developers that need more performance to get fast paths: if they can ensure that the frames aren't cloned / being copied / etc., as to keep a refcount of 1, they can write their code in a way that makes it efficient when it comes to memory access, allocator usage, or surface usage.

Additionally, in the copy case, because all the references are controlled by the implementation after the second line of your second example is executed, it's also possible to make this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension Interface changes that extend without breaking.
Projects
None yet
Development

No branches or pull requests

4 participants