Novel Method for Video Generation Using Feature-Space Optical Flow Mapping? #25

Trentonom0r3 · 2023-07-15T09:10:41Z

Trentonom0r3
Jul 15, 2023
Maintainer

This isn't something that would be possible for me to do, both due to hardware limitations, and experience with these things, but after doing a lot of research trying to figure out temporal coherence using existing tools, I had an idea that may (or may not) be feasible.
I haven't seen any papers regarding this idea, so I figure I might as well share it.

I'm posting this on this repo, in the hope that perhaps someone knowledgeable in Machine Learning that has this repo starred (maybe @dmarx (hope I'm not bothering you! 😅 ) or someone else) happens to take a look and finds any sort of merit in the idea.

Ultimately, the goal would be to be able to use motion vectors/ optical flow data from an input source to create an entirely new video.

The main applications I see would be for VFX work, and animation.
By being able to accurately take a target frame (or frames) and create a new video output based on an original plate, you could essentially make CGI character replacement, roto-animation, etc, that much easier—especially with stable diffusion generated images—being able to truly bring them to life.

Ideally, you would be able to take a small number of target (style) frames and input video frames, and then use that to make your whole scene.

Background

This approach is reminiscent of the traditional technique of "keyframe animation" used by animators.
However, this approach involves using machine learning to learn a mapping from the input video to a stylized version based on keyframes. The process of figuring out what the style of the in-between frames should be is akin to the process animators go through when creating in-between frames.

The optical flow provides information about how objects in the scene are moving from frame to frame, analogous to how an animator needs to understand the motion of characters or objects between keyframes.

The training of a CNN to learn the mapping from input optical flow to stylized optical flow is similar to how an animator learns to draw the motion of characters or objects over time.

The idea of extending this concept by computing the optical flow in feature space, at multiple levels of abstraction, could be seen as a high-tech version of an animator understanding the motion of a scene at different levels of detail.

A Bit of Extra Details I came up with:

Use VGG19 or a similar network to extract high-level features from the frames of the input video, as well as the target frames.
Compute Optical flow in the feature space. (This would probably require a new way to compute optical flow—AFAIK most methods now are based on actual pixels)
Train a neural network to learn the mapping between feature space optical flows of the input video and stylized keyframes. (This is where it starts to get a bit out of my realm of complete understanding).
Given the optical flow of an input frame and the learned transformation, generate the optical flow for the next stylized frame. Then, use an optimization method to generate an image that matches this optical flow and is consistent with the stylized keyframe.
For possibly more accurate results, you could iteratively refine the generated image by using different levels of VGG19 and the optical flow on those different levels.

Super Simple Outline;

Essentially, you'd take the optical flow between target keyframes, the optical flow between the corresponding input frames, and use that as a way to create some sort of mapping between the two and use that information to help inform the creation of new target frames, leading to a full video after iteration.
You could (albeit SUUUUPER SIMPLY) think of this as:

I'm an artist that went to school for mechanical engineering, so I'm approaching this from a physics/calc/diffeq standpoint, with a bit of artist flair.

[&] = Generic placeholder for operands.
StyleFlow(a->x, x<-a) = Optical Flow between frame a and x, and backward flow from x to a, where a is the keyframe we have access to, and x is the frame we need. (or in the correspondence calculation case, the next keyframe)
InputFlow(b->y, y<-b) = Optical Flow between frame b and y, and backward flow from y to b, where b is the input keyframe at the same time as a, and y is the keyframe for x, which may be unknown.
MapRatio = The Machine learned correlation. Think of it as a coefficient.

MapRatio = Correlation(StyleFlow(a->x, x<-a), InputFlow(b->y, y<-b))
StyleFlow(a->x, x<-a) = InputFlow(b->y, y<-b) [&] MapRatio

We would then need to solve for the 'unknown' frame, which I would see being almost like a differential equation problem.
In a differential equation, you're given the derivative of a function (analogous to the optical flow, which shows how the video changes from frame to frame) and you need to find the function that satisfies this derivative (analogous to generating the sequence of frames that results in the given optical flow).

Rewrite the equation as:

`G(a->x, x<-a) - M[F(b->y, y<-b)] = 0`

Where M is the mapped correlation, and x is the new frame we're trying to create.

Define a function E(x) that represents the error between the left and right sides of the equation:

`E(x) = G(a->x, x<-a) - M[F(b->y, y<-b)]`

Find the root of the equation E(x) = 0 using numerical methods like Newton-Raphson:
- Start with an initial guess for x
- Iterate until convergence:
  - Calculate the value of E(x) at the current x
  - Compute the derivative of E(x) with respect to x
  - Update x using the formula: x = x - E(x) / E'(x)
- Repeat until E(x) approaches zero or reaches a satisfactory tolerance level.
Continue the iterations until a solution is found that minimizes the difference between the two sides of the equation.

This isn't a direct translation so it won't look entirely correct, but its the best analogy I can come up with.

Anyways, hopefully someone finds this interesting, and maybe it has at least some degree of validity to it!

Trentonom0r3 · 2023-07-17T00:26:30Z

Trentonom0r3
Jul 17, 2023
Maintainer Author

You can also apply Bernoulli's principle from fluid mechanics to pixels of a video and the optical flow.

Larger areas tend to have smaller overall motion than smaller areas do.
Which lines up with observations of typical optical flow.
A persons head will have less overall optical flow change than the movement of that persons mouth.

I think we can use this concept in tandem with object detection methods as part of an iterative quality check process.

Essentially if the areas recognized as larger structures like a head, body, etc; have a larger magnitude of optical flow than areas recognized as a mouth, eyes, etc, we can assume that the flow is turbulent, incoherent, and the resulting video will not give us the result that we're looking for, so we would need to iterate again.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Novel Method for Video Generation Using Feature-Space Optical Flow Mapping? #25

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Novel Method for Video Generation Using Feature-Space Optical Flow Mapping? #25

Trentonom0r3 Jul 15, 2023 Maintainer

Background

A Bit of Extra Details I came up with:

Super Simple Outline;

Replies: 1 comment

Trentonom0r3 Jul 17, 2023 Maintainer Author

Trentonom0r3
Jul 15, 2023
Maintainer

Trentonom0r3
Jul 17, 2023
Maintainer Author