Temporal Coherence in Generated Videos #1

Trentonom0r3 · 2023-05-18T16:29:24Z

Describe the solution you'd like
A detailed temporal coherence function utilizing optical flow, among other methods, to achieve temporal coherence.

Describe alternatives you've considered
Tried EBSynth, need something more robust. I'm wondering if datamoshing techniques could be used in a way that could help achieve our goals, when layered with optical flow and other options like temporal smoothing.

Additional context
Suggestions from Aerial_1 on Reddit:
"It's been figured out back in the GAN days, then applied to disco diffusion, and then finally stable warp diffusion, although locked behind a patreon paywall.
There are also extensions for A1111 Webui like this temporal kit but it's mostly based on ebsynth and doesn't do true temporal warping that I have in mind with these other links.

Now I'm tearing my hair out here because the core principle afaik is a quite simple loop:

do img2img

warp result based on optical flow vectors

blend result with next frame (customizable ratio)

repeat img2img with this new blended result

And I haven't found an easy interface to run this with my own flow vectors (made form 3d software)

Stable warp fussion I think does this and adds a ton of bells and whistles (documented in authors patreon), most of which just overlaps with what webui does anyway.
For creative animation use workflows, feels like the entire SD community is bottlenecked with this one notebook behind a paywall. I think the author of it is rightfully collecting monetary support for how much work he's putting into it, but I figured the base warp method should be easy enough to implement anywhere."

Trentonom0r3 · 2023-05-23T11:42:52Z

Another Option;
Create custom controlnet that uses optical flow from the input image to inform the generation.

aulerius · 2023-05-23T12:26:57Z

Just discovered this implementation: https://github.com/volotat/SD-CN-Animation
The commonality between all successful methods seems to be using RAFT for optical flow estimation and warping according to it.

Trentonom0r3 · 2023-05-23T14:16:53Z

Just discovered this implementation: https://github.com/volotat/SD-CN-Animation The commonality between all successful methods seems to be using RAFT for optical flow estimation and warping according to it.

Now the choice seems to be to either adjust that Repo to have an endpoint, or try and create my own ControlNet preprocessor and model for the main ControlNet Extension.

aulerius · 2023-05-23T21:37:36Z

Just discovered this implementation: https://github.com/volotat/SD-CN-Animation The commonality between all successful methods seems to be using RAFT for optical flow estimation and warping according to it.

Now the choice seems to be to either adjust that Repo to have an endpoint, or try and create my own ControlNet preprocessor and model for the main ControlNet Extension.

What do you mean by ControlNet? To adapt motion vector warping into a ControlNet - like implementation?

It seems also the api is planned by the developer of that project

melMass · 2023-06-14T20:35:33Z

There is also this :) https://github.com/thu-ml/controlvideo

Trentonom0r3 · 2023-07-08T21:07:57Z

AHHA! I've figured it out... Kind of.

EBSynth makes things quick and easy, but often fails with more complex motion, and runs into issues with incoherence due to the nature of keyframes. Even when generating as a grid, there's often still temporal coherence issues to a degree.

The following method is quite a bit more detailed, but I believe that if utilized properly, could output some incredible results! (I have some newer tests to share that I'll upload later.)

At its core, it feels like a variation of old-school roto-animation, but with a lot of newer bells and whistles.

Background
The After Effects Content Aware Fill tool inherently makes use of PatchMatch & optical flow to guide how it runs.
While typically, you would use this to remove things such as stains, random people/ cars, stunt wires, etc, but I discovered that by using your SD generated keyframe as reference image(s), you can get a simple EBSynth like output, straight from AE.

Roto - Animation is the process of drawing/painting over live action frames to create an animated/ drawn version of the original frame.
(Also used as a guide for motion, actions, etc.)

AE has a great rotoscope tool, and, provides access to Mocha. These tools combined with content aware fill are highly valuable.

Here is what I call the Iterative Roto-Fill Process, or IRFP for short:

These steps can be done using EBSynth as well, but I find performing the steps in AE is a bit more streamlined, and a bit more integrated.

Essentially you need to rotoscope all important areas of your input, and split your input into sections.

For example; you have an input of a person talking/moving their hands.
You'd create at bare minimum, a mask/roto for the head, the hands, torso, and/or legs.

For further refinement, you can mask out smaller areas of the face, and break it into chunks-- nose, eyes, mouth, forehead, etc.

For each smaller patch, you'll perform content aware fill over that area.

For larger areas with minimal motion (such as the movement of torso)
You can use a single keyframe and get great results.

For areas with greater motion, you'd create 4-5 keyframes (I've found that 7-8 gives a great result) and perform content aware fill.

You repeat this process for each divided section of your input, and on your completed fills, iterate through the areas where you find consistencies. If it's an area with larger motion, you probably need more keyframes.
If it's an area such as a forehead, cheeks, other areas where there's slight motion, but nothing on the level of mouth or eye movement, you can use a keyframe or two to enhance the coherence.

By iterating through the patches like this, you have a lot more control over how the final output looks, and can more easily fix inconsistencies.

For further refinement, using Mocha AE to track and break up the patches more accurately can lead to an even more coherent result.

After a final pass, you can use the facial tracking data from your input to warp your stylized video even further.

This is still a workflow in progress, but each new method discovered is leading to better and better results.

I'll be posting an example I made using this method later tonight!

Trentonom0r3 self-assigned this May 18, 2023

Trentonom0r3 mentioned this issue Jul 9, 2023

AHHA! I've figured out temporal Coherence Directly in AE... Kind of. #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporal Coherence in Generated Videos #1

Temporal Coherence in Generated Videos #1

Trentonom0r3 commented May 18, 2023

Trentonom0r3 commented May 23, 2023

aulerius commented May 23, 2023

Trentonom0r3 commented May 23, 2023

aulerius commented May 23, 2023

melMass commented Jun 14, 2023

Trentonom0r3 commented Jul 8, 2023

Temporal Coherence in Generated Videos #1

Temporal Coherence in Generated Videos #1

Comments

Trentonom0r3 commented May 18, 2023

Trentonom0r3 commented May 23, 2023

aulerius commented May 23, 2023

Trentonom0r3 commented May 23, 2023

aulerius commented May 23, 2023

melMass commented Jun 14, 2023

Trentonom0r3 commented Jul 8, 2023