From 6d9d7479d7e637d109cec5758167f44c234831a3 Mon Sep 17 00:00:00 2001 From: Willi Menapace Date: Fri, 23 Feb 2024 00:36:08 +0100 Subject: [PATCH] Update --- gen2_pikalab_floor33.html | 10 +++++----- imagen_video.html | 10 +++++----- index.html | 14 +++++++------- make_a_video.html | 10 +++++----- our_samples.html | 10 +++++----- our_samples_3d.html | 8 ++++---- our_samples_diversity.html | 8 ++++---- our_samples_hierarchical.html | 10 +++++----- pyoco.html | 10 +++++----- stories.html | 8 ++++---- video_ldm.html | 10 +++++----- 11 files changed, 54 insertions(+), 54 deletions(-) diff --git a/gen2_pikalab_floor33.html b/gen2_pikalab_floor33.html index b5ea112..0bc7756 100644 --- a/gen2_pikalab_floor33.html +++ b/gen2_pikalab_floor33.html @@ -44,7 +44,7 @@

-
+
Paper Overview @@ -81,9 +81,9 @@

Comparison to Gen-2, Floor33 and PikaLab

-

We compare results produced by Snap Video against the publicly accessible Gen-2, Floor33 and PikaLab video generators on a selection of 65 prompts from the Evalcrafter benchmark eliciting dynamic scenes.

+

We compare results produced by Snap Video against the publicly accessible Gen-2, Floor33 and PikaLab video generators on a selection of 65 prompts from the Evalcrafter benchmark eliciting dynamic scenes.

-

When evaluated on a user study, our method shows increased photorealism with respect to PikaLab and Floor33, has significantly better video-text alignment and outperforms the baselines on all motion metrics. Results are expressed in percentage of votes in favor of our method.

+

When evaluated on a user study, our method shows increased photorealism with respect to PikaLab and Floor33, has significantly better video-text alignment and outperforms the baselines on all motion metrics. Results are expressed in percentage of votes in favor of our method.

@@ -121,7 +121,7 @@

Comparison to Gen-2, Floor33 and PikaLab

-

Hover the cursor on the video to reveal the prompt. Note that a selection of prompts issued to Gen-2 resulted in input prompt filtering issues and no output was generated for them.

+

Hover the cursor on the video to reveal the prompt. Note that a selection of prompts issued to Gen-2 resulted in input prompt filtering issues and no output was generated for them.

@@ -4229,7 +4229,7 @@

PikaLab

-
+ diff --git a/imagen_video.html b/imagen_video.html index 888300c..d788a3f 100644 --- a/imagen_video.html +++ b/imagen_video.html @@ -44,7 +44,7 @@

-
+
Paper Overview @@ -81,7 +81,7 @@

Comparison to Imagen Video

-

We compare Snap Video against publicly available samples released by the authors and perform a user study evaluating photorealism, video-text alignment, and motion quantity and quality. While the public samples may have been chosen to showcase the method's strengths, our method features improved photorealism, video-text alignment and quality of motion. Results ate expressed in percentage of votes in favor of our method.

+

We compare Snap Video against publicly available samples released by the authors and perform a user study evaluating photorealism, video-text alignment, and motion quantity and quality. While the public samples may have been chosen to showcase the method's strengths, our method features improved photorealism, video-text alignment and quality of motion. Results ate expressed in percentage of votes in favor of our method.

@@ -104,8 +104,8 @@

Comparison to Imagen Video

-

We compare results produced by our method (left) with results produced by Imagen Video (right)

-

Hover the cursor on the video to reveal the prompt.

+

We compare results produced by our method (left) with results produced by Imagen Video (right)

+

Hover the cursor on the video to reveal the prompt.

@@ -1038,7 +1038,7 @@

Imagen Video

-
+ diff --git a/index.html b/index.html index 2259a3f..11be0af 100644 --- a/index.html +++ b/index.html @@ -66,7 +66,7 @@

-
+
Paper Overview @@ -158,13 +158,13 @@

The Snap Video Model

-

The widely adopted U-Net architecture is required to fully processes each video frame. This increases computational overhead compared to purely text-to-image models, posing a very practical limit on model scalability. In addition, extending U-Net-based architectures to naturally support spatial and temporal dimensions requires volumetric attention operations, which have prohibitive computational demands.

+

The widely adopted U-Net architecture is required to fully processes each video frame. This increases computational overhead compared to purely text-to-image models, posing a very practical limit on model scalability. In addition, extending U-Net-based architectures to naturally support spatial and temporal dimensions requires volumetric attention operations, which have prohibitive computational demands.

-

Inspired by FITs, we propose to leverage redundant information between frames and introduce a scalable transformer architecture that treats spatial and temporal dimensions as a single, compressed, 1D latent vector. This highly compressed representation allows us to perform spatio-temporal computation jointly and enables modelling of complex motions.

+

Inspired by FITs, we propose to leverage redundant information between frames and introduce a scalable transformer architecture that treats spatial and temporal dimensions as a single, compressed, 1D latent vector. This highly compressed representation allows us to perform spatio-temporal computation jointly and enables modelling of complex motions.

-

Thanks to joint spatiotemporal video modeling, Snap Video can synthesize temporally coherent videos with large motion (left) while retaining the semantic control capabilities typical of large-scale text-to-video generators (right).

+

Thanks to joint spatiotemporal video modeling, Snap Video can synthesize temporally coherent videos with large motion (left) while retaining the semantic control capabilities typical of large-scale text-to-video generators (right).

-

Hover the cursor on the video to reveal the prompt.

+

Hover the cursor on the video to reveal the prompt.

@@ -238,14 +238,14 @@

The Snap Video Model

Acknowledgements

-

We would like to thank Oleksii Popov, Artem Sinitsyn, Anton Kuzmenko, Vitalii Kravchuk, Vadym Hrebennyk, Grygorii Kozhemiak, Tetiana Shcherbakova, Svitlana Harkusha, Oleksandr Yurchak, Andrii Buniakov, Maryna Marienko, Maksym Garkusha, Brett Krong, Anastasiia Bondarchuk for their help in the realization of video presentations, stories and graphical assets, Colin Eles, Dhritiman Sagar, Vitalii Osykov, Eric Hu for their supporting technical activities, Maryna Diakonova for her assistance with annotation tasks.

+

We would like to thank Oleksii Popov, Artem Sinitsyn, Anton Kuzmenko, Vitalii Kravchuk, Vadym Hrebennyk, Grygorii Kozhemiak, Tetiana Shcherbakova, Svitlana Harkusha, Oleksandr Yurchak, Andrii Buniakov, Maryna Marienko, Maksym Garkusha, Brett Krong, Anastasiia Bondarchuk for their help in the realization of video presentations, stories and graphical assets, Colin Eles, Dhritiman Sagar, Vitalii Osykov, Eric Hu for their supporting technical activities, Maryna Diakonova for her assistance with annotation tasks.

-
+ diff --git a/make_a_video.html b/make_a_video.html index 5d0ad9b..3c168bd 100644 --- a/make_a_video.html +++ b/make_a_video.html @@ -44,7 +44,7 @@

-
+
Paper Overview @@ -81,7 +81,7 @@

Comparison to Make-A-Video

-

We compare Snap Video against publicly available samples released by the authors and perform a user study evaluating photorealism, video-text alignment, and motion quantity and quality. While the public samples may have been chosen to showcase the method's strengths, our method outperforms the baseline on all metrics. Results are expressed in percentage of votes in favor of our method.

+

We compare Snap Video against publicly available samples released by the authors and perform a user study evaluating photorealism, video-text alignment, and motion quantity and quality. While the public samples may have been chosen to showcase the method's strengths, our method outperforms the baseline on all metrics. Results are expressed in percentage of votes in favor of our method.

@@ -104,8 +104,8 @@

Comparison to Make-A-Video

-

We compare results produced by our method (left) with results produced by Make-A-Video (right)

-

Hover the cursor on the video to reveal the prompt.

+

We compare results produced by our method (left) with results produced by Make-A-Video (right)

+

Hover the cursor on the video to reveal the prompt.

@@ -308,7 +308,7 @@

Make-A-Video

-
+ diff --git a/our_samples.html b/our_samples.html index b738b52..4e47c91 100644 --- a/our_samples.html +++ b/our_samples.html @@ -44,7 +44,7 @@

-
+
Paper Overview @@ -81,11 +81,11 @@

Snap Video Samples

-

We show a collection of samples produced by our model on a set of gathered prompts.

+

We show a collection of samples produced by our model on a set of gathered prompts.

-

Snap Video can syntesize a large number of different concepts. Most importantly, thanks to joint spatiotemporal modeling, it can produce videos with challenging motion including large camera movement, POV videos, and videos of fast moving objects. Notably, the method maintains temporal consistency and avoids video flickering artifacts.

+

Snap Video can syntesize a large number of different concepts. Most importantly, thanks to joint spatiotemporal modeling, it can produce videos with challenging motion including large camera movement, POV videos, and videos of fast moving objects. Notably, the method maintains temporal consistency and avoids video flickering artifacts.

-

Hover the cursor on the video to reveal the prompt.

+

Hover the cursor on the video to reveal the prompt.

@@ -1378,7 +1378,7 @@

Snap Video Samples

-
+ diff --git a/our_samples_3d.html b/our_samples_3d.html index 783fd58..22d7a4d 100644 --- a/our_samples_3d.html +++ b/our_samples_3d.html @@ -44,7 +44,7 @@

-
+
Paper Overview @@ -81,10 +81,10 @@

Novel View Generation

-

We show a collection of samples obtained from Snap Video with prompts eliciting circular camera movement around different object categories. We find that the model is capable of generating plausible novel views of objects, suggesting that the model possesses an understanding of the 3D object geometry.

+

We show a collection of samples obtained from Snap Video with prompts eliciting circular camera movement around different object categories. We find that the model is capable of generating plausible novel views of objects, suggesting that the model possesses an understanding of the 3D object geometry.

-

Hover the cursor on the video to reveal the prompt.

+

Hover the cursor on the video to reveal the prompt.

@@ -1295,7 +1295,7 @@

Novel View Generation

-
+ diff --git a/our_samples_diversity.html b/our_samples_diversity.html index 3ae3bf4..a02fbd0 100644 --- a/our_samples_diversity.html +++ b/our_samples_diversity.html @@ -44,7 +44,7 @@

-
+
Paper Overview @@ -81,10 +81,10 @@

Samples Diversity

-

To show the capabilities of Snap Video of producing varied outputs, we select a set of prompts and sample 3 videos from each, showing the results in each row. Our model is capable of producing diverse outputs for each prompt.

+

To show the capabilities of Snap Video of producing varied outputs, we select a set of prompts and sample 3 videos from each, showing the results in each row. Our model is capable of producing diverse outputs for each prompt.

-

Hover the cursor on the video to reveal the prompt.

+

Hover the cursor on the video to reveal the prompt.

@@ -955,7 +955,7 @@

Samples Diversity

-
+ diff --git a/our_samples_hierarchical.html b/our_samples_hierarchical.html index f5fc8c0..109e02c 100644 --- a/our_samples_hierarchical.html +++ b/our_samples_hierarchical.html @@ -44,7 +44,7 @@

-
+
Paper Overview @@ -81,11 +81,11 @@

Hierarchical Generation

-

We devise a hierarchical generation strategy to increase video duration and framerate where we adopt the reconstruction guidance method of "Video Diffusion Models" to condition the video generator on previously generated frames. We define a hierarchy of progressively increasing framerates and start by autoregressively generating a video of the desired length at the lowest framerate, at each step using the last generated frame as the conditioning. Subsequently, for each successive framerate in the hierarchy, we autoregressively generate a video of the same length but conditioning the model on all frames that have already been generated at the lower framerates.

+

We devise a hierarchical generation strategy to increase video duration and framerate where we adopt the reconstruction guidance method of "Video Diffusion Models" to condition the video generator on previously generated frames. We define a hierarchy of progressively increasing framerates and start by autoregressively generating a video of the desired length at the lowest framerate, at each step using the last generated frame as the conditioning. Subsequently, for each successive framerate in the hierarchy, we autoregressively generate a video of the same length but conditioning the model on all frames that have already been generated at the lower framerates.

-

We show a selection of 32 frames videos sampled at 12fps.

+

We show a selection of 32 frames videos sampled at 12fps.

-

Hover the cursor on the video to reveal the prompt.

+

Hover the cursor on the video to reveal the prompt.

@@ -177,7 +177,7 @@

Hierarchical Generation

-
+ diff --git a/pyoco.html b/pyoco.html index 7b3089e..57d43db 100644 --- a/pyoco.html +++ b/pyoco.html @@ -44,7 +44,7 @@

-
+
Paper Overview @@ -81,7 +81,7 @@

Comparison to PYoCo

-

We compare Snap Video against publicly available samples released by the authors and perform a user study evaluating photorealism, video-text alignment, and motion quantity and quality. While the public samples may have been chosen to showcase the method's strengths, our method shows improved photorealism, video-text alignment, motion quantity and quality. Results are expressed in percentage of votes in favor of our method.

+

We compare Snap Video against publicly available samples released by the authors and perform a user study evaluating photorealism, video-text alignment, and motion quantity and quality. While the public samples may have been chosen to showcase the method's strengths, our method shows improved photorealism, video-text alignment, motion quantity and quality. Results are expressed in percentage of votes in favor of our method.

@@ -104,8 +104,8 @@

Comparison to PYoCo

-

We compare results produced by our method (left) with results produced by PYoCo (right)

-

Hover the cursor on the video to reveal the prompt.

+

We compare results produced by our method (left) with results produced by PYoCo (right)

+

Hover the cursor on the video to reveal the prompt.

@@ -710,7 +710,7 @@

PYoCo

-
+ diff --git a/stories.html b/stories.html index 9e88126..599f13f 100644 --- a/stories.html +++ b/stories.html @@ -44,7 +44,7 @@

-
+
Paper Overview @@ -81,9 +81,9 @@

Our Stories

-

Snap Video can assist designers in the generation of long stories. We make use of an LLM to generate a story plot, video prompts for different scenes, and scripts for audio narrations. We generate all video assets using our model while tuning the video prompts to obtain the desired visuals, and synthesize the audio narration.

+

Snap Video can assist designers in the generation of long stories. We make use of an LLM to generate a story plot, video prompts for different scenes, and scripts for audio narrations. We generate all video assets using our model while tuning the video prompts to obtain the desired visuals, and synthesize the audio narration.

-

Postproduction software is used to assemble the final video. The generated video assets are trimmed and composed into a sequence to form the video track, to which text overlays are added. Background music is inserted and the synthesized audio narration is aligned to the video content to generate the final result.

+

Postproduction software is used to assemble the final video. The generated video assets are trimmed and composed into a sequence to form the video track, to which text overlays are added. Background music is inserted and the synthesized audio narration is aligned to the video content to generate the final result.

@@ -130,7 +130,7 @@

Breakfast Burrito

-
+ diff --git a/video_ldm.html b/video_ldm.html index 7842a78..d3975c1 100644 --- a/video_ldm.html +++ b/video_ldm.html @@ -44,7 +44,7 @@

-
+
Paper Overview @@ -81,7 +81,7 @@

Comparison to Video LDM

-

We compare Snap Video against publicly available samples released by the authors and perform a user study evaluating photorealism, video-text alignment, and motion quantity and quality. While the public samples may have been chosen to showcase the method's strengths, our method shows improved photorealism, video-text alignment, motion quantity and quality. Results are expressed in percentage of votes in favor of our method.

+

We compare Snap Video against publicly available samples released by the authors and perform a user study evaluating photorealism, video-text alignment, and motion quantity and quality. While the public samples may have been chosen to showcase the method's strengths, our method shows improved photorealism, video-text alignment, motion quantity and quality. Results are expressed in percentage of votes in favor of our method.

@@ -104,8 +104,8 @@

Comparison to Video LDM

-

We compare results produced by our method (left) with results produced by Video LDM (right)

-

Hover the cursor on the video to reveal the prompt.

+

We compare results produced by our method (left) with results produced by Video LDM (right)

+

Hover the cursor on the video to reveal the prompt.

@@ -1306,7 +1306,7 @@

Video LDM

-