In April 2024, we launched Open-Sora-Plan v1.0.0, featuring a simple and efficient design along with remarkable performance in text-to-video generation. It has already been adopted as a foundational model in numerous research projects, including its data and model.
Today, we are excited to present Open-Sora-Plan v1.1.0, which significantly improves video generation quality and duration.
Compared to the previous version, Open-Sora-Plan v1.1.0, the improvements include:
- Better compressed visual representations. We optimized the CausalVideoVAE architecture, which now has stronger performance and higher inference efficiency.
- Generate higher quality, longer videos. We used higher quality visual data and captions by ShareGPT4Video, enabling the model to better understand the workings of the world.
Along with performance improvements, Open-Sora-Plan v1.1.0 maintains the minimalist design and data efficiency of v1.0.0. Remarkably, we found that v1.1.0 exhibits similar performance to the Sora base model, indicating that our version's evolution aligns with the scaling law demonstrated by Sora.
We open-source the Open-Sora-Plan to facilitate future development of Video Generation in the community. Code, data, model will be made publicly available.
- Demo: Hugging Face demo here.
- Code: All training scripts and sample scripts.
- Model: Both Diffusion Model and CasualVideoVAE here.
- Data: Both raw videos and captions here.
generated 65×512×512 (2.7s) | edited 65×512×512 (2.7s) |
---|---|
As the number of frames increases, the encoder overhead of CausalVideoVAE gradually rises. When training with 257 frames, 80GB of VRAM is insufficient for the VAE to encode the video. Therefore, we reduced the number of CausalConv3D layers, retaining only the last two stages of CausalConv3D in the encoder. This change significantly lowers the overhead while maintaining nearly the same performance. Note that we only modified the encoder; the decoder still retains all CausalConv3D layers, as training the Diffusion Model does not require the decoder.
We compare the computational overhead of the two versions by testing the forward inference of the encoder on the H100.
Version | 129×256×256 | 257×256×256 | 513×256×256 | |||
---|---|---|---|---|---|---|
Peak Mem. | Speed | Peak Mem. | Speed | Peak Mem. | Speed | |
v1.0.0 | 22G | 2.9 it/s | OOM | - | OOM | - |
v1.1.0 | 18G | 4.9 it/s | 34G | 2.5 it/s | 61G | 1.2 it/s |
In v1.0.0, our temporal module had only TemporalAvgPool. TemporalAvgPool leads to the loss of high-frequency information in the video, such as details and edges. To address this issue, we improved this module in v1.1.0. As shown in the figure below, we introduced convolution and added learnable weights, allowing different branches to decouple different features. When we omit CausalConv3D, the video is reconstructed very blurry. Similarly, when we omit TemporalAvgPool, the video becomes very sharp.
SSIM↑ | LPIPS↓ | PSNR↑ | |
---|---|---|---|
Base | 0.850 | 0.091 | 28.047 |
+ Frames | 0.868 | 0.070 | 28.829 |
+ Reset mixed factor | 0.873 | 0.070 | 29.140 |
Similar to v1.0.0, we initialized from the Latent Diffusion's VAE and used tail initialization. For CausalVideoVAE, we trained for 100k steps in the first stage with a video shape of 9×256×256. Subsequently, we increased the frame count from 9 to 25 and found that this significantly improved the model's performance. It is important to clarify that we enabled the mixed factor during both the first and second stages, with a value of a (sigmoid(mixed factor)) reaching 0.88 at the end of training, indicating the model's tendency to retain low-frequency information. In the third stage, we reinitialized the mixed factor to 0.5 (sigmoid(0.5) = 0.6225), which further enhanced the model's capabilities.
We found that using GAN loss helps retain high-frequency information and alleviates grid artifacts. Additionally, we observed that switching from 2D GAN to 3D GAN provides further improvements.
GAN Loss/Step | SSIM↑ | LPIPS↓ | PSNR↑ |
---|---|---|---|
2D/80k | 0.879 | 0.068 | 29.480 |
3D/80k | 0.882 | 0.067 | 29.890 |
Therefore, we introduced a method called temporal rollback tiled convolution, a tiling approach specifically designed for CausalVideoVAE. Specifically, all windows except the first one discard the first frame because the first frame in a window is treated as an image, while the remaining frames should be treated as video frames.
We tested the speed on the H100 with a window size of 65×256×256.
Version | 129×256×256 | 257×256×256 | 513×256×256 | |||
---|---|---|---|---|---|---|
Peak Mem. | Speed | Peak Mem. | Speed | Peak Mem. | Speed | |
4×8×8 | 10G | 1.3 s/it | 10G | 2.6 s/it | 10G | 5.3 s/it |
Since Open-Sora-Plan supports joint training of images and videos, our data collection is divided into two parts: images and videos. Images do not need to originate from videos; they are independent datasets. We spent approximately 32×240 H100 hours generating image and video captions, and all of this is open source!
We obtained 11 million image-text pairs from Pixart-Alpha, with captions generated by LLaVA. Additionally, we utilized the high-quality OCR dataset Anytext-3M, which pairs each image with corresponding OCR characters. However, these captions were insufficient to describe the entire image, so we used InternVL-1.5 for supplementary descriptions. Since T5 only supports English, we filtered for English data, which constitutes about half of the complete dataset. Furthermore, we selected high-quality images from Laion-5B to enhance human-like generation quality. The selection criteria included high resolution, high aesthetic scores, and watermark-free images containing people.
Here, we are open-sourcing the prompt used for InternVL-1.5:
# for anytext-3m
Combine this rough caption: "{}", analyze the image in a comprehensive and detailed manner. "{}" can be recognized in the image.
# for human-160k
Analyze the image in a comprehensive and detailed manner.
Name | Image Source | Text Captioner | Num pair |
---|---|---|---|
SAM-11M | SAM | LLaVA | 11,185,255 |
Anytext-3M-en | Anytext | InternVL-1.5 | 1,886,137 |
Human-160k | Laion | InternVL-1.5 | 162,094 |
In v1.0.0, we sampled one frame from each video to generate captions. However, as video length increased, a single frame could not adequately describe the entire video's content or temporal movements. Therefore, we used a video captioner to generate captions for the entire video clip. Specifically, we used ShareGPT4Video, which effectively covers temporal information and describes the entire video content. The v1.1.0 video dataset comprises approximately 3k hours, compared to only 300 hours in v1.0.0. As before, we have open-sourced all text annotations and videos (both under the CC0 license), which can be found here.
Name | Hours | Num frames | Num pair |
---|---|---|---|
Mixkit | 42.0h | 65 | 54,735 |
513 | 1,997 | ||
Pixabay | 353.3h | 65 | 601,513 |
513 | 51,483 | ||
Pexel | 2561.9h | 65 | 3,832,666 |
513 | 271,782 |
Similar to our previous work, we employed a multi-stage cascaded training method. Below is our training card:
Surprisingly, we initially believed that the performance of the diffusion model would improve with longer training. However, by observing the logs, we found that videos generated at 50k steps were of higher quality than those at 70-100k steps. In fact, extensive sampling revealed that checkpoints at 40-60k steps outperformed those at 80-100k steps. Quantitatively, 50k steps correspond to approximately 2 epochs of training. It is currently unclear whether this is due to overfitting from a small dataset or the limited capacity of the 2+1D model.
In the second stage, we used Huawei Ascend computing power for training. This stage's training and inference were fully supported by Huawei. We conducted sequence parallel training and inference on a large-scale cluster, distributing one sample across eight ranks. Models trained on Huawei Ascend can also be loaded into GPUs and generate videos of the same quality.
In the third stage, we further increased the frame count to 513 frames, approximately 21 seconds at 24 FPS. However, this stage presents several challenges, such as ensuring temporal consistency in the 2+1D model over long durations and whether the current amount of data is sufficient. We are still training the model for this stage and continuously monitoring its progress.
Name | Stage 1 | Stage 2 | Stage 3 |
---|---|---|---|
Training Video Size | 65×512×512 | 221×512×512 | 513×512×512 |
Compute (#Num x #Hours) | 80 H100 × 72 | 512 Ascend × 72 | Under Training |
Checkpoint | HF | HF | Under Training |
Log | wandb | - | - |
Training Data | ~3k hours videos + 13M images |
The recently proposed ReVideo achieves accurate video editing by modifying the first frame and applying motion control within the edited area. Although it achieves excellent video editing performance, the editing length is limited by the base model SVD. Open-Sora, as a fundamental model for long-video generation, can compensate for this issue. Currently, we are collaborating with the ReVideo team to use Open-Sora as the base model for long video editing. Some preliminary results are shown here.
The initial version still needs improvement in several aspects. In the future, we will continue to explore integration with ReVideo to develop improved long-video editing models.
Despite the promising results of v1.1.0, there remains a gap between our model and Sora. Here, we present some failure cases and discuss them.
Despite the significant performance improvement of VAE in v1.1.0 over v1.0.0, we still encounter failures in challenging cases, such as sand dunes and leaves. The video on the left shows the reconstructed video downsampled by a factor of 4 in time, while the video on the right is downsampled by a factor of 2. Both exhibit jitter when reconstructing fine-grained features. This indicates that reducing temporal downsampling alone cannot fully resolve the jitter issue.
288.mp4
288.mp4
On the left is a video generated by v1.1.0 showing a puppy in the snow. In this video, the puppy's head exhibits semantic distortion, indicating that the model struggles to correctly identify which head belongs to which dog. On the right is a video generated by Sora's base model. We observe that Sora's early base model also experienced semantic distortion issues. This suggests that we may achieve better results by scaling up the model and increasing the amount of training data.
Prompt:A litter of golden retriever puppies playing in the snow.Their heads pop out of the snow, covered in.
Our | Sora Base×1 | Sora Base×4 | Sora Base×32 |
---|---|---|---|
The primary difference between videos and images lies in their dynamic nature, where objects undergo a series of changes across consecutive frames. However, the videos generated by v1.1.0 still contain many instances of limited dynamics. Upon reviewing a large number of training videos, we found that while web-crawled videos have high visual quality, they are often filled with meaningless close-up shots. These close-ups typically show minimal movement or are even static. On the left, we present a generated video of a bird, while on the right is a training video we found, which is almost static. There are many similar videos in the dataset from stock footage sites.
Prompt:This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest. Its crest is made of delicate, lacy feathers, while its eye is a striking red color. The bird's head is tilted slightly to the side,giving the impression of it looking regal and majestic. The background is blurred,drawing attention to the bird's striking appearance.
Our | Raw video |
---|---|
We found that using negative prompts can significantly improve video quality, even though we did not explicitly tag the training data with different labels. On the left is a video sampled using a negative prompt, while on the right is a video generated without a negative prompt. This suggests that we may need to incorporate more prior knowledge into the training data. For example, when a video has a watermark, we should note "watermark" in the corresponding caption. When a video's bitrate is too low, we should add more tags to distinguish it from high-quality videos, such as "low quality" or "blurry." We believe that explicitly injecting these priors can help the model differentiate between the vast amounts of pretraining data (low quality) and the smaller amounts of fine-tuning data (high quality), thereby generating higher quality videos.
Prompt:A litter of golden retriever puppies playing in the snow.Their heads pop out of the snow, covered in. Negative Prompt:distorted, discontinuous, ugly, blurry, low resolution, motionless, static, low quality
With Negative Prompt | Without Negative Prompt |
---|---|
In our future work, we will focus on two main areas: (1) data scaling and (2) model design. Once we have a robust baseline model, we will extend it to handle variable durations and conditional control models.
As mentioned earlier, our dataset is entirely sourced from stock footage websites. Although these videos are of high quality, many consist of close-up shots of specific areas, resulting in slow motion in the videos. We believe this is one of the main reasons for the limited dynamics observed. Therefore, we will continue to collect datasets from diverse sources to address this issue.
In v1.1.0, our dataset comprises only ~3k hours of video. We are actively collecting more data and anticipate that the video dataset for the next version will reach ~100k hours. We welcome recommendations from the open-source community for additional datasets.
In our internal testing, even without downsampling in time, we found that it is not possible to completely resolve the jitter issue in reconstructing fine-grained features. Therefore, we need to reconsider how to mitigate video jitter to the greatest extent possible while simultaneously supporting both images and videos. We will introduce a more powerful CasualVideoVAE in the next version.
In v1.1.0, we found that 2+1D models can generate higher-quality videos in short durations. However, for long videos, they tend to exhibit discontinuities and inconsistencies. Therefore, we will explore more possibilities in model architecture to address this issue.