You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the tile_decode implementation, tiling is performed only along the temporal dimension (t), while the spatial dimensions (h, w) are not tiled. This is different from implementations like cogvideox in diffusers, where tiling is done in both spatial and temporal dimensions.
Why does this implementation only perform tiling along the temporal dimension, while the spatial dimensions remain unchanged? In comparison, cogvideox from diffusers performs tiling along both spatial and temporal dimensions.
Would it be possible to extend this implementation to support spatial tiling as well?
The text was updated successfully, but these errors were encountered:
Q1: Why does our implementation only perform tiling along the temporal dimension?
By taking advantage of causal convolution, we can tile in the time domain without any loss. In simple terms, the results from direct inference and tiled inference are exactly the same. Plus, after tiling in the time domain, the memory usage dropped enough to meet our training needs, so we didn’t need to tile further. Note: You can see the details in our report.
Q2: Can it extend to spatial tiling?
Yes. It can also be extended to spatial tiling, similar to methods like CogVideoX. However, it may disrupt the latent space, posing unknown risks for diffusion training and degrading video reconstruction quality. Since temporal tiling already meets our memory requirements for video generation pretraining, we chose not to implement spatial tiling.
In the tile_decode implementation, tiling is performed only along the temporal dimension (t), while the spatial dimensions (h, w) are not tiled. This is different from implementations like cogvideox in diffusers, where tiling is done in both spatial and temporal dimensions.
Why does this implementation only perform tiling along the temporal dimension, while the spatial dimensions remain unchanged? In comparison, cogvideox from diffusers performs tiling along both spatial and temporal dimensions.
Would it be possible to extend this implementation to support spatial tiling as well?
The text was updated successfully, but these errors were encountered: