-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
varying depth values #30
Comments
The project mainly relies on stable video diffusion. |
My understanding of the paper is that they are using a sliding context window of around 1.5s for inference and so it makes sense it would shift over periods longer than a couple seconds. I doubt there is a simple fix but I'd love to hear it if people have ideas. |
Hi, thank you for your feedback. Due to memory restriction, the max processing length for one time is 110. Videos longer than 110 are processed in overlapped segments. Temporal consistency within the same segment is very good, I think. As for temporal consistency among segments, our designed inference strategy (including noise initialization and latent interpolation) works for most cases, but it's hard to always guarantee the consistency among segments due to temporal context. Best, |
hi Wenbo, thank you for the explanation |
Hi, the noise initialization for overlapped segments has been included in the code. For the failure case, you may try to set a different random seed (which is default set to 42) by adding the argument "--seed xxx". I'm not sure if this will help or not ... What will influence for sure is where to segment the video, you may tune this for the failure case |
Thank you, I'll try to make more tests, following your suggestions |
hi Wenbo,
|
Hi, now we have released the v1.0.1 version with improved quality and speed. The issue of "over-saturated" depth estimation is greatly alleviated. You may give it a try to check the latest results |
Hi, thanks for your valuable comments. I think it may produce a little bit better results if the normalization is performed globally. But I'm not sure, since we found the predicted values are almost between [0,1], even without post-normalization. Glad to hear your comments |
Hi, Thanks for your amazing work
I've been testing your code with long videos (from 300 to 800 frames). I often get varying values of the background over the time.
For example, with this video (383 frames), I get different values of the background, using these parameters, from frame 231 to frame 315:
Output frame rate: 24
Inference steps: 25
Guidance scale: 1.2
Dataset: kitti
Is it expected with longer videos?
The text was updated successfully, but these errors were encountered: