Awesome-Unified-Multimodal-Models
[Arxiv 2025] MovieAgent: Automated Movie Generation via Multi-Agent CoT Planning
[CVPR 2025] A Hierarchical Movie Level Dataset for Long Video Generation
[ECCV 2024] DragAnything: Motion Control for Anything using Entity Representation
[NeurIPS2023] DatasetDM:Synthesizing Data with Perception Annotations Using Diffusion Models
[ICCV2023] DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models
[IJCV 2024] TransDETR: End-to-end Video Text Spotting with Transformer
[IJCV 2025] Paragraph-to-Image Generation with Information-Enriched Diffusion Model
[NeurIPS2021] BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting