diff --git a/index.html b/index.html index 08e3278..06246f2 100644 --- a/index.html +++ b/index.html @@ -3,10 +3,10 @@ + content="EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing"> - MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens + EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing @@ -48,24 +48,40 @@
-

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

+

EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing

- University of California, Santa Cruz + 1University of California, Santa Cruz, + 2Microsoft, + 3University of Michigan, Ann Arbor
@@ -104,11 +120,13 @@

MiniGPT-5: Interleaved Vision-and-Langu
- +

-

Figure 1. MiniGPT-5 is a unified model for interleaved vision-and-language - comprehension and generation. Besides the original multimodal comprehension and text generation abilities, - MiniGPT-5 can provide appropriate, coherent multimodal outputs.

+

Figure 1. Editing Pipeline with EditRoom. EditRoom is a unified language-guided 3D scene layout + editing framework that can automatically execute all layout editing types with natural language + commands, which includes the command parameterizer for natural language comprehension and + the scene editor for editing execution. Given a source scene and natural language commands, it can + generate a coherent and appropriate target scene.

@@ -123,19 +141,17 @@

Abstract

- Large Language Models (LLMs) have garnered significant attention for their advancements - in natural language processing, demonstrating unparalleled prowess in text comprehension - and generation. Yet, the simultaneous generation of images with coherent textual narratives - remains an evolving frontier. In response, we introduce an innovative interleaved - vision-and-language generation technique anchored by the concept of "generative vokens", - acting as the bridge for harmonized image-text outputs. - Our approach is characterized by a distinctive two-staged training strategy focusing on - description-free multimodal generation, where the training requires no comprehensive - descriptions of images. To bolster model integrity, classifier-free guidance is incorporated, - enhancing the effectiveness of vokens on image generation. - Our model, MiniGPT-5, exhibits substantial improvement over the baseline Divter model - on the MMDialog dataset and consistently delivers superior or comparable multimodal outputs - in human evaluations on the VIST dataset, highlighting its efficacy across diverse benchmarks. + Given the steep learning curve of professional 3D software and the timeconsuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and + gaming. However, recent approaches to language-guided 3D scene editing either + require manual interventions or focus only on appearance modifications without + supporting comprehensive scene layout changes. In response, we propose EditRoom, a unified framework capable of executing a variety of layout edits through + natural language commands, without requiring manual intervention. Specifically, + EditRoom leverages Large Language Models (LLMs) for command planning and + generates target scenes using a diffusion-based method, enabling six types of edits: rotate, translate, scale, replace, add, and remove. To address + the lack of data for language-guided 3D scene editing, we have developed an automatic pipeline to augment existing 3D scene synthesis datasets and introduced + EditRoom-DB, a large-scale dataset with 83k editing pairs, for training and evaluation. Our experiments demonstrate that our approach consistently outperforms + other baselines across all metrics, indicating higher accuracy and coherence in + language-guided scene layout editing.

@@ -149,7 +165,7 @@

Abstract

-

Interleaved Vision-and-Language Generation via LLMs

+

Unified Scene Layout Editing

@@ -157,14 +173,17 @@

    -
  • We leverage the pretrained multimodal large language model (MiniGPT-4) and text-to-image generation model (Stable Diffusion 2.1) to create a unified multimodal generation pipeline.
  • -
  • We added vokens into LLM's vocabulary and align the voken features with stable diffusion conditional features.
  • -
  • Text Generation Loss help model learn voken positions while Conditional Latent Denoising Loss guide the model to predicate appropriate features
  • +
  • We leverage the pretrained multimodal large language model (GPT-4o) as command parameterizer and graph diffusion-based method as scene editor to create a unified scene layout editing pipeline.
  • +
  • Larget language model can convert natural language commands into breakdown commands by given source scene information.
  • +
  • SceneEditor take the breakdown commands and source scene as input, and it can first generate the abstract target scene graph, then it will estimate the accurate target scene layout.
- +

-

Figure 2. MiniGPT-5 pipeline.

+

Figure 2. Scene Editor aims to provide accurate, coherent editing results according to the given source scene and language commands. + It consists of two graph transformer-based conditional diffusion models. One diffusion model generates semantic target scene graphs. + Another diffusion model can estimate accurate poses and size information for each object inside the generated target scene graphs. + All diffusion processes are conditioned on the source scene and breakdown command.

@@ -183,12 +202,16 @@

- Qualitative examples from MiniGPT-5 and baselines on the CC3M, VIST, and MMDialog datasets. From the comparisons, we can find the MiniGPT-5 and SD 2 have similar results on single-image generation. When we evaluate with multi-step multimodal prompts, MiniGPT-5 can produce more coherent and high-quality images. + Qualitative examples from EditRoom and baselines on single- and multi-operation editing. From the comparisons, we can find the EditRoom can provide more accurate and coherent editing results than other baselines, and it can generalize to multi-operation editing tasks without training on such data.

- + +

+

Figure 3. Comparison with other baselines on single-operation editing.

+

+

-

Figure 3. Comparison with other baselines.

+

Figure 4. Comparison with other baselines on multi-operation editing.

@@ -197,11 +220,14 @@

BibTeX

-
@misc{zheng2023minigpt5,
-      title={MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens}, 
-      author={Kaizhi Zheng and Xuehai He and Xin Eric Wang},
-      year={2023},
-      journal={arXiv preprint arXiv:2310.02239}
+    
@misc{zheng2024editroomllmparameterizedgraphdiffusion,
+      title={EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing}, 
+      author={Kaizhi Zheng and Xiaotong Chen and Xuehai He and Jing Gu and Linjie Li and Zhengyuan Yang and Kevin Lin and Jianfeng Wang and Lijuan Wang and Xin Eric Wang},
+      year={2024},
+      eprint={2410.12836},
+      archivePrefix={arXiv},
+      primaryClass={cs.GR},
+      url={https://arxiv.org/abs/2410.12836}, 
     }
     
diff --git a/static/images/compare-arxiv.png b/static/images/compare-arxiv.png deleted file mode 100644 index 238f2f3..0000000 Binary files a/static/images/compare-arxiv.png and /dev/null differ diff --git a/static/images/structure.png b/static/images/structure.png deleted file mode 100644 index 8ff45b8..0000000 Binary files a/static/images/structure.png and /dev/null differ diff --git a/static/images/teaser.png b/static/images/teaser.png deleted file mode 100644 index 4df9a44..0000000 Binary files a/static/images/teaser.png and /dev/null differ