Stars
Clean, minimal, accessible reproduction of DeepSeek R1-Zero
Universal Actions for Enhanced Embodied Foundation Models
Official code release for ConceptGraphs
[Embodied-AI-Survey-2024] Paper list and projects for Embodied AI
PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators
A curated list of papers for generalist agents
Reading List of Memory Augmented Multimodal Research, including multimodal context modeling, memory in vision and robotics, and external memory/knowledge augmented MLLM.
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Cosmos is a world model development platform that consists of world foundation models, tokenizers and video processing pipeline to accelerate the development of Physical AI at Robotics & AV labs. C…
A generative world for general-purpose robotics & embodied AI learning.
[NeurIPS 2024 Best Paper][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ult…
Official Implementation of the paper: "Verbalized Representation Learning for Interpretable Few-Shot Generalization"
A paper list of some recent works about Token Compress for Vit and VLM
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
A Simple yet Effective Pathway to Empowering LLaVA to Understand and Interact with 3D World
Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources
LaTeX template files for dissertations and theses formatted according to UCLA graduate division's requirements
[RSS 2023] Diffusion Policy Visuomotor Policy Learning via Action Diffusion
A modular high-level library to train embodied AI agents across a variety of tasks and environments.
[ICLR 2025] Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
Embodied Agent Interface (EAI): Benchmarking LLMs for Embodied Decision Making (NeurIPS D&B 2024 Oral)
[ICLR 2025] Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Official Implementation of "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining"
Official implementation for FlexAttention for Efficient High-Resolution Vision-Language Models
A simple pip-installable Python tool to generate your own HTML citation world map from your Google Scholar ID.