AdaSeg4MR: Adaptive Speech-Guided Instance Segmentation for Mixed Reality
This project develops an innovative mixed reality (MR) pipeline that integrates real-time instance segmentation and speech-guided natural language interaction. It aims to create a more intuitive and immersive experience for users interacting with virtual and real-world environments.
- Advanced Instance Segmentation: Leverage state-of-the-art models like the Segment Anything Model (SAM) and FastSAM to achieve accurate and adaptable object recognition in varying conditions.
- Natural Language Interaction: Implement a natural language processing (NLP) module to interpret user voice commands related to object tracking and image segmentation, enabling real-time responses and seamless interaction.
- Intuitive User Interface: Design a user-friendly mixed reality interface for visualizing segmented objects and interacting with them using natural language.
- Real-Time Synchronization: Optimize synchronization between NLP and computer vision models to ensure efficient, real-time interaction and minimize latency.
- Environmental Robustness: Address challenges posed by variable lighting, object occlusion, and rapid object movement in real-time MR environments.
- User-Guided Visual Search: Support user-guided visual searches within an egocentric perception framework for more immersive and intuitive interactions.
- Novel Computer Vision Approaches: Explore innovative real-time instance segmentation implementations, such as visual Retrieval-Augmented Generation (RAG) and multimodal prompting.
Due to limited computing resources, the project employs a distributed computing strategy. This plan may be adjusted if additional resources are secured.
- Segmentation Models (YOLOv11-Seg, GroundedSAM, POSE): Deployed on Tesla T4 GPUs for high-performance image segmentation. Careful management of thread switching and inter-process communication is crucial due to shared hardware resources.
- Unity Rendering and Local Inference: Handled by an RTX 3060 GPU.
- LangChain and RAG tasks: Processed by an Intel i7 CPU.
- Thread switching is employed to manage concurrent tasks efficiently.
- Vector Database: Hosted on Pinecone for scalability and efficiency.
- Visual Language Model, Toolchain Agent, Automatic Speech Recognition (ASR), and Text-to-Speech (TTS): Powered by Groq LPUs for high-throughput and low-latency language processing. These services operate independently and efficiently, minimizing thread switching.
- Instance Segmentation Models: Segment Anything Model (SAM), FastSAM, YOLOv11-Seg, GroundedSAM, POSE
- Natural Language Processing: LangChain, RAG
- Vector Database: Pinecone
- Hardware: Tesla T4 GPUs, RTX 3060 GPU, Intel i7 CPU, Groq LPUs
- Mixed Reality Platform: Unity