Skip to content
@emma-heriot-watt

EMMA

Embodied MultiModal Agent from the Interaction Lab at Heriot-Watt University

Embodied MultiModal Agent (EMMA)

Hello, thanks for your interest in our work. We're the EMMA team, the Heriot-Watt University team that was one of the finalists in the Amazon Simbot Challenge, a university competition organised by Amazon to advance the state-of-the-art in Conversational Embodied AI. We created this Github organisation to organise all the code and artefacts that we are releasing to the public as part of our effort to create the first V+L model for Embodied AI tasks.

News

Organisation Structure

How do all the repositories connect together?

There are multiple repos which can make it confusing up front, so here’s how all the pieces connect on a high-level.

  • Datasets efficiently merges all annotations from all the datasets without duplicating images.
  • Perception extracts visual features from every single image.
  • Policy takes each instance and the visual features, creates the pretraining and fine-tuning datasets, and trains the EMMA models. It also evaluates checkpoints on the image-based tasks.
  • Experience Hub brings all the models together for inference; taking a request and an observation and predicting the next action for the environment.
  • Arena Evaluation sends observations and instructions from the Alexa Arena to the Experience Hub, and returns the predicted actions for execution in the environment.

Some more details on how it all works.

We have multiple image-based datasets where every example from the raw data can be turned into a single example consisting of an image and its annotations in its metadata. As we mix every dataset together for training, all the data must be in a single structure for easy training. Additionally, as each dataset contributes different tasks, many have introduced new tasks and annotations on the same images as other datasets. For example, while COCO provides image captions, VisualGenome provides scene graphs and also possible region descriptions.

The purpose of datasets is to convert every dataset from its raw form (as it was released) into a single structure, to merge multiple datasets together and remove duplicates, to store all this information in a lightweight format that allows quick querying, and to do it as fast as possible. We easily merge all annotations from multiple datasets without duplicating images, validate the fields using Pydantic, we store these in an incredibly simple SQLite table called a “DatasetDb” or “Db”. Importantly, datasets only handles the metadata for a given image. So, everything except for the actual image, but it keeps a reference to the original image.

The other part of providing input to the model involves extracting visual features for every single image. As mentioned in the paper, we use a feature extractor that is trained separately. To train the model efficiently, we want to extract the visual features for every single image up front. We use Perception to extract all visual features using VinVL for every single image for every single dataset.

Policy holds the main model of EMMA. Using the Dataset Dbs and the extracted features, we train the models to perform the various tasks. Policy also contains the evaluation for each of the image-based tasks.

The Experience Hub contains all logic for integrating the models together for inference: taking in a request from a user, processing the observation with Perception, predicting the next action with Policy, and returning the action to the environment.

The Arena Evaluation is how we send requests and observations from the Alexa Arena to the Experience Hub, and return actions to perform in the environment back to the Alexa Arena.

Download stuff

We provide pretrained/finetuned checkpoints, dataset features, and dbs required to reproduce our setup. All the material is available on Hugging Face! 🤗

Checkpoints

The checkpoints can be accessed from this link: https://huggingface.co/emma-heriot-watt/emma_models

Dataset Features

The features for pretraining and downstream evaluation are availble here: https://huggingface.co/emma-heriot-watt/emma_features

Dataset Dbs

The features for pretraining and downstream evaluation are availble here: https://huggingface.co/emma-heriot-watt/emma_datasets

Results

Dowstream Image-Based Tasks

Model COCO Captioning VQAv2 RefCOCOg NLVR2
VL-T5 116.5 70.3 71.3 73.6
VL-BART 116.6 71.3 22.4 70.3
UniTAB 119.8 71.0 84.5
OFA-base 138.2 78.1 82.3
EMMA 122.3 73.2 80.3 70.3

Dialog Guided Task Completion

Model MSR (↑) NRA (↓) QA
LeaderBoard
GauchoAI 36.47
SEAGULL 30.98
Kingfisher 22.37
Baseline
NS 19.32 11.73
NS 22.80 12.73
VL 18.19 11.82
VL 34.20 18.82
EMMA:
EMMA-modular 33.76 18.91
EMMA-modular 33.95 19.05 CR
EMMA-modular 35.16 18.92
EMMA-unified 33.26 18.79
EMMA-unified 33.59 18.89 CR
EMMA-unified 36.81 18.69

Pinned Loading

  1. common common Public

    Common modules which are used a lot throughout EMMA repositories.

    Python

  2. datasets datasets Public

    Create efficient datasets for training models

    Python 1 1

  3. perception perception Public

    Object Detector and Feature Extractor

    Python

  4. experience-hub experience-hub Public

    Everything that connects the models together to process and return a response

    Python 1

  5. arena-evaluation arena-evaluation Public

    Running the model with the Arena

    Python 1

  6. policy policy Public

    Python 1 1

Repositories

Showing 10 of 12 repositories
  • datasets Public

    Create efficient datasets for training models

    emma-heriot-watt/datasets’s past year of commit activity
    Python 1 MIT 1 0 10 Updated Apr 3, 2024
  • arena-evaluation Public

    Running the model with the Arena

    emma-heriot-watt/arena-evaluation’s past year of commit activity
    Python 1 MIT 0 0 6 Updated Apr 3, 2024
  • experience-hub Public

    Everything that connects the models together to process and return a response

    emma-heriot-watt/experience-hub’s past year of commit activity
    Python 1 MIT 0 0 7 Updated Apr 3, 2024
  • perception Public

    Object Detector and Feature Extractor

    emma-heriot-watt/perception’s past year of commit activity
    Python 0 0 0 9 Updated Apr 3, 2024
  • common Public

    Common modules which are used a lot throughout EMMA repositories.

    emma-heriot-watt/common’s past year of commit activity
    Python 0 MIT 0 0 5 Updated Mar 20, 2024
  • policy Public
    emma-heriot-watt/policy’s past year of commit activity
    Python 1 MIT 1 0 0 Updated Dec 21, 2023
  • .github Public
    emma-heriot-watt/.github’s past year of commit activity
    0 MIT 0 0 0 Updated Dec 8, 2023
  • scene-graph-benchmark Public

    A fork of Microsoft's scene graph benchmark

    emma-heriot-watt/scene-graph-benchmark’s past year of commit activity
    Python 0 MIT 0 0 0 Updated Nov 29, 2023
  • profanity-filter Public Forked from neorusa/profanity-filter

    A Python library for detecting and filtering profanity

    emma-heriot-watt/profanity-filter’s past year of commit activity
    Python 0 GPL-3.0 74 0 0 Updated Jan 13, 2023
  • VL-T5 Public archive Forked from j-min/VL-T5

    PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)

    emma-heriot-watt/VL-T5’s past year of commit activity
    Python 0 MIT 56 0 0 Updated Jan 13, 2023

Top languages

Loading…

Most used topics

Loading…