SQA3D Localization task

Overview

The goal of localization is achieving "situation understanding" as depicted above. In essense, the task is to predict the current location the agent situates given a 3D scene context and a situation description.

Currently, the location is decomposed into two parts: rotation (represented as quaternion in xyzw) and position (represented as translation in xyz, unit is meter). Starting from the origin, the agent transformation is as follows:

Rotate the agent as described by rotation.
Translate the agent as described by position.

Data

The data for SQA3D localization task can be downloaded here. The format of the data is:

Run the following code

import json
a = json.load(open("v1_balanced_localization_train_scannetv2.json", "r"))
print("#annotations: ", len(a["annotations"]))
print(a["annotations"][0])

The expected output is

{
    "scene_id": "scene0380_00",
    "situation": "I am facing a window and there is a desk on my right and a chair behind me.",
    "alternative_situation":
        [
            "I stand looking out of the window in thought and a radiator is right in front of me.",
            "I am looking outside through the window behind the desk."
        ],
    "position":
        {
            "x": -0.9651003385573296,
            "y": -1.2417634435553606,
            "z": 0
        },
    "rotation":
        {
            "_w": 0.9950041652780182,
            "_x": 0,
            "_y": 0,
            "_z": 0.09983341664682724
        },
}

Explaination
- scene_id: the ScanNetv2 scene id of the current localization problem.
- situation: the situation description.
- alternative_situation: alnativative situation descriptions, might be ambiguous, use on your own risk.
- position: ground truth position, unit is meter.
- rotation: ground truth rotation, represented as quaternion.

Please note that due to the coordinate system used by the annotations, we strongly suggest using the original 3D point cloud as 3D scene context input. You may also consider egocentric videos, but please considering adding per-frame camera poses as your input and and the coordinate system used by the camera should align with the point cloud.

Metric

We evaluate the prediction of position and rotation seperately. Given a ground truth <situation description, position, rotation> and a list of predictions <[position_1, position_2, ...], [rotation_1, rotation_2, ...]>, we first find the closest predicted position and orientation to the ground truth, then compute the accuracy based on the threshold. Specifically, we have the following metrics:

Acc@0.5m: If a predicted position is closer than 0.5 meter to the groud truth, it will be accepted. Note only the projected distance on the x-y plane is considered.
Acc@1.0m: Similar to Acc@0.5m but the threshold is 1.0 meter instead.
Acc@15°: If a predicted rotation is closer than 15 degree to the groud truth, it will be accepted. Note only rotation along the z-axis is considered.
Acc@30°: Similar to Acc@15° but the threshold is 30 degree instead.

Please refer to the evaluation script for more details.

Benchmarking

See bencmarking and leaderboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

localization.md

localization.md

SQA3D Localization task

Overview

Data

Metric

Benchmarking

Files

localization.md

Latest commit

History

localization.md

File metadata and controls

SQA3D Localization task

Overview

Data

Metric

Benchmarking