Project website: https://openvla-oft.github.io/
Paper: https://arxiv.org/abs/2502.19645
Summary video: https://youtu.be/T3Zkkr_NTSA
Inference:
- 1 GPU with ~16 GB VRAM for LIBERO sim benchmark tasks
- 1 GPU with ~18 GB VRAM for ALOHA robot tasks
Training:
- Between 1-8 GPUs with 27-80 GB, depending on the desired training setup (with default bfloat16 data type). See this FAQ on our project website for details.
First, set up a conda environment (see instructions in SETUP.md).
Then, run the Python script below to download a pretrained OpenVLA-OFT checkpoint and run inference to generate an action chunk:
import pickle
from experiments.robot.libero.run_libero_eval import GenerateConfig
from experiments.robot.openvla_utils import get_action_head, get_processor, get_proprio_projector, get_vla, get_vla_action
from prismatic.vla.constants import NUM_ACTIONS_CHUNK, PROPRIO_DIM
# Instantiate config (see class GenerateConfig in experiments/robot/libero/run_libero_eval.py for definitions)
cfg = GenerateConfig(
pretrained_checkpoint = "moojink/openvla-7b-oft-finetuned-libero-spatial",
use_l1_regression = True,
use_diffusion = False,
use_film = False,
num_images_in_input = 2,
use_proprio = True,
load_in_8bit = False,
load_in_4bit = False,
center_crop = True,
num_open_loop_steps = NUM_ACTIONS_CHUNK,
unnorm_key = "libero_spatial_no_noops",
)
# Load OpenVLA-OFT policy and inputs processor
vla = get_vla(cfg)
processor = get_processor(cfg)
# Load MLP action head to generate continuous actions (via L1 regression)
action_head = get_action_head(cfg, llm_dim=vla.llm_dim)
# Load proprio projector to map proprio to language embedding space
proprio_projector = get_proprio_projector(cfg, llm_dim=vla.llm_dim, proprio_dim=PROPRIO_DIM)
# Load sample observation:
# observation (dict): {
# "full_image": primary third-person image,
# "wrist_image": wrist-mounted camera image,
# "state": robot proprioceptive state,
# "task_description": task description,
# }
with open("experiments/robot/libero/sample_libero_spatial_observation.pkl", "rb") as file:
observation = pickle.load(file)
# Generate robot action chunk (sequence of future actions)
actions = get_vla_action(cfg, vla, processor, observation, observation["task_description"], action_head, proprio_projector)
print("Generated action chunk:")
for act in actions:
print(act)
See SETUP.md for instructions on setting up the conda environment.
See LIBERO.md for fine-tuning/evaluating on LIBERO simulation benchmark task suites.
See ALOHA.md for fine-tuning/evaluating on real-world ALOHA robot tasks.
If you run into any issues, please open a new GitHub issue. If you do not receive a response within 2 business days, please email Moo Jin Kim ([email protected]) to bring the issue to his attention.
If you use our code in your work, please cite our paper:
@article{kim2025fine,
title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
journal={arXiv preprint arXiv:2502.19645},
year={2025}
}