This repository contains the code for VLAGen, a simulation-based data generation and filtering pipeline designed to autonomously generalize Vision-Language-Action (VLA) models to new objects.
VLAGen addresses the limitations of existing VLA models by generating diverse, high-quality training trajectories all in simulation (by using high temperature during inference). Then, the trajectories are evaluated with a vision-language model (GPT-4V) and filtered by removing low-action time steps to mitigate "catastrophic idling." The generated data is then used to fine-tune or preference-optimize OpenVLA.
This repository is a fork of OpenVLA. This project incorporates code from SimplerEnv-OpenVLA, available at: https://github.com/DelinQu/SimplerEnv-OpenVLA
Licensed under the MIT License (c) 2024 simpler-env.
- Automated Data Generation: Generates robotic manipulation trajectories in SIMPLER/SAPIEN simulation using OpenVLA (model deployed with high temperature settings to generate diverse trajectories).
- GPT-4V Trajectory Scoring: Employs GPT-4V to automatically score trajectories based on their success in completing the task.
(Above)Data pipeline generates and ranks the trajectories for picking a Fanta can (out-of-distribution) with distractors in the background.
- Magnitude-Based Filtering: Filters out low-action trajectories to mitigate catastrophic idling behavior.
- Fine-tuning and Preference Optimization: Supports both fine-tuning and preference optimization (using KTO) of OpenVLA using the generated data. KTO for reference: https://arxiv.org/abs/2402.01306
- SIMPLER Environment Integration: Leverages the SIMPLER benchmark for real-to-sim evaluation and data generation.
- Scalable and Efficient: Provides a scalable and efficient solution for training robotic models without relying on extensive human-collected datasets.
This project uses bash scripts for data generation and evaluation located in the scripts_run_eval
directory. These scripts interact with the simpler_env
environment and the openvla
model. Specific scripts are provided for various tasks and variations (e.g., openvla_drawer_variant_agg.sh
, openvla_move_near_visual_matching.sh
). Refer to the individual script descriptions for detailed usage instructions and parameters.
Data Generation: To generate training data, use the bash scripts passing in --policy-model openvla_generate_data
as an argument. This will leverage OpenVLA to generate trajectories that are then scored by GPT-4V.
Model Fine-tuning: The vla-scripts
directory contains Python scripts for fine-tuning and training OpenVLA models (finetune.py
, finetune_KTO.py
, train.py
). These scripts can be used to fine-tune OpenVLA models using the data generated by the bash scripts.
Model Evaluation: The scripts_run_eval
directory contains bash scripts for evaluating the performance of OpenVLA models on various manipulation tasks. These scripts control the SIMPLER environment, run the OpenVLA policy, and log the results.
- Clone the repository:
git clone <repository_url> cd <repository_name>
- Install ManiSkill2 real-to-sim environments:
cd ManiSkill2_real2sim pip install -e .
- Install OpenVLA requirements: Refer to the
README.md
for complete installation instructions (this might involve installing PyTorch, transformers, and other dependencies). Note that specific version constraints might be necessary to ensure compatibility. - Obtain an OpenAI API key: An OpenAI key is required for GPT-4V scoring.
- Python: The primary programming language for the project.
- PyTorch: Deep learning framework for model training and inference.
- Transformers (Hugging Face): Library for loading and utilizing pre-trained OpenVLA models.
- PEFT (Hugging Face): Library for parameter-efficient fine-tuning, enabling LoRA (Low-Rank Adaptation).
- BitsAndBytes: Enables 4-bit quantization of the OpenVLA model for memory-efficient fine-tuning.
- SIMPLER: A simulation benchmark used for generating and evaluating robotic manipulation policies.
- OpenVLA: A pre-trained vision-language-action model. https://openvla.github.io/
- GPT-4V: A vision-language model used to score the generated trajectories.
- Draccus: A Python library for configuration management and structured data handling.
- Make: Used for streamlining common development tasks.
- Bash: Used to orchestrate the data generation and evaluation processes.
The project's dependencies are specified in pyproject.toml
. Use pip install -e .
to install all required packages from this file.
Contributions are welcome! Please open an issue or submit a pull request.
MIT License. See the LICENSE
file for details.