HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Siddhant Bansal, Michael Wray, and Dima Damen
Given an image from an egocentric video, the goal here is to refer the hands and the objects being interacted with. For example, here we wish to refer the left and right hand along with the two objects (jar and lid) that the hands are interacting with.1. Prepare the code and the environment
Git clone our repository, creating a python environment and activate it via the following command
git clone https://github.com/Sid2697/HOI-Ref
cd HOI-Ref
conda env create -f environment.yml
conda activate hoiref
2. Prepare the pretrained LLM weights
VLM4HOI is based on Llama2 Chat 7B. Download the corresponding LLM weights from the following huggingface space via clone the repository using git-lfs.
Llama 2 Chat 7B |
---|
Download |
Then, set the variable llama_model in the model config file to the LLM weight path.
- Set the LLM path here at Line 14.
3. Prepare the pre-trained VLM4HOI checkpoints
Download the pre-trained VLM4HOI checkpoints from this dropbox link.
Set the path to the pre-trained checkpoint in the evaluation config file in eval_configs/vlm4hoi_benchmark_evaluation.yaml at Line 9.
Run
python demo.py --cfg-path eval_configs/vlm4hoi_benchmark_evaluation.yaml --gpu-id 0
To save GPU memory, LLMs loads as 8 bit by default, with a beam search width of 1.
This configuration requires about 11.5G GPU memory for 7B LLM.
For more powerful GPUs, you can run the model
in 16 bit by setting low_resource
to False
in vlm4hoi_benchmark_evaluation.yaml
Before going ahead, make sure you have downloaded the HOI-QA dataset and extracted all the required frames. Refer to this HOI-QA README for downloading and preparing the dataset.
In the train_configs/vlm4hoi_finetune.yaml, you need to set up the following paths:
llama_model checkpoint path here: "/path/to/llama_checkpoint"
ckpt here: "/path/to/pretrained_checkpoint"
output_dir here: "/path/to/output/directory"
For ckpt, you may load from our pre-trained model checkpoints downloaded earlier.
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/vlm4hoi_finetune.yaml
To evaluate VLM4HOI on HOI-QA Dataset, run the following command:
python -m eval_scripts.eval_hoiqa --cfg-path eval_configs/vlm4hoi_benchmark_evaluation.yaml --pred_json /path/to/save/the/predictions.json
Once this script finishes, you will have all the predictions saved to /path/to/save/the/predictions.json
. Run the following script to get the final numbers (as reported in the paper):
python -m eval_scripts.evaluate --pred_json /path/to/save/the/predictions.json --hoi_pred_json /path/to/save/the/predictions_hoi.json
Running this script will print all the numbers as reported in the paper.
This repository is built upon MiniGPT-v2!
If you're using VLM4HOI or the HOI-QA dataset in your research or applications, please cite the paper using this BibTeX:
@article{bansal2024hoiref,
title={HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision},
author={Bansal, Siddhant and Wray, Michael, and Damen, Dima},
journal={arXiv preprint arXiv:2404.09933},
year={2024}
}
This repository is under BSD 3-Clause License. Many code are based on MiniGPT-v2 with BSD 3-Clause License here which is in-turn based on Lavis with BSD 3-Clause License here.