GitHub - weihaox/VINDEX: [ICCV 2025] Exploring The Visual Feature Space for Multimodal Neural Decoding

Exploring The Visual Feature Space for Multimodal Neural Decoding (ICCV 2025)

University of Cambridge

UMBRAE aligns brain activations with image features for zero-shot captioning and grounding. We further explore the impact of different image feature spaces on fine-grained multimodal decoding, drawing inspiration from prior MLLM research that uses diverse features. We also introduce a metric to effectively evaluate detailed open-vocabulary brain-based descriptions.

News 🚩

[2024/07/10] Codes and pretrained models are released.
[2024/06/26] VINDEX is accepted to ICCV 2025 at Hawaii.
[2025/05/21] Both project and arXiv are available.

Method

We investigate four types of image feature spaces: (a) Single Encoder (SE): uses features from a single pre-trained vision encoder (e.g., CLIP) for brain alignment—commonly used in MLLMs.; (b) Mixture of Encoders (ME): integrates features from multiple task-specific vision experts such as CLIP and DINO; (c) Aggregated Feature (AF): combines dense features from various layers (shallow, middle, and deep) to capture different image characteristics.; (d) Nested Features (NF): uses a variable-length set of visual tokens, downscaled using multiple downsampling factors to produce a hierarchical, nested representation—encoding visual content from coarse to fine-grained details.

Installation

Environment

Same environment as UMBRAE. Please use it directly if already set up; otherwise, install with following commands:

conda create -n vindex python=3.10
conda activate vindex
pip install -r requirements.txt

Download Data and Checkpoints

The training and inference scripts support automatically downloading the dataset if the designated path is empty. However, this process can be quite slow. You can try the following script to download all data in advance if this happens. Please fill out the NSD Data Access form and agree to the Terms and Conditions.

Download Checkpoints from Hugging Face.

bash download_data.sh
bash download_checkpoint.sh

Code Structure

src
├── llava // Single Encoder (SE)
│   ├── inference_nsd_image.py // image inference
│   ├── inference.py // brain inference
│   └── model_weights // model weights
├── mmvp // Mixture of Encoders (ME)
│   ├── inference_nsd_image.py
│   ├── inference.py
│   └── model_weights
├── denseconnector // Aggregated Feature (AF)
│   ├── inference_nsd_image.py
│   ├── inference.py
│   └── model_weights
├── matryoshka-mm // Nested Features (NF)
│   ├── inference_nsd_image.py
│   ├── inference.py
│   └── model_weights
├── model
│   ├── mm_utils.py
│   ├── model.py
│   ├── multimodal_projector
│   ├── perceiver.py
│   └── utils.py
├── shikra
│   ├── inference.py
│   └── model_weights
├── train_brainx.py // cross-subject training and adaptation
└── train.py //single-subject training

Inference

Inference requires a connection to the backend MLLM. Please ensure that configuration parameters such as model_name, hidden_size, and mm_projector_type are correctly matched. If no errors occur, the system should function properly.

For SE,

ckpt_name='clip_224' model_name='llava-v1.5-7b' # 'llava-v1.5-13b', 'llava-next-7b', 'llava-next-13b'
hidden_size=4096 # 5120
python src/llava/inference.py --fmri_encoder 'brainxs' --subj 1 \
    --data_path 'nsd_data' --prompt 'Describe this image in detail.' \
    --img_size 224 --feat_dim 1024 --hidden_size ${hidden_size} --mm_projector_type 'mlp2x_gelu' \
    --mllm_path "src/llava/model_weights/${model_name}" \
    --brainx_path "train_logs/specific_sub/sub01_dim1024/${ckpt_name}/last.pth" \
    --save_path "train_logs/evaluation/specific_sub/sub01_${ckpt_name}_dim1024_${model_name}"

For ME,

python src/llava/mmvp/inference.py --fmri_encoder 'brainxs' --subj 1 \
    --data_path 'nsd_data' --prompt 'Describe this image in detail.' \
    --brainx_path 'train_logs/specific_sub/sub01_dim1024' \
    --img_size 224 --feat_dim 1024 \
    --save_path 'train_logs/evaluation/specific_sub/sub01_dim1024/sub01_clip_dino224_dim1024_mmvp'

For AF,

vision_tower='siglip' img_size=384  feat_dim=3456 mm_hidden_size=3456 ckpt_name='siglip_384/epoch500'
mllm_name='dc-llava-v1.5-7b' hidden_size=4096 # mllm_name='dc-llava-v1.5-13b' hidden_size=5120
python src/denseconnector/inference.py  --subj 1 --fmri_encoder 'brainxs' --img_size $img_size  \
    --data_path 'nsd_data' --feat_dim $feat_dim --hidden_size $hidden_size \
    --prompt 'Describe this image in detail.' --mm_hidden_size=$mm_hidden_size \
    --mllm_path "/home/wx258/project/vindex/src/denseconnector/model_weights/${mllm_name}" \
    --brainx_path "train_logs/specific_sub/sub01_dim${feat_dim}/${ckpt_name}/last.pth" \
    --save_path "train_logs/evaluation/specific_sub/sub01_${vision_tower}_${img_size}_dim${feat_dim}_${mllm_name}"

For NF,

# mllm_name: [m3-llava-v1.5-7b, m3-llava-next-7b, llava-next-vicuna-7b-m3, llava-v1.5-vicuna-7b-m3]
# matryoshka_vis_token_scale: [1, 9, 36, 144, 576]
img_size=224 feat_dim=1024 hidden_size=4096 matryoshka_vis_token_scale=1
ckpt_name='clip_224' model_name='m3-llava-next-7b'
python src/matryoshka-mm/inference.py  --subj 1 --img_size $img_size --feat_dim $feat_dim \
    --data_path 'nsd_data' --hidden_size $hidden_size --fmri_encoder 'brainxs' \
    --prompt 'Describe this image in detail.' \
    --brainx_path "train_logs/specific_sub/sub01_dim1024/${ckpt_name}/last.pth" \
    --mllm_path "src/matryoshka-mm/model_weights/${model_name}" --matryoshka_vis_token_scale=$matryoshka_vis_token_scale \
    --save_path "train_logs/evaluation/specific_sub/sub01_${ckpt_name}_dim${feat_dim}_${model_name}_ts${matryoshka_vis_token_scale}"

Note

<image> should be included in the prompt if using Shikra, but must be excluded if using LLaVA variants.

Training

VINDEX is model-agnostic so the training procedure is independent of base MLLMs and only related to encoders (e.g., CLIP, DINO, and SigLIP). The training process basically follows UMBRAE. Due to the architectural consistency among LLaVA variants, the trained brain encoders can be directly integrated into different MLLMs without requiring additional modifications.

For cross-subject training using CLIP or DINO as the vision encoder:

# cross-subject training and adaptation
vison_tower='clip' img_size=224 # clip/dino_224/336
accelerate launch --num_processes=1 --num_machines=1 --gpu_ids='0' src/train_brainx.py \
    --data_path 'nsd_data' --fmri_encoder 'brainx2' --batch_size 128 --num_epochs 300 \
    --vision_tower $vison_tower --img_size $img_size  --subj 1 2 5 7 \
    --model_save_path "train_logs/cross_sub/${vison_tower}_${img_size}/epoch${num_epochs}"

For single-subject training using CLIP or DINO as the vision encoder:

# single-subject training 
vision_tower='clip' img_size=224
accelerate launch --num_processes=1 --num_machines=1 --gpu_ids='0' src/train.py \
    --data_path 'nsd_data' --fmri_encoder 'brainxs' --img_size $img_size \
    --vision_tower $vision_tower --subj 1 --feat_dim 1024 --batch_size 64 --num_epochs 300 \
    --model_save_path "train_logs/specific_sub/sub01_dim1024/${vision_tower}_${img_size}/epoch${num_epochs}"

For single-subject training using SigLIP as the vision encoder:

vision_tower='siglip' img_size=384 feat_dim=3456 batch_size=32 num_epochs=500
accelerate launch --num_processes=1 --num_machines=1 --gpu_ids='0' src/train_dc.py \
    --data_path 'nsd_data' --fmri_encoder 'brainxs' --img_size $img_size \
    --vision_tower $vision_tower --subj 1 --feat_dim $feat_dim --batch_size $batch_size --num_epochs $num_epochs \
    --model_save_path "train_logs/specific_sub/sub01_dim${feat_dim}/${vision_tower}_${img_size}/epoch${num_epochs}"

Caution

Training AF (SigLIP feature alignment) is not recommended — it is resource-intensive and typically leads to poor results, with the brain encoder often producing garbled or empty outputs.

TODO

Release inference scripts and pretrained checkpoints.
Update training scripts.
Update benchmark and evaluation scripts.

Acknowledgements

The codes are built upon UMBRAE. The processed data are from MindEye. We use pretrained models LLaVA, MMVP, M3, and DC as the MLLMs. Thanks for the awesome research works.

The following highlights a series of our works on multimodal brain decoding and benchmarking:

DREAM: mirrors pathways in the human visual system for stimulus reconstruction.
UMBRAE: interprets brain activations into multimodal explanations with task-specific MLLM prompting.
BASIC: provides interpretable and multigranular benchmarking for brain visual decoding.

Citation

@inproceedings{xia2025vindex,
  title     = {Exploring The Visual Feature Space for Multimodal Neural Decoding},
  author    = {Xia, Weihao and Öztireli, Cengiz},
  booktitle = {International Conference on Computer Vision (ICCV)},
  year      = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Exploring The Visual Feature Space for Multimodal Neural Decoding (ICCV 2025)

News 🚩

Method

Installation

Environment

Download Data and Checkpoints

Code Structure

Inference

Training

TODO

Acknowledgements

Citation

About

Uh oh!

Languages

License

weihaox/VINDEX

Folders and files

Latest commit

History

Repository files navigation

Exploring The Visual Feature Space for Multimodal Neural Decoding (ICCV 2025)

News 🚩

Method

Installation

Environment

Download Data and Checkpoints

Code Structure

Inference

Training

TODO

Acknowledgements

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages