Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely low zero-shot performance (0% acc on both val and test) on RefCOCOg #30

Open
yiranyyu opened this issue Oct 26, 2022 · 1 comment

Comments

@yiranyyu
Copy link

yiranyyu commented Oct 26, 2022

I downloaded the model weight pre-trained on VG&COCO and pre-processed features following the instruction in README. Then I tested the zero-shot grounding performance of VL-T5 on RefCOCOg dataset following the guidance. However the performance on val and test split are both zero, which really confuse me.

Then I tested the few-shot performance with VL-T5 and get reasonable result (44.53% acc on val split with four samples). I was wondering if it is the weight not used (see the log in below) when initializing RefCOCOModel from pre-trained weight that cause such big gap between the zero-shot performance and few-shot performance?

Command to Reproduce the Results

cd VL-T5/

# modify scripts/RefCOCOg_VLT5.sh to set the `lr` param to 0, set epoch to 1
vim scripts/RefCOCOg_VLT5.sh

# modify #304 of src/refcoco from `>` to `>=` to save the zero acc checkpoint for testing
vim src/refcoco.py

# run the training script
cd VL-T5/
bash scripts/RefCOCOg_VLT5.sh 4

Logs and Other Information

Log

Building Model at GPU 0
Building Model at GPU 3
Building Model at GPU 1
Building Model at GPU 2
Some weights of VLT5RefCOCO were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.visual_embedding.feat_embedding.0.weight', 'encoder.visual_embedding.feat_embedding.0.bias', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.bias', 'encoder.visual_embedding.obj_order_embedding.weight', 'encoder.visual_embedding.img_order_embedding.weight', 'encoder.visual_embedding.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model Launching at GPU 3
Model Launching at GPU 1
Model Launching at GPU 2
Model loaded from  snap/pretrain/VLT5/Epoch30.pth
_IncompatibleKeys(missing_keys=[], unexpected_keys=['encoder.visual_embedding.feat_embedding.1.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.1.weight'])

Xnip2022-10-26_20-22-44

Script

Content of scripts/RefCOCOg_VLT5.sh (only lr and epochs params changed):

# The name of experiment
name=VLT5

output=snap/refcocog/$name

PYTHONPATH=$PYTHONPATH:./src \
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    src/refcoco.py \
        --distributed --multiGPU \
        --train train \
        --valid val \
        --test test \
        --optim adamw \
        --warmup_ratio 0.1 \
        --clip_grad_norm 5 \
        --lr 0e-5 \
        --epochs 1 \
        --num_workers 4 \
        --backbone 't5-base' \
        --output $output ${@:2} \
        --load snap/pretrain/VLT5/Epoch30 \
        --batch_size 90 \

Platform

OS: Ubuntu
GPU: A100

@yiranyyu
Copy link
Author

I downloaded the model weight pre-trained on VG&COCO and pre-processed features following the instruction in README. Then I tested the zero-shot grounding performance of VL-T5 on RefCOCOg dataset following the guidance. However the performance on val and test split are both zero, which really confuse me.

Then I tested the few-shot performance with VL-T5 and get reasonable result (44.53% acc on val split with four samples). I was wondering if it is the weight not used (see the log in below) when initializing RefCOCOModel from pre-trained weight that cause such big gap between the zero-shot performance and few-shot performance?

Command to Reproduce the Results

cd VL-T5/

# modify scripts/RefCOCOg_VLT5.sh to set the `lr` param to 0, set epoch to 1
vim scripts/RefCOCOg_VLT5.sh

# modify #304 of src/refcoco from `>` to `>=` to save the zero acc checkpoint for testing
vim src/refcoco.py

# run the training script
cd VL-T5/
bash scripts/RefCOCOg_VLT5.sh 4

Logs and Other Information

Log

Building Model at GPU 0
Building Model at GPU 3
Building Model at GPU 1
Building Model at GPU 2
Some weights of VLT5RefCOCO were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.visual_embedding.feat_embedding.0.weight', 'encoder.visual_embedding.feat_embedding.0.bias', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.bias', 'encoder.visual_embedding.obj_order_embedding.weight', 'encoder.visual_embedding.img_order_embedding.weight', 'encoder.visual_embedding.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model Launching at GPU 3
Model Launching at GPU 1
Model Launching at GPU 2
Model loaded from  snap/pretrain/VLT5/Epoch30.pth
_IncompatibleKeys(missing_keys=[], unexpected_keys=['encoder.visual_embedding.feat_embedding.1.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.1.weight'])

Xnip2022-10-26_20-22-44

Script

Content of scripts/RefCOCOg_VLT5.sh (only lr and epochs params changed):

# The name of experiment
name=VLT5

output=snap/refcocog/$name

PYTHONPATH=$PYTHONPATH:./src \
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    src/refcoco.py \
        --distributed --multiGPU \
        --train train \
        --valid val \
        --test test \
        --optim adamw \
        --warmup_ratio 0.1 \
        --clip_grad_norm 5 \
        --lr 0e-5 \
        --epochs 1 \
        --num_workers 4 \
        --backbone 't5-base' \
        --output $output ${@:2} \
        --load snap/pretrain/VLT5/Epoch30 \
        --batch_size 90 \

Platform

OS: Ubuntu GPU: A100

Update:

It seems the unexpected_keys warning is not the reason of this low performance. The unexpected_keys message disappears when I use the model further pretrained on VCR, however the val and test performance is still low (i.e. nearly 0.6% on val and test). Then we try to constrain the decoding and only generate vis_extra_id_ tokens, resulting a 1% accuracy on test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant