Can large-scale pretraining achieve real open-vocabulary? 预训练目标检测器能实现真正的开放词汇吗？ #484

wangzishuo029 · 2024-09-07T02:56:45Z

Recent works like YOLO-World and GroundingDINO mainly use Object365 and GoldG for pretraining. These methods do not use CLIP image encoder as the backbone (unlike some open-vocabulary detection methods like CORA and F-VLM). But the vocabulary of O365 dataset is still limited. So can YOLO-World detect objects beyond the pretraining data? Is YOLO-World a really open-vocabulary detector?

最近的开集目标检测器（例如YOLO-World和GroundingDINO）主要是在Object365和GoldG等大规模数据集上预训练。这些方法没有采用CLIP的图像编码器作为backbone，而一些开放词汇目标检测（OVD）方法例如CORA、F-VLM是直接采用的CLIP图像编码器作为backbone。而YOLO-World的预训练数据的词汇虽然更大，但也是有限的。所以YOLO-World能够检测到预训练数据之外的对象吗？是真正的开放词汇目标检测器吗？

YonghaoHe · 2024-09-07T06:47:20Z

No, the performance is limited.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can large-scale pretraining achieve real open-vocabulary? 预训练目标检测器能实现真正的开放词汇吗？ #484

Can large-scale pretraining achieve real open-vocabulary? 预训练目标检测器能实现真正的开放词汇吗？ #484

wangzishuo029 commented Sep 7, 2024

YonghaoHe commented Sep 7, 2024

Can large-scale pretraining achieve real open-vocabulary? 预训练目标检测器能实现真正的开放词汇吗？ #484

Can large-scale pretraining achieve real open-vocabulary? 预训练目标检测器能实现真正的开放词汇吗？ #484

Comments

wangzishuo029 commented Sep 7, 2024

YonghaoHe commented Sep 7, 2024