Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery

In this paper, we study the problem of Generalized Category Discovery (GCD), which aims to cluster unlabeled data from both known and unknown categories using the knowledge of labeled data from known categories. Current GCD methods rely on only visual cues, which however neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories. To address this, we propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models. TextGCD mainly includes a retrieval-based text generation (RTG) phase and a crossmodality co-teaching (CCT) phase. First, RTG constructs a visual lexicon using category tags from diverse datasets and attributes from Large Language Models, generating descriptive texts for images in a retrieval manner. Second, CCT leverages disparities between textual and visual modalities to foster mutual learning, thereby enhancing visual GCD. In addition, we design an adaptive class aligning strategy to ensure the alignment of category perceptions between modalities as well as a softvoting mechanism to integrate multi-modality cues. Experiments on eight datasets show the large superiority of our approach over state-of-the-art methods. Notably, our approach outperforms the best competitor, by 7.7% and 10.8% in All accuracy on ImageNet-1k and CUB, respectively.

Running

Dependencies

pip install -r requirements.txt

Config

Set paths to datasets and desired log directories in config.py

Datasets

We use fine-grained benchmarks in this paper, including:

The Semantic Shift Benchmark (SSB)

We also use generic object recognition datasets, including:

CIFAR-10/100 and ImageNet-100/1K

Scripts

Train the model:

mkdir retrieved_text
bash scripts/train_${DATASET_NAME}.sh

Results

Our results:

Datasets	Paper (3runs)	Current Github (5 runs)
Cifar100	All 85.7 / Old 86.3 / New 84.6	seed0: All 0.8548 / Old 0.8509 / New 0.8626 seed1: All 0.8408 / Old 0.8557 / New 0.8110 seed2: All 0.8531 / Old 0.8630 / New 0.8333 seed3: All 0.8638 / Old 0.8629 / New 0.8656 seed4: All 0.8604 / Old 0.8642 / New 0.8526 average: All 85.46±0.88 / Old 85.93±0.58 / New 84.50±2.28
Cifar10	All 98.2 / Old 98.0 / New 98.6	seed0: All 0.9850 / Old 0.9789 / New 0.9881 seed1: All 0.9848 / Old 0.9814 / New 0.9864 seed2: All 0.9850 / Old 0.9792 / New 0.9878 seed3: All 0.9849 / Old 0.9785 / New 0.9881 seed4: All 0.9849 / Old 0.9768 / New 0.9889 average: All 98.49±0.01 / Old 97.90±0.17 / New 98.79±0.09
CUB	All 76.6 / Old 80.6 / New 74.7	seed0: All 0.7776 / Old 0.7872 / New 0.7728 seed1: All 0.7669 / Old 0.7585 / New 0.7711 seed2: All 0.7791 / Old 0.8099 / New 0.7638 seed3: All 0.7667 / Old 0.7772 / New 0.7614 seed4: All 0.7729 / Old 0.7632 / New 0.7778 average: 77.26±0.58 / Old 77.92±2.06 / New 76.94±0.67
Stanford Cars	All 86.9 / Old 87.4 / New 86.7	seed0: All 0.8556 / Old 0.9195 / New 0.8248 seed1: All 0.8547 / Old 0.9220 / New 0.8221 seed2: All 0.8652 / Old 0.9180 / New 0.8397 seed3: All 0.8548 / Old 0.9100 / New 0.8281 seed4: All 0.8752 / Old 0.9180 / New 0.8545 average: All 86.11±0.90 / Old 91.75±0.45 / New 83.38±1.34
Oxford Pets	All 95.5 / Old 93.9 / New 96.4	seed0: All 0.9368 / Old 0.9332 / New 0.9387 seed1: All 0.9434 / Old 0.9109 / New 0.9604 seed2: All 0.9529 / Old 0.9396 / New 0.9599 seed3: All 0.9459 / Old 0.9290 / New 0.9549 seed4: All 0.9496 / Old 0.9321 / New 0.9588 average: 94.57±0.62 / Old 92.90±1.08 / New 95.45±0.91
Flowers 102	All 87.2 / Old 90.7 / New 85.4	seed0: All 0.8418 / Old 0.9412 / New 0.7922 seed1: All 0.9059 / Old 0.9255 / New 0.8961 seed2: All 0.8850 / Old 0.9137 / New 0.8706 seed3: All 0.8810 / Old 0.9373 / New 0.8529 seed4: All 0.8850 / Old 0.9451 / New 0.8549 average: All 87.97±2.33 / Old 93.26±1.28 / New 85.33±3.83

Citing this work

If you find this repo useful for your research, please consider citing our paper:

@article{zheng2024textual,
  title={Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery},
  author={Zheng, Haiyang and Pu, Nan and Li, Wenjing and Sebe, Nicu and Zhong, Zhun},
  journal={arXiv preprint arXiv:2403.07369},
  year={2024}
}

Acknowledgements

Our lexicon is based on the LENS project, with thanks to the resources available at LENS. Our codebase also heavily relies on SimGCD. Thanks for their excellent work!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Lexicon		Lexicon
assets		assets
clip		clip
data		data
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
config.py		config.py
cross_modality_coteaching.py		cross_modality_coteaching.py
model.py		model.py
requirements.txt		requirements.txt
retrieval_based_text_generation.py		retrieval_based_text_generation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery

Running

Dependencies

Config

Datasets

Scripts

Results

Citing this work

Acknowledgements

License

About

Releases

Packages

Languages

HaiyangZheng/TextGCD

Folders and files

Latest commit

History

Repository files navigation

Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery

Running

Dependencies

Config

Datasets

Scripts

Results

Citing this work

Acknowledgements

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages