Yandex Cup 2024, Music Information Retrieval, 1st place main track.

Pipeline

The model takes in (50, 84) image, resizes it to (224, 224), applies mean/std normalization, 2d backbone and arcface head.

To train a model: python3 train.py --config config_1.py,

NB: takes ~10 hours to saturate validation metric on 4090 GPU assuming there is no data loading bottleneck (almost any CPU would do)

model	approx score LB
from scratch, patch transformer	0.30
whisper-base, take encoder, downsample pos encoding	0.35
2d cnns, e.g. convnext_tiny	0.45
add resize from orig resolution to (224, 224)	0.55
transformers, e.g. maxvit (works best)	0.64
+5 models ensemble	0.665
+query expansion	0.680