ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Code and model for paper: "ConLID: Supervised Contrastive Learning for Low-Resource Language Identification" arXiv - 2025

TL;DR: We introduce ConLID, a model trained on GlotLID-C dataset using Supervised Contrastive Learning. It supports 2,099 languages and is, especially, effective for low-resource languages.

🛠️ Setup

git clone https://github.com/epfl-nlp/ConLID.git
cd ConLID
# set the evironment variables as in `.env_example`
source setup.sh

🤖 Usage

Download the model

from huggingface_hub import snapshot_download

snapshot_download(repo_id="epfl-nlp/ConLID", local_dir="checkpoint")

Use the model

from model import ConLID
model = ConLID.from_pretrained(dir='checkpoint')

# print the supported labels
print(model.get_labels())
## ['aai_Latn', 'aak_Latn', 'aau_Latn', 'aaz_Latn', 'aba_Latn', ...]

# prediction
model.predict("The cat climbed onto the roof to enjoy the warm sunlight peacefully!")
# (['eng_Latn'], [0.970989465713501])

model.predict("The cat climbed onto the roof to enjoy the warm sunlight peacefully!", k=3)
## (['eng_Latn', 'sco_Latn', 'jam_Latn'], [0.970989465713501, 0.006496887654066086, 0.00487488554790616])

💪🏻 Training

Download the train dataset under data/glotlid/

huggingface-cli download cis-lmu/glotlid-corpus --repo-type dataset --local-dir data/glotlid

Run data preprocessing pipeline

bash scripts/preprocess_dataset.sh

Run trainings

bash scripts/train_lid_ce.sh    # Trains the LID-CE model
bash scripts/train_lid_scl.sh   # Trains the LID-SCL model
bash scripts/train_conlid_s.sh  # Trains the ConLID-S model

🎯 TODO

Release the inference code
Release the training code
Release the evaluation code
Optimize the inference using parallel tokenization

⭐️ Citation

If you find this project useful, welcome to cite us:

@article{foroutan2025conlid,
  title={ConLID: Supervised Contrastive Learning for Low-Resource Language Identification},
  author={Negar Foroutan and Jakhongir Saydaliev and Ye Eun Kim and Antoine Bosselut},
  journal={arXiv preprint arXiv:2506.15304},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
conlid		conlid
scripts		scripts
.env_example		.env_example
.gitignore		.gitignore
README.md		README.md
model.py		model.py
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

🛠️ Setup

🤖 Usage

💪🏻 Training

🎯 TODO

⭐️ Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

epfl-nlp/ConLID

Folders and files

Latest commit

History

Repository files navigation

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

🛠️ Setup

🤖 Usage

💪🏻 Training

🎯 TODO

⭐️ Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages