Skip to content

epfl-nlp/ConLID

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Model arXiv

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Model Overview

License: MIT

Code and model for paper: "ConLID: Supervised Contrastive Learning for Low-Resource Language Identification" arXiv - 2025

TL;DR: We introduce ConLID, a model trained on GlotLID-C dataset using Supervised Contrastive Learning. It supports 2,099 languages and is, especially, effective for low-resource languages.

🛠️ Setup

git clone https://github.com/epfl-nlp/ConLID.git
cd ConLID
# set the evironment variables as in `.env_example`
source setup.sh

🤖 Usage

Download the model

from huggingface_hub import snapshot_download

snapshot_download(repo_id="epfl-nlp/ConLID", local_dir="checkpoint")

Use the model

from model import ConLID
model = ConLID.from_pretrained(dir='checkpoint')

# print the supported labels
print(model.get_labels())
## ['aai_Latn', 'aak_Latn', 'aau_Latn', 'aaz_Latn', 'aba_Latn', ...]

# prediction
model.predict("The cat climbed onto the roof to enjoy the warm sunlight peacefully!")
# (['eng_Latn'], [0.970989465713501])

model.predict("The cat climbed onto the roof to enjoy the warm sunlight peacefully!", k=3)
## (['eng_Latn', 'sco_Latn', 'jam_Latn'], [0.970989465713501, 0.006496887654066086, 0.00487488554790616])

💪🏻 Training

Download the train dataset under data/glotlid/

huggingface-cli download cis-lmu/glotlid-corpus --repo-type dataset --local-dir data/glotlid

Run data preprocessing pipeline

bash scripts/preprocess_dataset.sh

Run trainings

bash scripts/train_lid_ce.sh    # Trains the LID-CE model
bash scripts/train_lid_scl.sh   # Trains the LID-SCL model
bash scripts/train_conlid_s.sh  # Trains the ConLID-S model

🎯 TODO

  • Release the inference code
  • Release the training code
  • Release the evaluation code
  • Optimize the inference using parallel tokenization

⭐️ Citation

If you find this project useful, welcome to cite us:

@article{foroutan2025conlid,
  title={ConLID: Supervised Contrastive Learning for Low-Resource Language Identification},
  author={Negar Foroutan and Jakhongir Saydaliev and Ye Eun Kim and Antoine Bosselut},
  journal={arXiv preprint arXiv:2506.15304},
  year={2025}
}

About

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification [arXiv - 2025]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •