A comprehensive toolkit for applying active learning techniques to natural language generation tasks. This repository contains implementations of various active learning strategies specifically designed for text generation models, helping to reduce annotation costs while maximizing model performance.
- Multiple Active Learning Strategies: Implementation of strategies like HUDS, HADAS, FAC-LOC, IDDS, and more
- Flexible Model Support: Compatible with various language models (Qwen, Llama, etc.)
- Comprehensive Evaluation: Supports multiple evaluation metrics including ROUGE, BLEU, BERTScore, AlignScore, etc.
- Interactive Visualization: Streamlit dashboard for exploring results and comparing strategies
- Hydra Configuration: Easily configurable experiments through Hydra's YAML-based configuration system
- PEFT Integration: Efficient fine-tuning using Parameter-Efficient Fine-Tuning methods
- Python 3.10+
- CUDA-compatible GPU (for model training)
- Dependencies listed in
requirements.txt
pip install atgenpip install git+https://github.com/Aktsvigun/atgen.gitFor development (e.g. a new AL / subset selection strategy) or if you want to modify the code:
# Clone the repository
git clone https://github.com/Aktsvigun/atgen.git
cd atgen
# Install in editable mode
pip install -e .This will install the package in editable mode and allow you to make changes to the code and see them immediately reflected without reinstalling the package.
Experiments can be launched using the run-al command:
CUDA_VISIBLE_DEVICES=0 HYDRA_CONFIG_NAME=base run-alParameters:
CUDA_VISIBLE_DEVICES: Specify which GPU to useHYDRA_CONFIG_NAME: Configuration file (e.g.,base,custom,test)
Additional parameters can be overridden via the command line following Hydra's syntax:
CUDA_VISIBLE_DEVICES=0 HYDRA_CONFIG_NAME=base run-al al.strategy=huds model.checkpoint=Qwen/Qwen2.5-7BLaunch the Streamlit application to explore and visualize your experiments:
streamlit run Welcome.pyNavigate to http://localhost:8501 in your web browser to access the dashboard.
configs/: Configuration files for experimentsal/: Active learning strategy configurationsdata/: Dataset configurationslabeller/: Labeller configurations
src/atgen/: Main packagestrategies/: Implementation of active learning strategiesmetrics/: Code for evaluation metricsutils/: Utility functionsrun_scripts/: Scripts for running experimentslabellers/: Labelling mechanismsvisualize/: Visualization tools
pages/: Streamlit application pagesoutputs/: Experimental results storagecache/: Cached computations to speed up repeated runs
huds: Hypothetical Document Scoringhadas: Harmonic Diversity Scoringrandom: Random sampling baselinefac-loc: Facility Location strategyidds: Improved Diverse Density Scoring- And more...
The toolkit comes pre-configured for several datasets including summarization, question answering, and other generative tasks. Custom datasets can be added by creating new configuration files.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE.md file for details.
If you use this toolkit in your research, please cite:
@inproceedings{tsvigun-etal-2025-atgen,
title = "{ATG}en: A Framework for Active Text Generation",
author = "Tsvigun, Akim and
Vasilev, Daniil and
Tsvigun, Ivan and
Lysenko, Ivan and
Bektleuov, Talgat and
Medvedev, Aleksandr and
Vinogradova, Uliana and
Severin, Nikita and
Mozikov, Mikhail and
Savchenko, Andrey and
Makarov, Ilya and
Rostislav, Grigorev and
Kuleev, Ramil and
Zhdanov, Fedor and
Shelmanov, Artem",
editor = "Mishra, Pushkar and
Muresan, Smaranda and
Yu, Tao",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-demo.63/",
doi = "10.18653/v1/2025.acl-demo.63",
pages = "653--665",
ISBN = "979-8-89176-253-4",
}