This repository is our implementation for the SemEval'25 Task 5 - LLMs4Subjects. Here is the webpage of the task: https://sites.google.com/view/llms4subjects/home?authuser=0
The main idea of our system is to leverage a range of few-shot prompts and LLMs and ensemble the results, no fine-tuning required. We hande the vocabulary by mapping the LLM's keywords onto it using embeddings. Important libraries for our system are DVC and vLLM. Our system also connects to a local Weaviate vector stroage as well as a local Text Embedding Service through Huggingface TEI. You will need to set these up, before you can run the code. Our work results from a project at the German National Library (DNB), aiming at finding and testing methods for the task of subject indexing digital publications in the German language.
- Clone this repository.
- Get the data from llms4subjects:
git submodule update --init
- Setup local Text Embedding Service with Huggingface TEI and Setup a local Weaviate vector storage (see instructions with docker-compose below)
- Install the requirements from
requirements.txt
- Reproduce the pipeline
dvc repro
Please find our full system description in our submitted paper: [TBA]
The subject tagging system mainly consists of five stages:
complete
: Generate free keywords with an LLMmap
: Map keywords to the target vocabsummarize
: aggregate suggestions from all ensemble component modelsrank
: use LLM to generate relevance scores for all suggestionscombine
: compute final ranking score from LLM-rating (rank
) and ensemble vote (summarize
)
The stages complete
and map
are executed for all models and prompts
specified in the params.yaml
flowchart TD
node1["complete@model0-prompt0"]
node2["map@model0-prompt0"]
node1-->node2
node3["..."]
node4["..."]
node3-->node4
node5["complete@modelM-promptP"]
node6["map@modelM-promptP"]
node5-->node6
node9["combine"]
node10["rank"]
node11["summarize"]
node2-->node11
node4-->node11
node6-->node11
node10-->node9
node11-->node9
node11-->node10
On a dev-sample of 1000 documents that was not used in fine-tuning our system we can report the following results
Ensemble Strategy | Precision | Recall | F1 | PR-AUC |
---|---|---|---|---|
top-20-ensemble | 0.488 | 0.459 | 0.420 | 0.411 |
one-model-all-prompts | 0.481 | 0.407 | 0.393 | 0.344 |
one-prompt-all-models | 0.492 | 0.414 | 0.407 | 0.375 |
one-model-one-prompt | 0.461 | 0.385 | 0.380 | 0.235 |
Precision, Recall and F1 are computed as document avegrages (macro avg) and refer to the optimal calibartion of the system, as marked with a cross on the precision-recall curves:
Some of the LLMs used in this repository have gated access only. You need to request access to these models through HuggingFace and you have to provide your personal access token as an environment variable like this before you can run our code:
export HF_TOKEN=hf_YOUR_PERSONAL_ACCESS_TOKEN
Also you may want to specify the download directory, where HuggingFace models
are stored on your workstation, which can
be configured in the params.yaml
section general.vllm_engineargs.download_dir
In the same section general.vllm_engineargs
you find other settings for the
execution of vLLM. In particular, set tensor_parallel_size
to the number
of available GPU-devices.
Note: Make sure you have enough GPU memory to run all the models specified in the params.yaml file.
The stage that maps keyword suggestions to the target vocabulary employs
a vector search across the vocabulary. To facilitate fast HNSW-Search
we store the vocabularies text embeddings in a Weaviate
vector storage, that you will need to set up locally.
Also, this process needs to generate text embeddings. For this purpose
we start a HuggingFace Text-Embedding-Inference Service (TEI). You can launch weaviate as well as the TEI using docker compose and the provided docker-compose.yaml
file.
Simply run docker compose up
in this directory and your services are all
set up.
Should you wish to change the port settings, where the two services are verved on your machine,
you will need to modify the docker-compose.yaml
as well as the params.yaml
.
We used DVC to synchronize the various stages of our data processing pipeline. You can find installation instructions here.
Two files are crucial for using our code to perform subject tagging
dvc.yaml
- This file contains the stages, i.e. the steps to run in your experiments. For each step, you can specify which
cmd
it calls, the parameters, dependencies and output files (which can be tracked with DVC, too). Unless you want to change our procedure or change some of the hard-coded files, it doesn't need to be changed. - For most stages, you can find a corresponding
.py
or.r
script that you can also run independently of the DVC pipeline.
- This file contains the stages, i.e. the steps to run in your experiments. For each step, you can specify which
params.yaml
- This file contains all the parameters, like prompt specifications, LLMs to prompt and more. You can adapt the parameters in this file and reproduce the pipeline (see next subsection) or them temporarily for experiments (see subsection after next)
Using the command below, you can reproduce the entire pipeline with the parameters as specified in the params.yaml
. Make sure to run the command in the main directory with the dvc.yaml
.
dvc repro dvc.yaml
View the DVC's [documentation]{https://dvc.org/doc/command-reference/repro#repro} to learn about possible options for this command. You can also reproduce individual stages (e.g. complete) by running:
dvc repro -s STAGE_NAME
If you want to run a multitude of experiments with different hyperparameters (and possibly queue those experiments), you could use:
dvc exp run
See the [documentation]{https://dvc.org/doc/command-reference/exp/run#exp-run} for how to run the experiments in a queue, name the experiment and modify the parameters.
Metrics are computed in a seperate dvc-pipeline contained in the subfolder eval-pipeline
.
Currently, these are only reproducible with our internal tools, but we hope to share
our evaluation tooling, too, asap.