This repository contains the code for the paper BETag: Behavior-enhanced Item Tagging with Finetuned Large Language Models.
We plan to make this code open source. However, as we are in the process of applying for a patent related to this work, the repository is temporarily unavailable.
We are committed to completing the process as quickly as possible and will make the repository publicly accessible once the patent process is finalized. Thank you for your understanding and patience.
- Install PyTorch (version >= 2.0) with the appropriate CUDA version for your system.
- Install dependencies using the following command:
- Alternatively, you can manually check and install dependencies listed in
pyproject.toml
.
- Alternatively, you can manually check and install dependencies listed in
pip install -e .
Base tags serve as the foundational representation of products and can be any relevant tags. We provide a script (base_tags_generation.py
) for generating base tags using an LLM API.
The base tags must be organized in the following format for subsequent BE-finetuning and BETag Generation:
Mapping[PID, list[str]]
To finetune the model on your own dataset, you need:
- Training interaction sequences: A list of interaction sequences (e.g.,
list[list[PID]]
). - Base Tags: A mapping of product IDs (PIDs) to lists of tags (
Mapping[PID, list[Tag]]
).
- Prepare the dataset and specify paths to your data in a dotenv configuration file. For example:
inters_path = dataset/amazon.scientific/inters.train.json
base_tags_path = dataset/amazon.scientific/base_tags.json
...
- Run the finetuning script:
python beft.py --env path/to/the/.env
- Preprocessed datasets used in the paper are available here.
- Default environment configurations can be found in the
envs.default
directory.- Finetuned checkpoints are available on google drive.
For BETag Generation, interactions are not required for BETag Generation.
- Base Tags: Use the same base tags as in BE-finetuning.
- Checkpoint: Path to the finetuned LLM checkpoint.
- Configure the dotenv file with required paths.
- Run the generation script:
python begen.py --env path/to/the/.env
The output directory will contain the following files:
generation_config.json
: Contains the generation configuration.raw_predict.json
: The raw output of the LLM.raw_betags.json
: Parsed BETags in the formatMapping[PID, list[list[str]]]
.-
For each product, the generated tags for each beam are stored separately.
-
You can select the top-M beams for each product:
betags = {pid: beams[:TOP_M+1] for pid, beams in raw_betags.items()}
- Beams are sorted by score, from highest to lowest. The base tags are included as the first beam, resulting in
M+1
beams.
- Beams are sorted by score, from highest to lowest. The base tags are included as the first beam, resulting in
-
To use weighted tags or select top-K tags via:
from collections import Counter betags = {pid: Counter(sum(beams, [])).most_common(TOP_K) for pid, beams in betags.items()}
-
- Generated BETags are available here.
The Amazon dataset used in this work was from Recformer.
TODO