This is our TensorFlow implementation for the paper:
Yunqin Zhu, Chao Wang, Qi Zhang, and Hui Xiong. 2024. Graph Signal Diffusion Model for Collaborative Filtering. In SIGIR ’24, July 14–18, 2024, Washington, DC, USA. ACM, New York, NY, USA, 11 pages.
See the arXiv version at https://arxiv.org/abs/2311.08744. If you have any questions, please feel free to contact Yunqin Zhu (haaasined at gmail dot com) or create an issue.
GiffCF is a variant of diffusion model tailored for collaborative filtering with implicit feedback, which smooths and sharpens graph signals by simulating the heat equation. The hierarchical generative model leverages the adjacency of the item-item similarity graph and enables high-quality reconstruction of user-item interactions. It improves upon previous diffusion recommender models and graph signal processing techniques.
The code has been tested running under Python 3.11.5, on both RTX 3080 Ti & 3090 GPUs. The following packages are required:
numpy==1.24.3
scipy==1.13.0
tabulate==0.9.0
tensorflow==2.14.0
tomli==2.0.1
To download the code and install the dependencies:
git clone https://github.com/HasiNed/GiffCF
cd GiffCF
pip install -r requirements.txt
Run python main.py -h
to show the help message:
usage: main.py [-h] [-c CONFIG_PATH] [-p PARAMETERS] [-v VERBOSE] {train,evaluate}
positional arguments:
{train,evaluate} whether to train or evaluate a GiffCF model
options:
-h, --help show this help message and exit
-c CONFIG_PATH path to config file in TOML format
-p PARAMETERS dict overriding config parameters
-v VERBOSE, --verbose VERBOSE
We support both TOML files and command-line key-value pairs for configuration. You can start by overwriting the default config files in ./configs
or passing a Python dictionary of parameters to the script.
We borrow the MovieLens-1M, Yelp, and Amazon-Book datasets from DiffRec, with the same training/validation/test splits. The datasets are preprocessed and stored in ./datasets
, each containing the following files:
train_list.npy
,valid_list.npy
,test_list.npy
: lists of user-item interactions in the format of(user_id, item_id)
, for training, validation, and testing, respectively.cache/eigen.npy
: truncated eigendecomposition of the normalized Laplacian matrix. Technically you can remove this file and let GiffCF compute and cache it on the fly. (Due to the limitation of repository size, we do not include this file for Amazon-Book.)
By default, our script selects the dataset by its name (movielens
, yelp
, or amazon
) and loads from the pre-defined paths. You can also specify the dataset path in the TOML file like:
[dataset]
path = './datasets/ml-1m_clean'
or in the command line like -p "{'dataset': {'path': './datasets/ml-1m_clean'}}"
.
We provide two sets of checkpoints for all the three datasets:
./checkpoints/legacy/*.keras
: used in the original paper../checkpoints/*.keras
: trained with the same hyperparameters but an updated version of codebase, which yield slightly better results.
The default config files for inference are located in ./configs/inference
. For example, to run inference on the Amazon-Book dataset:
python main.py evaluate -c ./configs/inference/amazon.toml
To reproduce the exact results in the paper, you can specify the checkpoint path like:
python main.py evaluate -c ./configs/inference/amazon.toml -p "{'model': {'ckpt_path': './checkpoints/legacy/amazon.keras'}}"
The recommendation performance on the testing set should be identical to the following table:
Checkpoint | Recall@10 | Recall@20 | NDCG@10 | NDCG@20 | MRR@10 | MRR@20 |
---|---|---|---|---|---|---|
legacy/movielens |
0.1251 | 0.1947 | 0.0962 | 0.1221 | 0.1577 | 0.1704 |
legacy/yelp |
0.0649 | 0.1063 | 0.0397 | 0.0523 | 0.0476 | 0.0527 |
legacy/amazon |
0.1121 | 0.1528 | 0.0733 | 0.0858 | 0.0795 | 0.0842 |
movielens |
0.1308 | 0.2024 | 0.1002 | 0.1272 | 0.1632 | 0.1759 |
yelp |
0.0653 | 0.1063 | 0.0399 | 0.0524 | 0.0479 | 0.0530 |
amazon |
0.1122 | 0.1534 | 0.0734 | 0.0861 | 0.0796 | 0.0844 |
The default config files for training are located in ./configs
and produce the same results as the last three rows of the table above. For example, to train GiffCF on the MovieLens-1M dataset:
python main.py train -c ./configs/movielens.toml
To train on the Yelp dataset with a customized diffusion schedule:
python main.py train -c ./configs/yelp.toml -p "{'model': {'T': 2, 'alpha': 1.0, 'noise_scale': 0.1}}"
Note: While our denoiser architecture is designed to be expressive, it may suffer from some common training issues like local minima or overfitting. We recommend using a different random seed or learning rate schedule if you find the model underperforms on your own dataset. It remains an open question how to design a more robust training algorithm for GiffCF, and we welcome any suggestions or contributions. Good luck!
If you use our code, please kindly cite:
@inproceedings{zhuGraphSignalDiffusion2024,
title={Graph Signal Diffusion Model for Collaborative Filtering},
author={Zhu, Yunqin and Wang, Chao and Zhang, Qi and Xiong, Hui},
booktitle={SIGIR},
year={2024}
}