AHCKA: Efficient and Effective Attributed Hypergraph Clustering via K-Nearest Neighbor Augmentation (SIGMOD 2023)
If you use our code or data, please cite:
@article{LiYS23,
author = {Yiran Li and
Renchi Yang and
Jieming Shi},
title = {Efficient and Effective Attributed Hypergraph Clustering via K-Nearest
Neighbor Augmentation},
journal = {Proc. {ACM} Manag. Data},
volume = {1},
number = {2},
pages = {116:1--116:23},
year = {2023}
}
This repository contains the implementation of AHCKA algorithm for attributed hypergraph clustering and the eight datasets used in our experiments.
numpy, scipy, scikit-learn, scann
Install with: pip install {library_name}
Processed Amazon and MAG-PM datasets, with constructed ScaNN indices, can be downloaded at: https://zenodo.org/records/10035761
Original Amazon and MAG data can be accessed via the following links.
Amazon: https://nijianmo.github.io/amazon/index.html
MAG (OAG v2): https://www.aminer.org/oag2019
To run AHCKA clustering algorithm on a dataset, please specify the type of dataset by command-line parameter --data
and its name by --dataset
. Datasets supported by our implementation include the following:
Type of dataset | --data | --dataset |
---|---|---|
Co-authorship hypergraph in .pickle files | coauthorship | cora, dblp |
Co-citation hypergraph in .pickle files | cocitation | cora, citeseer |
Attributed hypergraph stored in .npz file | npz | query, 20news, amazon, magpm |
Other parameters are optional:
Parameter | Default | Description |
---|---|---|
--knn_k | 10 |
|
--alpha | 0.2 |
|
--beta | 0.5 |
|
--tmax | 200 |
|
--interval | 5 |
|
--scale | - | Apply settings for large-scale data: approximate KNN with ScaNN; simplified initialization ( |
--rd_init | - | Initialize cluster labels at random |
--seeds | 0 | Random seed, only effective with --rd_init on |
--verbose | - | Produce verbose command-line output |
To reproduce the results in our paper, please use the following commands for corresponding datasets.
python AHCKA.py --dataset query --data npz
python AHCKA.py --dataset cora --data coauthorship
python AHCKA.py --dataset cora --data cocitation
python AHCKA.py --dataset citeseer --data cocitation
python AHCKA.py --dataset 20news --data npz
python AHCKA.py --dataset dblp --data coauthorship
Note: the following tests require downloading optional large-scale datasets with constructed ScaNN indices.
python AHCKA.py --dataset amazon --data npz --scale --beta 0.4 --interval 1
python AHCKA.py --dataset magpm --data npz --scale --beta 0.4
Sample output of AHCKA, on Cora co-authorship dataset:
Acc=0.651 F1=0.608 NMI=0.462 ARI=0.406 Time=0.470s RAM=221MB