[SIGMOD2025] An Efficient and Exact Algorithm for Locally h-Clique Densest Subgraph Discovery
Detecting locally non-overlapping, near-clique densest subgraphs is essential for community search in social networks. Our work introduces an efficient and exact algorithm to tackle this challenge by identifying the top-k non-overlapping, locally h-clique densest subgraphs (LhCDS).
检测局部不重叠、接近派系最密集的子图对于社交网络中的社区搜索至关重要。我们的工作引入了一种高效且精确的算法,通过识别 top-k 非重叠、局部 h 团最稠密子图 (LhCDS) 来应对这一挑战。
Key highlights of our approach:
- Iterative Propose-Prune-and-Verify (IPPV) Pipeline:
- Propose: Generate candidates using convex programming without missing any dense substructures.
- Prune: Handle overlapping cliques with an advanced graph decomposition technique.
- Verify: Use basic and fast verification methods to confirm LhCDS candidates, leveraging a flow network to enhance speed.
- Generalization: Our approach extends to broader densest subgraph detection problems while ensuring exactness and low computational complexity.
- Extensive experiments on real datasets demonstrate the effectiveness and efficiency of our IPPV algorithm.
Left figure shows the relationships between a subset of characters in “Harry Potter”. The top-1 and top-2 L3CDSes are the blue and green subgraphs, respectively. The top-1 L3CDS is a family named Weasley, and the top-2 L3CDS is an organization named Death Eaters, which indicate the potential of LℎCDS discovery for mining diverse dense communities.
The input data format is:
Fisrt row is nodes_num edges_num
, then each following row src_node dst_node
represents an edge. Nodes should start from 0, e.g.
23133 93439
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
Name | Abbr. | |V| | |E| | 3-clique nums | 5-cliques num |
---|---|---|---|---|---|
soc-hamsterster | HA | 2,426 | 16,630 | 53,251 | 298,013 |
CA-GrQc | GQ | 5,242 | 14,484 | 48,260 | 2,215,500 |
fb-pages-politician | PP | 5,908 | 41,706 | 174,632 | 2,002,250 |
fb-pages-company | PC | 14,113 | 52,126 | 56,005 | 207,829 |
web-webbase-2001 | WB | 16,062 | 25,593 | 21,115 | 382,674 |
CA-CondMat | CM | 23,133 | 93,439 | 173,361 | 511,088 |
soc-epinions | EP | 26,588 | 100,120 | 159,700 | 521,106 |
Email-Enron | EN | 36,692 | 183,831 | 727,044 | 5,809,356 |
loc-gowalla | GW | 196,591 | 950,327 | 2,273,138 | 14,570,875 |
DBLP | DB | 317,080 | 1,049,866 | 2,224,385 | 262,663,639 |
Amazon | AM | 334,863 | 925,872 | 667,129 | 61,551 |
soc-youtube | YT | 495,957 | 1,936,748 | 2,443,886 | 5,306,643 |
soc-lastfm | LF | 1,191,805 | 4,519,330 | 3,946,207 | 10,404,656 |
soc-flixster | FX | 2,523,386 | 7,918,801 | 7,897,122 | 96,315,278 |
soc-wiki-talk | WT | 2,394,385 | 4,659,565 | 9,203,519 | 382,777,822 |
All the datasets can be downloaded from network repository
mkdir build
cd build/
cmake ..
make
Examples:
./LhCDScvx --args -g ./dataset/CA-CondMat.txt -h 3 -t 10 -k 20 -p 0
h
means h-clique, t
is iteration, k
means top-k LhCDSm p
is pattern.
The following are the regular subgraph models we support.
if (p == 0)
{
h_cliques = get_kcliques(adj, v2cliques, k, selected);
}
else if (p == 1)
{
h_cliques = get_kloops(adj, v2cliques, k, selected);
}
else if (p == 2)
{
h_cliques = get_doubletriangle(adj, v2cliques, k, selected);
}
else if (p == 3)
{
h_cliques = get_tritriangle(adj, v2cliques, k, selected);
}
else if (p == 4)
{
h_cliques = get_kstar(adj, v2cliques, k - 1, selected);
}
else if (p == 5)
{
h_cliques = get_ctriangle(adj, v2cliques, k - 1, selected);
}
for (int i = 0; i < h_cliques.size(); i++) {
h_cliques_no.push_back(i);
}
Execute the following shell file to obtain the results of this experiment (note: the specific time may vary due to differences in hardware devices)
k_list=(5 10 15 20)
h_list=(3 4 5)
v_list=(1 0)
t=20
p=0
data_list=("fb-pages-company" "soc-hamsterster" "soc-epinions" "Email-Enron" "loc-gowalla" "CA-CondMat" "CA-GrQc" "Amazon")
for dataset in ${data_list[@]}
do
if [ -z "$dataset" ]; then
echo "Error: The variable 'dataset' is not set."
exit 1
fi
# Create the directory if it does not exist
mkdir -p "./output/$dataset"
for k in ${k_list[@]}
do
for h in ${h_list[@]}
do
for t in ${t_list[@]}
do
for s in ${s_list[@]}
do
for v in ${v_list[@]}
do
(/usr/bin/time -v nohup ./LDScvx -g ./dataset/${dataset}.txt -h $h -k $k -t $t -p $p -v $v&) >& ./output/$dataset/${dataset}_LhCDScvx_k=${k}_h=${h}_t=${t}_s=${s}_p=${p}_v=${v}.txt
done
done
done
done
done
done
The code of LDSflow comes from the author of Locally Densest Subgraph Discovery.
@article{xu2024efficient,
author = {Xiaojia Xu and Haoyu Liu and Xiaowei Lv and Yongcai Wang and Deying Li},
title = {An Efficient and Exact Algorithm for Locally h-Clique Densest Subgraph Discovery},
journal = {Proc. ACM Manag. Data},
volume = {2},
number = {N6},
article = {225},
year = {2024},
month = {December},
pages = {26},
doi = {10.1145/3698800}
}