GitHub - NoakLiu/GraphSnapShot: GraphSnapShot: Caching Local Structure for Fast Graph Learning

GraphSnapShot

Up to 30% training acceleration and 73% memory reduction for lossless graph ML training with GraphSnapShot!

GraphSnapShot is a framework for caching local structure for fast graph learning, it can achieve fast storage, retrieval and computation for graph learning at large scale. It can quickly store and update the local topology of graph structure, just like take snapshots of the graphs.

3 kinds of system design

1. proto - implemented by torch
2. dglsampler-simple - implemented with baseline of MultiLayerSampler in dgl
3. dglsampler - implemented with baseline of MultiLayerSampler in dgl

For dglsampler re-design 3 system design strategies

FBL: full batch load
OTF: partial cache refresh (on the fly) snapshot
FCR: full cache refresh snapshot

In detailed:

1. All methods in OTF and FCR has two modes: independent cache and shared cache
2. OTF has 4 methods, which are the combination of (full refresh, partial refresh) x (full fetch, partial fetch)

Deployment:

FBL implementation is same as the MultiLayerSampler implemented in dgl.

To deploy GraphSnapShot, Samplers in SSDReS_Sampler by cd SSDReS_Samplers, and then find the following file

NeighborSampler_OTF_struct.py
NeighborSampler_OTF_nodes.py
NeighborSampler_FCR_struct.py
NeighborSampler_FCR_nodes.py

The sampler code can be found at

vim ~/anaconda3/envs/dglsampler/lib/python3.9/site-packages/dgl/sampling/neighbor.py

Add samplers code in SSDReS_Sampler into the neighbor_sampler.py in dgl as in the path above and save the changes.

vim ~/anaconda3/envs/dglsampler/lib/python3.9/site-packages/dgl/dataloading/neighbor_sampler.py

Then you can deploy OTF and FCR samplers at node-level and struct-level from neighbor_sampler and create objects of those samplers.

FBL in execution

https://docs.dgl.ai/en/0.8.x/_modules/dgl/sampling/neighbor.html#sample_neighbors
https://docs.dgl.ai/en/0.9.x/generated/dgl.dataloading.NeighborSampler.html

FCR in execution

FCR_recording.mp4

OTF in execution

OTF_recording.mp4

FCR-SC, OTF-SC, FBL comparison (Note: SC is short for shared cache)

cached_comparison_recording_.mp4

Two types of samplers:

node-level: split graph into graph_static and graph_dynamic, enhance the capability for CPU-GPU co-utilization.
structure-level: reduce the inefficiency of resample whole k-hop structure for each node, use static-presample and dynamic-resample for structure retrieval acceleration.

Downsteam Task:

MultiLayer GCN - ogbn_arxiv / ogbn_products (homo)
MultiLayer SGC - ogbn_arxiv / ogbn_products (homo)
MultiLayer GraphSAGE - ogbn_arxiv / ogbn_products (homo)
MultiLayer RelGraphConv - ogbn_mag (hete)

Datasets:

ogbn_arxiv - node classification (homo)
ogbn_products - node classification (homo)
ogbn_mag - node classification (hete)

Feature	OGBN-ARXIV	OGBN-PRODUCTS	OGBN-MAG
Dataset Type	Citation Network	Product Purchase Network	Microsoft Academic Graph
Number of Nodes	17,735	24,019	132,534
Number of Edges	116,624	123,006	1,116,428
Feature Dimension	128	100	50
Number of Classes	40	89	112
Number of Train Nodes	9,500	12,000	41,351
Number of Validation Nodes	3,500	2,000	10,000
Number of Test Nodes	4,735	10,000	80,183
Supervised Task	Node Classification	Node Classification	Node Classification

Design of FBL

Design of OTF

Design of FCR

An Example of Memory Reduction for Original Graph

Graph(num_nodes=169343, num_edges=1166243,
      ndata_schemes={'year': Scheme(shape=(1,), dtype=torch.int64), 'feat': Scheme(shape=(128,), dtype=torch.float32)}
      edata_schemes={})

Dense Filter -->Dense Graph (degree>30)

Graph(num_nodes=5982, num_edges=65847,
      ndata_schemes={'year': Scheme(shape=(1,), dtype=torch.int64), 'feat': Scheme(shape=(128,), dtype=torch.float32), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})

Cached Graph (cached for FCRSampler)

Graph(num_nodes=5982, num_edges=30048,
      ndata_schemes={'year': Scheme(shape=(1,), dtype=torch.int64), 'feat': Scheme(shape=(128,), dtype=torch.float32), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})

Overall, achieve a memory reduction from num_edges=65847 to num_nodes=5982, num_edges=30048, edges memory reduction by by 45.6%

Dense Graph GraphSnapShot Cache for SSDReS_Samplers

Methods

For sparse graphs, FBL method will be directedly deployed
For dense graphs, SSDReS methods will be deployed

SSDReS Samplers

dgl samplers
- hete
  - FCR_hete
  - FCR_SC_hete
  - OTF((PR, FR)x(PF, FF))_hete
  - OTF((PR, FR)x(PF, FF))_SC_hete
- homo
  - FCR
  - FCR_SC
  - OTF((PR, FR)x(PF, FF))
  - OTF((PR, FR)x(PF, FF))_SC
dgl samplers simple
- hete
  - FCR_hete
  - FCR_SC_hete
  - OTF_hete
  - OTF_SC_hete
- homo
  - FCR
  - FCR_SC
  - OTF
  - OTF_SC

Deployment Sequence

For homograph
- 1. python div_graph_by_deg_homo.py --> dense graph, sparse graph
- 1. deploy homo SSDReS samplers such as FCR, FCR-SC, OTF((PR, FR)x(PF, FF)), OTF((PR, FR)x(PF, FF))-SC on dense graph
- 1. deploy FBL on sparse graph
For hetegraph
- 1. python div_graph_by_deg_hete.py --> dense graph, sparse graph
- 1. deploy homo SSDReS samplers such as FCR_hete, FCR-SC_hete, OTF((PR, FR)x(PF, FF))_hete, OTF((PR, FR)x(PF, FF))-SC_hete on dense graph
- 1. deploy FBL on sparse graph

Figures for Runtime Memory reduction

Figures for Memory reduction

Figures for GPU reduction

Analysis

The key point of GraphSnapShot is to cache the local structure instead of whole graph input for memory reduction and sampling efficiency.

Deployment on homo-graphs Import

from dgl.dataloading import (
    DataLoader,
    MultiLayerFullNeighborSampler,
    NeighborSampler,
    MultiLayerNeighborSampler,
    BlockSampler,
    NeighborSampler_FCR_struct,
    NeighborSampler_FCR_struct_shared_cache,
    NeighborSampler_OTF_struct_FSCRFCF,
    NeighborSampler_OTF_struct_FSCRFCF_shared_cache,
    NeighborSampler_OTF_struct_PCFFSCR_shared_cache,
    NeighborSampler_OTF_struct_PCFFSCR,
    NeighborSampler_OTF_struct_PCFPSCR_SC,
    NeighborSampler_OTF_struct_PCFPSCR,
    NeighborSampler_OTF_struct_PSCRFCF_SC,
    NeighborSampler_OTF_struct_PSCRFCF,
    # NeighborSampler_OTF_struct,
    # NeighborSampler_OTF_struct_shared_cache

)

Method Explanation - homo

NeighborSampler_FCR_struct: Fully Cache Refresh, with each hop has unique cached frontier
NeighborSampler_FCR_struct_shared_cache: Fully Cache Refresh with Shared Cache, with all hop has shared cached frontier
NeighborSampler_OTF_struct_FSCRFCF: 
NeighborSampler_OTF_struct_FSCRFCF_shared_cache: 
NeighborSampler_OTF_struct_PCFFSCR: 
NeighborSampler_OTF_struct_PCFFSCR_shared_cache: 
NeighborSampler_OTF_struct_PCFPSCR: 
NeighborSampler_OTF_struct_PCFPSCR_SC: 
NeighborSampler_OTF_struct_PSCRFCF: 
NeighborSampler_OTF_struct_PSCRFCF_SC:

Deployment

# FBL
sampler = NeighborSampler(
    [5, 5, 5],  # fanout for [layer-0, layer-1, layer-2]
    prefetch_node_feats=["feat"],
    prefetch_labels=["label"],
    fused=fused_sampling,
)

# FCR
sampler = NeighborSampler_FCR_struct(
    g=g,
    fanouts=[5,5,5],  # fanout for [layer-0, layer-1, layer-2] [2,2,2]
    alpha=1.5, T=50,
    prefetch_node_feats=["feat"],
    prefetch_labels=["label"],
    fused=fused_sampling,
)


# FCR shared cache
sampler = NeighborSampler_FCR_struct_shared_cache(
    g=g,
    fanouts=[5,5,5],  # fanout for [layer-0, layer-1, layer-2] [2,2,2]
    alpha=1.5, T=50,
    prefetch_node_feats=["feat"],
    prefetch_labels=["label"],
    fused=fused_sampling,
)    

# OTF
sampler = NeighborSampler_OTF_struct_FSCRFCF(
    g=g,
    fanouts=[5,5,5],  # fanout for [layer-0, layer-1, layer-2] [4,4,4]
    amp_rate=2, refresh_rate=0.3, T=50, #3, 0.4
    prefetch_node_feats=["feat"],
    prefetch_labels=["label"],
    fused=fused_sampling,
)

# OTF shared cache
sampler = NeighborSampler_OTF_struct_FSCRFCF_shared_cache(
    g=g,
    fanouts=[5,5,5],  # fanout for [layer-0, layer-1, layer-2] [2,2,2]
    # alpha=2, beta=1, gamma=0.15, T=119,
    amp_rate=2, refresh_rate=0.3, T=50,
    prefetch_node_feats=["feat"],
    prefetch_labels=["label"],
    fused=fused_sampling,
)

# OTF FSCR FCF shared cache
sampler = NeighborSampler_OTF_struct_PCFFSCR_shared_cache(
    g=g,
    fanouts=[5,5,5],
    amp_rate=2,fetch_rate=0.3,T_fetch=10
)

# OTF FSCR FCF
sampler = NeighborSampler_OTF_struct_PCFFSCR(
    g=g,
    fanouts=[5,5,5],
    amp_rate=2,fetch_rate=0.3,T_fetch=10
)

# PCF PSCR SC
sampler = NeighborSampler_OTF_struct_PCFPSCR_SC(
    g=g,
    fanouts=[5,5,5],
    amp_rate=2,refresh_rate=0.3,T=10
)

# PCF PSCR
sampler = NeighborSampler_OTF_struct_PCFPSCR(
    g=g,
    fanouts=[5,5,5],
    amp_rate=2,refresh_rate=0.3,T=50
)

# PSCR FCF SC
sampler = NeighborSampler_OTF_struct_PSCRFCF_SC(
    g=g,
    fanouts=[5,5,5],
    amp_rate=2, refresh_rate=0.3, T=50
)

Deployment on hete-graphs Import

from dgl.dataloading import(
    MultiLayerNeighborSampler,
    DataLoader,
    MultiLayerFullNeighborSampler,
    NeighborSampler_FCR_struct_hete,
    NeighborSampler_FCR_struct_shared_cache_hete,
    NeighborSampler_OTF_refresh_struct_hete,
    NeighborSampler_OTF_refresh_struct_shared_cache_hete,
    NeighborSampler_OTF_fetch_struct_hete,
    NeighborSampler_OTF_fetch_struct_shared_cache_hete,
    NeighborSampler_OTF_struct_PCFPSCR_hete,
    NeighborSampler_OTF_struct_PCFPSCR_shared_cache_hete,
    NeighborSampler_OTF_struct_PSCRFCF_hete,
    NeighborSampler_OTF_struct_PSCRFCF_shared_cache_hete,

    # modify_NeighborSampler_OTF_refresh_struct_shared_cache_hete
    # FCR_hete,
)

Method Explanation

NeighborSampler_FCR_struct_hete: 
NeighborSampler_FCR_struct_shared_cache_hete
NeighborSampler_OTF_refresh_struct_hete
NeighborSampler_OTF_refresh_struct_shared_cache_hete
NeighborSampler_OTF_fetch_struct_hete
NeighborSampler_OTF_fetch_struct_shared_cache_hete
NeighborSampler_OTF_struct_PCFPSCR_hete
NeighborSampler_OTF_struct_PCFPSCR_shared_cache_hete
NeighborSampler_OTF_struct_PSCRFCF_hete
NeighborSampler_OTF_struct_PSCRFCF_shared_cache_hete

Deployment

sampler = NeighborSampler_FCR_struct_hete(
    g=g,
    fanouts=[25,20],  # fanout for [layer-0, layer-1, layer-2] [4,4,4]
    alpha=2, T=10, # 800
    hete_label="paper",
    # prefetch_node_feats=["feat"],
    # prefetch_labels=["label"],
    # fused=True,
)

sampler = NeighborSampler_FCR_struct_shared_cache_hete(
    g=g,
    fanouts=[25,20],  # fanout for [layer-0, layer-1, layer-2] [4,4,4]
    alpha=2, T=10, # 800
    hete_label="paper",
)

sampler = NeighborSampler_OTF_refresh_struct_hete(
    g=g,
    fanouts=[25,20],
    alpha=2,
    T=20,
    refresh_rate=0.4,
    hete_label="paper"
)

sampler = NeighborSampler_OTF_refresh_struct_shared_cache_hete(
    g=g,
    fanouts=[25,20],
    alpha=2,
    T=20,
    refresh_rate=0.4,
    hete_label="paper"
)

sampler = NeighborSampler_OTF_fetch_struct_hete(
    g=g,
    fanouts=[25,20],
    amp_rate=1.5,
    T_refresh=20,
    T_fetch=3,
    hete_label="paper"
)

sampler = NeighborSampler_OTF_fetch_struct_shared_cache_hete(
    g=g,
    fanouts=[25,20],
    amp_rate=1.5,
    fetch_rate=0.4,
    T_fetch=3,
    T_refresh=20,
    hete_label="paper"
)

sampler = modify_NeighborSampler_OTF_refresh_struct_shared_cache_hete(
    g=g,
    fanouts=[25,20],
    amp_rate=1.5,
    fetch_rate=0.4,
    T_fetch=3,
    T_refresh=20,
)

sampler = NeighborSampler_OTF_struct_PCFPSCR_hete(
    g=g,
    fanouts=[25,20],
    amp_rate=1.5,
    refresh_rate=0.4,
    T=50,
    hete_label="paper",
)

sampler = NeighborSampler_OTF_struct_PCFPSCR_shared_cache_hete(
    g=g,
    fanouts=[25,20],
    amp_rate=1.5,
    refresh_rate=0.4,
    T=50,
    hete_label="paper",
)

sampler = NeighborSampler_OTF_struct_PSCRFCF_hete(
    g=g,
    fanouts=[25,20],
    amp_rate=1.5,
    refresh_rate=0.4,
    T=50,
    hete_label="paper",
)

sampler = NeighborSampler_OTF_struct_PSCRFCF_shared_cache_hete(
    g=g,
    fanouts=[25,20],
    amp_rate=1.5,
    refresh_rate=0.4,
    T=50,
    hete_label="paper",
)

Citation

If you find GraphSnapShot useful or relevant to your project and research, please kindly cite our paper:

@article{liu2024graphsnapshotgraphmachinelearning,
      title={GraphSnapShot: Graph Machine Learning Acceleration with Fast Storage and Retrieval}, 
      author={Dong Liu and Roger Waleffe and Meng Jiang and Shivaram Venkataraman},
      year={2024},
      eprint={2406.17918},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2406.17918}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 241 Commits
.idea		.idea
BufferManager_Simulation		BufferManager_Simulation
Disk_Mem_Simulation		Disk_Mem_Simulation
IOCostOptimizer_Simulation		IOCostOptimizer_Simulation
SSDReS_Samplers		SSDReS_Samplers
__pycache__		__pycache__
assets		assets
downstream_tasks		downstream_tasks
examples		examples
memory_detection		memory_detection
pe_tests		pe_tests
recordings		recordings
results		results
sampler_tests		sampler_tests
.DS_Store		.DS_Store
.gitignore		.gitignore
GraphSnapShot.pdf		GraphSnapShot.pdf
README.md		README.md
ogbn-arxiv_degree_distributions.png		ogbn-arxiv_degree_distributions.png
ogbn-mag_degree_distributions_with_thresholds.png		ogbn-mag_degree_distributions_with_thresholds.png
ogbn-products_degree_distributions.png		ogbn-products_degree_distributions.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphSnapShot

Methods

SSDReS Samplers

Deployment Sequence

Figures for Runtime Memory reduction

Figures for Memory reduction

Figures for GPU reduction

Analysis

Citation

About

Releases

Packages

Languages

NoakLiu/GraphSnapShot

Folders and files

Latest commit

History

Repository files navigation

GraphSnapShot

Methods

SSDReS Samplers

Deployment Sequence

Figures for Runtime Memory reduction

Figures for Memory reduction

Figures for GPU reduction

Analysis

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages