📦 Deep Learning Module

This repository aims at hosting the Deep Learning modules to train Graph Representation Self-Supervised Learning (GRSSL) methods in order to compute graphs embeddings.

The general deep learning modules rely on PyTorch [3] and DGL [4] libraries.

Data used to (pre)train the graph representation methods come from Netzschleuder, hosted in graph-tool python library [5].

Table of Content

🪩 Features

⚡️ Quick start

Installation

Train a model

🌻 Graph Neural Networks

Models' architecture

Encoding Layers

🧳 Graphs Datasets

⚙️ Implementation

🫧 Usage

Configurations

Results

Visualisation

🌵 Folder Structure

📚 References

Bibliography

Acknoledgements

The Graph Neural Networks section wraps the theoretical grounding of the methods whose implementations are hosted in this repository. Details regarding the Graphs Datasets used and the Implementation are described afterwards. Detailed instructions on how to use the code and/or custom it to ones needs are given in the Usage section.

🪩 0. Features

This repository can be used to:

train GRL models;
evaluate performances of the models on selected tasks;
visualise the embeddings produced by the models.

⚡️ 1. Quick start

1.a Installation

pip install -r requirements.txt
conda install graph-tool

1.b Train a model

python train.py --c <config_file_path>.json

You can then visualise the learnt graphs representations with:

python viz.py --c <config_file_path>.json --n

Note You can also launch the companion notebook 🔗 for a step-by-step tutorial.

🌻 2. Graph Neural Networks

🔝

2.a Self-Supervised models

Self-supervised learning is an attractive training paradigm in the big data era. It aims at mitigating the over-dependence of DL models on labeled data by devising training procedure from pretext tasks that don't require labels.

This repository shelters the implementation of two state-of-the-art self-supervised GRL models: GraphMAE [1] and PGCL [2]. These two models stem the two main subdomain of SSL: predictive and contrastive learning.

On one hand, Graph Masked Auto-Encoder (GraphMAE) is a generative method that aims at reconstructing node features (which are hidden during training).

On the other hand, Prototypical Graph Contrastive Learning (PGCL) is a contrastive learning model that was notably devised to alleviate the sampling bias harming contrastive methods.

2.b Encoding Layers

Most of GRL models (including GraphMAE and PGCL) rely on Graph Neural Networks (GNNs): adaptation of neural networks to handle graph-structured data. GNN layers are the backbone of these models and can often be used interchangeably.

GNNs implement message-passing methods that iteratively aggregate and combine information of nodes and their neighbours. They ultimately produce graphs of nodes' states with the same topology as the input graph.

Three of the most popular GNN instances are:

Graph Attention layer (GAT)
Graph Convolutional Network (GCN)
Graph Isomorphism Network (GIN)

Then, a pooling layer can be used to obtain whole-graph representation from the nodes states.

🧳 3. Graphs Datasets

🔝

The main goal of this repository is to enable the implementation of self-supervised methods. Those are meant to be general enough to tackle various tasks, while still being achieving high-performances.

With this in mind, a dataset containing graph from various domains and with different sizes is designed to train the models. Then the models are evaluated on standard graph classification baselines.

3.a Training

The training dataset is based on the Netzschleuder catalogue that references graphs hosted by graph-tool library [5]. The training dataset is obtained by setting a few constraints on the types of graphs (number of edges, balance, etc.). Thanks to graph-tool implementation, it can easily be loaded within python pipelines and used to train the models.

3.b Evaluation

The evaluation of the trained models is performed on the most common benchmark in the literature: graph classification on a subset of datasets from the TUDataset [6].

⚙️ 4. Implementation

🔝

The proposed implementation is flexible and provides ways to implement new models based on general modules. The implementation of the Deep Learning pipeline is composed of 4 main components meant to be customisable and adaptable to fit different frameworks. Each of the module can be modified or instanciated differently to render different models.

These four components allow to load graphs datasets, implement models and their backbone encoders and to train the models:

GraphDataset (see code)
GRSSLModel (see code)
- Encoder (see code)
Trainer (see code)

🫧 5. Usage

🔝

The whole pipeline is illustrated in a notebook chaining the different steps of this project.

Models can be trained using the following command line:

python train.py --c <config_file_path>.json

The training and results can be replicated by running the .sh script:

./repro_training.sh <config_file>

This script will train 5 models instantiated from the same configuration file but with different seeds. The models are stored before training as model_untrained.pth and the models from the epoch achieving the lowest loss as model_best.pth.

Then the performances of each of the 5 models are assessed (both before and after training) on a selected set of TU Datasets. The results are stored in a .csv file, together with the models.

5.a Configurations

🫧

The configurations of the models are provided as .json files containing all the required parameters and hyperparameters from the loading of the training dataset to the model's architectures details and the training scheme to follow.

See example configuration files in this folder 🔗.

5.b Results

🫧

Results obtained through this pipeline are described below.

Expected runtimes

First note that all computation were undertaken on a MacBook Pro (M1, 2020).

The expected runtimes for the training are shown in the following table:

Model type	Configuration	Average training time	Av. epoch time
GraphMAE	config link	02:23:14	00:01:54
PGCL	config link	07:01:16	00:04:49

Classification performances

The performances of the models are assessed on several graphs datasets selected from the TU datasets. Those datasets are widely used in the literature and provide common basis for models' comparison. The performances are computed by appending a SVC-classifier on top of the representations computed by the different models.

The outcome of the performance assessment are compiled in the table below. The average micro-F1 score upon 10-fold classification is reported for each dataset and each of the models.

Instructions to run tests

The evaluation procedure is implemented in Utils/performance_assessment.py. It can be run with the following command line:

python Utils/performance_assessment.py\
   -s <PATH_TO_SAVED_MODEL_FOLDER>\
	 -r <RUN_ID_0> <RUN_ID_1> ...\ # several possible
	 -d <DATASET_0> <DATASET_1> ...\

This script will store the results in a .csv file at the level of <PATH_TO_SAVED_MODEL_FOLDER>.

Model (config)	Custom (Training)	REDDIT-BINARY	COLLAB	IMDB-BINARY	IMDB-MULTI	PROTEINS	DD
NodesEdges [ config ]	84 (±1)	83 (±3)	68 (±2)	71 (±2)	48 (±4)	73 (±4)	75 (±4)
TradDegs [ config ]	91.9 (±1.4)	82.7 (±2.4)	80.8 (±1.0)	73.1 (±3.0)	50.5 (±1.7)	74.9 (±3.4)	75.8 (±3.0)
GraphMAE [ config ]	94.8 (±0.3)	90.0 (±0.5)	79.1 (±0.4)	73.3 (±1.0)	50.5 (±0.4)	74.3 (±0.8)	75.8 (±0.4)
PGCL [ config ]	95.1 (±1.7)	91.4 (±0.7)	74.5 (±0.7)	71.3 (±0.6)	48.6 (±0.8)	70.5 (±0.9)	69.8 (±1.5)

Compare with performances BEFORE training

Model (config)	Custom (Training)	REDDIT-BINARY	COLLAB	IMDB-BINARY	IMDB-MULTI	PROTEINS	DD
GraphMAE [ config ] Relative Change	94.0 (±0.2) ↓ - 0.8%	87.6 (±0.3) ↓ - 2.7%	78.2 (±0.3) ↓ - 1.7%	73.0 (±1.0) = - 0.4%	50.5 (±0.6) = + 0.1%	74.0 (±0.7) = - 0.4%	76.2 (±0.6) = + 0.5%
PGCL [ config ] Relative Change	95.0 (±0.7) = - 0.1%	89.5 (±1.3) ↓ -2.1%	69.8 (±0.3) ↓ -6.3%	71.1 (±0.4) ↓ -0.3%	48.1 (±0.3) = -1.0%	69.7 (±0.5) = -1.1%	66.1 (±0.8) ↓ -5.3%

5.c Visualisation

🫧

The class GRVisualiser (see code) is implemented to ease the visualisation and exploration of Graph Representations computed with the models.

The visualisation module makes use of matplotlib [7] and seaborn [8] libraries. The interactive module is implemented with plotly [9].

Usage (and possible visualisation outputs) are also showcased in the demonstration notebook. Visualisations can be obtained by running the following command line:

python viz.py -m <path_to_model> -d <dataset_name>

The following options can be given to the program:

-m or --model: the path to the model
-d or --dataset: dataset's name
-s or --save_path: save figure at this path if given
-f or --force_save: save figure with automatically generated name based on model and dataset, in illustrations/ folder
-i or --interactive: output interactive plot (default at port http://0.0.0.0:8060/)

Then, the program can be executed with the following command line:

python viz.py -m <model_path> -r <reducer> -d <dataset_id> -f -i

For example, the following allows to visualise the embeddings of the "REDDIT-BINARY" dataset, generated with the best saved PGCL model (it needs to be saved at ./saved/best_models/PGCL/), reduced with TSNE; save the figure and output the interactive plot:

python viz.py -d "REDDIT-BINARY"\
   -m './saved/best_models/PGCL/'\
   -s 'illustrations/pgcl_rdtb_tsne.png'\
   -i

Gallery

	TradDegs	PGCL	GraphMAE
Netzschleuder catalogue
REDDIT-BINARY

Interactive visualisation demonstration

GIF of an interactive visualisation (zoom + visualise instances + move + ...)

🌵 6. Folder Structure

🔝

DL_module/
├── __init__.py
├── Configs/
│   ├── __init__.py
│   ├── config_files/
│   │   └── <CONFIG_NAME>.json
│   ├── configs_parser.py
│   └── inspector.py # deprecated
├── DataLoader/
│   ├── data_loader.py
│   ├── data_util.py
│   └── test.ipynb # deprecated
├── illustrations/
│   └── ...
├── Logger/
│   ├── __init__.py
│   ├── logger.py
│   ├── logger_config.json
│   └── visualization.py
├── Models/
│   ├── __init__.py
│   ├── encoders.py
│   ├── from_pretrained.py
│   ├── model_grssl.py
│   ├── model_util.py
│   ├── README.md # deprecated ?
│   └── test.ipynb # deprecated
├── PGCL_pipe_tests.ipynb # deprecated
├── pipe_tests.ipynb
├── README.md
├── repro_training.sh
├── requirements.txt
├── saved/
│   ├── best_models/
│   │   ├── <MODEL_NAME>/
│   │   │   ├── config.json
│   │   │   └── model_best.pth
│   │   └── trad_degs/
│   │       └── config.json
│   ├── log/
│   │   └── <MODEL_NAME>/
│   │       └── <RUN_ID>/
│   │           └── info.log
│   ├── models/
│   │   └── <MODEL_NAME>/
│   │       └── <RUN_ID>/
│   │           ├── config.json
│   │           ├── model_untrained.json
│   │           ├── model_best.pth
│   │           └── config.json
│   └── repro/
│       ├── log/
│       │   └── ...
│       └── models/
│           └── ...
├── showcase_pretrained.ipynb
├── tests.ipynb # deprecated
├── train.py
├── Trainers/
│   ├── loss.py
│   ├── trainer.py
│   └── trainer_util.py
├── Utils/
│   ├── misc.py
│   ├── performance_assessment.py
│   └── tasks.py
├── Visualisers/
│   ├── visualiser.py
│   └── viz_util.py
└── viz.py

📚 7. References

🔝

Bibliography

[1] (^back to: ^0.; ^2.a) [ paper | code ]
Hou, Z., Liu, X., Dong, Y., Wang, C., & Tang, J. (2022). GraphMAE: Self-Supervised Masked Graph Autoencoders. arXiv preprint arXiv:2205.10803.

[2] (^back to: ^0.; ^2.a) [ paper | code ] Lin, S., Liu, C., Zhou, P., Hu, Z. Y., Wang, S., Zhao, R., ... & Liang, X. (2022). Prototypical Graph Contrastive Learning. IEEE Transactions on Neural Networks and Learning Systems.

[3] (^back to: ¹) [ paper | code]
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32 (pp. 8024–8035). Curran Associates, Inc.

[4] (^back to: ¹) [ paper | code ]
Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, & Zheng Zhang (2019). Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv preprint arXiv:1909.01315.

[5] (^back to: ¹, ^3.a) [ website | figshare ]
Tiago P. Peixoto. (2014). The graph-tool python library. figshare.

[6] (^back to: ^3.b) [ website | paper ]
Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, & Marion Neumann (2020). TUDataset: A collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020).

[7] (^back to: ^5.c) [ website ]
Hunter, J. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95.

[8] (^back to: ^5.c) [ website ]
Michael L. Waskom (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021.

[9] (^back to: ^5.c) [ website ]
Inc., P. T. (2015). Collaborative data science. Montreal, QC: Plotly Technologies Inc. Retrieved from https://plot.ly.

Acknoledgements

The general organisation of this repository, as well as some general implementations are taken from the project pytorch-template.
The code of the different GRSSL methods are slight adjustments of the original implementations to fit the general framework used here.

[ 🪩 | ⚡️ | 🌻 | 🧳 | ⚙️ | 🫧 | 🌵 | 📚 ]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

📦 Deep Learning Module

🪩 0. Features

⚡️ 1. Quick start

1.a Installation

1.b Train a model

🌻 2. Graph Neural Networks

2.a Self-Supervised models

2.b Encoding Layers

🧳 3. Graphs Datasets

3.a Training

3.b Evaluation

⚙️ 4. Implementation

🫧 5. Usage

5.a Configurations

5.b Results

Expected runtimes

Classification performances

5.c Visualisation

🌵 6. Folder Structure

📚 7. References

Bibliography

Acknoledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

📦 Deep Learning Module

🪩 0. Features

⚡️ 1. Quick start

1.a Installation

1.b Train a model

🌻 2. Graph Neural Networks

2.a Self-Supervised models

2.b Encoding Layers

🧳 3. Graphs Datasets

3.a Training

3.b Evaluation

⚙️ 4. Implementation

🫧 5. Usage

5.a Configurations

5.b Results

Expected runtimes

Classification performances

5.c Visualisation

🌵 6. Folder Structure

📚 7. References

Bibliography

Acknoledgements