GraphAware

A framework designed to make scikit learn classifiers "graph-aware". With GraphAware, classical machine learning algorithms like logistic regression or XGBoost are able to exploit features and the graph structures similar to the popular GraphNeuralNetworks.

Overview

Machine learning algorithms like XGBoost showed promising results in many applications like disease diagnosis. However, they cannot exploit the connections in graph structured data like citation networks (e.g., Cora) or protein-protein interaction networks (PPI). Although different graph learning algorithms like Graph Neural Networks (GNNs) were proposed, there is still a demand for new frameworks due to long training times and high number of trainable parameters in GNNs. We propose GraphAware, a new framework to analyze graph-structured data with machine learning algorithms. GraphAware trains separate machine learning classifiers on feature sets generated from aggregated neighborhoods of different orders and combines their outputs for the final prediction. We showed that the accuracy (Cora, CiteSeer, PubMed) or micro f1-score (PPI) for GraphAware (Cora: 0.831, CiteSeer: 0.720, PubMed: 0.802, PPI: 0.984) is higher or least comparable to the best performing GNN GAT (Cora: 0.831, CiteSeer: 0.708, PubMed: 0.790, PPI: 0.991). Furthermore, the training time required for GraphAware is much shorter (Cora: 1.09 s, CiteSeer: 2.56 s, PubMed: 1.77 s, PPI: 149.88 s) compared to GAT (Cora: 4.91 s, CiteSeer: 5.05 s, PubMed: 5.63 s, PPI: 338.72 s). GraphAware is compatible with popular python packages like sklearn and XGBoost and is open-sourced on https://github.com/danielwalke/GraphAware.

Installation/Setup

Clone the project

git clone https://github.com/danielwalke/GraphAware

Navigate in te project
```
cd GraphAware/
```
Install requirements
```
pip install -r requirements.txt
```
Import the framework in python or jupyter notebooks
```
from EnsembleFramework import Framework
```

Usage

Import the Framework, import torch and import sklearn:

from EnsembleFramework import Framework
import sklearn
import torch

Define a list describing the order of neighborhoods you want to incorporate, e.g.:

hops = [0, 2]

Define a aggregation function, e.g.:

def user_function(kwargs):
    mean = kwargs["mean_neighbors"]
    orignal_features = kwargs["original_features"]
    return orignal_features - mean

Define a classifier that is used to analyze each order of neighborhood, e.g.:

clfs=[sklearn.ensemble.RandomForestClassifier(max_leaf_nodes=50, n_estimators=1000, random_state=42), sklearn.ensemble.RandomForestClassifier(max_leaf_nodes=50, n_estimators=1000, random_state=42)]

Optional: You can set influence score configurations if you want to weight the contributoin of individual neighbors based on their cosine similarity (use_pseudo_attention) or cut-off neighbors that are too dissimilar (cosine_eps): attention_configs = [{'inter_layer_normalize': True,'use_pseudo_attention':True,'cosine_eps':.01, 'dropout_attn': None} for i in hops]
Initialize the framework, e.g.:

framework = Framework(hops_list= hops,
                      clfs=clfs,
                      attention_configs=[None for i in hops],
                      handle_nan=0.0,
                      gpu_idx=None,
                      user_functions=[user_function for i in hops]
)

Note that the hops_list, clfs, attention_configs, and user_functions needs to have the same length. Detailed documentations how individual parameters have to be set can be found in EnsembleFramework.py. 7) Fit the data, e.g.: X_train- features of the train set, edge_index - edge index in COO format for the train graph, y_train- labels for the train set, train_mask- boolean mask if you want to train it under transductive settings otherwise you can set it to None

framework.fit(X_train=features_train, edge_index=edge_index_train,y_train=labels_train, train_mask=torch.ones(features_train.shape[0]).type(torch.bool))

Get the final prediction probabilities: features_test- features of the test set, edge_index_test - edge index in COO format for the test graph, test_mask- boolean mask if you want to train it under transductive settings otherwise you can set it to None

predict_proba = framework.predict_proba(features_test, edge_index_test, torch.ones(features_test.shape[0]).type(torch.bool))

Evaluations, e.g.: y_test- labels for the test set

from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, predict_proba[:,1])

Automated Hyperparameter search

You can defined your own hyperparameters or use the automated hyperparameter search.

Import hyperopt, the preferred evaluation metric and your model, e.g.:

from sklearn.linear_model import LogisticRegression
from hyperopt import hp
from AutoTune2 import AutoSearch
from sklearn.metrics import accuracy_score

Define the dataset, e.g.:

from torch_geometric.datasets import Planetoid
cora_dataset = Planetoid(root = "./data", name = "Cora")
X =  dataset[0].x 
y =  dataset[0].y 

test =  dataset[0].test_mask
train = dataset[0].train_mask 
val =  dataset[0].val_mask

dataset = dict({})
dataset["X"] = X ## feature set
dataset["y"] = y ## labels
dataset["test"] = test  ## boolean test mask 
dataset["train"] = train ## boolean train mask 
dataset["val"] = val ## boolean val mask 
dataset["edge_index"] = edge_index

Define the classifiers you want to evaluate:

clfs = [LogisticRegression]

Define the hyperparameter space, e.g.:

Discrete values here:

lr_choices = {
    'penalty': ["l2"],
    'max_iter': [2**i for i in range(6, 15)],
    
}

Numeric or continous values here:

space_lr = {
    **{key: hp.choice(key, value) for key, value in lr_choices.items()},
    'tol': hp.loguniform('tol', -11, -3),
    'C': hp.uniform('C', 0.0, 10)
}
clfs_space = dict({})
clfs_space["LogisticRegression"] = space_lr

Define the aggregation functions you want to evaluate, e.g.:

def mean_user_function(kwargs):
    return  kwargs["original_features"] + kwargs["mean_neighbors"]
    
def sum_user_function(kwargs):
    return  kwargs["original_features"] + kwargs["summed_neighbors"]

user_functions = [mean_user_function, sum_user_function]

Define the lists of order of neighborhoods you want to evaluate, e.g.:

hops_lists = [[0,3,8], [0,3]]

Define the list of influence score configurations you wan tto evaluate, e.g.:

attention_configs = [None,{'inter_layer_normalize': False,
                     'use_pseudo_attention':True,
                     'cosine_eps':.01,
                     'dropout_attn': None}, 
                     {'inter_layer_normalize': True,
                     'use_pseudo_attention':True,
                     'cosine_eps':.01,
                     'dropout_attn': None},
                     {'inter_layer_normalize': True,
                     'use_pseudo_attention':True,
                     'cosine_eps':.001,
                     'dropout_attn': None}]

Initialize the searcher max_evals - maximum number of evaluations per classifier, aggregation list, influence score configurations, and aggregation functions provided pred_metric - dedired evaluation metric parallelism - number of parallel executions

searcher = AutoSearch(dataset, max_evals=500, pred_metric = accuracy_score, parallelism=50)

Start the search:

store = searcher.search(clfs, clfs_space, hops=hops_lists, user_functions= user_functions,
                         attention_configs = attention_configs)

Evaluate the search results:

print(store)

Examples

You can find examples for the usage under transudctive settings here and for inductive settings here

Contact

If you have any questions, struggles with the repository, want new features or want to cooperate, do not hesitate to contact me: [email protected]

Fundings

We thank the German Research Foundation (DFG) for funding this study under the project ‘Optimizing graph databases focusing on data processing and integration of machine learning for large clinical and biological datasets’ [project number 463414123; grant numbers HE 8077/2-1, SA 465/53-1]).

Competing intrests

The authors declare that they have no competing interests.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
data		data
old		old
.gitignore		.gitignore
AutoTune2.py		AutoTune2.py
AutoTuneCV.py		AutoTuneCV.py
Auto_Tune_Comb.py		Auto_Tune_Comb.py
CVGraphAwareEvaluation_transductive.ipynb		CVGraphAwareEvaluation_transductive.ipynb
CV_Cora_GNN.ipynb		CV_Cora_GNN.ipynb
Cheby_PPI_Benchmark.ipynb		Cheby_PPI_Benchmark.ipynb
CudaGraphAwareEvaluation.ipynb		CudaGraphAwareEvaluation.ipynb
EnsembleFramework.py		EnsembleFramework.py
EnsembleFrameworkThreads.py		EnsembleFrameworkThreads.py
GAT_PPI_Benchmark-PaperVersion.ipynb		GAT_PPI_Benchmark-PaperVersion.ipynb
GAT_PPI_Benchmark.ipynb		GAT_PPI_Benchmark.ipynb
GCN_PPI_Benchmark.ipynb		GCN_PPI_Benchmark.ipynb
GNNEvaluate.py		GNNEvaluate.py
GNNNestedCVEvaluation.py		GNNNestedCVEvaluation.py
GNNNestedCVEvaluationBinary.py		GNNNestedCVEvaluationBinary.py
GNNNestedCVEvaluationInductive.py		GNNNestedCVEvaluationInductive.py
GNNTraining.py		GNNTraining.py
GraphAwareEvaluation_transductive.ipynb		GraphAwareEvaluation_transductive.ipynb
GraphAwareNestedCVEvaluation.py		GraphAwareNestedCVEvaluation.py
GraphAwareNestedCVEvaluationInductive.py		GraphAwareNestedCVEvaluationInductive.py
GraphAwareNestedCVEvaluationThreads.py		GraphAwareNestedCVEvaluationThreads.py
GraphAware_Cora_WithoutAttention.ipynb		GraphAware_Cora_WithoutAttention.ipynb
GraphAware_Evaluation_indcutive.ipynb		GraphAware_Evaluation_indcutive.ipynb
Hyperparameters_Cora_WithoutAttention.csv		Hyperparameters_Cora_WithoutAttention.csv
Hyperparameters_inductive.csv		Hyperparameters_inductive.csv
Hyperparameters_transductive.csv		Hyperparameters_transductive.csv
Hyperparameters_tsne.csv		Hyperparameters_tsne.csv
InductiveGNNs-Compile-GCN.ipynb		InductiveGNNs-Compile-GCN.ipynb
InductiveGNNs-GAT.ipynb		InductiveGNNs-GAT.ipynb
InductiveGNNs.ipynb		InductiveGNNs.ipynb
InductiveGraphAware.ipynb		InductiveGraphAware.ipynb
LICENSE		LICENSE
NestedCV.py		NestedCV.py
README.md		README.md
TestMaskSubsample.ipynb		TestMaskSubsample.ipynb
Test_Framework.ipynb		Test_Framework.ipynb
TransductiveGNNs-Shweta.ipynb		TransductiveGNNs-Shweta.ipynb
TransductiveGNNs.ipynb		TransductiveGNNs.ipynb
TransductiveGraphAware-NeighborhoodsCora.ipynb		TransductiveGraphAware-NeighborhoodsCora.ipynb
TransductiveGraphAware.ipynb		TransductiveGraphAware.ipynb
TransductiveGraphAwareCuda.ipynb		TransductiveGraphAwareCuda.ipynb
citeseer_gnns.log		citeseer_gnns.log
cora_gnns.log		cora_gnns.log
feature_importance_cora.png		feature_importance_cora.png
gcn_cheb_inductive.log		gcn_cheb_inductive.log
graph_aware_inductive.log		graph_aware_inductive.log
inductive_gat.log		inductive_gat.log
inductive_gat_nohup.py		inductive_gat_nohup.py
inductive_gcn.log		inductive_gcn.log
inductive_gcn_and_cheb_nohup.py		inductive_gcn_and_cheb_nohup.py
inductive_gcn_nohup.py		inductive_gcn_nohup.py
inductive_graph_aware_nohup.py		inductive_graph_aware_nohup.py
nohup.out		nohup.out
ppi_gat.log		ppi_gat.log
pubmed_gnns.log		pubmed_gnns.log
requirements.txt		requirements.txt
test_nohup.py		test_nohup.py
transductive_gnn_nohup.py		transductive_gnn_nohup.py
transductive_pubmed_gnns.log		transductive_pubmed_gnns.log
tsne_graphaware.png		tsne_graphaware.png
tsne_without_neighbors.png		tsne_without_neighbors.png
xgboost_model.json		xgboost_model.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GraphAware

Table of Contents

Overview

Installation/Setup

Usage

Automated Hyperparameter search

Examples

Contact

Fundings

Competing intrests

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

danielwalke/GraphAware

Folders and files

Latest commit

History

Repository files navigation

GraphAware

Table of Contents

Overview

Installation/Setup

Usage

Automated Hyperparameter search

Examples

Contact

Fundings

Competing intrests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages