Skip to content

danielwalke/GraphAware

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GraphAware

A framework designed to make scikit learn classifiers "graph-aware". With GraphAware, classical machine learning algorithms like logistic regression or XGBoost are able to exploit features and the graph structures similar to the popular GraphNeuralNetworks.

Table of Contents

content
Overview
Installation
Usage
Automated Hyperparameter search
Examples
Contact
Fundings
Competing intrests
Machine learning algorithms like XGBoost showed promising results in many applications like disease diagnosis. However, they cannot exploit the connections in graph structured data like citation networks (e.g., Cora) or protein-protein interaction networks (PPI). Although different graph learning algorithms like Graph Neural Networks (GNNs) were proposed, there is still a demand for new frameworks due to long training times and high number of trainable parameters in GNNs. We propose GraphAware, a new framework to analyze graph-structured data with machine learning algorithms. GraphAware trains separate machine learning classifiers on feature sets generated from aggregated neighborhoods of different orders and combines their outputs for the final prediction. We showed that the accuracy (Cora, CiteSeer, PubMed) or micro f1-score (PPI) for GraphAware (Cora: 0.831, CiteSeer: 0.720, PubMed: 0.802, PPI: 0.984) is higher or least comparable to the best performing GNN GAT (Cora: 0.831, CiteSeer: 0.708, PubMed: 0.790, PPI: 0.991). Furthermore, the training time required for GraphAware is much shorter (Cora: 1.09 s, CiteSeer: 2.56 s, PubMed: 1.77 s, PPI: 149.88 s) compared to GAT (Cora: 4.91 s, CiteSeer: 5.05 s, PubMed: 5.63 s, PPI: 338.72 s). GraphAware is compatible with popular python packages like sklearn and XGBoost and is open-sourced on https://github.com/danielwalke/GraphAware.
  1. Clone the project
    git clone https://github.com/danielwalke/GraphAware
  2. Navigate in te project
    cd GraphAware/
  3. Install requirements
    pip install -r requirements.txt
  4. Import the framework in python or jupyter notebooks
    from EnsembleFramework import Framework
  1. Import the Framework, import torch and import sklearn:
from EnsembleFramework import Framework
import sklearn
import torch
  1. Define a list describing the order of neighborhoods you want to incorporate, e.g.:
hops = [0, 2]
  1. Define a aggregation function, e.g.:
def user_function(kwargs):
    mean = kwargs["mean_neighbors"]
    orignal_features = kwargs["original_features"]
    return orignal_features - mean
  1. Define a classifier that is used to analyze each order of neighborhood, e.g.:
clfs=[sklearn.ensemble.RandomForestClassifier(max_leaf_nodes=50, n_estimators=1000, random_state=42), sklearn.ensemble.RandomForestClassifier(max_leaf_nodes=50, n_estimators=1000, random_state=42)]
  1. Optional: You can set influence score configurations if you want to weight the contributoin of individual neighbors based on their cosine similarity (use_pseudo_attention) or cut-off neighbors that are too dissimilar (cosine_eps): attention_configs = [{'inter_layer_normalize': True,'use_pseudo_attention':True,'cosine_eps':.01, 'dropout_attn': None} for i in hops]
  2. Initialize the framework, e.g.:
framework = Framework(hops_list= hops,
                      clfs=clfs,
                      attention_configs=[None for i in hops],
                      handle_nan=0.0,
                      gpu_idx=None,
                      user_functions=[user_function for i in hops]
)

Note that the hops_list, clfs, attention_configs, and user_functions needs to have the same length. Detailed documentations how individual parameters have to be set can be found in EnsembleFramework.py. 7) Fit the data, e.g.: X_train- features of the train set, edge_index - edge index in COO format for the train graph, y_train- labels for the train set, train_mask- boolean mask if you want to train it under transductive settings otherwise you can set it to None

framework.fit(X_train=features_train, edge_index=edge_index_train,y_train=labels_train, train_mask=torch.ones(features_train.shape[0]).type(torch.bool))
  1. Get the final prediction probabilities: features_test- features of the test set, edge_index_test - edge index in COO format for the test graph, test_mask- boolean mask if you want to train it under transductive settings otherwise you can set it to None
predict_proba = framework.predict_proba(features_test, edge_index_test, torch.ones(features_test.shape[0]).type(torch.bool))
  1. Evaluations, e.g.: y_test- labels for the test set
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, predict_proba[:,1])

You can defined your own hyperparameters or use the automated hyperparameter search.

  1. Import hyperopt, the preferred evaluation metric and your model, e.g.:
from sklearn.linear_model import LogisticRegression
from hyperopt import hp
from AutoTune2 import AutoSearch
from sklearn.metrics import accuracy_score
  1. Define the dataset, e.g.:
from torch_geometric.datasets import Planetoid
cora_dataset = Planetoid(root = "./data", name = "Cora")
X =  dataset[0].x 
y =  dataset[0].y 

test =  dataset[0].test_mask
train = dataset[0].train_mask 
val =  dataset[0].val_mask

dataset = dict({})
dataset["X"] = X ## feature set
dataset["y"] = y ## labels
dataset["test"] = test  ## boolean test mask 
dataset["train"] = train ## boolean train mask 
dataset["val"] = val ## boolean val mask 
dataset["edge_index"] = edge_index
  1. Define the classifiers you want to evaluate:
clfs = [LogisticRegression]
  1. Define the hyperparameter space, e.g.:
  • Discrete values here:
lr_choices = {
    'penalty': ["l2"],
    'max_iter': [2**i for i in range(6, 15)],
    
}
  • Numeric or continous values here:
space_lr = {
    **{key: hp.choice(key, value) for key, value in lr_choices.items()},
    'tol': hp.loguniform('tol', -11, -3),
    'C': hp.uniform('C', 0.0, 10)
}
clfs_space = dict({})
clfs_space["LogisticRegression"] = space_lr
  1. Define the aggregation functions you want to evaluate, e.g.:
def mean_user_function(kwargs):
    return  kwargs["original_features"] + kwargs["mean_neighbors"]
    
def sum_user_function(kwargs):
    return  kwargs["original_features"] + kwargs["summed_neighbors"]

user_functions = [mean_user_function, sum_user_function]
  1. Define the lists of order of neighborhoods you want to evaluate, e.g.:
hops_lists = [[0,3,8], [0,3]]
  1. Define the list of influence score configurations you wan tto evaluate, e.g.:
attention_configs = [None,{'inter_layer_normalize': False,
                     'use_pseudo_attention':True,
                     'cosine_eps':.01,
                     'dropout_attn': None}, 
                     {'inter_layer_normalize': True,
                     'use_pseudo_attention':True,
                     'cosine_eps':.01,
                     'dropout_attn': None},
                     {'inter_layer_normalize': True,
                     'use_pseudo_attention':True,
                     'cosine_eps':.001,
                     'dropout_attn': None}]
  1. Initialize the searcher max_evals - maximum number of evaluations per classifier, aggregation list, influence score configurations, and aggregation functions provided pred_metric - dedired evaluation metric parallelism - number of parallel executions
searcher = AutoSearch(dataset, max_evals=500, pred_metric = accuracy_score, parallelism=50)
  1. Start the search:
store = searcher.search(clfs, clfs_space, hops=hops_lists, user_functions= user_functions,
                         attention_configs = attention_configs)
  1. Evaluate the search results:
print(store)

You can find examples for the usage under transudctive settings here and for inductive settings here

If you have any questions, struggles with the repository, want new features or want to cooperate, do not hesitate to contact me: [email protected]

We thank the German Research Foundation (DFG) for funding this study under the project ‘Optimizing graph databases focusing on data processing and integration of machine learning for large clinical and biological datasets’ [project number 463414123; grant numbers HE 8077/2-1, SA 465/53-1]).

The authors declare that they have no competing interests.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published