A framework designed to make scikit learn classifiers "graph-aware". With GraphAware, classical machine learning algorithms like logistic regression or XGBoost are able to exploit features and the graph structures similar to the popular GraphNeuralNetworks.
content |
---|
Overview |
Installation |
Usage |
Automated Hyperparameter search |
Examples |
Contact |
Fundings |
Competing intrests |
- Clone the project
git clone https://github.com/danielwalke/GraphAware
- Navigate in te project
cd GraphAware/
- Install requirements
pip install -r requirements.txt
- Import the framework in python or jupyter notebooks
from EnsembleFramework import Framework
- Import the Framework, import torch and import sklearn:
from EnsembleFramework import Framework
import sklearn
import torch
- Define a list describing the order of neighborhoods you want to incorporate, e.g.:
hops = [0, 2]
- Define a aggregation function, e.g.:
def user_function(kwargs):
mean = kwargs["mean_neighbors"]
orignal_features = kwargs["original_features"]
return orignal_features - mean
- Define a classifier that is used to analyze each order of neighborhood, e.g.:
clfs=[sklearn.ensemble.RandomForestClassifier(max_leaf_nodes=50, n_estimators=1000, random_state=42), sklearn.ensemble.RandomForestClassifier(max_leaf_nodes=50, n_estimators=1000, random_state=42)]
- Optional: You can set influence score configurations if you want to weight the contributoin of individual neighbors based on their cosine similarity (use_pseudo_attention) or cut-off neighbors that are too dissimilar (cosine_eps): attention_configs = [{'inter_layer_normalize': True,'use_pseudo_attention':True,'cosine_eps':.01, 'dropout_attn': None} for i in hops]
- Initialize the framework, e.g.:
framework = Framework(hops_list= hops,
clfs=clfs,
attention_configs=[None for i in hops],
handle_nan=0.0,
gpu_idx=None,
user_functions=[user_function for i in hops]
)
Note that the hops_list, clfs, attention_configs, and user_functions needs to have the same length. Detailed documentations how individual parameters have to be set can be found in EnsembleFramework.py. 7) Fit the data, e.g.: X_train- features of the train set, edge_index - edge index in COO format for the train graph, y_train- labels for the train set, train_mask- boolean mask if you want to train it under transductive settings otherwise you can set it to None
framework.fit(X_train=features_train, edge_index=edge_index_train,y_train=labels_train, train_mask=torch.ones(features_train.shape[0]).type(torch.bool))
- Get the final prediction probabilities: features_test- features of the test set, edge_index_test - edge index in COO format for the test graph, test_mask- boolean mask if you want to train it under transductive settings otherwise you can set it to None
predict_proba = framework.predict_proba(features_test, edge_index_test, torch.ones(features_test.shape[0]).type(torch.bool))
- Evaluations, e.g.: y_test- labels for the test set
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, predict_proba[:,1])
You can defined your own hyperparameters or use the automated hyperparameter search.
- Import hyperopt, the preferred evaluation metric and your model, e.g.:
from sklearn.linear_model import LogisticRegression
from hyperopt import hp
from AutoTune2 import AutoSearch
from sklearn.metrics import accuracy_score
- Define the dataset, e.g.:
from torch_geometric.datasets import Planetoid
cora_dataset = Planetoid(root = "./data", name = "Cora")
X = dataset[0].x
y = dataset[0].y
test = dataset[0].test_mask
train = dataset[0].train_mask
val = dataset[0].val_mask
dataset = dict({})
dataset["X"] = X ## feature set
dataset["y"] = y ## labels
dataset["test"] = test ## boolean test mask
dataset["train"] = train ## boolean train mask
dataset["val"] = val ## boolean val mask
dataset["edge_index"] = edge_index
- Define the classifiers you want to evaluate:
clfs = [LogisticRegression]
- Define the hyperparameter space, e.g.:
- Discrete values here:
lr_choices = {
'penalty': ["l2"],
'max_iter': [2**i for i in range(6, 15)],
}
- Numeric or continous values here:
space_lr = {
**{key: hp.choice(key, value) for key, value in lr_choices.items()},
'tol': hp.loguniform('tol', -11, -3),
'C': hp.uniform('C', 0.0, 10)
}
clfs_space = dict({})
clfs_space["LogisticRegression"] = space_lr
- Define the aggregation functions you want to evaluate, e.g.:
def mean_user_function(kwargs):
return kwargs["original_features"] + kwargs["mean_neighbors"]
def sum_user_function(kwargs):
return kwargs["original_features"] + kwargs["summed_neighbors"]
user_functions = [mean_user_function, sum_user_function]
- Define the lists of order of neighborhoods you want to evaluate, e.g.:
hops_lists = [[0,3,8], [0,3]]
- Define the list of influence score configurations you wan tto evaluate, e.g.:
attention_configs = [None,{'inter_layer_normalize': False,
'use_pseudo_attention':True,
'cosine_eps':.01,
'dropout_attn': None},
{'inter_layer_normalize': True,
'use_pseudo_attention':True,
'cosine_eps':.01,
'dropout_attn': None},
{'inter_layer_normalize': True,
'use_pseudo_attention':True,
'cosine_eps':.001,
'dropout_attn': None}]
- Initialize the searcher max_evals - maximum number of evaluations per classifier, aggregation list, influence score configurations, and aggregation functions provided pred_metric - dedired evaluation metric parallelism - number of parallel executions
searcher = AutoSearch(dataset, max_evals=500, pred_metric = accuracy_score, parallelism=50)
- Start the search:
store = searcher.search(clfs, clfs_space, hops=hops_lists, user_functions= user_functions,
attention_configs = attention_configs)
- Evaluate the search results:
print(store)
You can find examples for the usage under transudctive settings here and for inductive settings here
If you have any questions, struggles with the repository, want new features or want to cooperate, do not hesitate to contact me: [email protected]
We thank the German Research Foundation (DFG) for funding this study under the project ‘Optimizing graph databases focusing on data processing and integration of machine learning for large clinical and biological datasets’ [project number 463414123; grant numbers HE 8077/2-1, SA 465/53-1]).
The authors declare that they have no competing interests.