Skip to content

Datasets Core Concepts

giovanni-stilo edited this page Jan 9, 2024 · 46 revisions

Classes Involved in Dataset generation

Figure 1: Overview of the classes involved in the datasets creation and management.

DataInstance

DataInstance represents the base class for further extension. A DataInstance object contains an id, the data payload, a label, and a reference to the dataset that holds this object.

GraphInstance

GraphInstance extends the DataInstance class by adding also node_features, edge_features, edge_weights, and graph_features. All those data must be passed at initialization time.

  • data is an $n \times n$ matrix where $n$ is the number of nodes (i.e., it is the binary adjacency matrix);
  • node_features is an $n \times d$ matrix where $d$ is the number of node features;
  • edge_features is an $e \times f$ matrix where $e$ is the number of edges and $f$ is the number of edge features;
  • edge_weights is an $n \times n$ matrix containing the weight of each edge (i.e., it is the weighted adjacency matrix);
  • graph_features is a vector of size $1 \times g$ that contains the "global" attribute of the graph. E.g., the diameter.

Notes:

  • The number of the features must be constant even if they are not used by all the nodes/edges/graphs.
  • To better understand the features' mechanism, please also look at #features-maps.

Dataset

The Dataset class defines all the fundamental mechanisms necessary to Generate, Manipulate, Write and Load a dataset. In our philosophy, a Dataset is always generated: it can be generated by i) reading files (from, e.g., the disk) or by ii) a defined process. Once the Dataset is generated (within its manipulations), it will be stored in the custom format on the disk and reloaded when necessary. Thus, a specific dataset is identified by all its configuration snippets, including all its generation parameters and manipulators (see the next sections). For example, considering the process-generated dataset TreeCycle. The dataset obtained by the following configuration snippet:

{
    "class": "src.dataset.dataset_base.Dataset",
    "parameters": {
        "generator": {
            "class": "src.dataset.generators.treecycles_rand.TreeCyclesRand", 
            "parameters": { "num_instances": 128, "num_nodes_per_instance": 32, "ratio_nodes_in_cycles": 0.2}
        }
    } 
}

Will be different from the following one because, in the latter case, the node's features computed by the NodeCentrality manipulators add nodes' features to the dataset:

{
    "class": "src.dataset.dataset_base.Dataset",
    "parameters": {
        "generator": {
            "class": "src.dataset.generators.treecycles_rand.TreeCyclesRand", 
            "parameters": { "num_instances": 128, "num_nodes_per_instance": 32, "ratio_nodes_in_cycles": 0.2}
        },
        "manipulators": [{ 
            "class": "src.dataset.manipulators.centralities.NodeCentrality",
            "parameters": {} }] 
    } 
}

Dataset Generation/Load Workflow

The dataset workflow is defined by the following steps:

  1. if the dataset is ready-available (previously generated, manipulated, and stored):
    1. load the dataset in memory as it is;
  2. if the dataset is not available:
    1. use the defined Generator to generate the dataset;
    2. apply all the manipulators (there can be more than one) on the freshly generated dataset;
    3. store the dataset on disk for future use;

Features' Maps

The node_features, edge_features, and graph_feature maps contain the semantics of each indexed feature and respective index in the matrix/vector.

For example, suppose we have two features for each node: the first feature represents the degree centrality of the node, and the second represents the node's betweenness centrality. Then, the map that we need to create must be:

{"degree":0, "betweenness":1}

The purpose of the maps is to provide a mechanism that is able not to miss the semantics of the features. This might be crucial when, e.g., an explainer needs to know which feature he is interested in. For example, in which position is the degree of the node stored?

The maps must be initialised once for all at the Dataset level. The map initialization is typically done in the init method of the generator (look at Generating-your-own-dataset/#dealing-with-attributed-graph).

Generator

Please take a look at the Generating-your-own-dataset page.

Manipulator

The manipulators are automatically applied to the dataset once it is created by the Generator. If you want to create a custom manipulator, you need to extend the BaseManipulator and override one (or more) of the following methods. Each overridden method needs to return a map containing the attribute's name and its value (scalar) for the passed input instance. The BaseManipulator implementation takes care of adding the new attributes both to the instance and to the dataset's maps.

    def node_info(self, instance):
        return {}
    
    def graph_info(self, instance):
        return {}
    
    def edge_info(self, instance):
        return {}