GraphFormatConvertor cause a little bug on dataset loader #212

1511878618 · 2022-09-19T07:12:04Z

1511878618
Sep 19, 2022

just use construct_graph into networkx type and we can see onehot encoding is in it.
while when i convert inot PyTorch Geometric data, it looks like lost.

from graphein.protein.graphs import construct_graph
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot
from graphein.protein.config import ProteinGraphConfig

config = ProteinGraphConfig(**{"node_metadata_functions": [amino_acid_one_hot]})
g = construct_graph(config=config, pdb_path="graphein/pdb/AF-P51608-F1-model_v3.pdb")
for n, k in g.nodes(data=True):
    print(k.keys())
    break

/opt/conda/lib/python3.7/site-packages/rich/live.py:229: UserWarning: install "ipywidgets" for Jupyter support
  warnings.warn('install "ipywidgets" for Jupyter support')

/opt/conda/lib/python3.7/site-packages/biopandas/pdb/pandas_pdb.py:681: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: 
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  idxs["end_idx"] = ends.line_idx.values

dict_keys(['chain_id', 'residue_name', 'residue_number', 'atom_type', 'element_symbol', 'coords', 'b_factor', 'amino_acid_one_hot'])

As you can see, when i add node_metadata_functions into config.

Loading data from this pipline, amd when i call ds[0], it looks like no amino_acid_one_hot in the attr of ds[0]

from graphein.ml.conversion import GraphFormatConvertor
from graphein.ml import ProteinGraphDataset
import os 
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot
from graphein.protein.config import ProteinGraphConfig

params_to_change = {"granularity": "centroids", "node_metadata_functions": [amino_acid_one_hot]}

config = ProteinGraphConfig(**params_to_change)

local_dir = "graphein/pdb/"
pdb_paths = [os.path.join(local_dir, pdb_path) for pdb_path in os.listdir(local_dir) if pdb_path.endswith(".pdb")]

ds = ProteinGraphDataset(
    root = "graphein/pdb/test_without_list",
    pdb_paths = pdb_paths,
    graphein_config = config,
)
ds[0]

Processing...
100%|██████████| 3/3 [00:00<00:00,  8.88it/s]
100%|██████████| 1/1 [00:00<00:00,  1.20it/s]
Done!





Data(edge_index=[2, 485], node_id=[486], coords=[1], name=[1], dist_mat=[1], num_nodes=486)

While i add convert and specific the columns which is added "amino_acid_one_hot" into the list, it works.

However, in this way, i need to specific the name which come from amino_acid_one_hot func and add it into the list of convert func, which is unconvenient.

from graphein.ml.conversion import GraphFormatConvertor
from graphein.ml import ProteinGraphDataset
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot
from graphein.protein.config import ProteinGraphConfig

params_to_change = {"granularity": "centroids", "node_metadata_functions": [amino_acid_one_hot]}

config = ProteinGraphConfig(**params_to_change)

local_dir = "graphein/pdb/"
pdb_paths = [os.path.join(local_dir, pdb_path) for pdb_path in os.listdir(local_dir) if pdb_path.endswith(".pdb")]

ds = ProteinGraphDataset(
    root = "graphein/pdb/test_with_list",
    pdb_paths = pdb_paths,
    graphein_config = config,
    graph_format_convertor = GraphFormatConvertor(
            src_format="nx", dst_format="pyg",
            columns = [
                    "edge_index",
                    "coords",
                    "dist_mat",
                    "name",
                    "node_id",
                    "amino_acid_one_hot"
                ]
        )
)
ds[0]

Data(edge_index=[2, 485], node_id=[486], coords=[1], amino_acid_one_hot=[486], name=[1], dist_mat=[1], num_nodes=486)

ds[0]["amino_acid_one_hot"][0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Ok, i guess the problem comes from the convert func since it only add the columns of the list. I'm wondering is there some way to solve this and it will add all columns into the pyg Data since it's a little hard to figure out what happened in this kind bug if someone don't know how to specific GraphFormatConvertor. Or i guess a doc which explained it will solve problems!

wish a discuss ~

Answered by a-r-j

Sep 19, 2022

Hi @1511878618 ! So, there are a few reasons for this particular implementation.

To maintain consistency with other GDL frameworks (eg DGL, Jraph etc). DGL doesn't support certain datatypes (e.g. strings) as attributes on its data objects. Therefore, we thought it best for users to specify exactly which attributes they wish to include in the processed Data object to avoid errors.
I think explicit is always better than implicit for readability
There can be a lot of metadata that users may not wish to keep. Eg Distance matrices can be quite large.

You do raise a good point about the documentation. I think this should be made more clear

With respect to your point about the InMemoryPr…

View full answer

1511878618 · 2022-09-19T07:16:29Z

1511878618
Sep 19, 2022
Author

just use construct_graph into networkx type and we can see onehot encoding is in it. while when i convert inot PyTorch Geometric data, it looks like lost.
from graphein.protein.graphs import construct_graph
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot
from graphein.protein.config import ProteinGraphConfig

config = ProteinGraphConfig(**{"node_metadata_functions": [amino_acid_one_hot]})
g = construct_graph(config=config, pdb_path="graphein/pdb/AF-P51608-F1-model_v3.pdb")
for n, k in g.nodes(data=True):
    print(k.keys())
    break 
/opt/conda/lib/python3.7/site-packages/rich/live.py:229: UserWarning: install "ipywidgets" for Jupyter support
warnings.warn('install "ipywidgets" for Jupyter support')
/opt/conda/lib/python3.7/site-packages/biopandas/pdb/pandas_pdb.py:681: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
idxs["end_idx"] = ends.line_idx.values
dict_keys(['chain_id', 'residue_name', 'residue_number', 'atom_type', 'element_symbol', 'coords', 'b_factor', 'amino_acid_one_hot'])
As you can see, when i add node_metadata_functions into config.

Loading data from this pipline, amd when i call ds[0], it looks like no amino_acid_one_hot in the attr of ds[0]
from graphein.ml.conversion import GraphFormatConvertor
from graphein.ml import ProteinGraphDataset
import os 
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot
from graphein.protein.config import ProteinGraphConfig

params_to_change = {"granularity": "centroids", "node_metadata_functions": [amino_acid_one_hot]}

config = ProteinGraphConfig(**params_to_change)

local_dir = "graphein/pdb/"
pdb_paths = [os.path.join(local_dir, pdb_path) for pdb_path in os.listdir(local_dir) if pdb_path.endswith(".pdb")]

ds = ProteinGraphDataset(
    root = "graphein/pdb/test_without_list",
    pdb_paths = pdb_paths,
    graphein_config = config,
)
ds[0]
Processing...
100%|██████████| 3/3 [00:00<00:00,  8.88it/s]
100%|██████████| 1/1 [00:00<00:00,  1.20it/s]
Done!





Data(edge_index=[2, 485], node_id=[486], coords=[1], name=[1], dist_mat=[1], num_nodes=486)
While i add convert and specific the columns which is added "amino_acid_one_hot" into the list, it works.

However, in this way, i need to specific the name which come from amino_acid_one_hot func and add it into the list of convert func, which is unconvenient.
from graphein.ml.conversion import GraphFormatConvertor
from graphein.ml import ProteinGraphDataset
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot
from graphein.protein.config import ProteinGraphConfig

params_to_change = {"granularity": "centroids", "node_metadata_functions": [amino_acid_one_hot]}

config = ProteinGraphConfig(**params_to_change)

local_dir = "graphein/pdb/"
pdb_paths = [os.path.join(local_dir, pdb_path) for pdb_path in os.listdir(local_dir) if pdb_path.endswith(".pdb")]

ds = ProteinGraphDataset(
    root = "graphein/pdb/test_with_list",
    pdb_paths = pdb_paths,
    graphein_config = config,
    graph_format_convertor = GraphFormatConvertor(
            src_format="nx", dst_format="pyg",
            columns = [
                    "edge_index",
                    "coords",
                    "dist_mat",
                    "name",
                    "node_id",
                    "amino_acid_one_hot"
                ]
        )
)
ds[0]
Data(edge_index=[2, 485], node_id=[486], coords=[1], amino_acid_one_hot=[486], name=[1], dist_mat=[1], num_nodes=486)
ds[0]["amino_acid_one_hot"][0]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Ok, i guess the problem comes from the convert func since it only add the columns of the list. I'm wondering is there some way to solve this and it will add all columns into the pyg Data since it's a little hard to figure out what happened in this kind bug if someone don't know how to specific GraphFormatConvertor. Or i guess a doc which explained it will solve problems!

wish a discuss ~

@a-r-j , also, there is some little different in the code of ProteinGraphDataset

and in InMemoryProteinGraphDataset

so the graph label attr setting will be different, i'm wondering is this in some PR have fixed them?

0 replies

a-r-j · 2022-09-19T07:30:08Z

a-r-j
Sep 19, 2022
Maintainer

Hi @1511878618 ! So, there are a few reasons for this particular implementation.

To maintain consistency with other GDL frameworks (eg DGL, Jraph etc). DGL doesn't support certain datatypes (e.g. strings) as attributes on its data objects. Therefore, we thought it best for users to specify exactly which attributes they wish to include in the processed Data object to avoid errors.
I think explicit is always better than implicit for readability
There can be a lot of metadata that users may not wish to keep. Eg Distance matrices can be quite large.

You do raise a good point about the documentation. I think this should be made more clear

With respect to your point about the InMemoryProteinGraphDataset - yes the changes implemented in ProteinGraphDataset have not been propagated yet. Would you be willing to make a PR?

1 reply

1511878618 Sep 19, 2022
Author

Hi @1511878618 ! So, there are a few reasons for this particular implementation.

To maintain consistency with other GDL frameworks (eg DGL, Jraph etc). DGL doesn't support certain datatypes (e.g. strings) as attributes on its data objects. Therefore, we thought it best for users to specify exactly which attributes they wish to include in the processed Data object to avoid errors.

I think explicit is always better than implicit for readability

There can be a lot of metadata that users may not wish to keep. Eg Distance matrices can be quite large.

You do raise a good point about the documentation. I think this should be made more clear

With respect to your point about the InMemoryProteinGraphDataset - yes the changes implemented in ProteinGraphDataset have not been propagated yet. Would you be willing to make a PR?

Ok, I'll do a PR later

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GraphFormatConvertor cause a little bug on dataset loader #212

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

GraphFormatConvertor cause a little bug on dataset loader #212

1511878618 Sep 19, 2022

Replies: 2 comments · 1 reply

1511878618 Sep 19, 2022 Author

a-r-j Sep 19, 2022 Maintainer

1511878618 Sep 19, 2022 Author

1511878618
Sep 19, 2022

Replies: 2 comments 1 reply

1511878618
Sep 19, 2022
Author

a-r-j
Sep 19, 2022
Maintainer

1511878618 Sep 19, 2022
Author