GraphFormatConvertor cause a little bug on dataset loader #212
-
just use from graphein.protein.graphs import construct_graph
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot
from graphein.protein.config import ProteinGraphConfig
config = ProteinGraphConfig(**{"node_metadata_functions": [amino_acid_one_hot]})
g = construct_graph(config=config, pdb_path="graphein/pdb/AF-P51608-F1-model_v3.pdb")
for n, k in g.nodes(data=True):
print(k.keys())
break /opt/conda/lib/python3.7/site-packages/rich/live.py:229: UserWarning: install "ipywidgets" for Jupyter support warnings.warn('install "ipywidgets" for Jupyter support') /opt/conda/lib/python3.7/site-packages/biopandas/pdb/pandas_pdb.py:681: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy idxs["end_idx"] = ends.line_idx.values
As you can see, when i add Loading data from this pipline, amd when i call from graphein.ml.conversion import GraphFormatConvertor
from graphein.ml import ProteinGraphDataset
import os
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot
from graphein.protein.config import ProteinGraphConfig
params_to_change = {"granularity": "centroids", "node_metadata_functions": [amino_acid_one_hot]}
config = ProteinGraphConfig(**params_to_change)
local_dir = "graphein/pdb/"
pdb_paths = [os.path.join(local_dir, pdb_path) for pdb_path in os.listdir(local_dir) if pdb_path.endswith(".pdb")]
ds = ProteinGraphDataset(
root = "graphein/pdb/test_without_list",
pdb_paths = pdb_paths,
graphein_config = config,
)
ds[0]
While i add However, in this way, i need to specific the name which come from from graphein.ml.conversion import GraphFormatConvertor
from graphein.ml import ProteinGraphDataset
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot
from graphein.protein.config import ProteinGraphConfig
params_to_change = {"granularity": "centroids", "node_metadata_functions": [amino_acid_one_hot]}
config = ProteinGraphConfig(**params_to_change)
local_dir = "graphein/pdb/"
pdb_paths = [os.path.join(local_dir, pdb_path) for pdb_path in os.listdir(local_dir) if pdb_path.endswith(".pdb")]
ds = ProteinGraphDataset(
root = "graphein/pdb/test_with_list",
pdb_paths = pdb_paths,
graphein_config = config,
graph_format_convertor = GraphFormatConvertor(
src_format="nx", dst_format="pyg",
columns = [
"edge_index",
"coords",
"dist_mat",
"name",
"node_id",
"amino_acid_one_hot"
]
)
)
ds[0]
ds[0]["amino_acid_one_hot"][0]
Ok, i guess the problem comes from the wish a discuss ~ |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
@a-r-j , also, there is some little different in the code of and in so the graph label attr setting will be different, i'm wondering is this in some PR have fixed them? |
Beta Was this translation helpful? Give feedback.
-
Hi @1511878618 ! So, there are a few reasons for this particular implementation.
You do raise a good point about the documentation. I think this should be made more clear With respect to your point about the |
Beta Was this translation helpful? Give feedback.
Hi @1511878618 ! So, there are a few reasons for this particular implementation.
To maintain consistency with other GDL frameworks (eg DGL, Jraph etc). DGL doesn't support certain datatypes (e.g. strings) as attributes on its data objects. Therefore, we thought it best for users to specify exactly which attributes they wish to include in the processed Data object to avoid errors.
I think explicit is always better than implicit for readability
There can be a lot of metadata that users may not wish to keep. Eg Distance matrices can be quite large.
You do raise a good point about the documentation. I think this should be made more clear
With respect to your point about the
InMemoryPr…