Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the accuracy on ogbn-papers100M with GAT model, add undirected graph on ogbn-products #9467

Open
wants to merge 90 commits into
base: master
Choose a base branch
from

Conversation

guanxingithub
Copy link
Contributor

@guanxingithub guanxingithub commented Jun 27, 2024

Improve the test accuracy to >60% on ogbn-papers100M consolidate repetitive examples

guanxingithub and others added 27 commits June 3, 2024 19:51
2. Add undirected graph on ogbn-products with Sage model
3. Add undirected graph on ogbn-products with GAT model
@puririshi98
Copy link
Contributor

@akihironitta @rusty1s anything else needed to merge?

Copy link
Member

@akihironitta akihironitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! 🚀


inference_start = time.time()
val_acc = test(val_loader)
test_acc = test(test_loader)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need to evaluate the model on the test set every epoch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, @guanxingithub please make the test only happen at the end of all epochs of the training

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All evaluations are used for data analysis after the training is done. With all of these statistical data collected during training, we can do further performance and accuracy analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you agree, I can put all of these extra testing and evaluations under verbose mode. So, from user point of view, they don't need to make these further testing and evaluations. But for someone who wants to make performance and accuracy analysis, she/he can turn on verbose mode to collect raw data of performance and accuracy for further statistical analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no you should just not be running test during training, all other metrics are fine. please adjust accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep evaluation results on test set in each epoch, that will be very helpful to make further analysis. I did the same evaluation on the test set on other tools. If we can do the same evaluation in PyG, that will be very helpful to compare the accuracy between PyG and other tools apple to apple. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will be very helpful to make further analysis.

What kind of analysis are you referring to specifically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

performance and accuracy analysis cross platform and different tools and different version of tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, for platform1, and platform2, we collected all these performance and accuracy data, and then we can analyze these data obtained from both platforms, to make a judgment like the hypothesis: whether the performance is same or different, accuracy is difference or not from statistical point of view. If it is significant difference, what could cause the difference. This will build the foundation of our further investigation and improvement. We have done a lot of this kind of analysis not only in PyG, but also other tools.

Comment on lines +239 to +242
if val_acc > best_val:
best_val = val_acc
if test_acc > best_test:
best_test = test_acc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my above comment: I don't understand why it's ok to pick the best metric evaluated on the test set across all epochs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well this was just for tracking it seems. there is no checkpointing being done but yeah the test set should only be run once at the end anyways. @guanxingithub please remove the test acc tracking since it should only be run once at the end and reported then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best metric evaluated on the test set is one of indexes to make statistical accuracy analysis. With all the information we collected during training, we can compare the performance and accuracy of particular model among different platforms, among different version of tools used to run training.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guanxingithub in general in machine learning it is not proper to run the test set before training/val has finished.
Tracking the best val acc across multiple epochs for perflab is fine, but there should only be one final test accuracy for them to track. running test during training is sort of "cheating". please address me and akihiro's concerns in order to get this merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't train the test set, we just evaluate the accuracy on the test set after the model is trained on the training set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, still not sure why it makes sense to evaluate the model on the test set during training. Just to make sure, I'm suggesting deleting L233 and L241-L242.

best_test = test_acc
times.append(time.time() - train_start)

if verbose:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The concern I previously had was having too many redundant/meaningless print statements in the example, but now the scirpt looks pretty minimal. I think it's ok to just remove this verbose flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The redundant/meaningless print was intended to collect raw data of performance and accuracy in order to make further performance and accuracy analysis after we run the scripts in the different platforms with different versions of tools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its fine now, just remove verbose flag and let all prints come out (except remove the test accurac and please only run test set once at the end

Copy link
Member

@akihironitta akihironitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also make sure the filenames are updated in the entire repo, e.g.,

$ git grep ogbn_products_sage.py
README.md:- **[SAGEConv](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.SAGEConv.html)** from Hamilton *et al.*: [Inductive Representation Learning on Large Graphs](https://arxiv.org/abs/1706.02216) (NIPS 2017) \[[**Example1**](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/reddit.py), [**Example2**](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/ogbn_products_sage.py), [**Example3**](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/graph_sage_unsup.py), [**Example4**](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/graph_sage_unsup_ppi.py)\]
README.md:- **[NeighborLoader](https://pytorch-geometric.readthedocs.io/en/latest/modules/loader.html#torch_geometric.loader.NeighborLoader)** from Hamilton *et al.*: [Inductive Representation Learning on Large Graphs](https://arxiv.org/abs/1706.02216) (NIPS 2017) \[[**Example1**](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/reddit.py), [**Example2**](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/ogbn_products_sage.py), [**Example3**](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/ogbn_products_gat.py), [**Example4**](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/to_hetero_mag.py)\]
torch_geometric/loader/neighbor_sampler.py:        `examples/ogbn_products_sage.py
torch_geometric/loader/neighbor_sampler.py:        ogbn_products_sage.py>`_.

examples/ogbn_train.py Outdated Show resolved Hide resolved
Co-authored-by: Akihiro Nitta <[email protected]>
Comment on lines +239 to +242
if val_acc > best_val:
best_val = val_acc
if test_acc > best_test:
best_test = test_acc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guanxingithub in general in machine learning it is not proper to run the test set before training/val has finished.
Tracking the best val acc across multiple epochs for perflab is fine, but there should only be one final test accuracy for them to track. running test during training is sort of "cheating". please address me and akihiro's concerns in order to get this merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants