Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Bugs in the hybrid_node.py Example Script #293

Open
vladislavalerievich opened this issue Feb 15, 2025 · 1 comment
Open

Potential Bugs in the hybrid_node.py Example Script #293

vladislavalerievich opened this issue Feb 15, 2025 · 1 comment

Comments

@vladislavalerievich
Copy link

vladislavalerievich commented Feb 15, 2025

Several possible issues have been observed in the example script examples/hybrid_node.py that may lead to unexpected behavior or unfair comparisons. The following points outline the bugs along with the problematic code lines and suggested improvements:

  1. LightGBM Test Metrics Printing:
    The script prints outdated test metrics (e.g., using metrics from the GNN model when evaluating LightGBM predictions).
    Bug:

     pred = model.predict(tf_test).numpy()
     print(f"Test: {test_metrics}")

    Improvement:

     pred = pred.predict(tf_test).numpy()
     test_metrics = task.evaluate(test_pred)
     print(f"LightGBM Test metrics: {test_metrics}"
  2. State Dict Path Formatting:
    Bug:
    The state dict file path is defined with placeholders in a plain string:

    STATE_DICT_PTH = "results/{args.dataset}_{args.task}_state_dict.pth"

    Improvement:
    Use an f-string so that the placeholders are replaced with actual values:

    STATE_DICT_PTH = f"results/{args.dataset}_{args.task}_state_dict.pth"
  3. Sample Size Usage for GNN Model:
    Issue:
    The GNN model does not utilize the sample_size parameter when training. This means that while the LightGBM model is trained on a subsampled dataset (first sample_size rows), the GNN model is trained on the full training set. This discrepancy can lead to an unfair comparison between the models.

  4. Entity Table Overwriting:
    Bug:
    The script reassigns the entity_table variable in a loop when creating loaders for each split. For example:

    for split in ["train", "val", "test"]:
        table = task.get_table(split)
        table_input = get_node_train_table_input(table=table, task=task)
        entity_table = table_input.nodes[0]  # This gets overwritten each iteration
        ...

    Improvement:
    Instead of overwriting, maintain a mapping for each split and reference the correct table. For instance:

    entity_table_mapping: Dict[str, str] = {}
    for split in ["train", "val", "test"]:
        table = task.get_table(split)
        table_input = get_node_train_table_input(table=table, task=task)
        entity_table_mapping[split] = table_input.nodes[0]
        ...
    # Later reference the appropriate entity table, e.g., using task.entity_table = entity_table_mapping["train"]

Addressing these issues may help improve the robustness of the example script and ensure a fair comparison between the GNN and LightGBM models.


@vladislavalerievich vladislavalerievich changed the title Potential Bugs in the Example Script: Entity Table Overwrite, State Dict Path, and Sample Size Usage Potential Bugs in the hybrid_node.py Example Script Feb 15, 2025
@rishabh-ranjan
Copy link
Collaborator

Thanks for pointing this out @vladislavalerievich ! Since you have proposed the improvements, can you make a quick PR with these changes? We will be happy to merge to main!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants