Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WISE result #493

Open
LiuJinzhe-Keepgoing opened this issue Mar 20, 2025 · 1 comment
Open

WISE result #493

LiuJinzhe-Keepgoing opened this issue Mar 20, 2025 · 1 comment
Labels
question Further information is requested

Comments

@LiuJinzhe-Keepgoing
Copy link

LiuJinzhe-Keepgoing commented Mar 20, 2025

I repeated the results in Table 2 in this paper of WISE, in which ROME and MEMIT encountered several problems when editing the sequence:

  1. I found that when T=1, the result in ROME's accuracy table is 0.85. However, in my own experiment, the result of rewrite_acc is 1 every time. Is this correct? I found that the accuracy of single editing is 100%. Why is the accuracy of T=1 in WISE paper so low?"
        "post": {
            "rewrite_acc": [
                1.0
            ],
            "locality": {
                "Relation_Specificity_acc": [
                    0.0,
                    0.0
                ]
            },
            "portability": {
                "reasoning_acc": [
                    0.2
                ]
            },
            "fluency": {
                "ngram_entropy": 5.174247202938084
            }
        }

Image
  1. I use the following code to summary_metrics the evaluation results of each edit in sequential_edit. my code:
def sequential_edit_summary_metrics(all_metrics):
    if isinstance(all_metrics, dict):
        all_metrics = [all_metrics, ]
    logs_dir = './logs'
    if not os.path.exists(logs_dir):
        os.makedirs(logs_dir)
    output_file = os.path.join(logs_dir, 'results.json')
    with open(output_file, 'w', encoding="utf-8") as f:
        json.dump(all_metrics, f, ensure_ascii=False, indent=4)

    mean_metrics = dict()
    for eval in ["pre", "post"]:
        mean_metrics[eval] = dict()
        for key in ["rewrite_acc", "rephrase_acc", 'rewrite_ppl']:
            if key in all_metrics[0][eval].keys():
                mean_metrics[eval][key] = np.mean([metric[eval][key] for metric in all_metrics])
        for key in ["locality", "portability"]:
            if key in all_metrics[0][eval].keys() and all_metrics[0][eval][key] != {}:
                mean_metrics[eval][key] = dict()
                for lkey in get_all_acc_keys(all_metrics):
                    metrics = [np.mean(metric[eval][key][lkey]) for metric in all_metrics if lkey in metric[eval][key].keys()]
                    if len(metrics) > 0:
                        mean_metrics[eval][key][lkey] = np.mean(metrics)
                    # mean_metrics[eval][key][lkey] = np.mean(
                    #     [metric[eval][key][lkey] for metric in all_metrics])
    # mean_metrics["time"] = np.mean([metric["time"] for metric in all_metrics])
    print("Metrics Summary: ", mean_metrics)

    return mean_metrics

mean result as follows:

     "post": {
        "rewrite_acc": 1.0,
        "locality": {
            "Relation_Specificity_acc": 0.0
        },
        "portability": {
            "reasoning_acc": 0.2
        }
    }

But I found that I missed the metrics fluency. Should I just add fluency to for key in ["locality", "portability"]?

  1. I found the result of editing Rome _ zsre _ llama-2-7b-HF _ sequential _ edit = true, in which the result of locality is 0.25, which is very different from the 0.75 shown in table 2 of WISE. I wonder if I have any misunderstanding? What about those omissions? In addition, how to calculate the result of portability with multiple values? :
    my mean metrics result by upon code :
      "post": {
        "rewrite_acc": 0.9888888888888889,
        "locality": {
            "Relation_Specificity_acc": 0.25625
        },
        "portability": {
            "reasoning_acc": 0.51,
            "Subject_Aliasing_acc": 0.3333333333333333,
            "Logical_Generalization_acc": 0.48809523809523814
        }
    }
Image
  1. What I understand is that Metrics Rel. (A.K.A edit success rate [10]) corresponds to the result of rewrite_acc.
    Loc. (localization success rate [55]) corresponds to the average of all the numbers in locality.
    Gen. (generalization success rate [55]), how should I calculate it?
    I edited the ZSRE dataset by run_knowedit_llama2.py, and then evaluated it by edit_evaluation, but I didn't get the corresponding evaluation index. What should I do?

5.In addition, I want to know if there is any place where I can get the evaluation result num={1,10,100,500,1000} edited by ROME, MEMIT and WISE in sequence on ZSRE and WikiData_counterfact data sets on LLMA2-7B-HF model.

Thank you and look forward to your advice.

@zxlzr zxlzr added the question Further information is requested label Mar 20, 2025
@zxlzr
Copy link
Contributor

zxlzr commented Mar 20, 2025

Thank you for your interest in EasyEdit and WISE. We will arrange for a team member to respond to your questions as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants