Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: The evaluation code of scbench does not match the provided dataset. #103

Open
rainstorm12 opened this issue Dec 26, 2024 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@rainstorm12
Copy link

rainstorm12 commented Dec 26, 2024

Describe the bug

When I tested the "scbench_kv" task provided by scbench, I encountered the following problems in the compute_scores.py file during the evaluation.

[rank0]:   File "/myfile/MInference-main/scbench/compute_scores.py", line 365, in get_score_one
[rank0]:     assert task_name in NAME_TO_SCORE_GETTER, f"Invalid task name: {task_name}"
[rank0]: AssertionError: Invalid task name: scbench_kv

I found that the evaluation tasks provided in the compute_scores.py file are as follows, which do not match the test tasks of scbench.

def get_score_one(pred: str, label: str, task_name: str, model_name: str) -> float:
    """
    Computes the score for one prediction.
    Returns one float (zero and one for boolean values).
    """
    NAME_TO_SCORE_GETTER = {
        # Retrieve
        "kv_retrieval": get_score_one_kv_retrieval,
        "kv_retrieval_prefix": get_score_one_kv_retrieval,
        "kv_retrieval_both": get_score_one_kv_retrieval,
        "passkey": get_score_one_passkey,
        "number_string": get_score_one_number_string,
        # Code
        "code_run": get_score_one_code_run,
        "code_debug": get_score_one_code_debug,
        # Longbook
        "longdialogue_qa_eng": get_score_one_longdialogue_qa_eng,
        "longbook_qa_eng": get_score_one_longbook_qa_eng,
        "longbook_sum_eng": get_score_one_longbook_sum_eng,
        "longbook_choice_eng": get_score_one_longbook_choice_eng,
        "longbook_qa_chn": get_score_one_longbook_qa_chn,
        # Math
        "math_find": get_score_one_math_find,
        "math_calc": get_score_one_math_calc,
        # multi-turn nativ
        "multi_turn_summary": get_score_one_longbook_sum_eng,
        "multi_turn_vt": string_match_all,
        "multi_turn_many_shot": get_score_one_longdialogue_qa_eng,
        "multi_turn_kv_compressible": get_score_one_kv_retrieval,
    }
    assert task_name in NAME_TO_SCORE_GETTER, f"Invalid task name: {task_name}"
    score = NAME_TO_SCORE_GETTER[task_name](pred, label, model_name)
    return float(score)
@rainstorm12 rainstorm12 added the bug Something isn't working label Dec 26, 2024
@iofu728 iofu728 self-assigned this Dec 26, 2024
@iofu728 iofu728 added question Further information is requested and removed bug Something isn't working labels Dec 26, 2024
@iofu728 iofu728 changed the title [Bug]: The evaluation code of scbench does not match the provided dataset. [Question]: The evaluation code of scbench does not match the provided dataset. Dec 26, 2024
@iofu728
Copy link
Contributor

iofu728 commented Dec 26, 2024

Hi @rainstorm12, thank you for pointing out this issue.

We have already fixed it in #101.

Please fetch the updated code and let us know if you encounter any further problems!

git clone https://github.com/microsoft/MInference
pip install -e .

@rainstorm12
Copy link
Author

Thank you very much for your help! I solved my problem!
However, when I try the muli-task, I encounter a new problem. My test.sh file is as follows:

python run_scbench.py \
    --task scbench_repoqa_and_kv \
    --model_name_or_path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --data_dir ./data \
    --output_dir ./results \
    --rewrite \
    --attn_type minference \
    --kv_type dense \
    --use_chat_template \
    --trust_remote_code

and the error is as follow:

==== Evaluation scbench_repoqa_and_kv====
# examples: 88
Num eval examples: -1
Verbose: False
Max new tokens: {'scbench_repoqa': 1024, 'scbench_kv': 80}
Num of turns: 5
0it [00:00, ?it/s]# tokens before: 67598
# tokens after: 67598
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/myfile/MInference/scbench/run_scbench.py", line 397, in <module>
    pred = get_pred(
  File "/myfile/MInference/scbench/run_scbench.py", line 125, in get_pred
    outputs = model.test(
  File "/myfile/MInference/scbench/eval_utils.py", line 1246, in test
    max_length_per_turn = max_length[example["task"][idx]]
KeyError: 'multi_turn_kv'

The key 'multi_turn_kv' may not be present in Max new tokens
when I run the task of scbench_summary_with_needles, the error is as follow, which is similar to the above problem:

==== Evaluation scbench_summary_with_needles====
# examples: 70
Num eval examples: -1
Verbose: False
Max new tokens: {'scbench_summary': 800, 'scbench_passkey': 15}
Num of turns: 5
0it [00:00, ?it/s]# tokens before: 98057
# tokens after: 97962
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/myfile/MInference/scbench/run_scbench.py", line 397, in <module>
    pred = get_pred(
  File "/myfile/MInference/scbench/run_scbench.py", line 125, in get_pred
    outputs = model.test(
  File "/myfile/MInference/scbench/eval_utils.py", line 1246, in test
    max_length_per_turn = max_length[example["task"][idx]]
KeyError: 'multi_turn_passkey'

@iofu728
Copy link
Contributor

iofu728 commented Dec 30, 2024

Hi @rainstorm12,

This issue is due to an update in the SCBench HF dataset. You need to add download_mode='force_redownload'.

data = load_dataset("microsoft/SCBench", dataset, split="test", download_mode='force_redownload')

Let me know if it works!

@rainstorm12
Copy link
Author

rainstorm12 commented Dec 30, 2024

Thank you for your reply.
However, my server's network is not very stable, and downloading data directly from the code would time out. That's why I previously downloaded your dataset from the Hugging Face website.

from datasets import load_dataset
data_name = "scbench_kv"
data = load_dataset("./SCBench", data_name, split="test")

Before you updated the dataset, I was able to load the data with this method. After downloading your dataset this time, I encountered the following error.

      [567]    builder_config = self.builder_configs.get(config_name)
      [568]    if builder_config is None and self.BUILDER_CONFIGS:
  --> [569]       raise ValueError(
      [570]             f"BuilderConfig '{config_name}' not found. Available:{list(self.builder_configs.keys())}"
      [571]       )
      [573] # if not using an existing config, then create a new config on the fly
      [574] if not builder_config:

ValueError: BuilderConfig 'scbench_kv' not found. Available: ['default']

@iofu728
Copy link
Contributor

iofu728 commented Dec 31, 2024

Hi @rainstorm12,

Thank you for your feedback! However, I didn’t encounter any issues when running your code locally.

In [1]: from datasets import load_dataset
   ...: data_name = "scbench_kv"
   ...: data = load_dataset("./SCBench", data_name, split="test")
Generating test split: 100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1545.74 examples/s]

Please check the following:

  1. Ensure git lfs is enabled and the git clone process was completed successfully.
  2. Verify your datasets version. I’m using version 3.2.0.
~ ll SCBench/*
-rw-r--r-- 1 aiscuser aiscuser  14K Dec 30 22:11 SCBench/README.md

SCBench/data:
total 649M
-rw-r--r-- 1 aiscuser aiscuser 178K Dec 30 22:11 comparison.png
-rw-r--r-- 1 aiscuser aiscuser 337K Dec 30 22:11 framework.png
-rw-r--r-- 1 aiscuser aiscuser 299K Dec 30 22:11 overview.png
-rw-r--r-- 1 aiscuser aiscuser 6.9K Dec 30 22:11 readme.md
-rw-r--r-- 1 aiscuser aiscuser 645K Dec 30 22:11 results.png
-rw-r--r-- 1 aiscuser aiscuser  46M Dec 30 22:11 scbench_choice_eng.jsonl
-rw-r--r-- 1 aiscuser aiscuser  21M Dec 30 22:11 scbench_kv.jsonl
-rw-r--r-- 1 aiscuser aiscuser 4.7M Dec 30 22:11 scbench_many_shot.jsonl
-rw-r--r-- 1 aiscuser aiscuser  14M Dec 30 22:11 scbench_mf.jsonl
-rw-r--r-- 1 aiscuser aiscuser  17M Dec 30 22:11 scbench_prefix_suffix.jsonl
-rw-r--r-- 1 aiscuser aiscuser 344M Dec 30 22:12 scbench_qa_chn.jsonl
-rw-r--r-- 1 aiscuser aiscuser  57M Dec 30 22:11 scbench_qa_eng.jsonl
-rw-r--r-- 1 aiscuser aiscuser  25M Dec 30 22:11 scbench_repoqa_and_kv.jsonl
-rw-r--r-- 1 aiscuser aiscuser  25M Dec 30 22:11 scbench_repoqa.jsonl
-rw-r--r-- 1 aiscuser aiscuser  28M Dec 30 22:11 scbench_summary.jsonl
-rw-r--r-- 1 aiscuser aiscuser  28M Dec 30 22:11 scbench_summary_with_needles.jsonl
-rw-r--r-- 1 aiscuser aiscuser  42M Dec 30 22:11 scbench_vt.jsonl

SCBench/scbench_choice_eng:
total 28M
-rw-r--r-- 1 aiscuser aiscuser 28M Dec 30 22:11 test-00000-of-00001.parquet

SCBench/scbench_kv:
total 18M
-rw-r--r-- 1 aiscuser aiscuser 18M Dec 30 22:11 test-00000-of-00001.parquet

SCBench/scbench_many_shot:
total 100K
-rw-r--r-- 1 aiscuser aiscuser 98K Dec 30 22:11 test-00000-of-00001.parquet

SCBench/scbench_mf:
total 3.6M
-rw-r--r-- 1 aiscuser aiscuser 3.6M Dec 30 22:11 test-00000-of-00001.parquet

SCBench/scbench_prefix_suffix:
total 16M
-rw-r--r-- 1 aiscuser aiscuser 16M Dec 30 22:11 test-00000-of-00001.parquet

SCBench/scbench_qa_chn:
total 111M
-rw-r--r-- 1 aiscuser aiscuser 111M Dec 30 22:11 test-00000-of-00001.parquet

SCBench/scbench_qa_eng:
total 34M
-rw-r--r-- 1 aiscuser aiscuser 34M Dec 30 22:11 test-00000-of-00001.parquet

SCBench/scbench_repoqa:
total 4.3M
-rw-r--r-- 1 aiscuser aiscuser 4.3M Dec 30 22:11 test-00000-of-00001.parquet

SCBench/scbench_repoqa_and_kv:
total 8.2M
-rw-r--r-- 1 aiscuser aiscuser 8.2M Dec 30 22:11 test-00000-of-00001.parquet

SCBench/scbench_summary:
total 14M
-rw-r--r-- 1 aiscuser aiscuser 14M Dec 30 22:11 test-00000-of-00001.parquet

SCBench/scbench_summary_with_needles:
total 14M
-rw-r--r-- 1 aiscuser aiscuser 14M Dec 30 22:11 test-00000-of-00001.parquet

SCBench/scbench_vt:
total 2.1M
-rw-r--r-- 1 aiscuser aiscuser 2.1M Dec 30 22:11 test-00000-of-00001.parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants