Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Questions About the Evaluation. #7

Closed
LimOkii opened this issue Dec 8, 2024 · 9 comments
Closed

Some Questions About the Evaluation. #7

LimOkii opened this issue Dec 8, 2024 · 9 comments

Comments

@LimOkii
Copy link

LimOkii commented Dec 8, 2024

Thanks to your work for tool use!
However, I have some questions about the results.

I've noticed that you adopted StableToolBench for your Benchmark. And when checking whether the Final Answer has solved the user's query in the code you provided, you used the same solvable_queries as those in StableToolBench. However, I'm quite puzzled that although some of your evaluation results are set the same as those of StableToolBench, the difference in the effects is very significant.

For example, the result of ToolLLaMa(Cot & DFSDT) reported in StableToolBench..

image

The result of ToolLLaMa(Cot & DFSDT) reported in StepTool..
image

I've conducted a large number of experiments on StableToolBench. I've found that even with the existence of the caching system, many APIs that were accessible before have now become inaccessible. Moreover, in the caching system provided by StableToolBench, even though the query-key exists in its experiments, it hasn't been saved in the cache. Therefore, even though StableToolBench has proposed a stable version of the Benchmark, more APIs will become inaccessible over time. As a result, it's actually very difficult to achieve better results when facing the same test set after it for the purpose of reproduction.

@chenhengzh
Copy link

chenhengzh commented Dec 14, 2024

I did not find the results you listed in the StableToolBench. Here are the results I found from the StableToolBench paper, which are significantly higher than those you provided.
image

However, I also reproduced the experiments on ToolLlama (DFS). The results I obtained were consistently lower than those reported in StepTool and StableToolBench, with a difference of around 10 points in all three tests for I1, as shown in the table below:

my reproduction StepTool
I1 ins 44.3 ± 1.4 57.0 ± 1.0
I1 cat 49.7 ± 1.2 52.3 ± 1.5
I1 tool 42.8 ± 3.9 57.5 ± 1.2

I’m not sure if the difference is caused by a higher number of inaccessible APIs or the use of a different model in ToolEval. Could the author provide more information about your experiments, such as the model used in your ToolEval? @yuyq18

@LimOkii
Copy link
Author

LimOkii commented Dec 15, 2024

I did not find the results you listed in the StableToolBench. Here are the results I found from the StableToolBench paper, which are significantly higher than those you provided. image

However, I also reproduced the experiments on ToolLlama (DFS). The results I obtained were consistently lower than those reported in StepTool and StableToolBench, with a difference of around 10 points in all three tests for I1, as shown in the table below:

my reproduction StepTool
I1 ins 44.3 ± 1.4 57.0 ± 1.0
I1 cat 49.7 ± 1.2 52.3 ± 1.5
I1 tool 42.8 ± 3.9 57.5 ± 1.2
I’m not sure if the difference is caused by a higher number of inaccessible APIs or the use of a different model in ToolEval. Could the author provide more information about your experiments, such as the model used in your ToolEval? @yuyq18

Thank you very much for sharing. In fact, StableToolBench has updated the latest evaluation results. The paper you downloaded contains the results of earlier experiments, and the evaluation model used was GPT-4-turbo-preview.
Subsequently, the authors of StableToolBench used GPT-4-turbo-2024-04-09 as the evaluation model and updated the evaluation results. All the indicators have dropped significantly.
Perhaps you can find the latest version of the paper on arXiv.

@chenhengzh
Copy link

Thank you! I missed the update. I noticed that the evaluation model has a significant impact on the results, so I’m more curious about the experimental details from the authors.

@yuyq18
Copy link
Owner

yuyq18 commented Dec 15, 2024

Thank you very much for your thorough discussions and I truly appreciate @chenhengzh for their efforts in reproducing our results. To address the discrepancy in ToolLlama's performance and the evaluation results, I would like to explain three main factors that could have contributed to the differences:

  1. Code Bugs: During our experiments in August-September 2024 using StableToolBench, we encountered several bugs in both the StableToolBench repository and the ToolBench repository. Two significant bugs have already been reported in the ToolBench issue tracker Issue 304 and Issue 305. In addition, there were smaller bugs related to error handling, which I fixed myself (e.g., code processing error responses in StableToolBench). The bug-fixed version has been provided in my StepTool repository under the stabletoolbench directory. These issues could have led to degraded performance in ToolLlama during the evaluation.

  2. Caching System and API Inaccessibility: As the paper and code of Stabletoolbench, the instability of API responses is addressed by a caching system. In StableToolBench, a critical feature is its ability to generate a "fake" response using GPT-4 when an API is either inaccessible or produces an error. While I believe that fake responses might not fully represent real API interactions, they can still provide a way to measure the tool usage capability of different models under the same experimental conditions. If you require, we can share our tool response cache files to help with reproducing the results.

  3. Experimental Details: For our experiments, we used GPT-4-turbo-2024-04-09 as the evaluation model. The main experiments were conducted in late August and early September 2024. The evaluation results were based on the average of three evaluations for each case. It's also worth noting that from June to August 2024, the RapidAPI server provided by ToolBench experienced downtimes, which led to many real APIs being unresponsive. After the server was fixed, we re-run our experiments, as stated in the ToolBench repository 2024.8 Update We have updated the RapidAPI server with a new IP....

We hope this helps clarify the discrepancies. If you need further details or have additional questions, please feel free to reach out.

@chenhengzh
Copy link

Thank you for your detailed response!

  1. I’m interested in the bugs of ToolBench and StableToolBench. Could you please provide more detailed information about them?
  2. I’m not familiar with the cache file. Is your file significantly different from the one in StableToolBench? If so, your sharing would be greatly appreciated.

By the way, I have some concerns about the reward assignment mentioned in issue: Questions About StepTool Data Generation and Annotation. I hope you can help clarify this for me.

However, I still have some slight concerns regarding the current approach to reward assignment, particularly in cases where the final result is incorrect. For example, if an incorrect final answer is caused by a particular middle step, evaluating that step based solely on its contribution to the final answer could result in it receiving an undeservedly high reward. In reality, this step is actually a poor one since it directly leads to an incorrect final answer. This approach may not fully reflect its true impact. Perhaps it would be more appropriate to assess the step’s contribution to addressing the task query itself, rather than its influence on the final answer.

I really appreciate your patience in responding to my questions. I have learned a lot from the discussion.

@yuyq18
Copy link
Owner

yuyq18 commented Dec 15, 2024

Thank you for follow-up questions. I’m happy to clarify further.

  1. Details on ToolBench Bugs: As mentioned in my previous response, I have provided the issue links with detailed bug descriptions and relevant code locations. You can check out the specifics in the following issue links for further understanding:
    Issue 304
    Issue 305

  2. Cache System: I recommend referring to the StableToolBench paper and its code for a more in-depth explanation of the caching system. I’ll also work on organizing and uploading my version of the cache files as soon as possible.

  3. Reward Assignment Concern: As I mentioned earlier, our motivation is to avoid solely relying on the final reward to assess the quality of all intermediate steps. Instead, we aim to distinguish between different steps and optimize them more precisely. Thank you very much for bringing up this scenario, that is, how to distinguish which step is actually a poor one (directly leads to an incorrect final answer) for data with poor final reward, which our current method cannot detect. As noted in the conclusion of our paper, our reward design is a simple yet effective version, and we envision further refinement and improvement in future work.

@chenhengzh
Copy link

  1. Thank you for your contribution for the community. And you have mentioned that You also fixed some small bugs which could have led to degraded performance in StepTool. Could you provide the file name that you changed, so that we can learn from them?
  2. I understand the basic workings of the cache system. What I want to ask is whether the difference in your version lies in the cache recorded during your runtime, or if you made some specific improvements to it.

@yuyq18
Copy link
Owner

yuyq18 commented Dec 15, 2024

Thank you for your continued interest.

  1. Small Bugs Fixes: Due to the time elapsed and the lack of detailed version management, it’s difficult for me to list every single file change individually. Additionally, I have been quite occupied with several other ongoing research projects lately. I recommend comparing the version of StableToolBench we provided with the original using diff.

  2. Cache Responses: Regarding the cache, what I intended to convey was that StableToolBench doesn’t provide certain cached responses, and as time progresses, some APIs that were functional earlier may become inaccessible or produce errors. For example, an API that responded correctly in September may show errors by October. To ensure a more accurate reproduction of our results—i.e., to guarantee that the API responses are consistent—we can provide our cache responses.

I hope this clarifies your concerns.

@yuyq18 yuyq18 closed this as completed Dec 15, 2024
@chenhengzh
Copy link

Thank you for your patient response. It has resolved my concerns. Wishing you success in your research!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants