-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Questions About the Evaluation. #7
Comments
I did not find the results you listed in the StableToolBench. Here are the results I found from the StableToolBench paper, which are significantly higher than those you provided. However, I also reproduced the experiments on ToolLlama (DFS). The results I obtained were consistently lower than those reported in StepTool and StableToolBench, with a difference of around 10 points in all three tests for I1, as shown in the table below:
I’m not sure if the difference is caused by a higher number of inaccessible APIs or the use of a different model in ToolEval. Could the author provide more information about your experiments, such as the model used in your ToolEval? @yuyq18 |
Thank you very much for sharing. In fact, StableToolBench has updated the latest evaluation results. The paper you downloaded contains the results of earlier experiments, and the evaluation model used was GPT-4-turbo-preview. |
Thank you! I missed the update. I noticed that the evaluation model has a significant impact on the results, so I’m more curious about the experimental details from the authors. |
Thank you very much for your thorough discussions and I truly appreciate @chenhengzh for their efforts in reproducing our results. To address the discrepancy in ToolLlama's performance and the evaluation results, I would like to explain three main factors that could have contributed to the differences:
We hope this helps clarify the discrepancies. If you need further details or have additional questions, please feel free to reach out. |
Thank you for your detailed response!
By the way, I have some concerns about the reward assignment mentioned in issue: Questions About StepTool Data Generation and Annotation. I hope you can help clarify this for me.
I really appreciate your patience in responding to my questions. I have learned a lot from the discussion. |
Thank you for follow-up questions. I’m happy to clarify further.
|
|
Thank you for your continued interest.
I hope this clarifies your concerns. |
Thank you for your patient response. It has resolved my concerns. Wishing you success in your research! |
Thanks to your work for tool use!
However, I have some questions about the results.
I've noticed that you adopted StableToolBench for your Benchmark. And when checking whether the Final Answer has solved the user's query in the code you provided, you used the same solvable_queries as those in StableToolBench. However, I'm quite puzzled that although some of your evaluation results are set the same as those of StableToolBench, the difference in the effects is very significant.
For example, the result of ToolLLaMa(Cot & DFSDT) reported in StableToolBench..
The result of ToolLLaMa(Cot & DFSDT) reported in StepTool..
I've conducted a large number of experiments on StableToolBench. I've found that even with the existence of the caching system, many APIs that were accessible before have now become inaccessible. Moreover, in the caching system provided by StableToolBench, even though the query-key exists in its experiments, it hasn't been saved in the cache. Therefore, even though StableToolBench has proposed a stable version of the Benchmark, more APIs will become inaccessible over time. As a result, it's actually very difficult to achieve better results when facing the same test set after it for the purpose of reproduction.
The text was updated successfully, but these errors were encountered: