Skip to content

improve mcp_eval notebook #1901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions examples/evaluation/use-cases/mcp_eval_notebook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -450,6 +450,8 @@
"id": "ee1f655b",
"metadata": {},
"source": [
"Note that the 4.1 model was constructed to never use its tools to answer the query thus it never called the MCP server. The o4-mini model wasn't explicitly instructed to use it's tools either but it wasn't forbidden, thus it called the MCP server 3 times. We can see that the 4.1 model performed worse than the o4 model. Also notable is the one example that the o4-mini model failed was one where the MCP tool was not used.\n",
"\n",
"We can also check a detailed analysis of the outputs from each model for manual inspection and further analysis."
]
},
Expand Down Expand Up @@ -806,6 +808,30 @@
" print(item.sample.output[0].content)"
]
},
{
"cell_type": "markdown",
"id": "0936def6",
"metadata": {},
"source": [
"## How can we improve?\n",
"\n",
"If we add the phrase \"Always use your tools since they are the way to get the right answer in this task.\" to the system message of the o4-mini model, what do you think will happen? (try it out)\n",
"\n",
"<br><br><br>\n",
"\n",
"\n",
"If you guessed that the model would now call to MCP tool everytime and get every answer correct, you are right!"
]
},
{
"cell_type": "markdown",
"id": "cf797a91",
"metadata": {},
"source": [
"![Evaluation Data Tab](../../../images/mcp_eval_improved_output.png)\n",
"![Evaluation Data Tab](../../../images/mcp_eval_improved_data.png)"
]
},
{
"cell_type": "markdown",
"id": "924619e0",
Expand Down
Binary file added images/mcp_eval_improved_data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/mcp_eval_improved_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.