Skip to content

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 74 commits into
base: main
Choose a base branch
from

Conversation

himanshusinghs
Copy link
Collaborator

@himanshusinghs himanshusinghs commented Jul 7, 2025

Motivation and Goal

  • Motivation: Consolidate MCP server tools to accommodate client soft limits, while mitigating the risk of confusing LLMs with potentially ambiguous tool schemas and descriptions.
  • Goal: Benchmark current tool understanding by LLMs to establish a baseline and prevent regression during consolidation.

Design Brief

  • Method: Provide LLMs (Gemini, Claude, ChatGPT, etc.) with prompts and current MCP server tool schemas.
  • Evaluation: Record actual tool calls made by LLMs and compare them against expected calls and parameters.
  • Reporting: Generate a readable summary of test runs highlighting prompt, models and the achieved accuracy

Detailed Design

Refer to the doc titled - MCP Tools Accuracy Testing

Current State

  • Framework implemented and integrated to test the MCP server and MongoDB tools.
  • Accuracy tests for core MongoDB tool calls are written.
  • The scoring algorithm is implemented and unit-tested.
  • Supports multiple LLM providers and models.
  • Snapshots are stored on disk by default, with possibility to store in a MongoDB deployment as well.
  • On each successful test run a summary is generated highlighting prompt, model and accuracy of tool calls.
  • Github workflow added to trigger the test runs on a label and manual dispatch. It also attach the summary when triggered for a PR.

For reviewers

  • Please start reviewing the test cases / prompts themselves.
  • Once done with the prompts review, move on to the tool calling accuracy scorer. I have added some docs and tests to help understand how it works.
  • Later you can start reviewing the rest of the accuracy SDK in the folder tests/accuracy/sdk. Start with describe-accuracy-test.ts as this is where all the different parts come together and dive further into specific implementation of each parts afterwards.

Apologies for the big chunk to be reviewed here but I did not see a way around it.

@coveralls
Copy link
Collaborator

coveralls commented Jul 7, 2025

Pull Request Test Coverage Report for Build 16325068133

Details

  • 0 of 295 (0.0%) changed or added relevant lines in 3 files are covered.
  • 3 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-4.4%) to 72.966%

Changes Missing Coverage Covered Lines Changed/Added Lines %
scripts/accuracy/update-accuracy-run-status.ts 0 18 0.0%
vitest.config.ts 0 24 0.0%
scripts/accuracy/generate-test-summary.ts 0 253 0.0%
Files with Coverage Reduction New Missed Lines %
src/common/atlas/apiClient.ts 3 73.9%
Totals Coverage Status
Change from base Build 16299138293: -4.4%
Covered Lines: 2825
Relevant Lines: 3905

💛 - Coveralls

@himanshusinghs himanshusinghs force-pushed the chore/issue-307-proposal-2 branch 4 times, most recently from 58bc8a5 to b557e02 Compare July 10, 2025 08:53
@himanshusinghs himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 7791e20 to 79cd26e Compare July 10, 2025 11:37
@himanshusinghs himanshusinghs changed the title chore(tests): accuracy tests for MongoDB tools exposed by MCP server chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 Jul 10, 2025
@himanshusinghs himanshusinghs marked this pull request as ready for review July 10, 2025 11:39
@himanshusinghs himanshusinghs requested a review from a team as a code owner July 10, 2025 11:39
@himanshusinghs himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 79cd26e to 6ccaa11 Compare July 10, 2025 15:43
Copy link
Collaborator

@nirinchev nirinchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still going through it - leaving some comments related to storage as I move on to the actual testing framework.

import { MongoDBBasedResultStorage } from "./mongodb-storage.js";
import { AccuracyResultStorage } from "./result-storage.js";

export function getAccuracyResultStorage(): AccuracyResultStorage {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved to result-storage.ts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I will do it right away.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh doing this will make a circular dependency, disk-storage depending on result-storage which depends on disk-storage. I checked and although the resolution works fine, I am doubtful of its reliability. I think its best to keep it this untangled.

This comment has been minimized.

Copy link
Collaborator

@kmruiz kmruiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine on my side.

Copy link
Collaborator

@nirinchev nirinchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments so far - haven't gone through all of it, but the comments I left for some of the accuracy tests are applicable to the rest - review the function definitions and make sure those that are clearly consolidable are merged and functions that are used just once are inlined.

Additionally, I'm seeing LLM response and conversation messages being empty for each prompt in the generated report - is this expected?

Image

Finally, the naming of the new files doesn't follow the conventions of the repo - we tend to use camelCase names. Obviously, not enforced, but would be great to be consistent there.

callsCollectionIndexes("How many indexes do I have in 'mflix.movies' namespace?"),
callsCollectionIndexes("List all the indexes in movies collection in mflix database"),
callsCollectionIndexes(
`Is the following query: ${JSON.stringify({ runtime: { $lt: 100 } })} on the namespace 'mflix.movies' indexed?`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expect the model to call explain for this prompt instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the prompt is very borderline in terms of calling collection-indexes and explain but I would expect collection-indexes to be called because of indexed word in the prompt.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this is likely not a problem for this PR, but I feel looking at the explain plan is the only surefire way to check if a query is covered by an index. I guess for a simple query like this the model can look at the indexes and try and guess, but if it was { runtime: { $lt: 100 }, title: { $startsWith: "foo" } }, just listing the indexes is likely not enough. We might want to tweak the explain description though to hint to models to prefer using that if the user is asking about a query.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that this would be more a use case for the explain plan. However, if the models that we are testing do call list-indexes, either we should improve our prompting on the tools so LLMs prefer other tools or just keep it like this.

I believe this PR should show the current behaviour we already have, not fix what is not entirely correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool then I will include this as part of improvements that we can do for the tool schemas and that's when I will move the this prompt over to explain test. Does that sound good to you both?

Comment on lines +21 to +26
function callsDropCollection(prompt: string, expectedToolCalls: ExpectedToolCall[]): AccuracyTestConfig {
return {
prompt: prompt,
expectedToolCalls,
};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't make much sense to me - it's just creating an object with its arguments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like having empty functions, but this follows the convention on all the tests, and likely callDropsCollection is a better name that a random JSON object. A nit would be to concat always the call to dropsCollection into the expectedToolCalls and leave expectedToolCalls as an optional vararg parameter for additional tool calls.

I'm personally fine with the current scenario because the tests are easy already to understand and maintain, and I don't think this is something that won't scale properly with the amount of tests.


function onlyCallsDropCollection(prompt: string): AccuracyTestConfig {
return {
prompt: prompt,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prompt: prompt,
prompt,

};
}

function callsDropDatabase(prompt: string, expectedToolCalls: ExpectedToolCall[]): AccuracyTestConfig {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly - trivial function that serves no purpose.


function onlyCallsDropDatabase(prompt: string): AccuracyTestConfig {
return {
prompt: prompt,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prompt: prompt,
prompt,

Comment on lines +21 to +47
const callsExplainWithFind = (prompt: string) =>
callsExplain(prompt, {
name: "find",
arguments: {
filter: { release_year: 2020 },
},
});

const callsExplainWithAggregate = (prompt: string) =>
callsExplain(prompt, {
name: "aggregate",
arguments: {
pipeline: [
{
$match: { release_year: 2020 },
},
],
},
});

const callsExplainWithCount = (prompt: string) =>
callsExplain(prompt, {
name: "count",
arguments: {
query: { release_year: 2020 },
},
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are one-off functions with a bunch of hardcoded data - we don't need to explicitly specify them and should instead just inline their implementation in the describe call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for most of these tests, the idea was to be verbose via function names in what we were expecting by just having a glance at the tests. Do you see them more as noise than helping understand the test suite quickly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they live in this limbo where they have some configurable arguments, but also some hardcoded ones, making it really awkward. My suggestion would be to either go all in on the functions or only use them when they're being called with different arguments. An example of the former would be to make these functions just constants instead by incorporating the prompt:

const testSimpleCount: AccuracyTestConfig = callsExplain("Will counting documents, where release_year is 2020, from 'mflix.movies' namespace perform a collection scan?", {
        name: "count",
        arguments: {
            query: { release_year: 2020 },
        },
    });

const testMultistageAggregation = ...

describeAccuracyTests([
  testSimpleCount,
  testSimpleAggregation,
  testMultistageAggregation,
  testSimpleFind,
]);

Alternatively, if we want to keep the functions for only reusable code, we could do something like:

function callsExplain(prompt: string, methodName: string, methodArguments: Record<string, unknown>): AccuracyTestConfig {
  return {
    prompt,
    expectedToolCalls: [{ 
      toolName: "explain",
      parameters: {
        database: "mflix",
        collection: "movies",
        method: [{ name: methodName, arguments: methodArguments  }]
      }
    }]
  };
}

describeAccuracyTests([
  callsExplain(
    `Will fetching documents, where release_year is 2020, from 'mflix.movies' namespace perform a collection scan?`, 
    "find", 
    { query: { release_year: 2020 }}),
  callsExplain(...)

I'm fine either way - the only thing that I don't like is that we have a function that takes the prompt as an argument, but hardcodes the expected query, meaning that invoking it with any other prompt would not be correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way - the only thing that I don't like is that we have a function that takes the prompt as an argument, but hardcodes the expected query, meaning that invoking it with any other prompt would not be correct.

That is what we would expect from the test, right? These tests ensure that specific prompts behave as we expect: we have multiple prompts that should behave the same, if they don't, we likely broke something when we changed our tools.

I don't have strong opinions on the aesthetics if the test is easy to change and it does what it has to do. What I like of this approach is that it's not easy to change the expected outcome, so it's not easy to monkey patch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only thing that I don't like is that we have a function that takes the prompt as an argument, but hardcodes the expected query, meaning that invoking it with any other prompt would not be correct

So the functions are called only with prompt because all those prompts are expected to end up with same tool calls and same parameters. Wherever the parameter or tool calls differ, we have different functions (find test suite for example). If we instead do just objects in the list then you end up duplicating expectedToolCalls for no good reason and for some the prompts the expectedToolCalls list is big.

I think there are valid cases where a function definition is not needed at all, such as a suite having few prompts where tool call list is not that big to become repetitive noise but then there are actual suites where its better to reduce the repetitive noise by having it configurable via functions. Perhaps we should keep the tests writing style open to both methods and not stick to one?

- well defined Model types
- move getAllAvailableModels() inside the test setup
Copy link
Contributor

📊 Accuracy Test Results

📈 Summary

Metric Value
Commit SHA 39695d472865fa93cf7b06b7c5838c0fd5340b05
Run ID 9c05b8d4-8485-46aa-af55-21f4f668eff8
Status done
Total Prompts Evaluated 51
Models Tested 1
Average Accuracy 73.5%
Responses with 0% Accuracy 12
Responses with 75% Accuracy 6
Responses with 100% Accuracy 33

📎 Download Full HTML Report - Look for the accuracy-test-summary artifact for detailed results.

Report generated on: 7/16/2025, 4:39:30 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants