chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

himanshusinghs · 2025-07-07T14:02:55Z

Motivation and Goal

Motivation: Consolidate MCP server tools to accommodate client soft limits, while mitigating the risk of confusing LLMs with potentially ambiguous tool schemas and descriptions.
Goal: Benchmark current tool understanding by LLMs to establish a baseline and prevent regression during consolidation.

Design Brief

Method: Provide LLMs (Gemini, Claude, ChatGPT, etc.) with prompts and current MCP server tool schemas.
Evaluation: Record actual tool calls made by LLMs and compare them against expected calls and parameters.
Reporting: Generate a readable summary of test runs highlighting prompt, models and the achieved accuracy

Detailed Design

Refer to the doc titled - MCP Tools Accuracy Testing

Current State

Framework implemented and integrated to test the MCP server and MongoDB tools.
Accuracy tests for core MongoDB tool calls are written.
The scoring algorithm is implemented and unit-tested.
Supports multiple LLM providers and models.
Snapshots are stored on disk by default, with possibility to store in a MongoDB deployment as well.
On each successful test run a summary is generated highlighting prompt, model and accuracy of tool calls.
Github workflow added to trigger the test runs on a label and manual dispatch. It also attach the summary when triggered for a PR.

For reviewers

Please start reviewing the test cases / prompts themselves.
Once done with the prompts review, move on to the tool calling accuracy scorer. I have added some docs and tests to help understand how it works.
Later you can start reviewing the rest of the accuracy SDK in the folder tests/accuracy/sdk. Start with describe-accuracy-test.ts as this is where all the different parts come together and dive further into specific implementation of each parts afterwards.

Apologies for the big chunk to be reviewed here but I did not see a way around it.

coveralls · 2025-07-07T14:12:56Z

Pull Request Test Coverage Report for Build 16325068133

Details

0 of 295 (0.0%) changed or added relevant lines in 3 files are covered.
3 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-4.4%) to 72.966%

Changes Missing Coverage	Changed/Added Lines	%
scripts/accuracy/update-accuracy-run-status.ts	18	0.0%
vitest.config.ts	24	0.0%
scripts/accuracy/generate-test-summary.ts	253	0.0%

Files with Coverage Reduction	New Missed Lines	%
src/common/atlas/apiClient.ts	3	73.9%

Totals
Change from base Build 16299138293:	-4.4%
Covered Lines:	2825
Relevant Lines:	3905

💛 - Coveralls

.github/workflows/accuracy-tests.yml

package.json

scripts/generate-test-summary.ts

.github/workflows/accuracy-tests.yml

nirinchev

Still going through it - leaving some comments related to storage as I move on to the actual testing framework.

scripts/accuracy/generate-test-summary.ts

tests/accuracy/sdk/accuracy-result-storage/result-storage.ts

nirinchev · 2025-07-14T09:30:09Z

tests/accuracy/sdk/accuracy-result-storage/get-accuracy-result-storage.ts

+import { MongoDBBasedResultStorage } from "./mongodb-storage.js";
+import { AccuracyResultStorage } from "./result-storage.js";
+
+export function getAccuracyResultStorage(): AccuracyResultStorage {


Can this be moved to result-storage.ts?

That makes sense. I will do it right away.

Ahh doing this will make a circular dependency, disk-storage depending on result-storage which depends on disk-storage. I checked and although the resolution works fine, I am doubtful of its reliability. I think its best to keep it this untangled.

tests/accuracy/sdk/accuracy-result-storage/mongodb-storage.ts

tests/accuracy/sdk/accuracy-result-storage/disk-storage.ts

scripts/accuracy/generate-test-summary.ts

kmruiz

Looks fine on my side.

nirinchev

Some comments so far - haven't gone through all of it, but the comments I left for some of the accuracy tests are applicable to the rest - review the function definitions and make sure those that are clearly consolidable are merged and functions that are used just once are inlined.

Additionally, I'm seeing LLM response and conversation messages being empty for each prompt in the generated report - is this expected?

Finally, the naming of the new files doesn't follow the conventions of the repo - we tend to use camelCase names. Obviously, not enforced, but would be great to be consistent there.

nirinchev · 2025-07-16T14:01:50Z

tests/accuracy/collection-indexes.test.ts

+    callsCollectionIndexes("How many indexes do I have in 'mflix.movies' namespace?"),
+    callsCollectionIndexes("List all the indexes in movies collection in mflix database"),
+    callsCollectionIndexes(
+        `Is the following query: ${JSON.stringify({ runtime: { $lt: 100 } })} on the namespace 'mflix.movies' indexed?`


Should we expect the model to call explain for this prompt instead?

I think the prompt is very borderline in terms of calling collection-indexes and explain but I would expect collection-indexes to be called because of indexed word in the prompt.

Honestly, this is likely not a problem for this PR, but I feel looking at the explain plan is the only surefire way to check if a query is covered by an index. I guess for a simple query like this the model can look at the indexes and try and guess, but if it was { runtime: { $lt: 100 }, title: { $startsWith: "foo" } }, just listing the indexes is likely not enough. We might want to tweak the explain description though to hint to models to prefer using that if the user is asking about a query.

I do agree that this would be more a use case for the explain plan. However, if the models that we are testing do call list-indexes, either we should improve our prompting on the tools so LLMs prefer other tools or just keep it like this.

I believe this PR should show the current behaviour we already have, not fix what is not entirely correct.

Cool then I will include this as part of improvements that we can do for the tool schemas and that's when I will move the this prompt over to explain test. Does that sound good to you both?

tests/accuracy/sdk/describe-accuracy-tests.ts

tests/accuracy/sdk/models.ts

nirinchev · 2025-07-16T14:19:10Z

tests/accuracy/drop-collection.test.ts

+function callsDropCollection(prompt: string, expectedToolCalls: ExpectedToolCall[]): AccuracyTestConfig {
+    return {
+        prompt: prompt,
+        expectedToolCalls,
+    };
+}


This function doesn't make much sense to me - it's just creating an object with its arguments.

I don't really like having empty functions, but this follows the convention on all the tests, and likely callDropsCollection is a better name that a random JSON object. A nit would be to concat always the call to dropsCollection into the expectedToolCalls and leave expectedToolCalls as an optional vararg parameter for additional tool calls.

I'm personally fine with the current scenario because the tests are easy already to understand and maintain, and I don't think this is something that won't scale properly with the amount of tests.

nirinchev · 2025-07-16T14:19:25Z

tests/accuracy/drop-collection.test.ts

+
+function onlyCallsDropCollection(prompt: string): AccuracyTestConfig {
+    return {
+        prompt: prompt,


Suggested change

prompt: prompt,

prompt,

nirinchev · 2025-07-16T14:19:54Z

tests/accuracy/drop-database.test.ts

+    };
+}
+
+function callsDropDatabase(prompt: string, expectedToolCalls: ExpectedToolCall[]): AccuracyTestConfig {


Similarly - trivial function that serves no purpose.

nirinchev · 2025-07-16T14:20:04Z

tests/accuracy/drop-database.test.ts

+
+function onlyCallsDropDatabase(prompt: string): AccuracyTestConfig {
+    return {
+        prompt: prompt,


Suggested change

prompt: prompt,

prompt,

nirinchev · 2025-07-16T14:22:53Z

tests/accuracy/explain.test.ts

+const callsExplainWithFind = (prompt: string) =>
+    callsExplain(prompt, {
+        name: "find",
+        arguments: {
+            filter: { release_year: 2020 },
+        },
+    });
+
+const callsExplainWithAggregate = (prompt: string) =>
+    callsExplain(prompt, {
+        name: "aggregate",
+        arguments: {
+            pipeline: [
+                {
+                    $match: { release_year: 2020 },
+                },
+            ],
+        },
+    });
+
+const callsExplainWithCount = (prompt: string) =>
+    callsExplain(prompt, {
+        name: "count",
+        arguments: {
+            query: { release_year: 2020 },
+        },
+    });


These are one-off functions with a bunch of hardcoded data - we don't need to explicitly specify them and should instead just inline their implementation in the describe call.

So for most of these tests, the idea was to be verbose via function names in what we were expecting by just having a glance at the tests. Do you see them more as noise than helping understand the test suite quickly?

I think they live in this limbo where they have some configurable arguments, but also some hardcoded ones, making it really awkward. My suggestion would be to either go all in on the functions or only use them when they're being called with different arguments. An example of the former would be to make these functions just constants instead by incorporating the prompt:

const testSimpleCount: AccuracyTestConfig = callsExplain("Will counting documents, where release_year is 2020, from 'mflix.movies' namespace perform a collection scan?", { name: "count", arguments: { query: { release_year: 2020 }, }, }); const testMultistageAggregation = ... describeAccuracyTests([ testSimpleCount, testSimpleAggregation, testMultistageAggregation, testSimpleFind, ]);

Alternatively, if we want to keep the functions for only reusable code, we could do something like:

function callsExplain(prompt: string, methodName: string, methodArguments: Record<string, unknown>): AccuracyTestConfig { return { prompt, expectedToolCalls: [{ toolName: "explain", parameters: { database: "mflix", collection: "movies", method: [{ name: methodName, arguments: methodArguments }] } }] }; } describeAccuracyTests([ callsExplain( `Will fetching documents, where release_year is 2020, from 'mflix.movies' namespace perform a collection scan?`, "find", { query: { release_year: 2020 }}), callsExplain(...)

I'm fine either way - the only thing that I don't like is that we have a function that takes the prompt as an argument, but hardcodes the expected query, meaning that invoking it with any other prompt would not be correct.

I'm fine either way - the only thing that I don't like is that we have a function that takes the prompt as an argument, but hardcodes the expected query, meaning that invoking it with any other prompt would not be correct.

That is what we would expect from the test, right? These tests ensure that specific prompts behave as we expect: we have multiple prompts that should behave the same, if they don't, we likely broke something when we changed our tools.

I don't have strong opinions on the aesthetics if the test is easy to change and it does what it has to do. What I like of this approach is that it's not easy to change the expected outcome, so it's not easy to monkey patch.

the only thing that I don't like is that we have a function that takes the prompt as an argument, but hardcodes the expected query, meaning that invoking it with any other prompt would not be correct

So the functions are called only with prompt because all those prompts are expected to end up with same tool calls and same parameters. Wherever the parameter or tool calls differ, we have different functions (find test suite for example). If we instead do just objects in the list then you end up duplicating expectedToolCalls for no good reason and for some the prompts the expectedToolCalls list is big.

I think there are valid cases where a function definition is not needed at all, such as a suite having few prompts where tool call list is not that big to become repetitive noise but then there are actual suites where its better to reduce the repetitive noise by having it configurable via functions. Perhaps we should keep the tests writing style open to both methods and not stick to one?

- well defined Model types - move getAllAvailableModels() inside the test setup

github-actions · 2025-07-16T16:39:32Z

📊 Accuracy Test Results

📈 Summary

Metric	Value
Commit SHA	`39695d472865fa93cf7b06b7c5838c0fd5340b05`
Run ID	`9c05b8d4-8485-46aa-af55-21f4f668eff8`
Status	done
Total Prompts Evaluated	51
Models Tested	1
Average Accuracy	73.5%
Responses with 0% Accuracy	12
Responses with 75% Accuracy	6
Responses with 100% Accuracy	33

📎 Download Full HTML Report - Look for the accuracy-test-summary artifact for detailed results.

Report generated on: 7/16/2025, 4:39:30 PM

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch 4 times, most recently from 58bc8a5 to b557e02 Compare July 10, 2025 08:53

github-advanced-security bot found potential problems Jul 10, 2025

View reviewed changes

.github/workflows/accuracy-tests.yml Fixed Show fixed Hide fixed

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 7791e20 to 79cd26e Compare July 10, 2025 11:37

himanshusinghs changed the title ~~chore(tests): accuracy tests for MongoDB tools exposed by MCP server~~ chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 Jul 10, 2025

himanshusinghs marked this pull request as ready for review July 10, 2025 11:39

himanshusinghs requested a review from a team as a code owner July 10, 2025 11:39

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 79cd26e to 6ccaa11 Compare July 10, 2025 15:43

nirinchev reviewed Jul 11, 2025

View reviewed changes

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from f666014 to 1cc93f2 Compare July 13, 2025 23:37

himanshusinghs added accuracy-tests and removed accuracy-tests labels Jul 14, 2025

mongodb-js deleted a comment from github-actions bot Jul 14, 2025

nirinchev reviewed Jul 14, 2025

View reviewed changes

himanshusinghs added accuracy-tests and removed accuracy-tests labels Jul 14, 2025

gagik reviewed Jul 14, 2025

View reviewed changes

scripts/accuracy/generate-test-summary.ts Outdated Show resolved Hide resolved

scripts/accuracy/generate-test-summary.ts Outdated Show resolved Hide resolved

himanshusinghs removed the accuracy-tests label Jul 14, 2025

himanshusinghs and others added 19 commits July 16, 2025 13:27

chore: shift only when arguments are passed to the script

90f3e3c

chore: azure url is on vars

b6c5a36

chore: use env vars for mongo namespace

b6fee30

chore: ensure the generated asset directory is present

d693079

chore: generate a markdown brief for PR comments

510b1f3

chore: use lockfile for updating local test results

129147d

chore: make expectedToolCalls part of PromptResult

cb5178f

chore: make omitted fields a const

d8cbcfa

chore: update formatRunStatus as per feedback

24e164e

chore: move saveModelResponseForPromptAtomic to atomic update pipeline

dd8b26d

chore: prefer exclusive reads for public interface

1cda01a

chore: minor refactor of disk-storage (#370)

55b006c

chore: simplify getAccuracyResult

561e624

chore: simplified the update pipeline and added tool call serialization

7d33f00

chore: use $literal instead of serializing the tool calls

54880a5

chore: don't import what is not used

1d3ac04

chore: should use $literal also for expectedToolCalls

1eefcc5

chore: should recreate comment and hide previous one

9f05fc0

chore: rebase fixes and move to vitest

f6b2e7d

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 778165b to f6b2e7d Compare July 16, 2025 11:27

himanshusinghs added accuracy-tests and removed accuracy-tests labels Jul 16, 2025

This comment has been minimized.

Sign in to view

kmruiz approved these changes Jul 16, 2025

View reviewed changes

chore: run unit and integration for test script

426cda8

nirinchev requested changes Jul 16, 2025

View reviewed changes

chore: PR feedback

e6f426d

- well defined Model types - move getAllAvailableModels() inside the test setup

himanshusinghs added accuracy-tests and removed accuracy-tests labels Jul 16, 2025

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

Are you sure you want to change the base?

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

Conversation

himanshusinghs commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Goal

Design Brief

Detailed Design

Current State

For reviewers

Uh oh!

coveralls commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16325068133

Details

💛 - Coveralls

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nirinchev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

kmruiz left a comment

Choose a reason for hiding this comment

Uh oh!

nirinchev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himanshusinghs commented Jul 7, 2025 •

edited

Loading

coveralls commented Jul 7, 2025 •

edited

Loading