-
Notifications
You must be signed in to change notification settings - Fork 72
chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 16325068133Details
💛 - Coveralls |
58bc8a5
to
b557e02
Compare
7791e20
to
79cd26e
Compare
79cd26e
to
6ccaa11
Compare
f666014
to
1cc93f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still going through it - leaving some comments related to storage as I move on to the actual testing framework.
import { MongoDBBasedResultStorage } from "./mongodb-storage.js"; | ||
import { AccuracyResultStorage } from "./result-storage.js"; | ||
|
||
export function getAccuracyResultStorage(): AccuracyResultStorage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved to result-storage.ts
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I will do it right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh doing this will make a circular dependency, disk-storage depending on result-storage which depends on disk-storage. I checked and although the resolution works fine, I am doubtful of its reliability. I think its best to keep it this untangled.
778165b
to
f6b2e7d
Compare
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine on my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments so far - haven't gone through all of it, but the comments I left for some of the accuracy tests are applicable to the rest - review the function definitions and make sure those that are clearly consolidable are merged and functions that are used just once are inlined.
Additionally, I'm seeing LLM response and conversation messages being empty for each prompt in the generated report - is this expected?

Finally, the naming of the new files doesn't follow the conventions of the repo - we tend to use camelCase names. Obviously, not enforced, but would be great to be consistent there.
callsCollectionIndexes("How many indexes do I have in 'mflix.movies' namespace?"), | ||
callsCollectionIndexes("List all the indexes in movies collection in mflix database"), | ||
callsCollectionIndexes( | ||
`Is the following query: ${JSON.stringify({ runtime: { $lt: 100 } })} on the namespace 'mflix.movies' indexed?` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we expect the model to call explain for this prompt instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the prompt is very borderline in terms of calling collection-indexes and explain but I would expect collection-indexes to be called because of indexed
word in the prompt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, this is likely not a problem for this PR, but I feel looking at the explain plan is the only surefire way to check if a query is covered by an index. I guess for a simple query like this the model can look at the indexes and try and guess, but if it was { runtime: { $lt: 100 }, title: { $startsWith: "foo" } }
, just listing the indexes is likely not enough. We might want to tweak the explain description though to hint to models to prefer using that if the user is asking about a query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that this would be more a use case for the explain plan. However, if the models that we are testing do call list-indexes, either we should improve our prompting on the tools so LLMs prefer other tools or just keep it like this.
I believe this PR should show the current behaviour we already have, not fix what is not entirely correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool then I will include this as part of improvements that we can do for the tool schemas and that's when I will move the this prompt over to explain test. Does that sound good to you both?
function callsDropCollection(prompt: string, expectedToolCalls: ExpectedToolCall[]): AccuracyTestConfig { | ||
return { | ||
prompt: prompt, | ||
expectedToolCalls, | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't make much sense to me - it's just creating an object with its arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like having empty functions, but this follows the convention on all the tests, and likely callDropsCollection is a better name that a random JSON object. A nit would be to concat always the call to dropsCollection into the expectedToolCalls and leave expectedToolCalls as an optional vararg parameter for additional tool calls.
I'm personally fine with the current scenario because the tests are easy already to understand and maintain, and I don't think this is something that won't scale properly with the amount of tests.
|
||
function onlyCallsDropCollection(prompt: string): AccuracyTestConfig { | ||
return { | ||
prompt: prompt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prompt: prompt, | |
prompt, |
}; | ||
} | ||
|
||
function callsDropDatabase(prompt: string, expectedToolCalls: ExpectedToolCall[]): AccuracyTestConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly - trivial function that serves no purpose.
|
||
function onlyCallsDropDatabase(prompt: string): AccuracyTestConfig { | ||
return { | ||
prompt: prompt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prompt: prompt, | |
prompt, |
const callsExplainWithFind = (prompt: string) => | ||
callsExplain(prompt, { | ||
name: "find", | ||
arguments: { | ||
filter: { release_year: 2020 }, | ||
}, | ||
}); | ||
|
||
const callsExplainWithAggregate = (prompt: string) => | ||
callsExplain(prompt, { | ||
name: "aggregate", | ||
arguments: { | ||
pipeline: [ | ||
{ | ||
$match: { release_year: 2020 }, | ||
}, | ||
], | ||
}, | ||
}); | ||
|
||
const callsExplainWithCount = (prompt: string) => | ||
callsExplain(prompt, { | ||
name: "count", | ||
arguments: { | ||
query: { release_year: 2020 }, | ||
}, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are one-off functions with a bunch of hardcoded data - we don't need to explicitly specify them and should instead just inline their implementation in the describe call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for most of these tests, the idea was to be verbose via function names in what we were expecting by just having a glance at the tests. Do you see them more as noise than helping understand the test suite quickly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they live in this limbo where they have some configurable arguments, but also some hardcoded ones, making it really awkward. My suggestion would be to either go all in on the functions or only use them when they're being called with different arguments. An example of the former would be to make these functions just constants instead by incorporating the prompt:
const testSimpleCount: AccuracyTestConfig = callsExplain("Will counting documents, where release_year is 2020, from 'mflix.movies' namespace perform a collection scan?", {
name: "count",
arguments: {
query: { release_year: 2020 },
},
});
const testMultistageAggregation = ...
describeAccuracyTests([
testSimpleCount,
testSimpleAggregation,
testMultistageAggregation,
testSimpleFind,
]);
Alternatively, if we want to keep the functions for only reusable code, we could do something like:
function callsExplain(prompt: string, methodName: string, methodArguments: Record<string, unknown>): AccuracyTestConfig {
return {
prompt,
expectedToolCalls: [{
toolName: "explain",
parameters: {
database: "mflix",
collection: "movies",
method: [{ name: methodName, arguments: methodArguments }]
}
}]
};
}
describeAccuracyTests([
callsExplain(
`Will fetching documents, where release_year is 2020, from 'mflix.movies' namespace perform a collection scan?`,
"find",
{ query: { release_year: 2020 }}),
callsExplain(...)
I'm fine either way - the only thing that I don't like is that we have a function that takes the prompt as an argument, but hardcodes the expected query, meaning that invoking it with any other prompt would not be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine either way - the only thing that I don't like is that we have a function that takes the prompt as an argument, but hardcodes the expected query, meaning that invoking it with any other prompt would not be correct.
That is what we would expect from the test, right? These tests ensure that specific prompts behave as we expect: we have multiple prompts that should behave the same, if they don't, we likely broke something when we changed our tools.
I don't have strong opinions on the aesthetics if the test is easy to change and it does what it has to do. What I like of this approach is that it's not easy to change the expected outcome, so it's not easy to monkey patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only thing that I don't like is that we have a function that takes the prompt as an argument, but hardcodes the expected query, meaning that invoking it with any other prompt would not be correct
So the functions are called only with prompt because all those prompts are expected to end up with same tool calls and same parameters. Wherever the parameter or tool calls differ, we have different functions (find test suite for example). If we instead do just objects in the list then you end up duplicating expectedToolCalls for no good reason and for some the prompts the expectedToolCalls list is big.
I think there are valid cases where a function definition is not needed at all, such as a suite having few prompts where tool call list is not that big to become repetitive noise but then there are actual suites where its better to reduce the repetitive noise by having it configurable via functions. Perhaps we should keep the tests writing style open to both methods and not stick to one?
- well defined Model types - move getAllAvailableModels() inside the test setup
📊 Accuracy Test Results📈 Summary
📎 Download Full HTML Report - Look for the Report generated on: 7/16/2025, 4:39:30 PM |
Motivation and Goal
Design Brief
Detailed Design
Refer to the doc titled -
MCP Tools Accuracy Testing
Current State
For reviewers
tests/accuracy/sdk
. Start withdescribe-accuracy-test.ts
as this is where all the different parts come together and dive further into specific implementation of each parts afterwards.Apologies for the big chunk to be reviewed here but I did not see a way around it.