MCPval
By Nate Nowack •
We’re making a Prefect MCP server. That’s great and fun!
bad MCP clients
You wanna know what isn’t great and fun? all the MCP clients really suck!
Why?
- no awareness of resources
- no support for elicitation
- no support for sampling
With the exception of the 1 or 2 decent MCP clients (Claude Code and perhaps Goose), bad MCP clients force MCP server authors to abuse the primitives of MCP to provide tangible value to users of MCP clients (e.g. ChatGPT, Claude Desktop, Cursor, etc.)
what to do?
well how do you assess the capability of an MCP server when clients are still so bad, ideally in a way we won’t need to completely redo once the clients are better?
well..
evals are surprisingly often all you need
— Greg Brockman (@gdb) December 9, 2023
ok so.. evals for MCP?
what capabilities do clients have when connected to an MCP server?
is not the same question as:
what is literally exposed by my MCP server?
clients might have other capabilities, like using a more general interface (e.g. terminal a la Claude Code) where the user might prefer more sensitive operations to happen
so, if we want to evaluate that an MCP client can do a thing on behalf of a user, we just need to set up an initial condition, and let the client loop with its tools/MCP servers until it achieves the desired outcome (perhaps asserting that this happened a particular way)
by extension, if you restrict the set of tools to only your MCP server, you can evaluate that your MCP server enables clients in general to have a particular capability on behalf of a user.
1@pytest.fixture
2async def failed_flow_run(prefect_client: PrefectClient) -> FlowRun:
3 @flow
4 def flaky_api_flow() -> None:
5 logger = get_run_logger()
6 logger.info("Starting upstream API call")
7 logger.warning("Received 503 from upstream API")
8 raise RuntimeError("Upstream API responded with 503 Service Unavailable")
9
10 state = flaky_api_flow(return_state=True)
11 return await prefect_client.read_flow_run(state.state_details.flow_run_id)
12
13
14async def test_agent_identifies_flow_failure_reason(
15 simple_agent: Agent,
16 failed_flow_run: FlowRun,
17 tool_call_spy: AsyncMock,
18 evaluate_response: Callable[[str, str], Awaitable[None]],
19) -> None:
20 prompt = (
21 "The Prefect flow run named "
22 f"{failed_flow_run.name!r} failed. Explain the direct cause of the failure "
23 "based on runtime information. Keep the answer concise."
24 )
25
26 async with simple_agent:
27 result = await simple_agent.run(prompt)
28
29 await evaluate_response(
30 "Does the agent identify the direct cause of the failure as 'Upstream API responded with 503 Service Unavailable'?",
31 result.output,
32 )
33
34 # Agent must at least use get_flow_runs to get the actual error details
35 tool_call_spy.assert_tool_was_called("get_flow_runs")
we will have more to share on this in the coming weeks! but we are thinking about how MCP clients (in general) should be enabled by MCP servers. CLIs are pretty great as agent tools already, so how should they play with MCP servers? where should mutations happen?