Revise WolframLanguageEvaluator and context tool descriptions#189
Open
mtirard wants to merge 1 commit into
Open
Revise WolframLanguageEvaluator and context tool descriptions#189mtirard wants to merge 1 commit into
mtirard wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Following recent mailing list feedback that agents in Claude Desktop reach for
wolframscriptvia shell rather than theWolframLanguageEvaluatortool when computing in Wolfram Language. The MCP tool's description doesn't communicate what makes it preferable to a shell-spawnedwolframscriptinvocation: a persistent kernel that avoids per-call startup cost.Reviewing the four affected tool descriptions, several coupled issues compound that core gap:
WolframLanguageEvaluator's key advantage — a persistent kernel that avoidswolframscript's per-call startup cost — isn't stated in the description. The current text describes the mechanism ("Evaluates Wolfram Language code... in a Wolfram Language kernel") but not what makes it preferable.The cross-tool nag "Always use the Wolfram context tool before using this tool" is unconditional. Unconditional "always" directives lose their force when applied to situations they don't fit, and dilute the credibility of other directives in the same prompt. The intent — "look things up first" — is valuable; the unconditional mandate undermines it.
The three context tools have nearly identical descriptions with an "Always use at the start of new conversations" mandate that doesn't condition on situation. An agent gets effectively the same signal from all three and can't disambiguate.
WolframLanguageEvaluator's "read access to local files" understates the defaultMethod -> "Session"capability — already flagged by the TODO atKernel/Tools/WolframLanguageEvaluator.wl:33.These issues are coupled: addressing (2) requires the context tools to self-promote properly, which is what (3) is about. This PR addresses them together.
Design principles
Four cross-cutting principles guided the redesign. These are arguments, not measurements — happy to discuss any of them:
Positive framing. Rules of the form "do not X" require the agent to recognize and suppress X, which is fragile. Rules of the form "do Y, because Z" give a positive directive plus reasoning the agent can apply contextually. Throughout the new descriptions, directives are positive, with reasoning anchored in why they hold.
No cross-tool mandates. Tool descriptions should sell their own tool, not other tools. Cross-tool coupling (Tool A's description pointing the agent at Tool B first) doesn't scale and undermines its own credibility when applied unconditionally. The "look things up first" intent is preserved, but relocated to where it belongs.
Situational triggers over blanket mandates. Concrete situational conditions ("when verifying documented behavior", "when code isn't behaving as expected") give the agent something matchable against the current task. Blanket mandates require the agent to either over-apply them or ignore them — both failure modes.
Disambiguation through positive recommendation.
WolframContextis the broadest of the three context tools — a naive agent under uncertainty will default to it, doubling latency and result volume. Rather than deprecating it, the new description positions it as a fallback by positively recommending the specific tools when the domain is clear.Changes per file
Kernel/Tools/WolframLanguageEvaluator.wlNew opening: "Evaluates Wolfram Language code in a live, persistent kernel session. Definitions, variables, and loaded packages survive across calls." States the actual differentiator from a fresh
wolframscriptsubprocess; the original description never did.Replaced "Do not ask permission to evaluate code" with reasoning-based framing: "The user installed this MCP server deliberately — they want Wolfram Language used where it fits (computation, symbolic math, data lookups, plotting, etc.) to get results. When a request fits, evaluate code and return the result." The intent is for the agent to contextually calibrate (lean in when WL fits, avoid forcing WL into unrelated requests) rather than apply a bare rule.
Resolved the TODO at line 33: "read access to local files" → "Read and write local files directly from code (e.g. with
Import,Export)". The original understated defaultMethod -> "Session"capability. The examples also softly steer toward in-code file ops.Removed "Always use the Wolfram context tool before using this tool...". The intent is preserved — relocated to the context tool descriptions themselves where it can be properly conditioned.
The
\[FreeformPrompt]block is unchanged. Out of scope for this PR.Kernel/Tools/Context.wlAll three context tool descriptions rewritten with situational triggers and disambiguation:
WolframLanguageContexttriggers focus on programming (function lookup, behavior verification, symbol discovery).WolframAlphaContexttriggers focus on factual lookups (real-world data, entity resolution, knowledge cross-reference).WolframContextleads with cross-domain queries (chemistry, physics, finance, geography) and explicitly recommends the specific tools when the domain is clear.Testing
Smoke-tested locally with Claude Desktop on representative scenarios. Behavior matches expectations — agents reach for
WolframLanguageEvaluatordirectly for computation requests without narrating first, and the context tools differentiate appropriately by query type.