Skip to content

Revise WolframLanguageEvaluator and context tool descriptions#189

Open
mtirard wants to merge 1 commit into
WolframResearch:mainfrom
mtirard:feature/tool-description-revision
Open

Revise WolframLanguageEvaluator and context tool descriptions#189
mtirard wants to merge 1 commit into
WolframResearch:mainfrom
mtirard:feature/tool-description-revision

Conversation

@mtirard
Copy link
Copy Markdown
Contributor

@mtirard mtirard commented May 28, 2026

Motivation

Following recent mailing list feedback that agents in Claude Desktop reach for wolframscript via shell rather than the WolframLanguageEvaluator tool when computing in Wolfram Language. The MCP tool's description doesn't communicate what makes it preferable to a shell-spawned wolframscript invocation: a persistent kernel that avoids per-call startup cost.

Reviewing the four affected tool descriptions, several coupled issues compound that core gap:

  1. WolframLanguageEvaluator's key advantage — a persistent kernel that avoids wolframscript's per-call startup cost — isn't stated in the description. The current text describes the mechanism ("Evaluates Wolfram Language code... in a Wolfram Language kernel") but not what makes it preferable.

  2. The cross-tool nag "Always use the Wolfram context tool before using this tool" is unconditional. Unconditional "always" directives lose their force when applied to situations they don't fit, and dilute the credibility of other directives in the same prompt. The intent — "look things up first" — is valuable; the unconditional mandate undermines it.

  3. The three context tools have nearly identical descriptions with an "Always use at the start of new conversations" mandate that doesn't condition on situation. An agent gets effectively the same signal from all three and can't disambiguate.

  4. WolframLanguageEvaluator's "read access to local files" understates the default Method -> "Session" capability — already flagged by the TODO at Kernel/Tools/WolframLanguageEvaluator.wl:33.

These issues are coupled: addressing (2) requires the context tools to self-promote properly, which is what (3) is about. This PR addresses them together.

Design principles

Four cross-cutting principles guided the redesign. These are arguments, not measurements — happy to discuss any of them:

  • Positive framing. Rules of the form "do not X" require the agent to recognize and suppress X, which is fragile. Rules of the form "do Y, because Z" give a positive directive plus reasoning the agent can apply contextually. Throughout the new descriptions, directives are positive, with reasoning anchored in why they hold.

  • No cross-tool mandates. Tool descriptions should sell their own tool, not other tools. Cross-tool coupling (Tool A's description pointing the agent at Tool B first) doesn't scale and undermines its own credibility when applied unconditionally. The "look things up first" intent is preserved, but relocated to where it belongs.

  • Situational triggers over blanket mandates. Concrete situational conditions ("when verifying documented behavior", "when code isn't behaving as expected") give the agent something matchable against the current task. Blanket mandates require the agent to either over-apply them or ignore them — both failure modes.

  • Disambiguation through positive recommendation. WolframContext is the broadest of the three context tools — a naive agent under uncertainty will default to it, doubling latency and result volume. Rather than deprecating it, the new description positions it as a fallback by positively recommending the specific tools when the domain is clear.

Changes per file

Kernel/Tools/WolframLanguageEvaluator.wl

  • New opening: "Evaluates Wolfram Language code in a live, persistent kernel session. Definitions, variables, and loaded packages survive across calls." States the actual differentiator from a fresh wolframscript subprocess; the original description never did.

  • Replaced "Do not ask permission to evaluate code" with reasoning-based framing: "The user installed this MCP server deliberately — they want Wolfram Language used where it fits (computation, symbolic math, data lookups, plotting, etc.) to get results. When a request fits, evaluate code and return the result." The intent is for the agent to contextually calibrate (lean in when WL fits, avoid forcing WL into unrelated requests) rather than apply a bare rule.

  • Resolved the TODO at line 33: "read access to local files""Read and write local files directly from code (e.g. with Import, Export)". The original understated default Method -> "Session" capability. The examples also softly steer toward in-code file ops.

  • Removed "Always use the Wolfram context tool before using this tool...". The intent is preserved — relocated to the context tool descriptions themselves where it can be properly conditioned.

  • The \[FreeformPrompt] block is unchanged. Out of scope for this PR.

Kernel/Tools/Context.wl

All three context tool descriptions rewritten with situational triggers and disambiguation:

  • The "Always use at the start of new conversations or if the topic changes" mandate is removed in favor of per-tool triggers an agent can match against current state.
  • The "up to 250 words" / "as much detail as possible" guidance is removed — let the agent decide.
  • WolframLanguageContext triggers focus on programming (function lookup, behavior verification, symbol discovery).
  • WolframAlphaContext triggers focus on factual lookups (real-world data, entity resolution, knowledge cross-reference).
  • WolframContext leads with cross-domain queries (chemistry, physics, finance, geography) and explicitly recommends the specific tools when the domain is clear.

Testing

Smoke-tested locally with Claude Desktop on representative scenarios. Behavior matches expectations — agents reach for WolframLanguageEvaluator directly for computation requests without narrating first, and the context tools differentiate appropriately by query type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant