MyPrototypeWhat

ContextChef (3): Pruner — Decoupling Tool Registration from Routing

中文版:ContextChef (3):Pruner——把工具注册和路由彻底分开

Tool hallucinations are usually blamed on “too many tools, model picks the wrong one.” That diagnosis is correct. The solution it suggests usually isn’t.

The intuitive fix is: only inject tools relevant to the current task each turn, hide the rest, and the hallucinations will stop. Manus tried this approach and ultimately abandoned it.

Two Problems with Dynamic Tool Addition/Removal

Manus summarized why they follow “Mask, Don’t Remove” in Context Engineering for AI Agents: don’t add or remove tool definitions mid-conversation. Two reasons.

First, tool definitions are serialized near the front of the context, right after the system prompt. Any change to the tool list invalidates the KV-cache for every token that follows — including the entire conversation history. Rebuilding the cache every turn means full re-prefill every turn. For agents with an input/output ratio of 100:1, the cost is extreme.

Second, the model references prior history when generating. If an earlier action in history referenced tool A, but tool A no longer exists in the current context, the model gets confused — it may violate the schema or fabricate a parameter structure. Tools aren’t state. They’re contracts. Contracts can’t be unilaterally modified mid-session.

Design Angle: Register Once, Choose Your Routing Strategy

Pruner’s design angle is to separate two concerns: tool registration is one-time; routing strategy is a per-turn decision. You register all tools at initialization with semantic tags; before each compile(), you choose which routing strategy determines the subset actually exposed to the model. Decoupling these two things delivers several concrete benefits:

Strategies can be swapped without touching registration. The same registered tools work with pruneByTask() today and the two-layer architecture tomorrow. Just change the routing call — registration stays untouched. This is genuinely useful during product iteration: start with the simplest approach, upgrade when you hit the wall, no need to refactor the registration layer.

Tool descriptions are declarative; matching logic is transparent. Semantic filtering matches tool description fields against your task description. You write good descriptions; Pruner handles the matching. No hardcoded if/else routing tables, no manual “which tools go with which task” mappings to maintain.

Two-layer stability comes from structure, not per-turn manual control. Once the namespace tool list is established, it never changes. You don’t need to decide “which tools to expose this turn” in every agent loop iteration — that decision is made at architecture time, with zero runtime overhead.

Two Paths

With the design angle clear, the trade-offs between the two paths become obvious:

Flat pruning: The tool list can change each turn. Best for agents that don’t care about cache costs and have clear task boundaries. Lowest onboarding cost — tag tools at registration, call pruneByTask() before each turn. For agents with fewer than 20 tools, this is usually enough.

Two-layer architecture: Core tools are injected as stable namespace groups; the list never changes. Long-tail tools are registered as an XML directory and loaded on demand via load_toolkit. A stable tool list means optimal cache hit rates and no model confusion from a shifting tool set. The right choice when you have more than 20 tools or care about cache hit rates.

The extra load_toolkit round-trip is a real cost, but what it buys is structural stability — the model can’t hallucinate a tool name that doesn’t exist in its schema, because it only ever sees the tool sets it has explicitly loaded.

Anthropic’s team put it plainly: “If a human engineer can’t definitively say which tool should be used in a given situation, an AI agent can’t be expected to do better.” This isn’t a model intelligence problem. It’s an information density problem. Give the model fewer but more precise choices, and its performance typically improves.


Next: Offloader/VFS — when a tool returns 15,000 characters of logs, what do you do?