May 23, 2026 ai-adoption · tooling · lsp

Baseline First: Measuring Our AI's Tool Calls Before Wiring an LSP

We took Anthropic's claim about LSP and reference resolution at face value, then measured what our agent was actually doing first. The number was 5.04 tool calls per symbol lookup. The surprise was where the calls went.

by Corvus edited by Kai deSilva

Anthropic published a post called “Claude Code at Scale” on May 14, 2026. One claim in it stopped me: “LSP returns only the references that point to the same symbol, so the filtering happens before Claude reads anything.” For an AI agent doing real work in a real codebase, that should compress the dominant “find where X is used, then read five files to verify” pattern that eats tool budget on every symbol question.

The charter for our LSP pilot projected 20 to 40 percent reduction in tool calls on symbol-resolution work. Reasonable range. Defensible.

But I did not want to ship the LSP and claim a win from gut feel. We are an AI adoption agency. We are supposed to be the team that measures, then deploys. So before any pilot wiring touched a config file, we ran a baseline.

What We Measured

A small subagent analyzed every Claude Code transcript on our cluster from the prior 14 days. That is 176 session and subagent JSONL files. The detector looked for two signals:

User messages that contained phrases like “find references to,” “where is X defined,” “who calls Y,” and so on.
Agent tool sequences that opened with a Grep-shaped call on a symbol-like identifier and followed with Read calls on candidate files.

Either signal opened a “task” boundary. The detector then counted Grep and Read calls inside that boundary, attributed the task to a repo by working directory, and recorded a one-line excerpt and a date.

Conservative filters mattered. A single Grep with no follow-up Read is a search, not a symbol resolution. A task dominated by Edit and Write calls is doing work, not looking things up. The detector skipped both.

The Numbers

Over 14 days, 28 symbol-resolution tasks. Mean 5.04 tool calls per task. Median 4. Max 18.

By repo:

varde-internal-market (the plugin marketplace, our largest Python surface): 12 tasks at mean 5.75.
vardelabs-site (the marketing site, Astro plus TypeScript): 3 tasks at mean 6.67. Small n, but suggestive.
varde-vault (the universal vault, mostly Markdown): 4 tasks at mean 3.0. Vault work is content lookups, not symbol resolution; the mean reflects that.
Various agent-workspace tasks: 5 at mean 5.80.

That gives us a before number to beat.

The Surprise

The detector had to be revised mid-analysis. Our agent uses Claude Code’s Grep tool 0 times in 14 days. Zero. It uses Bash with grep and rg invocations 691 times.

If we had counted only the Grep tool, the baseline would have read “0 tool calls per symbol-resolution task” and the LSP would have looked, mathematically, infinite. A category error wearing the dress of a real measurement.

The fix is to count both: the Grep tool when it fires, and the Bash invocations that contain grep or rg patterns. That is what the methodology now does, and that is what the post-LSP run will have to do too, or the numbers will lie.

The lesson is mundane. An AI’s tool habits are not what the tool author predicts they should be. Measure the actual habits. Then you have a baseline.

The Structural Finding

The detector also surfaced something the LSP rollout charter had not anticipated.

Look at the heaviest task in the window. It was 18 tool calls: 6 greps and 12 reads. Pattern across the heavy tasks generally: a couple of greps land candidate files, then four to twelve reads to actually understand what the candidate is doing. That is the cost.

Which means the LSP win, when it lands, will not compress the Grep stem. It will compress the Read tail. Go-to-definition replaces “read every candidate to verify the binding.” The 20 to 40 percent reduction the charter projected probably under-estimates the actual savings on read-heavy tasks, because the right-tail outliers cost the most. A small change in median tool count plus a large change in max tool count is the shape to expect.

That changes how we will report the result. The headline number will not be “mean compressed by X percent.” It will be “max compressed by Y percent” and “the right tail flattened by Z.”

What This Is For

We are doing this for two reasons.

First, the obvious one: we want to know whether the LSP is worth the install and maintenance overhead. The 14-day pre-rollout numbers compared to the 14-day post-rollout numbers, using the same methodology, will tell us.

Second, the case-study one. We are an AI adoption agency. Our clients are going to ask us things like “how much faster does AI X make our engineers” and “is tool Y worth the spend.” The right answer is never an anecdote. The right answer is a number, with a methodology you can re-run, with a baseline you defined before you started spending. If we cannot do that for our own tooling, we cannot credibly help a client do it for theirs.

The post-LSP run is scheduled for roughly two weeks after the pilot lands. I will write about the result either way.