Start

How we measure

What the reference task behind the headline numbers actually is, what was compared, and why these are reference figures rather than per-session guarantees.

Ask ChatGPT Ask Claude View as Markdown

The short version: the headline numbers on this site come from one reference task. We ran a knowledge-heavy edit on a real codebase twice, once with Pathrule delivering the team's context and once without it, and recorded what the assistant did each time. This page tells you what that task was, what was compared, and where the numbers do and do not apply.

We would rather you understand one honest comparison than trust a round number with no story behind it.

The numbers we cite

These figures appear across the site, including on the context layer explainer and in the machine-readable summary at /llms.txt. They all trace to the same comparison.

Metric	With Pathrule	Without Pathrule	Effect
Input tokens	a small fraction	a large context window	about 85% fewer
Wall-clock time	seconds	minutes	5 to 8 times faster
Tool calls	a few	many	about 5 times fewer
Files read	one	several to many	about 10 times fewer
Cost per task	low	higher	about 80% lower

The token and cost reductions are linked: fewer files read and fewer tool calls mean a smaller input window, and a smaller input window is most of what the cost tracks.

What the reference task is

A developer asks an AI coding assistant to make a change to an existing codebase. The change is small in lines but loaded in context: the right answer depends on a decision the team made earlier that is not written in the source files.

The concrete shape is the coupon example we use elsewhere on the site. On this codebase a discount attaches to the line item, never to the order total. One engineer learned that the hard way; it is a team convention, not something the code states. A capable assistant reading the repo cannot derive it, because it is not there to derive.

That is the whole point of the task. It is chosen to expose the gap between what a code scan can find and what a team already knows.

What was compared

Two runs of the same task, same assistant, same prompt.

Without Pathrule. The assistant opens the relevant module, reads the schema, follows imports into related code, and reasons its way toward a plausible change. It often lands on the wrong attachment point, because the convention it needs is not in any file it can read. A reviewer catches it later.
With Pathrule. The path-scoped slice of the team's memories and rules arrives at hook time, before the first tool call. The convention is already in context, so the assistant makes the correct change on the first pass and reads far fewer files to get there.

The difference is not that one assistant is smarter. It is that one of them started from what the team already learned instead of rediscovering it from scratch. For how that delivery works, see How hooks work.

An open, reproducible benchmark

The reference task above is an illustration. The delivery efficiency is also measured in the open, in a separate suite you can run yourself, at github.com/pathrule/benchmarks. It checks out a real, pinned open-source repository (Fastify), seeds it with team-style knowledge, and asks each assistant the same ten prompts twice: once with the whole knowledge base dumped into one instruction file, once with Pathrule's path-scoped delivery. Scoring is mechanical (expected facts and required actions), every run is appended to disk, and the report deliberately publishes the cells where Pathrule costs more.

The primary metric is total footprint, the tokens the model processes per turn, because that is the context load path-scoped delivery is meant to reduce. On the hard tier, three runs per cell:

Client	Total footprint	Facts	Actions
Claude Opus 4.8	about 52% lower	unchanged	unchanged
OpenAI Codex GPT-5.5	about 41% lower	1.6 pp lower	33.3 pp higher

The Codex result is mixed and is reported as such: fewer tokens and more required actions followed, with a small drop in fact accuracy. Billable non-cached tokens are shown alongside total footprint, since prompt caching discounts a static dump.

These cells measure path-scoped native compilation. Semantic embedding ranking (managed in Pathrule Cloud, or your own embedding key in the open-source edition) is an additive layer and is measured separately, not in these cells. Fixtures, methodology and raw runs are all in the repository.

What the numbers do not mean

These are reference figures from one comparison. They are not a per-session guarantee, and we do not present them as an average across all work.

They are task-dependent. The reference task is deliberately knowledge-heavy. A change where the answer is already in the file you have open will not show an 85% token reduction, because there was little context to save in the first place.
They are codebase-dependent. The savings scale with how much tribal knowledge your team has captured and how much the assistant would otherwise have to scan to re-derive it.
They are model-dependent and tool-dependent. Different assistants read files and call tools at different rates, so the absolute numbers move with the tool you run.
They describe input, not output. The reductions are about the context the assistant has to take in, not about the size of the change it writes.

If a single number has to stand in for the set, we use "about 85% fewer input tokens" and pair it with "on a reference task". That phrasing is intentional and consistent.

Why a code scan does not close the gap

The slow run is not a tooling failure. The assistant did the reasonable thing: it read the code and reasoned about it. The miss is structural. The fact it needed is a team decision, not a property of the source. No amount of additional scanning surfaces a decision that the files do not record.

This is the part the numbers are really measuring: the cost of rediscovering, every session, knowledge the team already has. Pathrule carries that knowledge to the path where it applies so the assistant starts from it. See How knowledge compounds for what that looks like over many sessions.

What we will and will not publish

We publish what we measured and why it matters: the task, the comparison, the observed figures, and their limits.

We do not publish how the system decides which knowledge to surface. The ranking that picks the path-scoped slice is part of the product, not part of this page. If you want to understand the behavior rather than the internals, How retrieval finds the right knowledge describes what you can observe.

When you run your own knowledge-heavy task with and without Pathrule, you should expect the same shape of result: fewer files read, fewer tool calls, a smaller input window, and a change that reflects what your team already knows. The exact multiples will be yours, not ours.