Almost every team working with agents has been through this: you write a spec that looks reasonable, the agent runs, and what comes back is not quite what you asked for. It is not broken code, it is subtly wrong code. One extra field, a flow nobody approved, an architectural decision made along the way. Then the rework starts: review, flag, ask for adjustments, hope it lands this time.

The reflex is to blame the model. Most of the time, the model is not the problem. A spec for an agent is not a description, it is a constraint. It is the smallest set of boundaries that lets the agent execute without reinterpretation. Most “the agent did weird stuff” failures are spec failures, not model failures. This post is about what a spec must contain, what it must exclude, and how to know it is ready before you run the agent.

Why the narrative spec fails

The narrative spec is the most common format, and the most dangerous. It describes the feature in paragraphs, tells the user story, explains the business context, and somewhere in the middle says what needs to happen. With a human reader, it works. A senior developer reads it, fills the gaps with judgment, and moves on. An agent reads everything as signal and tries to honor everything equally.

The problem is that prose and constraints compete for the same space. When you write “some visual feedback would be nice”, the agent treats it as a requirement. When you write “ideally without adding dependencies”, the agent treats it as an optional preference. There is no reliable way to tell the agent “this is mandatory, this is decoration” without marking it explicitly.

The other problem is that narrative hides the actual decision. Many specs look complete until you try to answer the question: what concrete decision is this spec asking for?. If the answer requires rereading three paragraphs and inferring, the spec is already ambiguous for an agent.

What a minimum spec must contain

Here is the core, without ceremony. If any item is missing, the spec is very likely to produce rework.

Concrete decision. One imperative sentence stating what the agent will deliver by the end of the run. Not “improve the login flow” but “add email verification to new user signup using the configured email provider”. The difference is that the second version does not admit two reasonable readings.

Verifiable acceptance criteria. Each AC needs a mechanical way to check it was met. Given/When/Then works, SHALL works, plain text you can verify manually works. What does not work is an AC like “the flow should feel smooth”. Smooth for whom, measured how, verified by which test?

Explicit non-goals. This is the part most people skip. A non-goal is what the spec is consciously leaving out. “Do not add new endpoints”, “do not touch the users schema”, “do not change the logout flow” are non-goals. Without this layer, the agent decides the scope on its own, and almost always decides for more than you wanted.

Target files. The files that will be touched, named. src/auth/email-verification.ts, src/auth/routes.ts, tests/auth/email-verification.test.ts. Not “the auth module”, not “wherever it makes sense”. If the file list is not obvious, that is a sign the spec is still too abstract and needs another pass.

Critical technical constraints. Anything that will trigger review if ignored: dependency version, the error pattern the project follows, a type contract, a performance bound, a compatibility requirement. This is not the place to repeat the full style guide, which already lives in the project constitution. It is the place for constraints specific to this task.

What to leave out

What stays out of the spec matters as much as what goes in. A lean spec avoids:

  • Discussion history. Debates about alternatives turn into noise. If an alternative was ruled out and the reason matters, it becomes a short ADR (architecture decision record). If it does not matter, cut it.
  • Long product rationale. The agent does not need business motivation to execute. It needs the decision. Long motivation dilutes the signal.
  • Tentative code samples. A snippet “just to give an idea” almost always ends up copied verbatim by the agent, bugs included.
  • Generic project instructions. Naming conventions, commit style, test rules already live in the project constitution. Repeating them bloats the prompt and increases the odds of a silent contradiction with the canonical document.
  • Decorative preferences. “It would be nice if”, “ideally”, “if possible”. Either it is a requirement, or it does not belong in the spec.

The operational rule is this: every line in the spec should change what the agent does. If it does not change anything, it does not belong. Same logic as the context budget from the previous post, applied to the spec artifact instead of the full prompt.

The operational test: two agents, same output

There is a mental test that beats any long checklist. Before running, ask:

Would two different agents, given the same spec and the same context, produce functionally equivalent output?

If the answer is no, the spec is underspecified. Some decision still lives in the author’s head. Instead of running the agent, it is worth five minutes making that decision explicit.

This test catches what checklists miss. A spec can have every field filled in and still fail the test because the fields are vague. A spec can have far less than the full template and pass, because each field is precise.

“Functionally equivalent” here means observable behavior is the same. Different internal names, different directory layout, different parameter order — all acceptable. What is not acceptable is one agent treating a field as required and another treating it as optional.

Common spec failures

Four patterns show up often enough to deserve names. All of them are fixable in minutes once you recognize the pattern.

Narrative spec instead of constraint spec. The spec tells a story instead of setting boundaries. Fix: extract the decision into an imperative sentence at the top, move the rest to optional context or drop it.

AC without verifiability. “Smooth experience”, “expected behavior”, “works well”. Fix: for each AC, write the concrete check. If the check does not exist, the AC does not exist.

Implicit non-goals. The author knows what is out of scope but did not write it down. Fix: list three to five non-goals before sending. Usually the first one that comes to mind is the most important.

Decision hidden in prose. The spec looks complete, but the actual decision is diluted across two paragraphs. Fix: if you cannot point to the sentence that contains the decision, the decision is not in the spec.

A before-and-after example

Imagine the task is to add email verification to signup. A typical spec would look like this:

We need to improve signup security. Today users can create an account
with any email, including throwaway domains or typos, which has been
causing support issues. It would be good to have email verification,
ideally using something we already have configured. The flow should
feel intuitive and not hurt conversion. If possible, also consider the
OAuth case. Worth evaluating whether to block features before
verification.

Looks reasonable. It is terrible for an agent. The decision is hidden in “it would be good to have”, there are three optional-looking items with no clear markers, two triggers for the agent to expand scope (“consider OAuth”, “evaluate blocking”), no verifiable AC, and no target files.

The same task rewritten as a constraint:

decision: >
  Add email verification to new user signup using the email provider
  already configured at src/email/provider.ts.

acceptance_criteria:
  - Given a signup with a valid email, when the account is created,
    then a verification email with a token is sent.
  - Given a valid verification token, when the confirmation endpoint
    is called, then the user is marked as verified.
  - Given an expired or invalid token, when the confirmation endpoint
    is called, then the response is 400 with a generic error.
  - Existing users keep working without verification.

non_goals:
  - Do not change the OAuth flow.
  - Do not block features for unverified users.
  - Do not add resend verification in this story.
  - Do not touch logout or password reset.

target_files:
  - src/auth/email-verification.ts (new)
  - src/auth/routes.ts (add confirmation route)
  - src/users/user.model.ts (add verified_at field)
  - tests/auth/email-verification.test.ts (new)

constraints:
  - Tokens expire in 24h.
  - Error messages follow the pattern in src/errors/http-errors.ts.
  - No new dependencies.

The second version is shorter, duller to read, and much less ambiguous. Two different agents would deliver equivalent results. The first version would ship two different features depending on the model’s mood.

Note that this format does not depend on any product-specific DSL. It is structured text. It can live as YAML, markdown, a field in a story tool, whatever fits your flow.

Specs are an editing skill, not a writing skill

Writing a long spec is easy. Writing a lean spec is hard for the same reason a short legal brief is hard: every word has to earn its place. In practice, a good spec usually goes through two or three passes, and each pass removes content.

The first pass tends to mix everything together. The second pass splits decision from context and marks the non-goals. The third pass cuts half of what does not change behavior. When you keep removing and the two-agent test still passes, the spec is ready.

Five minutes editing a spec saves hours reviewing a PR. The return on investment is lopsided in your favor.

Closing

Context and spec are two different dimensions of the same problem. Context answers “what does the agent see”. Spec answers “what does the agent need to decide”. Together, they are the foundation of working with agents. An inflated prompt with an ambiguous spec is the combo that produces the most rework. Minimal context with a constraint-style spec is the combo that pays off the most.

If you take one thing from this post: next time you are about to run an agent, before you hit enter, read the spec and ask out loud “would two agents produce the same output?”. If the answer is not a clear yes, edit the spec first. It is the highest return per minute you can get in any agent workflow.

References

Addy Osmani — How to Write a Good Spec for AI Agents.

BMAD Method — a practical reference for spec-driven handoffs.