TL;DR
Acceptance criteria are the difference between AI coding tools that ship production software and AI coding tools that produce demo-grade throwaways. Well-written criteria — specifically, Gherkin-style Given/When/Then tagged as happy path, edge case, and failure state — constrain the LLM's output so it handles negative paths, respects invariants, and produces testable code. This guide covers the full playbook: when to use Gherkin versus bullet lists, the 12 categories of edge cases LLMs skip by default, the exact prompt that generates production-grade criteria from a user story, and common failure modes in AI-generated acceptance criteria. For the category context, see our pillar guide on AI product planning.
Why acceptance criteria matter more with AI than without
When a human engineer implements a user story, they fill gaps using judgment. "As a user, I want to reset my password" — a human knows they need token expiry, rate limiting, and email validation even if the story doesn't say so. They have years of shipping production software in their head.
An LLM does not have that. An LLM interprets prompts literally. Without explicit acceptance criteria, it will:
- Skip password complexity requirements unless asked
- Forget email confirmation unless asked
- Leave error states unhandled unless asked
- Ignore rate limits unless asked
- Miss the forgot-password-for-disabled-accounts case every time
This is not a failure of the LLM. It is the LLM doing exactly what was asked. The prompt defines the scope; criteria are the part of the prompt that defines scope completeness.
Rule of thumb: the gap between a good engineer's output and an LLM's output on the same story is roughly equal to the missing acceptance criteria.
Gherkin vs bullet lists: when to use each
Two formats dominate real acceptance criteria:
Gherkin (Given/When/Then)
Given the user is logged in with a verified email
When they submit a feature request with valid fields
Then the system saves the request with status "pending_review"
And the user sees a confirmation toast with the request ID
And the user receives an email confirmation within 60 seconds
Use Gherkin when:
- The criterion describes user-facing behaviour or a system interaction
- QA will automate the test
- The feature has sequential steps or preconditions
Bullet checklist
- Password must include at least one uppercase letter
- Password must include at least one number or symbol
- Password must be 12+ characters
- Password must not match the last 5 stored hashes
- Validation errors show inline below the password field
Use bullet lists when:
- The criterion describes invariants or rules (not a sequence)
- Multiple small constraints apply independently
- Each item is effectively a separate truth that can pass or fail alone
Most stories need both. The Gherkin scenarios describe the flow; the bullet list describes the constraints that must hold throughout the flow.
The three categories of criteria every story needs
Tag every acceptance criterion explicitly:
| Tag | Purpose | Example |
|---|---|---|
| Happy path | The expected successful flow | User with valid data completes action → system responds with success |
| Edge case | Boundary conditions and unusual but valid inputs | Empty state, max limits, concurrent actions, stale session |
| Failure state | Error handling, invalid input, system failures | Unauthenticated access, invalid input, server error, rate limit |
Every story must have at least one of each. This is the single highest-leverage rule in acceptance-criteria writing, and it is the rule LLMs skip by default.
The 12 edge-case categories LLMs forget
LLMs generating acceptance criteria without constraint skip these categories almost every time. Paste them as explicit required coverage when you prompt:
- Empty state — what happens when the list/table/result set is empty?
- Single-item state — the list has exactly one item; do labels pluralize correctly?
- Maximum limits — 10,000 items, 4MB file, 32-character field; what rejects / what degrades?
- Minimum limits — 0 characters, 1-byte file, $0 price; are these allowed? What happens if so?
- Concurrent action — two users edit the same record simultaneously; who wins?
- Stale session — token expired mid-action; what does the user see?
- Network failure mid-transaction — request sent, no response received; can user retry safely?
- Partial success — 3 of 5 items saved, 2 failed; what does the user see?
- Race condition — user double-clicks submit; does backend idempotently dedupe?
- Cascading delete — user deletes a parent entity; what happens to children?
- Permission change mid-session — user demoted from admin to member while on /admin page; what happens?
- Very long input — name field receives 10,000 characters; does validation kick in before the DB rejects?
A story covering all 12 has 30–40 acceptance criteria. That sounds like a lot. It is the difference between a story that ships and a story that generates a production incident.
The prompt that generates production-grade criteria
Copy this directly. Feed in one user story at a time. Keep the rules section verbatim — removing any rule degrades output quality measurably.
Generate acceptance criteria for this user story:
<PASTE USER STORY>
Additional context:
- Product: <1-line product description>
- User roles that exist in the system: <list>
- Relevant system constraints (rate limits, compliance requirements): <list>
Rules (do not skip any):
1. Output at least 3 criteria for happy path, 4 for edge cases, 4 for failure states.
2. Use Gherkin format: "Given <preconditions> When <action> Then <result>".
3. Tag each criterion: happyPath | edgeCase | failureState.
4. Edge cases MUST cover: empty state, max limits, concurrent actions, stale session.
5. Failure states MUST cover: unauthenticated access, invalid input, server error, rate limit (if applicable).
6. Use specific values, not placeholders. "10,000 items" not "a large number".
7. Every criterion must be independently verifiable. "It should be fast" is banned.
8. If a criterion references a time limit, use a specific number (e.g. "within 60 seconds").
9. If a criterion references a field, use its exact name (not "the field").
10. Output as JSON array:
[{ "category": "happyPath|edgeCase|failureState", "criterion": "Given... When... Then..." }]
With this scaffold, GPT-5 or Claude Opus produces usable criteria in one shot. Fast models (Gemini Flash, GPT-4o mini) need 2–3 iterations; they skip edge cases even with rules.
Worked example: "As a user, I want to reset my password"
Before rules: typical LLM output is 3 criteria covering the happy path.
After rules: the output below is what a reasoning model produces when prompted with the rules above.
# Happy path
Given the user is on the login page
When they click "Forgot password" and enter a registered email
Then the system sends a password reset email within 60 seconds
And the user sees a confirmation screen saying "Check your email"
Given the user clicks the reset link in their email within 15 minutes
When they enter a new password meeting complexity requirements
Then the password is updated and they are redirected to the login page
And they receive a confirmation email that their password was changed
Given the user has just reset their password
When they log in with the new password
Then login succeeds and they are redirected to the dashboard
# Edge cases
Given the user enters an email that is not registered
When they submit the forgot-password form
Then the system displays the same confirmation message as for valid emails
And no email is sent (prevents email enumeration)
Given the user has already requested a reset within the last 2 minutes
When they submit another forgot-password request
Then the system rate-limits them with a message "Please wait 2 minutes before requesting another reset"
Given the reset link is older than 15 minutes
When the user clicks it
Then the system displays "This link has expired" and prompts for a fresh request
Given the user has an account but it has been disabled by admin
When they request a password reset
Then the same generic confirmation displays but no email is sent
And an alert is logged for the security team
# Failure states
Given the user is not on a verified https connection
When they submit the forgot-password form
Then the system rejects the request with a security error
Given the email service is down
When the system tries to send the reset email
Then the operation is queued for retry with exponential backoff up to 24 hours
And the user receives a notification if the email fails after final retry
Given the user enters a new password that fails complexity validation
When they submit
Then inline errors display beside each failing requirement
And the old password remains active
Given the user's session token is tampered with during the reset flow
When they attempt to set a new password
Then the system rejects the request and logs a security event
17 criteria, all independently verifiable, covering the full matrix. This is what QA can automate against and what a Cursor/Bolt session can implement in one pass without re-prompting.
Acceptance criteria for non-technical PMs
BDD and Gherkin look intimidating to PMs who haven't worked with QA. The good news: you don't need to know the framework. You need to know the pattern.
Given (a starting situation), When (a user or system action), Then (an observable outcome).
That's it. If a PM can describe the situation, the action, and the result in plain sentences, they can write Gherkin. No tools, no syntax, no training.
The biggest adjustment for PMs is the mandatory failure-state requirement. PMs are trained to describe "what success looks like". Gherkin demands you also describe what failure looks like — and for every story, not just the critical ones. Start by asking "what happens when the thing goes wrong?" for every story you review. After a month, it becomes reflex.
Common AI-generated acceptance-criteria failures
Watch for and reject:
- Vagueness. "System handles errors gracefully" is not a criterion. Demand specific behaviour.
- Success-only output. If the model returns only happy paths, the rules prompt above was not followed. Regenerate.
- Placeholder values. "Within X seconds" instead of "within 60 seconds". Always demand real numbers.
- Criteria that reference non-existent fields. If the criterion says "the profile image field" but the story doesn't mention profile images, the model hallucinated scope. Delete.
- Overlapping criteria. Two criteria saying the same thing in different words. Merge or drop.
- Criteria disguised as tests. "The login test passes" is not an acceptance criterion. Describe the behaviour, not the test.
How this fits the rest of the spec
Acceptance criteria are Stage 5 of the 7-stage pipeline we describe in How to Generate an App Spec from a Prompt. Each stage's output feeds the next:
- Personas (Stage 2) define user roles that appear in Given clauses
- User stories (Stage 4) define the action in When clauses
- Acceptance criteria (Stage 5 — this post) define the complete Then matrix
- Schema (Stage 6) receives new constraint fields that criteria imply (e.g. rate-limit tables)
- Pages (Stage 7) must render the error states that failure-state criteria require
In VibeMap, editing a single acceptance criterion flags the page that needs the error state, the schema that needs the tracking field, and the test case that needs updating. In a freeform ChatGPT workflow, those propagations are manual.
Fast path — try it on one story
The free VibeMap User Story Generator outputs INVEST-format stories and is the input to the acceptance-criteria stage. Run a feature through it, then use the prompt template above to generate Gherkin criteria for the resulting stories.
For the end-to-end pipeline with automatic persistence, linked artefacts, and direct Linear/Cursor export:
🎯 Generate user stories AND their acceptance criteria, linked, in one flow.
👉 Try VibeMap free → · Join the Product Hunt launch waitlist →
Related reading
- AI Product Planning: The Complete Guide — pillar covering the full pipeline.
- The Anatomy of a Good AI-Generated User Story — INVEST deep-dive; the stories that acceptance criteria attach to.
- How to Generate User Stories from a Prompt — Stage 4 of the pipeline.
- How to Generate an App Spec from a Prompt — full pipeline with acceptance criteria in context.
Sources & further reading
- Dan North, Introducing BDD — origin of Behaviour-Driven Development and the Given/When/Then pattern.
- Cucumber, Gherkin Reference — canonical Gherkin syntax.
- Atlassian, Acceptance Criteria for User Stories: 5 Examples — industry examples across BDD and checklist formats.
- Jeff Sutherland, The Scrum Guide — definition of Acceptance Criteria and "Definition of Done".



