Name: VibeMap
Availability: InStock
Author: VibeMap

TL;DR

Acceptance criteria are the difference between AI coding tools that ship production software and AI coding tools that produce demo-grade throwaways. Well-written criteria (specifically, Gherkin-style Given/When/Then tagged as happy path, edge case, and failure state) constrain the LLM's output so it handles negative paths, respects invariants, and produces testable code. This guide covers the full playbook: when to use Gherkin versus bullet lists, the 12 categories of edge cases LLMs skip by default, the exact prompt that generates production-grade criteria from a user story, and common failure modes in AI-generated acceptance criteria. For the category context, see our pillar guide on AI product planning.

Why acceptance criteria matter more with AI than without

When a human engineer implements a user story, they fill the gaps using judgment. Take "As a user, I want to reset my password." A human knows they need token expiry, rate limiting, and email validation even if the story never says so, because they have years of shipping production software in their head.

An LLM does not have that. An LLM interprets prompts literally. Without explicit acceptance criteria, it will:

Skip password complexity requirements unless asked
Forget email confirmation unless asked
Leave error states unhandled unless asked
Ignore rate limits unless asked
Miss the forgot-password-for-disabled-accounts case every time

This is not a failure of the LLM. It is the LLM doing exactly what was asked. The prompt defines the scope; criteria are the part of the prompt that defines scope completeness.

Rule of thumb: the gap between a good engineer's output and an LLM's output on the same story is roughly equal to the missing acceptance criteria.

Gherkin vs bullet lists: when to use each

Two formats dominate real acceptance criteria:

Gherkin (Given/When/Then)

Given the user is logged in with a verified email
When they submit a feature request with valid fields
Then the system saves the request with status "pending_review"
And the user sees a confirmation toast with the request ID
And the user receives an email confirmation within 60 seconds

Use Gherkin when:

The criterion describes user-facing behaviour or a system interaction
QA will automate the test
The feature has sequential steps or preconditions

Bullet checklist

- Password must include at least one uppercase letter
- Password must include at least one number or symbol
- Password must be 12+ characters
- Password must not match the last 5 stored hashes
- Validation errors show inline below the password field

Use bullet lists when:

The criterion describes invariants or rules (not a sequence)
Multiple small constraints apply independently
Each item is effectively a separate truth that can pass or fail alone

Most stories need both. The Gherkin scenarios describe the flow; the bullet list describes the constraints that must hold throughout the flow.

The three categories of criteria every story needs

Tag every acceptance criterion explicitly:

Tag	Purpose	Example
Happy path	The expected successful flow	User with valid data completes action → system responds with success
Edge case	Boundary conditions and unusual but valid inputs	Empty state, max limits, concurrent actions, stale session
Failure state	Error handling, invalid input, system failures	Unauthenticated access, invalid input, server error, rate limit

Every story must have at least one of each. This is the single highest-leverage rule in acceptance-criteria writing, and it is the rule LLMs skip by default.

The 12 edge-case categories LLMs forget

LLMs generating acceptance criteria without constraint skip these categories almost every time. Paste them as explicit required coverage when you prompt:

Empty state — what happens when the list/table/result set is empty?
Single-item state — the list has exactly one item; do labels pluralize correctly?
Maximum limits — 10,000 items, 4MB file, 32-character field; what rejects / what degrades?
Minimum limits — 0 characters, 1-byte file, $0 price; are these allowed? What happens if so?
Concurrent action — two users edit the same record simultaneously; who wins?
Stale session — token expired mid-action; what does the user see?
Network failure mid-transaction — request sent, no response received; can user retry safely?
Partial success — 3 of 5 items saved, 2 failed; what does the user see?
Race condition — user double-clicks submit; does backend idempotently dedupe?
Cascading delete — user deletes a parent entity; what happens to children?
Permission change mid-session — user demoted from admin to member while on /admin page; what happens?
Very long input — name field receives 10,000 characters; does validation kick in before the DB rejects?

A story covering all 12 has 30–40 acceptance criteria. That sounds like a lot. It is the difference between a story that ships and a story that generates a production incident.

The prompt that generates production-grade criteria

Copy this directly. Feed in one user story at a time. Keep the rules section verbatim — removing any rule degrades output quality measurably.

Generate acceptance criteria for this user story:

<PASTE USER STORY>

Additional context:
- Product: <1-line product description>
- User roles that exist in the system: <list>
- Relevant system constraints (rate limits, compliance requirements): <list>

Rules (do not skip any):
1. Output at least 3 criteria for happy path, 4 for edge cases, 4 for failure states.
2. Use Gherkin format: "Given <preconditions> When <action> Then <result>".
3. Tag each criterion: happyPath | edgeCase | failureState.
4. Edge cases MUST cover: empty state, max limits, concurrent actions, stale session.
5. Failure states MUST cover: unauthenticated access, invalid input, server error, rate limit (if applicable).
6. Use specific values, not placeholders. "10,000 items" not "a large number".
7. Every criterion must be independently verifiable. "It should be fast" is banned.
8. If a criterion references a time limit, use a specific number (e.g. "within 60 seconds").
9. If a criterion references a field, use its exact name (not "the field").
10. Output as JSON array:
[{ "category": "happyPath|edgeCase|failureState", "criterion": "Given... When... Then..." }]

With this scaffold, GPT-5 or Claude Opus produces usable criteria in one shot. Fast models (Gemini Flash, GPT-4o mini) need 2–3 iterations; they skip edge cases even with rules.

Worked example: "As a user, I want to reset my password"

Before rules: typical LLM output is 3 criteria covering the happy path.

After rules: the output below is what a reasoning model produces when prompted with the rules above.

# Happy path

Given the user is on the login page
When they click "Forgot password" and enter a registered email
Then the system sends a password reset email within 60 seconds
And the user sees a confirmation screen saying "Check your email"

Given the user clicks the reset link in their email within 15 minutes
When they enter a new password meeting complexity requirements
Then the password is updated and they are redirected to the login page
And they receive a confirmation email that their password was changed

Given the user has just reset their password
When they log in with the new password
Then login succeeds and they are redirected to the dashboard

# Edge cases

Given the user enters an email that is not registered
When they submit the forgot-password form
Then the system displays the same confirmation message as for valid emails
And no email is sent (prevents email enumeration)

Given the user has already requested a reset within the last 2 minutes
When they submit another forgot-password request
Then the system rate-limits them with a message "Please wait 2 minutes before requesting another reset"

Given the reset link is older than 15 minutes
When the user clicks it
Then the system displays "This link has expired" and prompts for a fresh request

Given the user has an account but it has been disabled by admin
When they request a password reset
Then the same generic confirmation displays but no email is sent
And an alert is logged for the security team

# Failure states

Given the user is not on a verified https connection
When they submit the forgot-password form
Then the system rejects the request with a security error

Given the email service is down
When the system tries to send the reset email
Then the operation is queued for retry with exponential backoff up to 24 hours
And the user receives a notification if the email fails after final retry

Given the user enters a new password that fails complexity validation
When they submit
Then inline errors display beside each failing requirement
And the old password remains active

Given the user's session token is tampered with during the reset flow
When they attempt to set a new password
Then the system rejects the request and logs a security event

17 criteria, all independently verifiable, covering the full matrix. This is what QA can automate against and what a Cursor/Bolt session can implement in one pass without re-prompting.

Acceptance criteria for non-technical PMs

BDD and Gherkin look intimidating to PMs who haven't worked with QA. The good news: you don't need to know the framework. You need to know the pattern.

Given (a starting situation), When (a user or system action), Then (an observable outcome).

That's it. If a PM can describe the situation, the action, and the result in plain sentences, they can write Gherkin. No tools, no syntax, no training.

The biggest adjustment for PMs is the mandatory failure-state requirement. PMs are trained to describe "what success looks like". Gherkin asks you to describe what failure looks like too, and for every story, not just the critical ones. Start by asking "what happens when the thing goes wrong?" for every story you review. After a month, it becomes reflex.

Common AI-generated acceptance-criteria failures

Watch for and reject:

Vagueness. "System handles errors gracefully" is not a criterion. Demand specific behaviour.
Success-only output. If the model returns only happy paths, the rules prompt above was not followed. Regenerate.
Placeholder values. "Within X seconds" instead of "within 60 seconds". Always demand real numbers.
Criteria that reference non-existent fields. If the criterion says "the profile image field" but the story doesn't mention profile images, the model hallucinated scope. Delete.
Overlapping criteria. Two criteria saying the same thing in different words. Merge or drop.
Criteria disguised as tests. "The login test passes" is not an acceptance criterion. Describe the behaviour, not the test.

How this fits the rest of the spec

Acceptance criteria are Stage 5 of the 7-stage pipeline we describe in How to Generate an App Spec from a Prompt. Each stage's output feeds the next:

Personas (Stage 2) define user roles that appear in Given clauses
User stories (Stage 4) define the action in When clauses
Acceptance criteria (Stage 5 — this post) define the complete Then matrix
Schema (Stage 6) receives new constraint fields that criteria imply (e.g. rate-limit tables)
Pages (Stage 7) must render the error states that failure-state criteria require

In VibeMap, editing a single acceptance criterion flags the page that needs the error state, the schema that needs the tracking field, and the test case that needs updating. In a freeform ChatGPT workflow, those propagations are manual.

Fast path — try it on one story

The free VibeMap User Story Generator outputs INVEST-format stories and is the input to the acceptance-criteria stage. Run a feature through it, then use the prompt template above to generate Gherkin criteria for the resulting stories.

For the end-to-end pipeline with automatic persistence, linked artefacts, and direct Linear/Cursor export:

Generate user stories AND their acceptance criteria, linked, in one flow.

👉 Try VibeMap free → · Join the Product Hunt launch waitlist →

Sources & further reading

Dan North, Introducing BDD — origin of Behaviour-Driven Development and the Given/When/Then pattern.
Cucumber, Gherkin Reference — canonical Gherkin syntax.
Atlassian, Acceptance Criteria for User Stories: 5 Examples — industry examples across BDD and checklist formats.
Jeff Sutherland, The Scrum Guide — definition of Acceptance Criteria and "Definition of Done".

AI Acceptance Criteria: The Gherkin Playbook for Stopping AI Output Chaos

TL;DR

Why acceptance criteria matter more with AI than without

Gherkin vs bullet lists: when to use each

Gherkin (Given/When/Then)

Bullet checklist

The three categories of criteria every story needs

The 12 edge-case categories LLMs forget

The prompt that generates production-grade criteria

Worked example: "As a user, I want to reset my password"

Acceptance criteria for non-technical PMs

Common AI-generated acceptance-criteria failures

How this fits the rest of the spec

Fast path — try it on one story

Related reading

Sources & further reading

Related Topics

User Stories Guide

Acceptance Criteria Best Practices

Project Planning Tools

Related Articles

How to Use AI to Generate User Stories & Acceptance Criteria from a Prompt

AI-Generated App Architecture: Pages, Schema, and File Structure from a Prompt

Why AI-Generated Code Breaks in Production (With Real Failure Examples)