Property-Based Testing for Specs

The Missing Discipline in AI-Driven Delivery

AI Accelerates What Specs Fail to Say

AI-assisted development amplifies requirements entropy because most specifications encode examples, not understanding.

For decades, teams (including mine) have relied on narrative, example-based specifications to communicate user requirements. I personally lean heavily on BDD scenarios. I would write a few representative examples, convince myself they captured the behavior, and trust that shared mental models between the team and the users would fill in the rest. This approach already struggled with ambiguity, defaults, and edge cases before the AI-are but those weaknesses were partially masked by human ingenuity and slower feedback loops.

AI changes the dynamics completely. When large parts of implementation are generated or modified automatically, the cost of underspecification explodes. AI agents execute faithfully against what is written, not what was meant. When BDD scenarios describe behavior narratively rather than structurally, AI fills the gaps plausibly but inconsistently. Different contexts, prompts, or refactors yield different interpretations of the same “spec.”

The core issue is not that using BDD scenarios is wrong, nor that AI is unreliable. The issue is that specs, as commonly written, do not encode the decision structure of the system. They sample behavior without defining the space of possible behavior. They assert outcomes without making explicit which conditions matter, which defaults apply, and which combinations are forbidden or irrelevant.

As a result, teams cannot answer a fundamental question after a change request:

“Have we specified all the cases that must change?”

There is no structural notion of completeness, no way to distinguish missing BDD scenarios from acceptable omissions, and no mechanism to detect contradictions across scenarios that “sound right” in isolation. The specification becomes a collection of stories rather than a checkable knowledge artifact.

This problem existed long before AI. What AI does is remove the friction that once limited its impact.

When specs encode examples instead of understanding, AI turns a long-standing weakness into a systemic risk.

Speed Without Understanding Becomes Unbounded Risk

Once specifications lose authority, speed and entropy increases while control collapses.

When specs lack a notion of completeness, teams lose the ability to reason about change. After a requirement update, no one can say with confidence whether all affected behavior has been specified. Reviews devolve into judgment calls, and “looks reasonable” replaces evidence. The faster delivery becomes, the more dangerous this uncertainty gets.

The first consequence is hidden and silent rework. AI and humans alike produce implementations that are locally correct but globally inconsistent. Defaults drift, edge cases diverge, and refactoring introduce regressions that no test ever catches because no scenario ever asserted the invariant that was violated. The organization appears productive while comprehension and the once shared mental models quietly erode.

The second consequence is unbounded change risk. Without a decision structure, there is no way to bound the blast radius of a change request. Every modification potentially affects unseen combinations of conditions. Over time, teams respond by becoming conservative: avoiding refactors, duplicating logic, or relying on institutional memory to navigate landmines the specs no longer map.

The third consequence is that at some point AI acceleration plateaus or reverses. Early gains from automation give way to increasing coordination cost, review friction, and downstream correction. AI executes faster than teams can reason, but reasoning is exactly what was never encoded in the first place. Velocity rises, but learning does not compound.

At this point, specifications stop functioning as a contract. They become narrative artifacts that require experienced interpreters to explain what they “really mean.” New team members ramp slower. Senior engineers become bottlenecks. The organization starts depending on heroics instead of artifacts.

Without bounded change and checkable understanding, AI turns speed into a liability rather than a competitive advantage.

The Core Move: Treat the Specification as the System Under Test

Apply Property-Based Testing to the specification itself, and treat the specification as the system under test.

Instead of asking whether individual example scenarios look right, we test properties that must hold for all rule-relevant decision situations described by a change request. This shifts the role of the specification from storytelling to knowledge encoding. Correctness is no longer inferred from plausibility, but evaluated against explicit invariants.

Practically, this means redefining specs as a decision system. Teams externalize the dimensions of variation that actually matter such as roles, operations, states, inputs, views, and make them explicit as condition axes. Axes capture only those dimensions whose variation can change the outcome; irrelevant detail is deliberately excluded. Business rules then constrain which combinations of those axes are valid.

Each BDD scenario becomes a concrete instantiation of a decision, rather than an isolated narrative. In decision-table terms, scenarios are rows, expressed in prose, often with some columns left implicit. The structure is already there; this approach simply makes it explicit and checkable.

From this structure, we generate coverage obligations: the set of decision combinations that must be exemplified for the specification to be complete. These obligations are not written by hand. They are generated using Property-Based Testing techniques such as pairwise combinations, boundary conditions, and rule-constrained sampling, excluding invalid or equivalent combinations. This replaces guesswork with systematic exploration.

Correctness is then defined by a small set of spec-level properties, not by individual rules:

  • every required decision obligation is exemplified by at least one scenario
  • no two scenarios contradict each other under semantically equivalent axis assignments
  • defaults are explicit and stable across contexts
  • new concepts do not leak outside their declared scope

These are meta-properties. They do not describe system behavior; they define what it means for the specification itself to be trustworthy.

In this approach, the system under test is the specification, not the runtime implementation. Large Language Models (LLMs) are used deliberately and narrowly within this loop. They enumerate axes, generate obligation candidates, normalize scenarios into decision rows, and surface gaps or contradictions. They are not asked to judge correctness heuristically, but to operate within an explicit, mechanically checkable structure.

The result is a spec that can be falsified. Missing scenarios appear as counterexamples. Contradictions surface mechanically. When something fails, shrinking produces the smallest decision context that explains why.

When correctness is defined structurally, AI stops guessing and starts enforcing understanding.

Compounding Learning or Compounding Entropy

You must decide whether AI will compound learning or compound entropy.

If you act now by treating specifications as testable systems, change becomes bounded and auditable. For every change request, teams can generate the set of decision obligations that must be covered and check using AI whether the spec already covers them. “Did we update everything that should change?” stops being a debate and becomes a verifiable question with evidence.

As a result, AI becomes a learning multiplier rather than a speed hack. Defaults, scope boundaries, and invariants are preserved across refactors and automated edits because they are enforced at the spec level, before code exists. Over time, understanding compounds: fewer scenarios are missing, rework shrinks, and new team members rely on artifacts rather than oral history. The specification regains authority as a contract.

This path has a clear execution cost. Teams must make decision structure explicit and accept that some ambiguity will be surfaced rather than hidden. However, that cost is paid once per change request, while the benefit compounds across every AI-assisted iteration that follows.

If you do nothing, AI will continue to ship faster while specification debt grows invisibly. Missing BDD scenarios will not fail tests because they were never written. Contradictions will remain latent because each scenario sounds locally correct. Defaults will drift across contexts without being noticed until downstream behavior diverges.

Over time, teams adapt by relying on senior engineers, informal conventions, and repeated manual reviews to compensate for what the specs no longer guarantee. AI acceleration plateaus as coordination cost rises. Delivery becomes dependent on heroics, and change risk remains structurally unbounded.

The choice is not between speed and rigor. It is between bounded change with compounding learning and unbounded change with accelerating entropy.

AI forces this decision now because ambiguity that once slowed teams down now scales at machine speed.

Next Step

Decide now whether AI in your organization will scale learning or scale entropy by mandating that every material change request is validated against property-based specification invariants before AI-assisted delivery proceeds.

Dimitar Bakardzhiev

Getting started