Harness engineering for AI-assisted teams

AI produces code.A harness makes it

A model alone does not know your architecture, conventions, or definition of done. Our open-source harnesses surround it with stack-specific context, disciplined workflows, independent review, tests, and hard gates, so teams can move faster without letting the codebase drift.

Get the Engine Diagnostic Explore the editions

guide before generation verify after generation gate before release

.ruler/instructions.md

# priority beats prompt pressure
P0  safety + approvals
P1  engineer, not autocomplete
P2  repo conventions
P3  spec + test-first changes
P4  independent verification

# review fleet / current diff
architect ........ APPROVE
code ............. APPROVE
qa ............... BLOCK
  missing authz failure test

→ status: not done

What harness engineering means

Governance is an outcome. The harness is the engineering system.

A coding-agent harness makes the team's implicit standards explicit: architecture boundaries, coding conventions, test strategy, security constraints, and the definition of done. It guides work before code exists, senses problems after generation, and improves as recurring failures become new controls.

This follows the guide, sensor, and steering-loop framing described in Birgitta Böckeler's harness engineering article on MartinFowler.com .

01 / GUIDES

Make the desired path explicit.

Give every agent the same repository context, architecture rules, workflows, and acceptance standards before it starts generating code.

02 / SENSORS

Detect drift while it is cheap.

Use tests, linters, type checks, structural analysis, and independent AI review to find violations before human review becomes the first feedback loop.

03 / STEERING

Turn repeated failures into controls.

When the same issue appears twice, improve the guide, sensor, or gate. The harness becomes organizational memory instead of another static rules file.

The harness family

One architecture. Multiple editions.

The safety model, workflow discipline, review pattern, distribution CLI, and measurement loop stay consistent. The stack knowledge changes.

Shared core

The harness engineering system travels across editions.

priority-ordered profile · approval gates · spec and TDD workflows · independent reviewers · deterministic controls · multi-agent-tool fan-out · versioned updates · eval baselines

Reference edition

@tierone/llm-harness-fullstack

Fullstack

For monorepos combining a NestJS API, React web app, and shared contracts. This is the most documented edition and the source of the evidence shown below.

44 skills · 7 review agents

View fullstack source

Active development

@tierone/llm-harness-react

React

For frontend repositories that need React architecture, state, routing, forms, accessibility, performance, Vitest, and Playwright guidance without backend context.

38 skills · 7 review agents

View React source

Active development

@tierone/llm-harness-nest

NestJS

For backend repositories that need clean architecture, authorization, persistence, transactions, Node.js operations, and API verification without frontend guidance.

26 skills · 7 review agents

View NestJS source

More planned

The architecture is extensible by design.

New editions can add a stack's conventions, skills, review rubrics, and eval cases while keeping the same control model.

Choose by repository shape, not by agent vendor. Every edition can target the same supported agent tools through ruler.

The system design

Guides before. Sensors after. Gates underneath.

Every edition combines feedforward guidance, computational and inferential feedback, and deterministic release controls. The exact skills and review rubrics change with the stack; the architecture does not.

01 / GUIDES

Steer the work before code exists.

A compact, priority-ordered operating profile routes the agent into the right depth only when the work needs it.

Stack-specific skills loaded only when the work needs them
Repo conventions turn tribal knowledge into shared instructions
Spec and TDD workflows move decisions earlier

02 / SENSORS

Review in fresh context.

Seven one-shot agents inspect the artifact, not the implementer's confidence. Each owns one concern and returns a binding verdict.

Architecture and specification readiness
Design principles, coverage, edge cases, and security
Live acceptance verification and lesson capture

03 / GATES

Enforce what advice cannot guarantee.

CI, pre-commit hooks, and agent permission rules hold the line when a model misses an instruction.

Typecheck, lint, unit, integration, and end-to-end checks
Denied pushes to main
Human approval for deploys, publishes, and database writes

prompt route spec test implement independent review executed verification

Engineering outcomes

More predictable delivery without pretending the model is deterministic.

The model remains probabilistic. The harness narrows the acceptable solution space, catches drift earlier, and makes quality visible before code ships.

More predictable delivery

Shared context, workflows, and acceptance criteria reduce output variance. Human review starts with a smaller, more consistent, and more reviewable diff.

Architecture that resists drift

Clean Architecture boundaries, dependency rules, cohesion, and separation of concerns are taught before implementation and checked after it.

Readable, maintainable code

Explicit naming, small changes, tests, DRY, SOLID, and KISS become review criteria. Reviewers reject duplication, accidental complexity, and brute-force fixes.

Confidence through evidence

Type checks, tests, linters, structural assertions, fresh-context review, and eval baselines reveal whether the system still meets its engineering bar.

The same source can target Claude Code, Copilot, Codex, Cursor, and Windsurf, so changing agent tools does not mean rebuilding the engineering system.

Measured, not believed

The reference edition tests its own claims.

These committed results come from @tierone/llm-harness-fullstack. Each focused edition uses the same eval machinery and publishes its own baselines as it matures.

Read the eval methodology

Routing recall 0.98

Sonnet-class baseline for loading the right skills from real and paraphrased prompts.

Adherence 100%

Twenty-one cases, repeated three times, under the full operating profile.

Mutation kill rate 6 / 6

Seeded regressions caught, proving the eval suite notices deleted or weakened gates.

Context finding 0.33

Adherence at roughly 90k filler tokens. The reason deterministic gates are part of the design.

The weak number matters. Instruction-following degrades under context pressure. The harness does not hide that failure mode; it designs around it with CI, permission denies, and approval gates that do not depend on model memory.

Adopt without faith

One repo. Thirty days. Measured.

Do not roll this across the organization because the architecture sounds right. Pilot it where the result can be observed.

Pick an active repository, choose the edition that matches its shape, capture the baseline, instrument the pilot, and make a scale-or-stop decision from the delta.

Read the full adoption playbook

Pick the pilot

One active repo, three to five engineers, and a baseline window captured before the harness changes anything.

Install and customize

Fill in repo conventions, copy the deterministic gates, and generate each agent tool's native config.

Run the 30-day pilot

Measure cycle time, review rounds, caught findings, escaped defects, gate events, and engineer sentiment.

Scale, narrow, or stop

The decision is based on observed tradeoffs. A failed pilot costs about two developer-days, not an organization-wide migration.

The obvious objections

Answered straight.

A rules file is one layer. The harness adds fresh-context review agents, deterministic enforcement, multi-tool generation, measured adherence, committed baselines, and a three-way-merge update path that preserves your local conventions.

The overhead is real and scales with risk. Small changes can take a declared fast path. Larger changes pay for specification, review, and live acceptance because design mistakes are cheaper before implementation than after a pull request reaches a human.

Baselines are keyed by model. Re-run the evals, inspect the behavioral diff, and decide whether the new model clears the same bar. The change is visible before the team switches, not discovered later through an incident.

No runtime dependency remains. The installer copies plain Markdown into your repository. You can inspect it, edit it, stop updating it, or remove it. Your governance does not disappear if a package or agent vendor does.

Yes. The editions are MIT-licensed, and their source, tests, and evolving eval evidence are public. TierOne helps when you need the repo conventions, control design, pilot instrumentation, and operating habits fitted to a real engineering organization.

Context is a budget. A React-only change should not load NestJS persistence guidance, and a backend repository should not carry routing or browser-accessibility instructions. The editions share the control architecture while keeping stack guidance focused.

Governance is one responsibility of the harness, not the whole practice. Harness engineering also shapes maintainability, architecture fitness, functional behavior, delivery workflow, feedback timing, and how the team improves the system when agents fail in repeatable ways.

Code is cheap. Operators are made.

The model produces code. Your engineering system decides whether it ships.

Engineering is what humans do. Code is what tools produce.

Get the Engine Diagnostic Book a Sprint conversation