March 2026

The Harness Is the Product.
The Model Is a Commodity.

A structured comparison of the higher-order abstractions that sit on top of AI coding agents. Not which model is best — which system around the model is best, and how they compose.

The same Claude 4 model produces radically different output depending on whether it's running bare, inside a role-specialist workflow, or as part of a governed multi-agent organization. The operator and the system around the model are the moat.

The AI coding ecosystem exploded in early 2026. Hundreds of skills, thousands of prompts, dozens of marketplaces. But nobody is systematically comparing the underlying abstractions — the frameworks, philosophies, and architectures that shape how agents plan, execute, review, remember, and coordinate. This Atlas is the map.

The Layer Model (L0–L4)

Every project operates at a specific layer. Comparing across layers is noise. Comparing within layers is signal. Higher layers compose on top of lower layers.

L0
Company / Multi-Agent OS
Org charts, budgets, governance, heartbeat coordination across dozens of agents
Paperclip
Paperclip (~28K★) is the standout. Node.js server + React dashboard + embedded PostgreSQL. Heartbeat protocol where agents check in on schedule, receive tasks with full company context. Per-agent budgets (warn at 80%, pause at 100%), immutable audit logs, board-level governance. If you have one agent, you don't need L0. If you have twenty — you need Paperclip.
L1
Orchestration / Worktree
Parallel agents in isolated workspaces, PR automation, CI-fix loops
Agent Orchestrator, Claude Squad, 1code
Agent Orchestrator (~5K★) plans tasks, spawns parallel agents in git worktrees, handles CI-fix and merge conflicts. Claude Squad (~8K★) is a terminal multiplexer for multiple runtimes. 1code (~3K★) is API-first: submit prompt → get PR. Subtask (~2K★) extends Claude Code's subagent model with worktree isolation. The pattern: L1 handles logistics, L2 handles craft.
L2
Workflow / Spec
Opinionated dev lifecycle: plan → build → review → ship → learn
gstack, Compound, Superpowers, BMAD
The richest layer. gstack (~19K★) is role-specialist mode switching with first-party browser. Compound Engineering (~10K★) is the compounding loop that builds institutional knowledge. Superpowers (~32K★) enforces SDLC with the largest community. BMAD (~5K★) is full agile simulation. GSD, Spec Kit, OpenSpec drive spec-driven development. Most comparisons live here.
L3
Memory / Review / Governance
Persistent knowledge, structured review, CI-integrated checks
Beads, Continue
Beads (~5K★) provides persistent structured memory backed by Dolt (git-like versioned database) plus a distributed graph issue tracker. Continue (~25K★) adds source-controlled AI checks and PR review workflows via GitHub Actions. These supplement any L2 workflow — memory that travels with the codebase, not the session.
L4
Base Runtime
The host agent itself — the substrate everything runs on
Claude Code, Codex CLI, Gemini CLI
The control group. Claude Code, Codex CLI, Gemini CLI, Kiro, OpenHands, Cline, Roo Code, Aider. Runtime capabilities constrain what overlays can do — if Claude Code ships native browser support, gstack's advantage changes. We track L4 to understand the substrate, not to compare base runtimes.

Four Philosophical Schools

These frameworks embody different theories of what makes AI-assisted engineering succeed. The strongest setups blend all four.

🎯

Role Specialization

gstack — ~19K★

A generalist agent produces mediocre output. The fix: explicit cognitive gears. Tell the model what kind of brain to use right now. Planning ≠ review ≠ shipping.

Best for: Sharp solo execution
🔄

Compounding Loops

Compound Engineering — ~10K★

Most codebases get harder over time. Invert this. Each unit of work makes the next easier. The critical fourth step: extract, classify, document learnings.

Best for: Long-lived codebases
🏢

Multi-Agent Orchestration

Paperclip — ~28K★

If one agent is an employee, the missing layer is the company. Org charts, budgets, governance, heartbeat coordination. You manage goals, not tabs.

Best for: Scaling past 10 agents
📋

Spec-Driven Development

Spec Kit, OpenSpec, GSD

The agent's biggest bottleneck is context rot — drift between what the human wants and what the agent builds. Formal specs as the single source of truth.

Best for: Alignment over speed

L2 Framework Comparison

The primary comparison layer. These workflow frameworks compete for your daily engineering loop.

Dimension gstack Compound Superpowers BMAD GSD
Stars ~19K ~10K ~32K ~5K ~4K
Philosophy Role-specialist mode switching Compounding loop Enforced SDLC + subagents Full agile simulation Spec-driven context engineering
Planning CEO taste + eng rigor Parallel research → structured plan Plan-first with auto-review 4-phase agile Spec files drive all work
Review Paranoid staff eng, fix-first 14+ parallel specialist agents Auto-review with TDD Gate checks between phases Spec validation
Browser/QA ★★★★★ First-party Chromium ★★☆☆☆ Bolt-on None None None
Memory ★★☆☆☆ Retro + docs ★★★★★ Compound step ★★★☆☆ Pattern docs ★★★☆☆ Sprint artifacts ★★☆☆☆ Spec accumulation
Shipping ★★★★★ One-command ★★★☆☆ Disciplined ★★★☆☆ Subagent impl ★★★☆☆ Story-based ★★☆☆☆ Manual
Portability ★☆☆☆☆ Claude Code only ★★★★★ 10+ platforms ★★☆☆☆ Claude primary ★★★☆☆ Multi-tool ★★★☆☆ Multi-tool

Jarvis 21-Layer Capability Model

A comprehensive capability taxonomy for evaluating agent systems. We audited a real deployment — Feral Bots on OpenClaw — against all 21 layers.

125/210

A powerful artisanal operator stack, not yet a cleanly-governed platform.

Strong 8–10 4 layers

  • Execution8/10
  • Memory8/10
  • Identity/Persona8/10
  • Knowledge/Research8/10

Decent 5–7 12 layers

  • Dispatch7/10
  • Comms7/10
  • Rollback7/10
  • Planning7/10
  • Context Mgmt7/10
  • UI/Command7/10
  • Filing6/10
  • Coordination6/10
  • Observability6/10
  • Self-Modification6/10
  • Health/Heartbeat5/10
  • Rate Limiting5/10

Weak 2–4 5 layers

  • Task System4/10
  • Evaluation4/10
  • Simulation/Sandbox4/10
  • Goal/Priority3/10
  • Auth/Secrets2/10

Composability Guide

The best setup isn't one framework — it's the right combination. Layer, don't replace. Pick one primary loop. Compose at the edges.

gstack compound

Browser QA + Institutional Memory

gstack's visual QA catches what compound can't see. Compound's memory ensures QA patterns persist. Use gstack's /browse, /qa daily, then /ce:compound to capture root causes.

BMAD gstack

Agile Planning + Sharp Execution

BMAD's phased planning (PRD → architecture → stories) for requirements, then gstack's /review, /qa, /ship for implementation quality and release automation.

Paperclip gstack

Company OS + Role Specialists

Paperclip manages org chart, goals, budgets, heartbeats. Each agent runs gstack skills: Product Lead does /plan-ceo-review, QA agent does /qa, Release Eng does /ship.

Superpowers Compound

Best Discovery + Deepest Research

Superpowers is better at figuring out what to build. Compound is better at researching how. The compounding step ensures neither system's insights are lost.

Compound Beads

Learning Loop + Persistent Memory

Compound's solution docs + Beads' database-backed persistence. If you switch workflows, your institutional knowledge comes with you. Framework-independent memory.

gstack Compound Paperclip

The Full Stack

Role-specialist execution (L2) + institutional memory (L2/L3) + company governance (L0). Each layer operates independently. The composition is the product.

Choose Your Stack

Answer honestly — the right choice depends on your situation, not what sounds coolest.

How many agents are you running?
One agent (solo session) Skip L0 and L1. Focus on L2 workflow + maybe L3 memory. See questions below.
2–5 agents Add L1 orchestration: Claude Squad (terminal multiplexer), Agent Orchestrator (full automation), or Subtask (lightweight worktrees).
10+ agents You need Paperclip (L0). It's currently the only mainstream company OS. Each agent can run any L2 workflow internally.
"I build the wrong thing" — a requirements problem
Light process: GSD or OpenSpec — spec files as context anchors, minimal overhead
Medium process: Spec Kit — GitHub's formal spec toolkit
Heavy process: BMAD — full agile simulation with phased workflows
"The AI produces mediocre output" — a focus problem
gstack Role-specialist mode switching. Tell the model what kind of brain to use right now. 12 distinct cognitive modes.
"I keep solving the same problems" — a memory problem
Compound Engineering Explicit compounding loop. Six subagents extract, classify, document learnings after every cycle.
+ Beads Add database-backed persistence if you want memory independent of your workflow framework.
"I can't see if the UI is broken" — a visual QA problem
gstack First-party persistent Chromium daemon. 100ms latency, cookie import, health scores, AI-slop detection. Nothing else comes close.
"I don't want to think about it" — just tell me
Solo dev shipping a web product:
1. Install gstack for daily work (planning, review, browser QA, shipping)
2. Add Compound's /ce:compound after significant PRs to build knowledge
3. Ignore everything else until you hit a bottleneck
Team:
1. BMAD for process structure (PRDs, architecture, sprints)
2. gstack for execution quality (review, QA, shipping)
3. Claude Squad or Agent Orchestrator for parallelism
4. Graduate to Paperclip at 10+ agents

The 100:1 Challenge

What Jensen Huang's vision of 100 AI agents per human actually requires — and what the Atlas reveals about closing the gap.

75K
Human employees
7.5M
AI agents
100:1
Agents per human

This isn't science fiction. It's an engineering problem. The Atlas reveals exactly what's missing:

  1. L0 governance that scales. Paperclip's org-chart model is the closest working prototype. But current L0 solutions are fragile — over-governance bottlenecks, under-governance creates chaos.
  2. Self-selecting L2 workflows. Today, humans choose whether agents run gstack or Compound. At 100:1, agents need to pick their own workflow based on task type.
  3. Trustworthy task state. The Feral Bots audit exposed this: you can't coordinate 100 agents if your task system lies. Stale queues poison coordination at scale.
  4. Real evaluation loops. Not "did CI pass?" but "was this the right thing to build?" The eval gap is the most underinvested layer across the entire ecosystem.
  5. Zero-trust execution. At 1 agent, you can trust the operator. At 100, you need tiered permissions and sandboxed environments by default.

Key Findings

#1

The harness is the product

Raw model capability is table stakes. The frameworks, memory systems, and governance layers create differentiated value.

#2

Compare within layers

gstack vs Compound is a fair fight (both L2). gstack vs Paperclip is apples vs chainsaws (L2 vs L0). The layer model prevents bad comparisons.

#3

Mix frameworks across layers

No single system wins every dimension. The best setups compose role specialization + compounding + orchestration + specs. Composability is the meta-skill.

#4

Memory is underrated

Compound's institutional memory and Beads' persistent knowledge are force multipliers that most practitioners skip. Each PR should make the next easier.

#5

Browser QA is gstack's moat

First-party persistent Chromium daemon, 100ms latency, cookie import, health scores, AI-slop detection. Nothing else in L2 comes close.

#6

Portability belongs to Compound

10+ target platforms vs Claude Code–only for gstack. Cross-tool portability is increasingly strategic as the runtime landscape fragments.

#7

L0 is exploding

Paperclip hit ~28K stars in ~2 weeks. Massive demand for company-level agent orchestration. This layer will be the most competitive in 2026.

#8

Auth, eval, sandbox: weak everywhere

The Jarvis 21 reveals gaps nobody talks about. These layers are weak across the entire ecosystem — not just individual deployments.