Agent Engineering Atlas

The same Claude 4 model produces radically different output depending on whether it's running bare, inside a role-specialist workflow, or as part of a governed multi-agent organization. The operator and the system around the model are the moat.

The AI coding ecosystem exploded in early 2026. Hundreds of skills, thousands of prompts, dozens of marketplaces. But nobody is systematically comparing the underlying abstractions — the frameworks, philosophies, and architectures that shape how agents plan, execute, review, remember, and coordinate. This Atlas is the map.

Reader Guide

How to Read This Atlas

This project is opinionated, but it should still be legible. Here’s what is being compared, what is deliberately out of scope, and how to interpret the claims.

Methodology snapshot

In scope: open-source L0–L3 frameworks with real code, real architectural opinions, and evidence of real-world use. Individual prompts/skills and closed-source tools are out of scope. L4 runtimes are tracked as the control group, not reviewed as the main subject.

How claims are checked: repo-first. The Atlas reads source code, README/changelog, issue threads, and documented usage. Ratings are editorial judgments from repo evidence — not benchmark scores, not sponsored placements.

Freshness: this snapshot was verified on March 17, 2026. Stars, versions, and capabilities drift quickly. If you catch drift, open a PR with the date you verified it. Read full methodology ↗

Runtime (L4)	The host tool the model runs inside — Claude Code, Codex CLI, Gemini CLI, Cursor, and so on.
Workflow framework (L2)	An opinionated process layer for planning, review, shipping, QA, and learning.
Orchestrator (L1)	A system that runs multiple agents in parallel in isolated workspaces, branches, or sandboxes.
Company OS (L0)	A governance layer for many agents: goals, budgets, approvals, org charts, and heartbeat coordination.
Compounding	Capturing the learnings from finished work so the next cycle gets easier instead of harder.
Worktree	A git feature that lets agents work on separate branches without touching your main checkout.
Star count (★)	A rough proxy for attention, not quality. Treat star counts as point-in-time context, not proof.

Start with The Layer Model so you know whether you are comparing company orchestration, workflow systems, or memory layers.
Read Four Philosophical Schools to understand the core beliefs behind each approach.
Use L2 Framework Comparison for side-by-side tradeoffs, then Choose Your Stack for concrete recommendations.
Dive into the linked repos and markdown deep-dives when you want primary-source detail.

Architecture

The Layer Model (L0–L4)

Every project operates at a specific layer. Comparing across layers is noise. Comparing within layers is signal. Higher layers compose on top of lower layers.

Company / Multi-Agent OS

Org charts, budgets, governance, heartbeat coordination across dozens of agents

Paperclip

Paperclip (~28K★) is the standout. Node.js server + React dashboard + embedded PostgreSQL. Heartbeat protocol where agents check in on schedule, receive tasks with full company context. Per-agent budgets (warn at 80%, pause at 100%), immutable audit logs, board-level governance. If you have one agent, you don't need L0. If you have twenty — you need Paperclip.

Orchestration / Worktree

Parallel agents in isolated workspaces, PR automation, CI-fix loops

Agent Orchestrator, Claude Squad, 1code

Agent Orchestrator (~5K★) plans tasks, spawns parallel agents in git worktrees, handles CI-fix and merge conflicts. Claude Squad (~8K★) is a terminal multiplexer for multiple runtimes. 1code (~3K★) is API-first: submit prompt → get PR. Subtask (~2K★) extends Claude Code's subagent model with worktree isolation. The pattern: L1 handles logistics, L2 handles craft.

Workflow / Spec

Opinionated dev lifecycle: plan → build → review → ship → learn

gstack, Compound, Superpowers, BMAD

The richest layer. gstack (~19K★) is role-specialist mode switching with first-party browser. Compound Engineering (~10K★) is the compounding loop that builds institutional knowledge. Superpowers (~32K★) enforces SDLC with the largest community. BMAD (~5K★) is full agile simulation. GSD, Spec Kit, OpenSpec drive spec-driven development. Most comparisons live here.

Memory / Review / Governance

Persistent knowledge, structured review, CI-integrated checks

Beads, Continue

Beads (~5K★) provides persistent structured memory backed by Dolt (git-like versioned database) plus a distributed graph issue tracker. Continue (~25K★) adds source-controlled AI checks and PR review workflows via GitHub Actions. These supplement any L2 workflow — memory that travels with the codebase, not the session.

Base Runtime

The host agent itself — the substrate everything runs on

Claude Code, Codex CLI, Gemini CLI

The control group. Claude Code, Codex CLI, Gemini CLI, Kiro, OpenHands, Cline, Roo Code, Aider. Runtime capabilities constrain what overlays can do — if Claude Code ships native browser support, gstack's advantage changes. We track L4 to understand the substrate, not to compare base runtimes.

Theory

Four Philosophical Schools

These frameworks embody different theories of what makes AI-assisted engineering succeed. The strongest setups blend all four.

🎯

Role Specialization

gstack — ~19K★

A generalist agent produces mediocre output. The fix: explicit cognitive gears. Tell the model what kind of brain to use right now. Planning ≠ review ≠ shipping.

Best for: Sharp solo execution

🔄

Compounding Loops

Compound Engineering — ~10K★

Most codebases get harder over time. Invert this. Each unit of work makes the next easier. The critical fourth step: extract, classify, document learnings.

Best for: Long-lived codebases

🏢

Multi-Agent Orchestration

Paperclip — ~28K★

If one agent is an employee, the missing layer is the company. Org charts, budgets, governance, heartbeat coordination. You manage goals, not tabs.

Best for: Scaling past 10 agents

📋

Spec-Driven Development

Spec Kit, OpenSpec, GSD

The agent's biggest bottleneck is context rot — drift between what the human wants and what the agent builds. Formal specs as the single source of truth.

Best for: Alignment over speed

Analysis

L2 Framework Comparison

The primary comparison layer. These workflow frameworks compete for your daily engineering loop.

Dimension	gstack	Compound	Superpowers	BMAD	GSD
Repo	garrytan/gstack ↗	EveryInc/compound-engineering-plugin ↗	obra/superpowers ↗	bmad-code-org/BMAD-METHOD ↗	gsd-build/get-shit-done ↗
Stars	~19K	~10K	~32K	~5K	~4K
Philosophy	Role-specialist mode switching	Compounding loop	Enforced SDLC + subagents	Full agile simulation	Spec-driven context engineering
Planning	CEO taste + eng rigor	Parallel research → structured plan	Plan-first with auto-review	4-phase agile	Spec files drive all work
Review	Paranoid staff eng, fix-first	14+ parallel specialist agents	Auto-review with TDD	Gate checks between phases	Spec validation
Browser/QA	★★★★★ First-party Chromium	★★☆☆☆ Bolt-on	None	None	None
Memory	★★☆☆☆ Retro + docs	★★★★★ Compound step	★★★☆☆ Pattern docs	★★★☆☆ Sprint artifacts	★★☆☆☆ Spec accumulation
Shipping	★★★★★ One-command	★★★☆☆ Disciplined	★★★☆☆ Subagent impl	★★★☆☆ Story-based	★★☆☆☆ Manual
Portability	★☆☆☆☆ Claude Code only	★★★★★ 10+ platforms	★★☆☆☆ Claude primary	★★★☆☆ Multi-tool	★★★☆☆ Multi-tool

Receipts

Primary sources: gstack · Compound Engineering · Superpowers · BMAD · GSD.

Interpretation note: the ★ rows are human ratings based on repo evidence and cross-project comparison within the L2 layer. They are not automated benchmarks. For collection rules and freshness policy, see METHODOLOGY.md ↗.

Audit

Jarvis 21-Layer Capability Model

A comprehensive capability taxonomy for evaluating agent systems. We audited a real deployment — Feral Bots on OpenClaw — against all 21 layers.

125/210

A powerful artisanal operator stack, not yet a cleanly-governed platform.

Strong 8–10 4 layers

Execution8/10
Memory8/10
Identity/Persona8/10
Knowledge/Research8/10

Decent 5–7 12 layers

Dispatch7/10
Comms7/10
Rollback7/10
Planning7/10
Context Mgmt7/10
UI/Command7/10
Filing6/10
Coordination6/10
Observability6/10
Self-Modification6/10
Health/Heartbeat5/10
Rate Limiting5/10

Weak 2–4 5 layers

Task System4/10
Evaluation4/10
Simulation/Sandbox4/10
Goal/Priority3/10
Auth/Secrets2/10

Audit source

Full writeup: feral-bots-jarvis-21-audit.md ↗. The 125/210 score is a human rubric across 21 layers, not an automated benchmark or live telemetry feed.

Integration

Composability Guide

The best setup isn't one framework — it's the right combination. Layer, don't replace. Pick one primary loop. Compose at the edges.

gstack compound

Browser QA + Institutional Memory

gstack's visual QA catches what compound can't see. Compound's memory ensures QA patterns persist. Use gstack's /browse, /qa daily, then /ce:compound to capture root causes.

BMAD gstack

Agile Planning + Sharp Execution

BMAD's phased planning (PRD → architecture → stories) for requirements, then gstack's /review, /qa, /ship for implementation quality and release automation.

Paperclip gstack

Company OS + Role Specialists

Paperclip manages org chart, goals, budgets, heartbeats. Each agent runs gstack skills: Product Lead does /plan-ceo-review, QA agent does /qa, Release Eng does /ship.

Superpowers Compound

Best Discovery + Deepest Research

Superpowers is better at figuring out what to build. Compound is better at researching how. The compounding step ensures neither system's insights are lost.

Compound Beads

Learning Loop + Persistent Memory

Compound's solution docs + Beads' database-backed persistence. If you switch workflows, your institutional knowledge comes with you. Framework-independent memory.

gstack Compound Paperclip

The Full Stack

Role-specialist execution (L2) + institutional memory (L2/L3) + company governance (L0). Each layer operates independently. The composition is the product.

Decision

Choose Your Stack

Answer honestly — the right choice depends on your situation, not what sounds coolest.

How many agents are you running?

One agent (solo session) → Skip L0 and L1. Focus on L2 workflow + maybe L3 memory. See questions below.

2–5 agents → Add L1 orchestration: Claude Squad (terminal multiplexer), Agent Orchestrator (full automation), or Subtask (lightweight worktrees).

10+ agents → You need Paperclip (L0). It's currently the only mainstream company OS. Each agent can run any L2 workflow internally.

"I build the wrong thing" — a requirements problem

Light process: → GSD or OpenSpec — spec files as context anchors, minimal overhead

Medium process: → Spec Kit — GitHub's formal spec toolkit

Heavy process: → BMAD — full agile simulation with phased workflows

"The AI produces mediocre output" — a focus problem

gstack → Role-specialist mode switching. Tell the model what kind of brain to use right now. 12 distinct cognitive modes.

"I keep solving the same problems" — a memory problem

Compound Engineering → Explicit compounding loop. Six subagents extract, classify, document learnings after every cycle.

+ Beads → Add database-backed persistence if you want memory independent of your workflow framework.

"I can't see if the UI is broken" — a visual QA problem

gstack → First-party persistent Chromium daemon. 100ms latency, cookie import, health scores, AI-slop detection. Nothing else comes close.

"I don't want to think about it" — just tell me

Solo dev shipping a web product:
1. Install gstack for daily work (planning, review, browser QA, shipping)
2. Add Compound's /ce:compound after significant PRs to build knowledge
3. Ignore everything else until you hit a bottleneck

Team:
1. BMAD for process structure (PRDs, architecture, sprints)
2. gstack for execution quality (review, QA, shipping)
3. Claude Squad or Agent Orchestrator for parallelism
4. Graduate to Paperclip at 10+ agents

Future

The 100:1 Challenge

What Jensen Huang's vision of 100 AI agents per human actually requires — and what the Atlas reveals about closing the gap.

75K

Human employees

7.5M

AI agents

100:1

Agents per human

This isn't science fiction. It's an engineering problem. The Atlas reveals exactly what's missing:

L0 governance that scales. Paperclip's org-chart model is the closest working prototype. But current L0 solutions are fragile — over-governance bottlenecks, under-governance creates chaos.
Self-selecting L2 workflows. Today, humans choose whether agents run gstack or Compound. At 100:1, agents need to pick their own workflow based on task type.
Trustworthy task state. The Feral Bots audit exposed this: you can't coordinate 100 agents if your task system lies. Stale queues poison coordination at scale.
Real evaluation loops. Not "did CI pass?" but "was this the right thing to build?" The eval gap is the most underinvested layer across the entire ecosystem.
Zero-trust execution. At 1 agent, you can trust the operator. At 100, you need tiered permissions and sandboxed environments by default.

Insights

Key Findings

The harness is the product

Raw model capability is table stakes. The frameworks, memory systems, and governance layers create differentiated value.

Compare within layers

gstack vs Compound is a fair fight (both L2). gstack vs Paperclip is apples vs chainsaws (L2 vs L0). The layer model prevents bad comparisons.

Mix frameworks across layers

No single system wins every dimension. The best setups compose role specialization + compounding + orchestration + specs. Composability is the meta-skill.

Memory is underrated

Compound's institutional memory and Beads' persistent knowledge are force multipliers that most practitioners skip. Each PR should make the next easier.

Browser QA is gstack's moat

First-party persistent Chromium daemon, 100ms latency, cookie import, health scores, AI-slop detection. Nothing else in L2 comes close.

Portability belongs to Compound

10+ target platforms vs Claude Code–only for gstack. Cross-tool portability is increasingly strategic as the runtime landscape fragments.

L0 is exploding

Paperclip hit ~28K stars in ~2 weeks. Massive demand for company-level agent orchestration. This layer will be the most competitive in 2026.

Auth, eval, sandbox: weak everywhere

The Jarvis 21 reveals gaps nobody talks about. These layers are weak across the entire ecosystem — not just individual deployments.

The Harness Is the Product.The Model Is a Commodity.