A structured comparison of the higher-order abstractions that sit on top of AI coding agents. Not which model is best — which system around the model is best, and how they compose.
The same Claude 4 model produces radically different output depending on whether it's running bare, inside a role-specialist workflow, or as part of a governed multi-agent organization. The operator and the system around the model are the moat.
The AI coding ecosystem exploded in early 2026. Hundreds of skills, thousands of prompts, dozens of marketplaces. But nobody is systematically comparing the underlying abstractions — the frameworks, philosophies, and architectures that shape how agents plan, execute, review, remember, and coordinate. This Atlas is the map.
This project is opinionated, but it should still be legible. Here’s what is being compared, what is deliberately out of scope, and how to interpret the claims.
In scope: open-source L0–L3 frameworks with real code, real architectural opinions, and evidence of real-world use. Individual prompts/skills and closed-source tools are out of scope. L4 runtimes are tracked as the control group, not reviewed as the main subject.
How claims are checked: repo-first. The Atlas reads source code, README/changelog, issue threads, and documented usage. Ratings are editorial judgments from repo evidence — not benchmark scores, not sponsored placements.
Freshness: this snapshot was verified on March 17, 2026. Stars, versions, and capabilities drift quickly. If you catch drift, open a PR with the date you verified it. Read full methodology ↗
| Runtime (L4) | The host tool the model runs inside — Claude Code, Codex CLI, Gemini CLI, Cursor, and so on. |
| Workflow framework (L2) | An opinionated process layer for planning, review, shipping, QA, and learning. |
| Orchestrator (L1) | A system that runs multiple agents in parallel in isolated workspaces, branches, or sandboxes. |
| Company OS (L0) | A governance layer for many agents: goals, budgets, approvals, org charts, and heartbeat coordination. |
| Compounding | Capturing the learnings from finished work so the next cycle gets easier instead of harder. |
| Worktree | A git feature that lets agents work on separate branches without touching your main checkout. |
| Star count (★) | A rough proxy for attention, not quality. Treat star counts as point-in-time context, not proof. |
Every project operates at a specific layer. Comparing across layers is noise. Comparing within layers is signal. Higher layers compose on top of lower layers.
These frameworks embody different theories of what makes AI-assisted engineering succeed. The strongest setups blend all four.
A generalist agent produces mediocre output. The fix: explicit cognitive gears. Tell the model what kind of brain to use right now. Planning ≠ review ≠ shipping.
Most codebases get harder over time. Invert this. Each unit of work makes the next easier. The critical fourth step: extract, classify, document learnings.
If one agent is an employee, the missing layer is the company. Org charts, budgets, governance, heartbeat coordination. You manage goals, not tabs.
The agent's biggest bottleneck is context rot — drift between what the human wants and what the agent builds. Formal specs as the single source of truth.
The primary comparison layer. These workflow frameworks compete for your daily engineering loop.
| Dimension | gstack | Compound | Superpowers | BMAD | GSD |
|---|---|---|---|---|---|
| Repo | garrytan/gstack ↗ | EveryInc/compound-engineering-plugin ↗ | obra/superpowers ↗ | bmad-code-org/BMAD-METHOD ↗ | gsd-build/get-shit-done ↗ |
| Stars | ~19K | ~10K | ~32K | ~5K | ~4K |
| Philosophy | Role-specialist mode switching | Compounding loop | Enforced SDLC + subagents | Full agile simulation | Spec-driven context engineering |
| Planning | CEO taste + eng rigor | Parallel research → structured plan | Plan-first with auto-review | 4-phase agile | Spec files drive all work |
| Review | Paranoid staff eng, fix-first | 14+ parallel specialist agents | Auto-review with TDD | Gate checks between phases | Spec validation |
| Browser/QA | ★★★★★ First-party Chromium | ★★☆☆☆ Bolt-on | None | None | None |
| Memory | ★★☆☆☆ Retro + docs | ★★★★★ Compound step | ★★★☆☆ Pattern docs | ★★★☆☆ Sprint artifacts | ★★☆☆☆ Spec accumulation |
| Shipping | ★★★★★ One-command | ★★★☆☆ Disciplined | ★★★☆☆ Subagent impl | ★★★☆☆ Story-based | ★★☆☆☆ Manual |
| Portability | ★☆☆☆☆ Claude Code only | ★★★★★ 10+ platforms | ★★☆☆☆ Claude primary | ★★★☆☆ Multi-tool | ★★★☆☆ Multi-tool |
Primary sources: gstack · Compound Engineering · Superpowers · BMAD · GSD.
Interpretation note: the ★ rows are human ratings based on repo evidence and cross-project comparison within the L2 layer. They are not automated benchmarks. For collection rules and freshness policy, see METHODOLOGY.md ↗.
A comprehensive capability taxonomy for evaluating agent systems. We audited a real deployment — Feral Bots on OpenClaw — against all 21 layers.
A powerful artisanal operator stack, not yet a cleanly-governed platform.
Full writeup: feral-bots-jarvis-21-audit.md ↗. The 125/210 score is a human rubric across 21 layers, not an automated benchmark or live telemetry feed.
The best setup isn't one framework — it's the right combination. Layer, don't replace. Pick one primary loop. Compose at the edges.
gstack's visual QA catches what compound can't see. Compound's memory ensures QA patterns persist. Use gstack's /browse, /qa daily, then /ce:compound to capture root causes.
BMAD's phased planning (PRD → architecture → stories) for requirements, then gstack's /review, /qa, /ship for implementation quality and release automation.
Paperclip manages org chart, goals, budgets, heartbeats. Each agent runs gstack skills: Product Lead does /plan-ceo-review, QA agent does /qa, Release Eng does /ship.
Superpowers is better at figuring out what to build. Compound is better at researching how. The compounding step ensures neither system's insights are lost.
Compound's solution docs + Beads' database-backed persistence. If you switch workflows, your institutional knowledge comes with you. Framework-independent memory.
Role-specialist execution (L2) + institutional memory (L2/L3) + company governance (L0). Each layer operates independently. The composition is the product.
Answer honestly — the right choice depends on your situation, not what sounds coolest.
What Jensen Huang's vision of 100 AI agents per human actually requires — and what the Atlas reveals about closing the gap.
This isn't science fiction. It's an engineering problem. The Atlas reveals exactly what's missing:
Raw model capability is table stakes. The frameworks, memory systems, and governance layers create differentiated value.
gstack vs Compound is a fair fight (both L2). gstack vs Paperclip is apples vs chainsaws (L2 vs L0). The layer model prevents bad comparisons.
No single system wins every dimension. The best setups compose role specialization + compounding + orchestration + specs. Composability is the meta-skill.
Compound's institutional memory and Beads' persistent knowledge are force multipliers that most practitioners skip. Each PR should make the next easier.
First-party persistent Chromium daemon, 100ms latency, cookie import, health scores, AI-slop detection. Nothing else in L2 comes close.
10+ target platforms vs Claude Code–only for gstack. Cross-tool portability is increasingly strategic as the runtime landscape fragments.
Paperclip hit ~28K stars in ~2 weeks. Massive demand for company-level agent orchestration. This layer will be the most competitive in 2026.
The Jarvis 21 reveals gaps nobody talks about. These layers are weak across the entire ecosystem — not just individual deployments.