A structured comparison of the higher-order abstractions that sit on top of AI coding agents. Not which model is best — which system around the model is best, and how they compose.
The same Claude 4 model produces radically different output depending on whether it's running bare, inside a role-specialist workflow, or as part of a governed multi-agent organization. The operator and the system around the model are the moat.
The AI coding ecosystem exploded in early 2026. Hundreds of skills, thousands of prompts, dozens of marketplaces. But nobody is systematically comparing the underlying abstractions — the frameworks, philosophies, and architectures that shape how agents plan, execute, review, remember, and coordinate. This Atlas is the map.
Every project operates at a specific layer. Comparing across layers is noise. Comparing within layers is signal. Higher layers compose on top of lower layers.
These frameworks embody different theories of what makes AI-assisted engineering succeed. The strongest setups blend all four.
A generalist agent produces mediocre output. The fix: explicit cognitive gears. Tell the model what kind of brain to use right now. Planning ≠ review ≠ shipping.
Most codebases get harder over time. Invert this. Each unit of work makes the next easier. The critical fourth step: extract, classify, document learnings.
If one agent is an employee, the missing layer is the company. Org charts, budgets, governance, heartbeat coordination. You manage goals, not tabs.
The agent's biggest bottleneck is context rot — drift between what the human wants and what the agent builds. Formal specs as the single source of truth.
The primary comparison layer. These workflow frameworks compete for your daily engineering loop.
| Dimension | gstack | Compound | Superpowers | BMAD | GSD |
|---|---|---|---|---|---|
| Stars | ~19K | ~10K | ~32K | ~5K | ~4K |
| Philosophy | Role-specialist mode switching | Compounding loop | Enforced SDLC + subagents | Full agile simulation | Spec-driven context engineering |
| Planning | CEO taste + eng rigor | Parallel research → structured plan | Plan-first with auto-review | 4-phase agile | Spec files drive all work |
| Review | Paranoid staff eng, fix-first | 14+ parallel specialist agents | Auto-review with TDD | Gate checks between phases | Spec validation |
| Browser/QA | ★★★★★ First-party Chromium | ★★☆☆☆ Bolt-on | None | None | None |
| Memory | ★★☆☆☆ Retro + docs | ★★★★★ Compound step | ★★★☆☆ Pattern docs | ★★★☆☆ Sprint artifacts | ★★☆☆☆ Spec accumulation |
| Shipping | ★★★★★ One-command | ★★★☆☆ Disciplined | ★★★☆☆ Subagent impl | ★★★☆☆ Story-based | ★★☆☆☆ Manual |
| Portability | ★☆☆☆☆ Claude Code only | ★★★★★ 10+ platforms | ★★☆☆☆ Claude primary | ★★★☆☆ Multi-tool | ★★★☆☆ Multi-tool |
A comprehensive capability taxonomy for evaluating agent systems. We audited a real deployment — Feral Bots on OpenClaw — against all 21 layers.
A powerful artisanal operator stack, not yet a cleanly-governed platform.
The best setup isn't one framework — it's the right combination. Layer, don't replace. Pick one primary loop. Compose at the edges.
gstack's visual QA catches what compound can't see. Compound's memory ensures QA patterns persist. Use gstack's /browse, /qa daily, then /ce:compound to capture root causes.
BMAD's phased planning (PRD → architecture → stories) for requirements, then gstack's /review, /qa, /ship for implementation quality and release automation.
Paperclip manages org chart, goals, budgets, heartbeats. Each agent runs gstack skills: Product Lead does /plan-ceo-review, QA agent does /qa, Release Eng does /ship.
Superpowers is better at figuring out what to build. Compound is better at researching how. The compounding step ensures neither system's insights are lost.
Compound's solution docs + Beads' database-backed persistence. If you switch workflows, your institutional knowledge comes with you. Framework-independent memory.
Role-specialist execution (L2) + institutional memory (L2/L3) + company governance (L0). Each layer operates independently. The composition is the product.
Answer honestly — the right choice depends on your situation, not what sounds coolest.
What Jensen Huang's vision of 100 AI agents per human actually requires — and what the Atlas reveals about closing the gap.
This isn't science fiction. It's an engineering problem. The Atlas reveals exactly what's missing:
Raw model capability is table stakes. The frameworks, memory systems, and governance layers create differentiated value.
gstack vs Compound is a fair fight (both L2). gstack vs Paperclip is apples vs chainsaws (L2 vs L0). The layer model prevents bad comparisons.
No single system wins every dimension. The best setups compose role specialization + compounding + orchestration + specs. Composability is the meta-skill.
Compound's institutional memory and Beads' persistent knowledge are force multipliers that most practitioners skip. Each PR should make the next easier.
First-party persistent Chromium daemon, 100ms latency, cookie import, health scores, AI-slop detection. Nothing else in L2 comes close.
10+ target platforms vs Claude Code–only for gstack. Cross-tool portability is increasingly strategic as the runtime landscape fragments.
Paperclip hit ~28K stars in ~2 weeks. Massive demand for company-level agent orchestration. This layer will be the most competitive in 2026.
The Jarvis 21 reveals gaps nobody talks about. These layers are weak across the entire ecosystem — not just individual deployments.