In a week marked by advancements in AI technology, both Anthropic and OpenAI have launched multi-agent tools, with Anthropic showcasing some of its bold AI coding experiments. However, these strides in AI come with inevitable caveats.
On Thursday, Anthropic's researcher Nicholas Carlini published a blog post explaining a fascinating experiment where 16 instances of Anthropic’s Claude Opus 4.6 AI model were unleashed on a joint codebase with minimal guidance. Their mission: to develop a C compiler from scratch.
Over a span of two weeks and nearly 2,000 Claude Code sessions, costing approximately $20,000 in API fees, these AI agents reportedly crafted a 100,000-line compiler using Rust. This compiler is capable of building a bootable Linux 6.9 kernel across x86, ARM, and RISC-V architectures.
Carlini, a seasoned research scientist from Anthropic’s Safeguards team with prior stints at Google Brain and DeepMind, utilized a novel feature from Claude Opus 4.6 known as “agent teams.” In operation, each Claude instance functioned within its own Docker container, accessing a shared Git repository, selecting tasks by creating lock files, and then integrating the finalized code upstream. The instances operated autonomously without a central orchestration agent, independently identifying and tackling the problems they deemed most pressing. They even resolved merge conflicts themselves.
The accomplished compiler, now available on GitHub, is able to compile an extensive array of open source projects such as PostgreSQL, SQLite, Redis, FFmpeg, and QEMU. It boasts a 99% success rate on the GCC torture test suite and, in Carlini’s words, passed “the developer’s ultimate litmus test” by compiling and running Doom.
It is crucial to acknowledge that creating a C compiler represents an almost optimal task for semi-autonomous AI model coding. The specifications are long established and well-defined, with pre-existing comprehensive test suites and a known-good reference compiler to compare against. In contrast, most real-world software projects lack these structured advantages. The primary challenge typically lies not in writing code that clears tests but in determining what those tests should be initially.