Amid a trend towards deploying autonomous AI agents, Anthropic and OpenAI have both unveiled multi-agent tools recently. Anthropic is particularly keen to highlight its ambitious AI coding initiatives. However, as is common with AI advancements, there are important considerations to bear in mind.
Anthropic researcher Nicholas Carlini revealed in a blog post on Thursday that he supervised an experiment using 16 instances of the company’s Claude Opus 4.6 AI model. These instances were tasked with collaboratively creating a C compiler from the ground up, based on a shared codebase and with only basic oversight.
Over the course of two weeks and approximately 2,000 Claude Code sessions, with API costs totaling about $20,000, the AI model agents managed to produce a 100,000-line compiler in Rust. This compiler is capable of building a bootable Linux 6.9 kernel compatible with x86, ARM, and RISC-V architectures.
Carlini, who is part of Anthropic’s Safeguards team and has a background at Google Brain and DeepMind, employed a feature of Claude Opus 4.6 called “agent teams.” Under this framework, each instance operated independently in its Docker container. They shared a Git repository, claimed tasks through lock files, and pushed updates autonomously. There was no central orchestration; each agent chose tasks based on immediate needs and resolved conflicts independently.
The resulting compiler, made publicly available on GitHub, can compile several significant open-source projects such as PostgreSQL, SQLite, Redis, FFmpeg, and QEMU. It successfully cleared 99% of the GCC torture test suite and notably compiled and executed the game Doom, which Carlini described as “the developer’s ultimate litmus test.”
It's important to acknowledge that a C compiler represents an almost ideal scenario for semi-autonomous AI coding: its specifications are well-established, comprehensive testing is already in place, and a reliable reference compiler is available for validation. In contrast, most real-world software projects lack these conditions. Typically, the challenge lies not in producing test-passing code, but in defining what those tests should be.