OpenAI engineer Michael Bolin shared a comprehensive technical breakdown on Friday regarding the internal operations of the company’s Codex CLI coding agent. This release offers developers a peek into AI coding tools capable of writing code, conducting tests, and fixing bugs under human supervision. This complements our December article by delving into the technical aspects of how OpenAI executes its “agentic loop.”
Currently, AI coding agents are experiencing a surge in popularity, akin to the ''ChatGPT moment.'' Tools like Claude Code with Opus 4.5 and Codex incorporating GPT-5.2 have advanced in functionality, proving to be valuable assets for swiftly coding prototypes, developing interfaces, and producing boilerplate code. The publication of OpenAI’s design philosophy behind Codex aligns with the increasing practicality of these AI tools in everyday work environments.
Despite their growing utility, these tools are not without imperfections and continue to stir debate among software developers. While OpenAI has indicated to Ars Technica that Codex is utilized as a tool in its own development process, our hands-on experience highlighted that these tools excel in executing simple tasks with remarkable speed but are fragile when handling tasks beyond their training dataset. Consequently, they necessitate human oversight, particularly in production environments. The initial framework of a project often comes together swiftly in what seems like a magical manner, but the subsequent stages involve detailed debugging and workaround strategies to address the agent's inherent limitations.
Bolin in his post does not shy away from discussing these engineering hurdles, covering the inefficiency of quadratic prompt growth, performance bottlenecks due to cache misses, and the inconsistencies encountered in the enumeration of MCP tools, all of which needed fixing.
This level of technical insight is somewhat atypical for OpenAI, which has not previously provided similar expositions for other products like ChatGPT. Nonetheless, we’ve noted from our December interview with OpenAI that Codex is treated distinctly, particularly as programming tasks align well with the capabilities of large language models.