Fourth Issue (July 2025)
- martinprause
- Jul 28
- 3 min read

The AI Paradox: Machines That Can Beat Math Olympiads Still Can't Pass Simple Intelligence Tests
The July 2025 issue of "Beyond the Box" reveals a striking paradox at the heart of artificial intelligence development: while AI systems are achieving unprecedented feats- from solving International Mathematical Olympiad problems to replacing millions of workers—they simultaneously fail at tasks that any human finds trivially easy.
The Great AI Divide
Perhaps no example better illustrates this divide than the contrast between two recent AI milestones. Google DeepMind's Gemini system just achieved gold medal performance at the International Mathematical Olympiad, solving five out of six problems that challenge the world's brightest young minds. Yet on the ARC-AGI-2 benchmark—a test of simple pattern recognition that any human can ace—the most advanced AI systems struggle to exceed 16% accuracy. Pure language models score exactly zero.
This gap reveals something fundamental about current AI: impressive specialized performance doesn't translate to general intelligence. As Martin Prause notes in his editor's letter, "We have seen that AI systems know 40% more than they can express, while failing spectacularly at simple puzzles humans solve instantly."
The Physical Frontier
While AI struggles with abstract reasoning, significant progress is emerging in physical applications. The development of "robot metabolism" at Columbia University represents a conceptual breakthrough. Researchers created modular robots that can grow, heal, and adapt by consuming parts from their environment or other robots—much like biological organisms.
These Truss Link modules can combine to form increasingly complex structures. A single module can only crawl forward and backward, but three links form a triangle that navigates in two dimensions, while six can create a 3D tetrahedron capable of climbing obstacles. When damaged, they can reconnect and even help other robots transform—a rudimentary form of self-repair that could revolutionize disaster response and space exploration.
Similarly, Large Action Models (LAMs) like ORGANA and RT-2 are bridging the gap between digital intelligence and physical action. ORGANA automates complex chemistry experiments, saving human researchers over 80% of their time. RT-2 demonstrates chain-of-thought reasoning while manipulating objects, successfully handling items it never encountered during training.
The Workforce Revolution
Perhaps the most immediate impact comes from the deployment of AI agents in the workplace. SoftBank's CEO Masayoshi Son plans to deploy one billion AI agents by the end of 2025, with each agent costing just 27 cents per month. Goldman Sachs is testing an autonomous coder named Devin that could soon join its 12,000 human developers as a full team member.
The economics are compelling: Son estimates that 1,000 AI agents can replace one human employee at a fraction of the cost, while being four times more productive. This isn't about simple automation—these are fully autonomous digital workers that can code, negotiate, make decisions, and manage other AI agents without human oversight.
The Trust Problem
Yet as AI becomes more integrated into critical systems, new vulnerabilities emerge. The Amazon Q security breach serves as a stark warning: a hacker gained administrative access to this popular AI coding assistant through a simple GitHub pull request, potentially affecting nearly a million developers.
The incident exposes how traditional security measures struggle with AI-powered development tools. With 42% of developers now saying their codebases are predominantly AI-generated, and only 29% feeling confident in detecting AI-created vulnerabilities, we're creating a massive attack surface for malicious actors.
Thinking Machines
Some of the most promising advances address AI's fundamental limitations. Energy-Based Transformers introduce a form of deliberative thinking, allowing models to verify their answers before responding, much like humans pause to think through complex problems. These systems showed 29% performance improvements while scaling 35% more efficiently than traditional transformers.
This approach to "System 2 thinking" could lead to AI assistants that catch their own mistakes, medical diagnostic systems that double-check their reasoning, and educational tools that can explain their thought process clearly.



