Eleventh Issue (December 2025)
- martinprause
- 2 days ago
- 5 min read

December 2025: The Month AI Learned to See, Think, and Act
From Meta's SAM 3 teaching machines to see the world through human eyes, to DeepMind's SIMA 2 reasoning its way through virtual worlds, to Anthropic's robot dog experiment demonstrating how AI can bridge software and hardware—this edition of Beyond the Box captures a technological inflection point.
But alongside these capability leaps come sobering findings: AI systems that quietly drift in their beliefs during conversations, hidden biases that standard tools miss, and fundamental questions about the race toward superintelligence. Here's what decision-makers need to know.
Visual AI Reaches Human-Like Understanding
SAM 3
Meta's Segment Anything Model 3 represents a fundamental shift in computer vision. Previous systems required carefully curated lists of object categories, 'dogs,' 'cars,' 'tables', and struggled when encountering anything outside their predefined vocabulary. SAM 3 breaks this limitation entirely. The system can now detect, segment, and track any visual concept expressed in simple natural language. Tell it to find 'striped cats' or 'wooden bowls' or 'yellow school buses,' and it delivers. Show it an example image, and it finds all similar objects. For video, it maintains consistent object identities even when subjects temporarily disappear behind obstacles.
The business implications are immediate: automated quality control in manufacturing, inventory management in retail, agricultural monitoring, medical image analysis, all become more accessible when AI can understand visual concepts as flexibly as humans describe them. Annotation costs, historically a major bottleneck, could fall dramatically as SAM 3's hybrid human-AI data engine generates training data at scale.
Teaching Machines Human Perception
A breakthrough published in Nature tackles a different but equally important problem: making AI vision systems organize knowledge the way humans do. Current systems make errors that strike us as absurd, confusing a chihuahua with a blueberry muffin, or deeming a green lizard more similar to a houseplant than to a crocodile.
Researchers created a 'surrogate teacher', an AI model tuned to judge visual similarity like humans do, and used it to train other vision systems. The results: models that organize concepts hierarchically (poodles are dogs, dogs are animals, animals differ from vehicles) and show up to 73x better alignment with human perception on certain measures. These human-aligned models also became technically better—outperforming unaligned counterparts on 32 of 40 benchmark tasks and improving accuracy by nearly 10 percentage points on challenging test sets. Human-like thinking, it turns out, produces more robust AI.
AI Enters the Physical World
SIMA 2
Google DeepMind's SIMA 2 represents a qualitative leap in embodied AI. The original SIMA could follow basic instructions across video games, 'open the door,' 'climb the ladder.' SIMA 2 does something fundamentally different: it reasons about goals, explains its thinking, and learns from experience. Ask it to 'build a shelter before nightfall,' and SIMA 2 breaks the problem into components: What materials are needed? Where might they be found? What sequence of actions makes sense? It can articulate this reasoning to users, transforming the interaction from issuing commands to genuine collaboration.
Perhaps most significant is SIMA 2's capacity for transfer learning and self-improvement. Skills learned in one game apply to others, 'mining' in one world translates to 'harvesting' in another. After initial training on human demonstrations, the agent can learn autonomously through trial and error, each generation becoming more capable than the last.
Project Fetch
Anthropic's 'Project Fetch' experiment provides a concrete demonstration of AI bridging the digital-physical divide. Eight employees with no robotics background were divided into two teams and asked to program a robot dog to fetch a beach ball. One team had access to Claude; the other didn't. The results were striking: the AI-assisted team completed tasks in roughly half the time, saving over 100 minutes on sensor connectivity alone. Only Team Claude made meaningful progress toward full autonomous retrieval. They wrote approximately nine times more code, exploring multiple approaches in parallel. The strategic implication: if AI substantially lowers the barrier to robotics programming, small businesses might deploy robots for tasks that previously required specialized contractors. The cost of automation, historically driven by software integration rather than hardware, could decline meaningfully.
The Reliability Challenge
When AI Beliefs Drift
A comprehensive study from Princeton, Stanford, Carnegie Mellon, and other institutions documented a troubling phenomenon: the longer AI systems engage in conversation, debate, or extended reading, the more their expressed 'beliefs' shift. This isn't the result of adversarial attacks, it emerges from ordinary use.
The numbers are substantial. In structured debates, GPT-5 shifted its stated beliefs in nearly 55% of interactions. With information-based persuasion, that figure climbed to 73%. Different models showed different vulnerabilities, Claude-4-Sonnet proved less reactive to direct persuasion but more susceptible to drift during extended reading tasks. Most intriguingly, models appeared to absorb narrative framing rather than specific facts. When researchers masked topic-relevant sentences, the drift persisted—suggesting these systems respond to stories and tone rather than isolated evidence.
The business risk is clear: if a customer service bot or research assistant subtly shifts its positions based on recent interactions, brand voice may evolve without intentional updates, compliance outputs may become unpredictable, and product behavior may change without any new deployment.
Unmasking Hidden Bias
A new method called FiSCo (Fine-grained Semantic Comparison) addresses a different reliability concern: detecting subtle demographic bias that existing tools miss. Traditional approaches struggle because AI systems naturally vary their responses, ask the same question twice and you get different answers, making it difficult to distinguish random variation from systematic discrimination. FiSCo breaks responses into discrete claims and uses rigorous statistical testing to determine when inter-group differences exceed natural variation. Applied to current models, it revealed that larger, more advanced models exhibit less measurable bias overall, with Claude 3 Sonnet demonstrating the lowest bias levels among systems tested. Racial bias emerged as the most prominent category across all models. For organizations deploying AI, FiSCo offers a path toward demonstrable fairness, quantifiable metrics relevant to regulatory frameworks including the EU AI Act.
Efficiency and the Future of Computing
Intelligence Per Watt
As AI becomes woven into everyday life, a Stanford-led study proposes a simple but powerful idea: measure not just how smart a model is, but how efficiently it turns power into useful answers. Their 'Intelligence per Watt' metric challenges assumptions about where AI needs to run.
The surprising discovery: small, local models running on laptops can now solve 88.7% of single-turn queries, everyday tasks like writing, summarizing, and general knowledge questions. From 2023 to 2025, 'intelligence per watt' improved 5.3x, driven by both smarter architectures and better consumer hardware.
With intelligent routing that decides which queries stay local and which escalate to the cloud, simulations showed energy reductions of over 80%. At global scale, this approach could save terawatt-hours annually, a path to sustainable AI without endless new data centers.
Looking Ahead
December 2025's research paints a picture of AI at an inflection point. The technology is becoming more capable, seeing like humans, reasoning through complex tasks, bridging into physical systems, while simultaneously revealing new challenges around reliability, consistency, and governance.
For decision-makers, the message is clear: AI capability is accelerating faster than our frameworks for understanding it. The organizations that thrive will be those that embrace both the opportunity and the responsibility, investing in capability while building the governance structures to deploy it safely.



