Überraschender Sieg: GPT-5.5 schlägt Claude Fable 5 im brutalen neuen Agents’ Last Exam-Benchmark

Researchers at the University of California, Berkeley’s Center for Responsible, Decentralized Intelligence (RDI) have launched the Agents’ Last Exam (ALE), a benchmark designed to evaluate whether artificial intelligence can perform real-world, long-horizon professional tasks. Unlike traditional benchmarks, ALE simulates authentic workflows from 55 industries, such as 3D modeling in Siemens NX or neuroimaging analysis in FSLeyes, using a strict framework called Generalist Computer-Use Agent (GCUA). ALE’s evaluation structure addresses past flaws in AI testing, like automated graders rejecting correct solutions or models exploiting hidden data. The benchmark uses a multi-layered approach—Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate)—requiring models to interact with virtual machines and desktop software. Only 6.8% of tasks rely on unpredictable LLM-as-a-judge grading, with most using deterministic, code-based verification against expert references. In its initial leaderboard, OpenAI’s GPT-5.5, accessed via the Codex harness, achieved the highest pass rate at 24.0%, surpassing Anthropic’s Claude Fable 5, which scored 22.0%. Other harnesses like Ale Claw (45.8% with GPT-5.5) and Claude Code (40.5% with Claude Fable 5) also performed well, but all models struggled with the benchmark’s complexity. ALE’s 1,490 tasks, aligned with the U.S. federal occupational taxonomy, aim to scale to 5,000, reflecting real-world demands. The benchmark’s creators emphasize its authenticity, with tasks sourced from professional workflows in fields like visual effects compositing in Adobe After Effects. ALE categorizes tasks into Near-Term, Full-Spectrum, and Last-Exam tiers, exposing gaps in current AI capabilities. The results highlight that even advanced models like GPT-5.5 and Claude Fable 5 face significant challenges in executing economically valuable, multi-step professional tasks.

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

Comments (0)