Artificial Intelligence

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

North America / United States0 views1 min
Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

Researchers from the University of California, Berkeley’s Center for Responsible, Decentralized Intelligence (RDI) introduced the Agents’ Last Exam (ALE), a benchmark testing AI’s ability to execute real-world professional workflows. OpenAI’s GPT-5.5 outperformed Anthropic’s Claude Fable 5, securing a 24.0% pass rate compared to 22.0%, as ALE’s rigorous framework exposes limitations in current AI models across 55 industries.

Researchers at the University of California, Berkeley’s Center for Responsible, Decentralized Intelligence (RDI) have launched the Agents’ Last Exam (ALE), a benchmark designed to evaluate whether artificial intelligence can perform real-world, long-horizon professional tasks. Unlike traditional benchmarks, ALE simulates authentic workflows from 55 industries, such as 3D modeling in Siemens NX or neuroimaging analysis in FSLeyes, using a strict framework called Generalist Computer-Use Agent (GCUA). ALE’s evaluation structure addresses past flaws in AI testing, like automated graders rejecting correct solutions or models exploiting hidden data. The benchmark uses a multi-layered approach—Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate)—requiring models to interact with virtual machines and desktop software. Only 6.8% of tasks rely on unpredictable LLM-as-a-judge grading, with most using deterministic, code-based verification against expert references. In its initial leaderboard, OpenAI’s GPT-5.5, accessed via the Codex harness, achieved the highest pass rate at 24.0%, surpassing Anthropic’s Claude Fable 5, which scored 22.0%. Other harnesses like Ale Claw (45.8% with GPT-5.5) and Claude Code (40.5% with Claude Fable 5) also performed well, but all models struggled with the benchmark’s complexity. ALE’s 1,490 tasks, aligned with the U.S. federal occupational taxonomy, aim to scale to 5,000, reflecting real-world demands. The benchmark’s creators emphasize its authenticity, with tasks sourced from professional workflows in fields like visual effects compositing in Adobe After Effects. ALE categorizes tasks into Near-Term, Full-Spectrum, and Last-Exam tiers, exposing gaps in current AI capabilities. The results highlight that even advanced models like GPT-5.5 and Claude Fable 5 face significant challenges in executing economically valuable, multi-step professional tasks.

This content was automatically generated and/or translated by AI. It may contain inaccuracies. Please refer to the original sources for verification.

Comments (0)

Log in to comment.

Loading...