Benchmarking Professional Reasoning: Gemini 3 and GPT-5.2 Lead in Complex AI Agent Performance

Breaking News Technology
Macro close-up of illuminated blue fiber optics connecting polished gold metallic nodes on a dark, sleek surface.
4K Quality
A new benchmark titled APEX-Agents has been released to evaluate how effectively AI agents can execute long-horizon, cross-application tasks typically performed by investment bankers and corporate lawyers. The study reveals that while high-reasoning models like Gemini 3 Flash and GPT-5.2 are making significant strides, the complexity of realistic professional environments still presents a formidable challenge for autonomous systems.

The landscape of artificial intelligence is undergoing a fundamental shift from passive conversational models to autonomous "agents" capable of executing complex, multi-step workflows. As these systems move closer to integration within high-stakes professional environments, the need for rigorous, domain-specific evaluation has never been more critical. Researchers Bertie Vidgen, Austin Mann, and Abby Fennelly have addressed this gap with the introduction of the AI Productivity Index for Agents (APEX-Agents), a new benchmark designed to measure the efficacy of AI in tasks traditionally reserved for investment banking analysts, management consultants, and corporate lawyers.

Transitioning from Chatbots to Autonomous Agents

For the past several years, the gold standard for Large Language Model (LLM) performance has relied on benchmarks that measure static reasoning or general knowledge retrieval. However, as the industry pivots toward agentic workflows—where AI is expected to navigate file systems, use software tools, and execute sequences of actions over extended periods—traditional metrics have proven insufficient. The APEX-Agents benchmark represents a departure from these simple chat interfaces, focusing instead on "long-horizon" tasks that require persistent state management and cross-application coordination.

The research team identified three core pillars of professional services: investment banking, consulting, and law. These fields were selected because they demand high levels of precision, the ability to synthesize data from multiple sources (such as spreadsheets, PDFs, and internal databases), and the capacity to follow complex, multi-stage instructions. By shifting the focus to these high-value cognitive sectors, APEX-Agents provides a more accurate reflection of AI’s potential impact on the modern workforce than general-purpose benchmarks.

The Methodology of Professional Reasoning

The complexity of APEX-Agents lies in its insistence on "long-horizon" execution. Unlike a standard prompt that asks for a summary of a document, a task within this benchmark might require an agent to ingest an acquisition agreement, cross-reference specific clauses with a financial model in Excel, and then draft a formal memorandum in a word processor. These tasks are not performed in a vacuum; the benchmark provides a realistic work environment complete with structured file systems and professional-grade software tools.

To ensure the reliability of the results, the researchers utilized the Pass@1 metric. This rigorous scoring method requires the agent to complete the task correctly on its first attempt, mirroring the expectations of a professional environment where constant supervision or multiple retries would negate the productivity gains of using an AI. The benchmark consists of 480 discrete tasks (n=480), each accompanied by a detailed rubric and "gold outputs"—the human-verified correct answers used to grade the AI’s performance.

Performance Analysis: A New Hierarchy of Intelligence

The results of the initial testing phase reveal a clear hierarchy among the industry's most advanced models. According to the data, Gemini 3 Flash (Thinking=High) emerged as the top performer, achieving a Pass@1 score of 24.0%. This was followed closely by GPT-5.2 (Thinking=High) and Claude Opus 4.5 (Thinking=High). Gemini 3 Pro (Thinking=High) also rounded out the top tier of the leaderboard. These results are particularly notable because they highlight the efficacy of "thinking" or "reasoning" modes—configurations where the model is given additional compute time to process internal logic before generating an external action.

The success of the "Thinking=High" variants across all major model families—Google’s Gemini, OpenAI’s GPT, and Anthropic’s Claude—suggests that the ability to self-correct and plan internally is the primary differentiator in professional-grade tasks. However, even the highest score of 24.0% serves as a sobering reminder of the current state of the art. While these models are making significant strides in reasoning, nearly three-quarters of the professional tasks in the benchmark remained beyond their reach, often due to failures in tool integration or the accumulation of small errors over long-term execution.

Open-Sourcing Professional Standards: Archipelago

A significant contribution of this research is the commitment to transparency and reproducibility. Along with the benchmark itself, Vidgen, Mann, and Fennelly have open-sourced Archipelago, an infrastructure specifically designed for agent execution and evaluation. Archipelago allows other researchers to plug in different models and test them against the same professional scenarios, providing a standardized "sandbox" that mimics a real-world workstation.

By releasing the prompts, rubrics, gold outputs, and metadata, the researchers have created a public utility for the AI community. This open-source approach is intended to prevent "benchmark saturation," where models are inadvertently trained on the test data. The granularity of the APEX-Agents metadata—which tracks not just if a model succeeded, but where in the sequence of actions it failed—provides developers with a roadmap for improving agentic persistence and tool-use accuracy.

The Implications for Corporate Labor

The implications of the APEX-Agents findings for the professional services sector are multifaceted. On one hand, the ability of Gemini 3 and GPT-5.2 to navigate complex legal and financial files marks a significant milestone in AI capabilities. On the other hand, the low absolute success rates suggest that AI is currently better suited as a sophisticated assistant rather than a full replacement for human analysts. The "brittleness" of agents—their tendency to fail when faced with unexpected software behavior or ambiguous instructions—remains the primary hurdle to widespread adoption.

For firms in investment banking and law, the benchmark provides a framework for "Build vs. Buy" decisions. It suggests that while general-purpose models are becoming more capable, the gap between general reasoning and domain-specific execution is still vast. Organizations may need to invest heavily in specialized "wrappers" or fine-tuning to bring these models up to the 90% or 95% accuracy levels required for autonomous client-facing work.

Future Directions: The Path to 100%

Looking ahead, the researchers indicate that the next frontier for APEX-Agents will involve expanding the diversity of professional tools and increasing the length of the tasks even further. As AI companies release models with even larger context windows and more sophisticated internal reasoning chains, the benchmark will serve as a persistent "stress test" for the industry. The goal is to move the needle from the current 24.0% success rate toward a level of reliability that matches human output.

Ultimately, APEX-Agents establishes a new baseline for what it means for an AI to be "productive." It moves the conversation beyond the novelty of a chat interface and into the practicalities of professional labor. As autonomous agents continue to evolve, the metrics provided by Vidgen, Mann, and Fennelly will likely remain a critical yardstick for measuring the transition from AI that talks to AI that works.

James Lawson

James Lawson

Investigative science and tech reporter focusing on AI, space industry and quantum breakthroughs

University College London (UCL) • United Kingdom

Readers

Readers Questions Answered

Q What is the APEX-Agents benchmark?
A The APEX-Agents benchmark, or AI Productivity Index for Agents, evaluates AI agents' performance on complex, long-horizon, cross-application tasks in professional services like investment banking, management consulting, and corporate law.[1][2] It features 480 tasks across 33 simulated 'worlds' with diverse files and tools (e.g., Calendar, Mail, Spreadsheets), using Pass@1 as the primary metric for single-run success based on expert rubrics.[1] Developed by Mercor, it highlights gaps in agent consistency and capability compared to human professionals.[2][3]
Q How does GPT-5 compare to Gemini 3 in professional reasoning?
A Gemini 3 Pro outperforms GPT-5 on professional reasoning benchmarks like GPQA Diamond (91.9% vs. GPT-5's 85.7% with thinking), establishing it as the leader in pure reasoning and scientific tasks.[1][2] GPT-5 remains competitive, particularly with tools (89.4%) and adaptive reasoning, but trails in novel, complex problems such as Humanity's Last Exam where Gemini 3 scores 37.5%.[1] Later models like GPT-5.2 slightly edge Gemini 3 Pro on GPQA Diamond at 92.4%, though the query focuses on GPT-5.[4]
Q Can AI agents execute long-horizon cross-application tasks?
A Yes, AI agents can execute long-horizon cross-application tasks, as demonstrated by frameworks like MUSE, which achieves state-of-the-art performance on the TAC benchmark involving tasks exceeding 40-100 action steps across multiple applications using a lightweight Gemini-2.5 Flash model.[1] Benchmarks such as APEX-Agents specifically evaluate AI agents on long-horizon, cross-application professional services tasks,[8] while SWE-Bench Pro tests capabilities in complex software engineering scenarios requiring cross-file reasoning.[3] Ongoing advancements, including exponential improvements in task length completion (doubling every 7 months), indicate growing proficiency, though substantial gaps remain in real-world reliability.[2][4]

Have a question about this article?

Questions are reviewed before publishing. We'll answer the best ones!

Comments

No comments yet. Be the first!