AI GameStore: A New Framework for AGI Evaluation

Breaking News Technology
A glowing digital screen displaying a colorful collage of many different video game worlds against a dark background.
4K Quality
As artificial intelligence models saturate traditional benchmarks, researchers are shifting focus to a more dynamic proving ground: the vast, evolving library of games designed for human enjoyment. The proposed 'AI Gamestore' framework aims to measure true general intelligence by testing models across the 'Multiverse of Human Games' found on platforms like Steam and the Apple App Store.

The quest for Artificial General Intelligence (AGI) has reached a critical bottleneck as traditional evaluation methods fail to keep pace with rapid model capabilities. To address this, researchers including José Hernández-Orallo, Joshua B. Tenenbaum, and Samuel J. Gershman have introduced the AI GameStore, a scalable framework that tests machine intelligence against the "Multiverse of Human Games." By leveraging titles from platforms like Steam and the Apple App Store, this open-ended platform provides a more rigorous and dynamic environment than static benchmarks, measuring an agent’s ability to learn and adapt across diverse human experiences.

The Saturation of Conventional AI Benchmarks

Conventional AI benchmarks typically assess only narrow capabilities and quickly become saturated as developers optimize for specific test parameters. Traditional evaluations are often static, meaning models can eventually "memorize" or overfit to the data, leading to inflated scores that do not reflect true general intelligence. This phenomenon creates a false sense of progress where models appear highly capable in controlled settings but fail when faced with novel, real-world complexity.

Static environments lack the open-endedness required to simulate the unpredictable nature of human life. When a benchmark remains unchanged for years, it ceases to be a measure of generalization and becomes a measure of specialized optimization. The research team argues that for a machine to demonstrate AGI, it must show proficiency in environments it has never seen before, requiring a transition from fixed datasets to a vast, evolving library of interactive challenges.

What is the AI GameStore and how does it work?

The AI GameStore is a scalable, open-ended platform for evaluating machine general intelligence using human games, which are games designed by humans for humans. It works by employing Large Language Models (LLMs) with humans-in-the-loop to synthesize new representative games, automatically sourcing and adapting standardized, containerized variants from popular platforms like the Apple App Store and Steam.

The system operates by identifying diverse game mechanics and translating them into environments that Vision-Language Models (VLMs) can interact with. By using a "human-in-the-loop" approach, the researchers ensure that the generated games maintain the qualitative essence of human entertainment while remaining computationally accessible for AI agents. This process allows for the rapid generation of representative human games, creating a testing ground that is as broad as the human imagination itself.

Defining the Multiverse of Human Games

The "Multiverse of Human Games" represents the total sum of all games created by people to be enjoyed by people, serving as a proxy for human cognitive diversity. Unlike synthetic tasks created specifically for AI training, these games are built around human intuition, social logic, and physical common sense. This makes them a uniquely effective tool for measuring how well an AI can navigate a world designed for human general intelligence.

In their proof of concept, the researchers selected 100 games based on top charts from Steam and the Apple App Store. These games were not simplified for the AI; rather, they were standardized into containerized environments to ensure reproducibility. By testing frontier VLMs on these titles, the study creates a direct comparison between machine performance and the "human average," revealing exactly where current architectures fall short of human-level reasoning.

How does AI GameStore compare to ARC-AGI for measuring intelligence?

The AI GameStore evaluates AI on a broad 'Multiverse of Human Games' from real-world platforms, providing scalable and diverse benchmarks beyond static tests. In contrast, ARC-AGI focuses on abstract reasoning tasks, while AI GameStore tests practical skills like world-model learning, memory, and planning in dynamic game environments. Frontier models score under 10% of the human average on most AI GameStore games, highlighting broader gaps than ARC-AGI might reveal.

While ARC-AGI (the Abstraction and Reasoning Corpus) is highly regarded for its focus on fluid intelligence, the AI GameStore offers a more expansive "common sense" test. Games require more than just pattern recognition; they require:

  • Long-term memory to track objects and objectives across levels.
  • Spatial reasoning to navigate 2D and 3D environments.
  • Strategic planning to manage resources and anticipate opponent moves.
  • Causal inference to understand how different game mechanics interact.
This multi-dimensional approach ensures that the AGI being measured is functional and grounded in reality, rather than purely mathematical.

Why do current AI models struggle with games requiring world-model learning?

Current AI models struggle with games requiring world-model learning because they lack robust abilities to build internal representations of game physics, object persistence, and environmental dynamics. Evaluations on AI GameStore games show these models especially falter in tasks demanding memory retention across episodes and multi-step planning. This reveals limitations in achieving human-like AGI, as models perform significantly below human scores.

Researchers found that even the most advanced Vision-Language Models achieved less than 10% of the human average score on the majority of the 100 test games. The core issue lies in the gap between predicting the next token and understanding cause-and-effect. In a gaming environment, an agent must predict how its actions will change the state of the world. Without a sophisticated world-model, the AI cannot effectively plan its moves, leading to "hallucinated" strategies that fail when the game’s physics or logic do not match the AI's internal statistical correlations.

The Future of AGI Testing and Scalability

The future of AGI evaluation lies in moving toward a reality where machines are judged by their ability to learn any task, not just pre-defined ones. The AI GameStore provides a roadmap for this shift by offering a platform that can scale alongside AI development. As models improve, the "store" can be updated with more complex games, ensuring that the benchmark remains a "moving target" that resists saturation and continues to drive innovation in machine learning.

The implications of this research extend beyond gaming. By using human entertainment as a scientific yardstick, the team has identified specific architectural weaknesses in current models, particularly regarding physical intuition and long-range planning. Addressing these gaps is essential for the development of AGI that can safely and effectively operate in the physical world. Moving forward, the team aims to expand the AI GameStore to include even more diverse genres, further challenging the boundaries of what machines can understand and achieve.

James Lawson

James Lawson

Investigative science and tech reporter focusing on AI, space industry and quantum breakthroughs

University College London (UCL) • United Kingdom

Readers

Readers Questions Answered

Q What is the AI GameStore and how does it work?
A The AI GameStore is a scalable, open-ended platform for evaluating machine general intelligence using human games, which are games designed by humans for humans. It works by employing large language models (LLMs) with humans-in-the-loop to synthesize new representative games, automatically sourcing and adapting standardized, containerized variants from popular platforms like the Apple App Store and Steam. As a proof of concept, it generated 100 such games and evaluated frontier vision-language models on short play episodes.
Q How does AI GameStore compare to ARC-AGI for measuring intelligence?
A The AI GameStore evaluates AI on a broad 'Multiverse of Human Games' from real-world platforms, providing scalable and diverse benchmarks beyond static tests. In contrast, ARC-AGI focuses on abstract reasoning tasks, while AI GameStore tests practical skills like world-model learning, memory, and planning in dynamic game environments. Frontier models score under 10% of human average on most AI GameStore games, highlighting broader intelligence gaps than ARC-AGI might reveal.
Q Why do current AI models struggle with games requiring world-model learning?
A Current AI models, particularly vision-language models, struggle with games requiring world-model learning because they lack robust abilities to build internal representations of game physics, object persistence, and environmental dynamics. Evaluations on AI GameStore games show these models especially falter in tasks demanding memory retention across episodes and multi-step planning. This reveals limitations in achieving human-like general intelligence, as they perform below 10% of human scores on such challenging games.

Have a question about this article?

Questions are reviewed before publishing. We'll answer the best ones!

Comments

No comments yet. Be the first!