Skip to main content

Why games may not be the best benchmark for AI

OpenAI Dota 2
Human teams competing against OpenAI Five, OpenAI's Dota-playing artificial intelligence.
Image Credit: OpenAI

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


In 2019, San Francisco-based AI research lab OpenAI held a tournament to tout the prowess of OpenAI Five, a system designed to play the multiplayer battle arena game Dota 2. OpenAI Five defeated a team of professional players — twice. And when made publicly available, OpenAI Five managed to win against 99.4% of people who played against it online.

OpenAI has invested heavily in games for research, developing libraries like CoinRun and Neural MMO, a simulator that plops AI in the middle of an RPG-like world. But that approach is changing. According to a spokesperson, OpenAI hasn’t been using games as benchmarks “as much anymore” as the lab shifts its focus to other domains, including natural language processing.

OpenAI’s deemphasis on games like Dota 2, platformers, and hide-and-seek reflects the split opinion among experts of the value of games in AI research. While some believe that games can lead to new insights, spawning AI systems with commercial applications, others think that AI created to play games is pigeonholed by design.

“I do think that games have a tendency to get people very excited, because they can relate — because people played [games like Dota 2] and they know that they were hard for them,” Richard Socher, the founder of You.com and former Salesforce chief AI scientist, told VentureBeat. “But it’s a little bit like when you’re excited your computer can multiply up to very large numbers. [These systems are] ultimately not that intelligent … [They haven’t] really created value in the world outside of playing that game.”

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

Most AI applied to games falls into the category of reinforcement learning, where a system is given a set of actions that it can apply to its environment. The system — which usually starts knowing nothing about the environment — receives rewards based on how its actions bring it closer to a goal. As the system gradually receives feedback from the environment, it learns sequences of actions that can maximize its rewards.

Socher notes that, unlike the real world, games provide a theoretically infinite amount of data to train AI systems. For example, to develop OpenAI Five, OpenAI had the system play the equivalent of 180 years’ worth of games every day for weeks. DeepMind’s AlphaStar, a system that can beat top players at the strategy game StarCraft II, learned from hundreds of thousands of examples of matches released by the game’s publisher, Activision Blizzard. And one version of an Uber-designed, Atari-game-playing system called Go-Explore took 58 hours of continuous play to achieve the top score in Montezuma’s Revenge.

“[Games] have progressed the research world with some interesting new [ideas,] but the trouble is, a lot of times, people believe that things that are hard for humans are hard for computers,” Socher said. “Once AI could solve chess, it didn’t really become smarter than people — it just got good at chess. That’s the fallacy that we’ve seen now … [these types of] algorithms aren’t generally intelligent, they can just play certain games very well.”

A brief history of games in AI

Games have been conceived as AI benchmarks for decades. As Digital Trends’ Luke Dormehl writes, the American mathematician Claude Shannon argued in 1949 that games like computer chess presented a worthy challenge for “intelligent” software. Games distill problems into actions, states, and rewards and yet require reasoning to excel at, Shannon argued, while possessing a structure in keeping with the manner in which computers solve problems.

In 1996, IBM famously set loose Deep Blue on chess, and it became the first program to defeat a reigning world champion, Garry Kasparov, under regular time controls. Leveraging 30 top-of-the-line microprocessors, Deep Blue evaluated 200 million board positions every second and drew on a memory bank of hundreds of thousands of previous, master-level chess games.

In 2011, IBM’s Watson AI faced off against former Jeopardy champions Ken Jennings and Brad Rutter in a multipart televised special of the game show . With access to 200 million pages of content including the full text of the 2011 edition of Wikipedia, Watson handily beat the pair, winning $1 million in prizes.

In 2013, DeepMind demoed a system that could play Pong, Breakout, Space Invaders, Seaquest, Beamrider, Enduro, and Q*bert at “superhuman” levels. Three years later, DeepMind’s AlphaGo won a three-game match of Go against Lee Sedol, one of the highest-ranked players in the world. In 2017, an improved version of the system — AlphaZero — defeated human champions at chess, a Japanese variant of chess called shogi, and Go. And in 2020, DeepMind demoed MuZero, which picks up the rules of games like chess as it plays.

Labs have more recently developed AI that can play games of imperfect information, like poker, with high skill. In contrast to perfect information games such as chess and shogi, imperfect information games have information that’s hidden from players during the game (e.g., another player’s hand in poker). Two years ago, Facebook and Carnegie Mellon’s Pluribus was among the first to beat professionals in Texas Hold’em. DeepMind’s Player of Games also shows strong performance on the strategy game Scotland Yard as well as perfect information games including chess.

An imperfect measure

Some researchers argue that systems like Player of Games, which can reason about others’ goals and motivations, could pave the way for AI that can successfully work with others. Tasks like route planning around congestion, contract negotiations, and even interacting with customers all involve compromise and consideration of how people’s preferences coincide and conflict, as in games.

“Throughout human societies, people engage in a wide range of activities with a diversity of other people,” the researchers behind an AI benchmark for Hanabi write. “With such complex … interactions playing a pivotal role in human lives, it is essential for artificially intelligent agents to be capable of cooperating effectively with other agents, particularly humans.”

Beyond Hanabi and board games like Diplomacy, Microsoft’s Minecraft — which has straightforward goals like acquiring enough food to not starve  — has been proposed as a training ground for this type of collaborative AI. Researchers at DeepMind and the University of California, Berkeley, recently launched a competition called BASALT where the goal of an AI system must be communicated through demonstrations, preferences, or some other form of human feedback.

DeepMind Diplomacy

Above: The Diplomacy game board.

Image Credit: DeepMind

“Video games … have provided an extremely valuable sandbox for researchers looking to teach agents to complete complex tasks,” Luca Weihs, a research scientist at the Allen Institute for Artificial Intelligence, told VentureBeat. “This owes in large part to the extensive visual diversity across games, variety of strategies required for success, and fast simulation speed enabling large-scale experimentation.”

But despite their convenience from a research perspective, Weihs believes that games are a flawed AI benchmark because of their abstractness and relative simplicity. He notes that even the best game-playing systems, like AlphaStar, generally struggle to reason about the states of other AI systems, don’t adapt well to new environments, and can’t easily solve problems they haven’t seen before — particularly problems that must be solved over long time horizons.

For example, a reinforcement learning model that can play StarCraft 2 at an expert level won’t be able to play a game with similar mechanics at any level of competency. Even slight changes to the original game will degrade the model’s performance. OpenAI Five only managed to master 16 playable characters in Dota 2 — not the game’s over 100 characters — and non-champion players were able to find strategies to reliably beat the system in a matter of days after it was made public.

Mike Cook, an AI researcher and game designer at Queen Mary University of London, agrees that games “aren’t that special” as a benchmark for AI. What really matters about games, he says, is the role they have in society and culture. But he believes that researchers are running out of both low-hanging fruit and cultural touchstones for the non-gaming public.

“Chess and Go were obvious targets [for AI benchmarks] because of their historical importance both in computer science and in wider human culture as a game ‘clever people’ please,” Cook told VentureBeat via email. “From there, where do you go? Well, you need games that (1) have a clear benchmark you can say you’ve beaten, (2) are understood or at least vaguely known by people who are not gamers, and (3) feel meaningful to beat … Playing chess is playing chess; the computer clearly doesn’t have an ‘edge’ there because the game is played in the mind. But if we tried to get an AI to play Call of Duty [or Quake II Arena] — a game which fulfill the first two criteria — it might not feel like a meaningful win because people expect computers to have fast reactions.”

Innovation through play

Others disagree. Nvidia — which has a vested interest in gaming hardware — stands behind the idea that games remain an important area of AI research, in particular for reinforcement learning. Bryan Catanzaro, VP of applied deep learning research at Nvidia, describes games as “clearly defined sandboxes” with rules and objectives that the real world lacks.

“Teaching AI agents to navigate them helps us work towards building generally useful agents that can help us solve problems in the real world,” Catanzaro told VentureBeat via email. “Also, they’re just a lot of fun to work with.”

Microsoft, too, believes in the power of gaming as a platform for AI development, pointing to efforts like the ongoing Project Paidia. A joint intiative between Microsoft Research Cambridge and Microsoft-owned game studio Ninja Theory, Project Paida aims to drive research in reinforcement learning by enabling systems to learn to collaborate with video game players.

Game engine vendor Unity is engaged in work along a similar vein. Its ML-Agents Toolkit plugin allows AI to acquire new skills and behaviors via reinforcement learning, where the only thing it knows in any given virtual environment is what’s correct. In partnership with Google, Unity created Obstacle Tower, a video game designed to challenge a system’s ability to navigate over obstacles including puzzles, complicated layouts, and dangerous enemies.

Is DeepMind’s new reinforcement learning system a step toward general AI?

Above: DeepMind’s XLand learning environment.

Image Credit: TechTalks

Recently, Microsoft’s Project Paida turned to “designer-centered” reinforcement learning, with the goal of developing a tuneable system (e.g., a robot) that learns to behave in lifelike ways without a developer having to hardcode every natural behavior. Project Paida has also uncovered techniques for helping AI systems collaborate with each other in the multiplayer combat game Bleeding Edge.

“With projects like this, we’re showing how AI is shifting from competitive applications to being used to empower players to achieve more,” Microsoft principal researcher Sam Devlin said in an interview.

In one of the more promising projects to date, DeepMind created an engine — XLand — that can generate environments in which researchers can train AI systems on a number of tasks. Each new task is generated according to a system’s training history and in a way to help distribute the system’s skills across challenges, like “capture the flag” and “hide and seek.” After over a month of training, DeepMind claims that systems in XLand demonstrate humanlike behavior such as teamwork and object permanence, awareness of the basics of their bodies and the passage of time; and knowledge of the high-level structure of the games that they encounter.

Moving beyond games

Games have informed the development of AI deployed in the real world. For example, reinforcement learning tools are used not only in robotic control, software testing and security, industrial machines, chipset design, drug design, self-driving cars, and video compression, but in systems that determine which videos and ads are shown to users online. Similarly, search algorithms — which allow AI systems to find their way in video games — support automatic route planning in navigation systems.

Further demonstrating the potential usefulness of games, Go-Explore was used to improve the training of a robotic arm in the real world. Researchers from the University of Eastern Finland and Aalto University also claim to have successfully “transferred” skills learned by an AI in a video game — Doom — to a real-world robot.

Some of DeepMind’s top scientists published a paper recently in which they hypothesize that a single reward and reinforcement learning are enough to eventually reach artificial general intelligence (AGI), or AI systems that can accomplish any task. “[Systems like AlphaZero are] a stepping stone for us all the way to general AI,” DeepMind CEO Demis Hassabis told VentureBeat in a 2018 interview. “The reason we test ourselves and all these games is … that [they’re] a very convenient proving ground for us to develop our algorithms. … Ultimately, [we’re developing algorithms that can be] translate[ed] into the real world to work on really challenging problems … and help experts in those areas.”

Setting aside the fact that not every expert believes that AGI is achievable, researchers — while acknowledging games’ contribution to the field of AI — are looking at games with an increasingly skeptical eye. In an interview with The Verge, Francois Chollet, a software engineer at Google and a well-known figure in the AI community, says that the motivation to pursue blockbuster games as training benchmarks boils down to public relations plays.

“If the public wasn’t interested in these flashy ‘milestones’ that are so easy to misrepresent as steps toward superhuman general AI, researchers would be doing something else,” he said. “I don’t really see it as scientific research because it doesn’t teach us anything we didn’t already know … If the question was, ‘Can we play X at a superhuman level?,’ the answer is definitely, ‘Yes, as long as you can generate a sufficiently dense sample of training situations and feed them into a sufficiently expressive deep learning model.’ We’ve known this for some time.”

Meanwhile, experts like Noam Brown, a research scientist at Meta (formerly Facebook), aren’t convinced that even state-of-the-art game-like environments like XLand achieve what their creators set out to accomplish. AI systems trained in XLand have to stumble upon an interesting area by chance and then be encouraged to revisit that area until it’s no longer “interesting” — unlike humans.

Part of the problem is the mechanism used to reward the AI. “Sparse” rewards reward a system for accomplishing a certain goal, but at the risk of leading to a dead end. “Dense” rewards help a system along the way to a task, but can lead to a rigid system that doesn’t generalize to new scenarios.

Newer research from Caltech and UC Berkeley illustrates the problem. It finds that, as a reinforcement learning model trained to play the Atari game Riverraid scales up, it becomes more likely to seek a “proxy” or false reward rather than the true reward. According to the coauthors, reward designers will likely need to take greater care to specify reward functions accurately as larger models become more common.

DeepMind AlphaStar

Above: DeepMind’s AlphaStar competing against a human player.

Image Credit: DeepMind

“Just because a game is complex doesn’t necessarily mean it’s difficult for an AI,” Brown told VentureBeat in an interview. “Video games are not necessarily more difficult than board or card games. For example, [Valve’s] Counter-Strike is a popular 3D real-time game that involves cooperation, competition, long-term planning, and partial observability. On paper, this sounds like a really difficult game for an AI to play, but bots have been able to beat human players in Counter-Strike since the ’90. That’s because you can do quite well in Counter-Strike if you have fast reflexes and accurate aim, which are both things that machines excel at. Whether or not a game is a good benchmark depends on whether the techniques needed to play it well are more difficult for machines than for humans, such as communication skills and fast adaptation.”

IBM, for all its work in games (including the more recent Project Debater), says it’s moving away from “benchmark-focused” AI development in favor of alternative approaches. Ruchir Puri, chief scientist for IBM Research, places the blame on games’ “narrow,” “nuanced” task focus and the increasing difficulty understanding and applying benchmarks to these “evolving” systems.

“Games have helped drive significant innovation in AI, from chess to Go and beyond.  That said, the strategy of creating AI with game benchmarks in mind at this stage of AI [creation], where AI is graduating to impact enterprises by being infused into business and consumer processes, is somewhat narrow-focused,” Puri told VentureBeat in an interview. “Instead of focusing on an AI ‘beating’ a specific benchmark, we should instead measure a system on the diversity and range of tasks it can perform, coupled with its ability to demonstrate more human-like reasoning and understanding.”

Potential answers

Cook’s solution is games that pose a more relevant, general challenge to AI than, say, soccer or Pong. He suggests the Jackbox Party Pack series, which requires a mix of creativity, bluffing, intuition, and humor. As a piece in Time earlier this month underlined, the prospect of AI that understands what humans find funny — and that can generate its own genuinely funny material — is a holy grail for a subset of AI researchers, because it could demonstrate a theory of mind.

“That’s a really challenging problem [and would] advance the field greatly … but it’s not a widely tackled issue [and] it’s way hard,” Cook said.

Brown also believes that interesting algorithmic lessons could be learned from the right game, like simulations or those that require a complex use of language. Toward this, Meta in January released the NetHack Learning Environment (NHLE), a research tool based on the game NetHack that tasks players with descending dungeon levels to retrieve a magical amulet. Levels in NetHack are procedurally generated, meaning every game is different, and success often hinges on consulting sources like the official NetHack Guidebook, the NetHack Wiki, online videos, and forum discussions.

“Certain games are still important AI benchmarks, but it depends on the game,” Brown told VentureBeat. “Now that AIs are able to beat human players at games like Go and poker, nobody would be surprised if an AI system beat expert humans in a game like gin rummy. But there are some games that are still incredibly difficult for AI algorithms and that will require fundamentally new techniques.”

Chollet proposes a game-like benchmark called ARC, which covers a set of reasoning tasks where each task is explained via a small sequence of demonstrations. An AI system has to learn to accomplish the task from these few demonstrations. While ARC is solvable by humans without verbal explanations or prior training, it’s unapproachable by most AI techniques that’ve been tried so far.

NHLE and ARC — and benchmarks like it — could help to solve another problem with games in AI: their compute inefficiency. NHLE can train reinforcement learning agents 15 times faster than even decade-old Atari benchmarks because it only renders symbols instead of pixels and uses simplistic physics. That could lead to substantial cost savings, considering that DeepMind reportedly spent $35 million training the latest version of AlphaGo.

“My hunch is, if you [forced an AI system] to use language in a complex way, [it] couldn’t just try every illegal combination,” Socher said. “Games that include self-deception or language could be interesting to see if [the AI system] could bluff properly, but also through language.”

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.