Alrich Lawson | Getty Images
On Tuesday, researchers from Google and Tel Aviv University unveiled GameNGen, a new AI model that can interactively simulate the classic 1993 first-person shooter Doom in real time, using AI image generation techniques borrowed from Stable Diffusion. It's a neural network system that acts as a limited game engine, and could open up new possibilities for real-time video game synthesis in the future.
For example, instead of using traditional techniques to draw graphic video frames, future games may use AI engines to “imagine” or hallucinate graphics in real time as a predictive task.
“There's tremendous potential here,” app developer Nick Dobos wrote in response to the news. “Why hand-write complex rules for software when an AI can do the thinking for you, pixel by pixel?”
Using a single Tensor Processing Unit (TPU), a type of specialized GPU-like processor optimized for machine learning tasks, GameNGen is reportedly capable of generating new frames of Doom gameplay at over 20 frames per second.
In tests, 10 human raters were sometimes unable to distinguish between short clips of actual Doom game footage (1.6 seconds and 3.2 seconds long) and the output generated by GameNGen, identifying the actual gameplay footage 58 percent or 60 percent of the time, the researchers said.
GameNGen in action, simulating Doom interactively using an image synthesis model.
GameNGen in action, simulating Doom interactively using an image synthesis model.
Real-time video game synthesis using something called “neural rendering” isn't an entirely new idea: In an interview in March, Nvidia CEO Jensen Huang predicted, perhaps boldly, that within five to 10 years, most video game graphics will be generated in real time by AI.
GameNGen also builds on previous work in the field, including World Models in 2018, GameGAN in 2020, and Google's own Genie in March, all of which are cited in the GameNGen paper. And earlier this year, a group of university researchers trained an AI model (called “DIAMOND”) that uses the diffusion model to simulate vintage Atari video games.
Also leaning in a similar direction is ongoing research into “world models” or “world simulators,” the types of AI video synthesis models often associated with such as Runway's Gen-3 Alpha and OpenAI's Sora. For example, when Sora debuted, OpenAI released a demo video of an AI generator that simulated Minecraft.
Popularization is key
In a preprint research paper titled “The Diffusion Model is a Real-Time Game Engine,” authors Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter explain how GameNGen works: Their system uses a modified version of Stable Diffusion 1.4, an image synthesis diffusion model released in 2022 that is used to create AI-generated imagery.
“The answer to the question 'can DOOM run?' turns out to be yes in our diffusion model,” wrote Tanishq Mathew Abraham, director of research at Stability AI, who was not involved in the research project.
Enlarge / GameNGen architecture diagram provided by Google.
Directed by player input, the diffusion model is trained on extensive footage of Doom behavior and then predicts the next game state from previous game states.
Developing GameNGen involved a two-stage training process: First, the researchers trained a reinforcement learning agent to play Doom and recorded that gameplay session to create an automatically generated training dataset (the aforementioned footage), and then used that data to train a custom Stable Diffusion model.
However, using Stable Diffusion does introduce some graphical glitches: “The pre-trained autoencoder in Stable Diffusion v1.4, which compresses 8×8 pixel patches into four latent channels, produces meaningful artifacts when predicting game frames, affecting fine details, especially the bottom bar HUD,” the researchers write in their summary.
GameNGen in action, simulating Doom interactively using an image synthesis model.
GameNGen in action, simulating Doom interactively using an image synthesis model.
The challenges don't end there: keeping an image visually sharp and consistent over time (often called “temporal consistency” in the AI video field) can be tricky. “Interactive world simulation is much more than very fast video generation,” GameNGen researchers write in their paper. “The requirement to be conditional on a stream of input actions available only during generation breaks several assumptions of existing diffusion model architectures.” This involves repeatedly generating new frames based on previous frames (known as “autoregression”), which can lead to instability and rapid degradation of quality of the generated world over time.