What Happens Next?
Next Scene Prediction with a Unified Video Model

From creating videos of the present to predicting videos of future scenes.

Xinjie Li 1 Zhimin Chen2 Rui Zhao2 Florian Schiffers2 Zhenyu Liao2 Vimal Bhat2
1 Pennsylvania State University, USA
2 Amazon, USA

Core Capabilities

Distinguishing between Text-to-Video Generation and Next Scene Prediction.

"A cat wearing sunglasses and working as a lifeguard at a pool."

"A boat sailing leisurely along the Seine River with the Eiffel Tower in background, pixel art."

"A teddy bear is swimming in the ocean."

Model Comparison

Preceding Scene Description
Ours
Omni
Wan
LTX
Prompt "Just finished setting up camera and lighting"
Prompt "Just finished making the artwork"
Prompt "Dark clouds begin to gather over the forest"
Prompt "A glass sitting near the edge of the table begins to wobble"

Ablation Studies

Visualizing the impact of Pre-Training (PT), Supervised Fine-Tuning (SFT), and Reinforcement Learning (RL).
Prompt 1: Just finished setting up camera and lighting
Prompt 2: Just finished making the artwork

Base

PT

Pre-Training Stage

Refined

SFT

Supervised Fine-Tuning

Optimal

RL

Reinforcement Learning

Base

PT

Pre-Training Stage

Refined

SFT

Supervised Fine-Tuning

Optimal

RL

Reinforcement Learning

Text-to-Video Comparison

Caption
Ours
LTX
Prompt "A teddy bear is swimming in the ocean."
Prompt "A beautiful coastal beach in spring, waves lapping on sand by Hokusai, in the style of Ukiyo."