What Happens Next?
Next Scene Prediction with a Unified Video Model

From creating videos of the present to predicting videos of future scenes.

Xinjie Li ¹ Zhimin Chen² Rui Zhao² Florian Schiffers² Zhenyu Liao² Vimal Bhat²

¹ Pennsylvania State University, USA

² Amazon, USA

View Comparisons Explore Demos

Core Capabilities

Distinguishing between Text-to-Video Generation and Next Scene Prediction.

"A cat wearing sunglasses and working as a lifeguard at a pool."

"A boat sailing leisurely along the Seine River with the Eiffel Tower in background, pixel art."

"A teddy bear is swimming in the ocean."

Model Comparison

Preceding Scene Description

Ours

Omni

Wan

LTX

Prompt "Just finished setting up camera and lighting"

Prompt "Just finished making the artwork"

Prompt "Dark clouds begin to gather over the forest"

Prompt "A glass sitting near the edge of the table begins to wobble"

Ablation Studies

Visualizing the impact of Pre-Training (PT), Supervised Fine-Tuning (SFT), and Reinforcement Learning (RL).
Prompt 1: Just finished setting up camera and lighting
Prompt 2: Just finished making the artwork

Base

PT

Pre-Training Stage

Refined

SFT

Supervised Fine-Tuning

Optimal

RL

Reinforcement Learning

Base

PT

Pre-Training Stage

Refined

SFT

Supervised Fine-Tuning

Optimal

RL

Reinforcement Learning

Text-to-Video Comparison

Caption

Ours

LTX

Prompt "A teddy bear is swimming in the ocean."

Prompt "A beautiful coastal beach in spring, waves lapping on sand by Hokusai, in the style of Ukiyo."