From creating videos of the present to predicting videos of future scenes.
Distinguishing between Text-to-Video Generation and Next Scene Prediction.
"A cat wearing sunglasses and working as a lifeguard at a pool."
"A boat sailing leisurely along the Seine River with the Eiffel Tower in background, pixel art."
"A teddy bear is swimming in the ocean."
Visualizing the impact of Pre-Training (PT), Supervised Fine-Tuning (SFT), and Reinforcement Learning (RL).
Prompt 1: Just finished setting up camera and lighting
Prompt 2: Just finished making the artwork
Pre-Training Stage
Supervised Fine-Tuning
Reinforcement Learning
Pre-Training Stage
Supervised Fine-Tuning
Reinforcement Learning