Progressive Autoregressive Video Diffusion Models

Abstract

Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. Existing methods naively achieve autoregressive long video generation by directly placing the ending of the previous clip at the front of the attention window as conditioning, which leads to abrupt scene changes, unnatural motion, and error accumulation. In this work, we introduce a more natural formulation of autoregressive long video generation by revisiting the noise level assumption in video diffusion models. Our key idea is to 1. assign the frames with per-frame, progressively increasing noise levels rather than a single noise level and 2. denoise and shift the frames in small intervals rather than all at once. This allows for smoother attention correspondence among frames with adjacent noise levels, larger overlaps between the attention windows, and better propagation of information from the earlier to the later frames. Video diffusion models equipped with our progressive noise schedule can autoregressively generate long videos with much improved fidelity compared to the baselines and minimal quality degradation over time. We present the first results on text-conditioned 60-second (1440 frames) long video generation at a quality close to frontier models.

Motivation

Current frontier video diffusion models can only generate short video clips (e.g. 10 seconds or 240 frames) due to the expensive O(N^2) long sequence modeling in DiTs. To enable long video generation, autoregressively applying video diffusion models is the straightforward solution.

How to generate longer videos?

Increasing video length at inference time results in poor video quality.
Naïve autoregressively video extension leads to drifting after 3-4 times.

A new way to noise/denoise?

Instead of shared noise level among frames, assign progressively increasing noise levels to each frame!

Method

The replacement methods (left) vs. out PA-VDM (right).

Given previously generated F=30 frames,

Traditional replacement methods:

Directly place condition frames at the beginning of noisy frames.
repeat every 30 denoising steps.
😔 Severe drifting at 20s, small overlap, unnatural, limited motion.

Progressive Autoregressive Video Diffusion Models (PA-VDM):

Assign per-frame, progressively increasing noise levels.
Shift the frames by 1 and repeat every 1 denoising step.
😁 minimal drifting up to 60s, maximum overlap, natural motion. Better information propagation from earlier clean frames to later uncertain frames.

Progressive Autoregressive Video Diffusion Models

Abstract

Motivation

A new way to noise/denoise?

Method

60-second Long Video Generation Benchmark

SotA Quantitative Results

User Study

Qualitative Comparison

Ablation Study page