Diffusion models

Setup

$q (y)$ : true sampling distribution of observed data
$p_{θ} (y)$ : learned sampling distribution
$x^{1 : T}$ : noisy latent variables at each step $t$

Forward process

In the forward process, noise is continuously added to the data across multiple steps, converging toward white noise $x^{T} \sim N (0, 1)$ : $q (x^{t} ∣ x^{t - 1}) := N (1 - β_{t} x_{t - 1}, β_{t} I)$ The fixed Gaussian forward process means we can directly calculate/sample from $q (x^{t} ∣ y)$ without calculating all the intermediate steps (similar to Autoregressive models): $x^{t} = \overset{α}{ˉ}_{t} y + (1 - \overset{α}{ˉ}_{t}) ϵ$ Where $α_{t} = 1 - β_{t}$ , $\overset{α}{ˉ}_{t} = \prod_{i = 1}^{t} α_{i}$ , and $ϵ$ is white noise. Again, this is 100% analogous to the forecasting equation for an AR(1).

In practice, $α_{t}$ is chosen to be close to 1, as this yields the best results.

Denoising

In the denoising or reverse diffusion process, noise is progressively removed over multiple steps, modeling the inverse process: $p_{θ} (x^{t - 1} ∣ x^{t}) := N (μ_{θ} (x^{t}, t), σ_{t} I)$ Where $σ_{t} = \frac{1 - α ˉ _{t - 1}}{1 - α ˉ _{t}} β_{t}$ .

$μ_{θ}$ is parameterized using a denoising network, $ϵ_{θ}$ (of which many exist, including the popular U-Net): $μ_{θ} (x^{t}, t) = \frac{1}{α _{t}} (x^{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x^{t}, t))$ This model is trained using the objective function: $E_{y, ϵ, t} [∣∣ ϵ_{θ} (x^{t}, t) - ϵ ∣ ∣^{2}]$

References

@kolloviehPredictRefineSynthesize2023

@harveyFlexibleDiffusionModeling2022

e online

Nnamdi's Notes

All Notes