Introduction to Generative Models


Denise Lanzieri






slides at denise-lanzieri-csl.github.io/lecture_cs_2024/

Disclaimer

    If some mathematical symbols or equations are not displayed correctly in this presentation:

    1. Please use Firefox for the best compatibility.

    2. Alternatively, you can adjust the math rendering settings in your browser.
      For example, in browsers supporting MathJax, you can follow these steps:
      • Right-click on any mathematical expression.
      • Navigate to Math Settings > Math Renderer.
      • Select a different rendering option (COMMON HTML).

    The image below demonstrates how to adjust these settings:

Math Settings Example

Supervised vs Unsupervised


Supervised Learning
  • Data: $(x, y)$ with $x$ as data and $y$ as labels

  • Goal: Learn the function $f$ that maps $x \to y$

  • Applications: Classification, regression, object detection, etc.
Unsupervised Learning
  • Data: $x$ only, no labels

  • Goal: Learn the underlying structure of the data

  • Applications: Clustering, dimensionality reduction, feature learning, density estimation, etc.

    Unsupervised learning is a vast and very exciting area of research for two reasons:

  • It is relatively cost-effective since it doesn’t require labels. $\rightarrow$ This allows us to learn from a large amount of data at once, using much more data than if we had to annotate or label it manually.

  • Although unsupervised learning is still a relatively unsolved area of research with many open problems remaining, it holds immense potential. $\rightarrow$ By successfully learning and representing the underlying structure in data, it can significantly advance our understanding, bringing us closer to the ultimate goal of comprehending the structure of the visual world.

Generative Models

    Generative modeling involves modeling data distributions $P_{model}(X)$:

    • We get examples $X$ distributed according to some unknown distribution $P_{data}(X)$

    • Goal: Learn a model $P_{model}(X)$ which we can sample from, such that $P_{model}(X)$ is as similar as possible to $P_{data}(X)$

Example: We have a massive collection of images of galaxies from telescopes. Can the network learn to generate new galaxies? i.e. Generate new samples from the distribution of galaxy images

Why Study Generative Models?

  • They test our ability to represent and manipulate high-dimensional probability distributions.

  • They can be incorporated into reinforcement learning in various ways.

  • They can be trained with missing data and provide predictions for inputs with missing information (e.g., the case of missing data in semi-supervised learning).

  • They handle multi-modal outputs effectively. For many tasks, a single input may correspond to several different correct answers, each of which is acceptable.

  • Some tasks intrinsically require realistic generation of samples from a specific distribution: Single image super-resolution, Tasks aimed at creating art, Image-to-image translation applications, etc.
(Ledig, et al., 2016)

Variational Autoencoders (VAEs)

Some Background First: Autoencoders

  • Specific type of a neural network designed to learn a lower-dimensional feature representation of unlabeled training data:

    1. Encode the input $\mathbf{x}$ into a compressed and meaningful representation (smaller set of features $\mathbf{z}$)

    2. Decode it back such that the reconstructed input $\hat{\mathbf{x}}$ is similar as possible to the original one

  • To reconstruct the input data, an $L_2$ loss function is often used.
      e.g. the pixels of the reconstructed data match the pixels of the original input data $\Longrightarrow$ Autoencoders do not use labels for training.

Some background first: Autoencoders

  • Autoencoders can reconstruct data and learn features to initialize a supervised model.

  • However, we can not generate new images from an autoencoder because we don't know the space of $z$.

  • How can we make an autoencoder a generative model?
      $\rightarrow$ By following Variational Bayes (VB)!

    • VAE: Generative models that attempt to describe data generation through a probabilistic distribution.

Latent Variables

  • Latent variables $z$: Hidden variables capturing underlying data characteristics.

  • Example: Generating digits

    • $z$: Represents attributes such as digit identity, stroke width, orientation, abstract stylistic properties, etc.
      $\rightarrow$ Attributes may be correlated

    • We won't determine what each dimension of $z$ encodes and explicitly defining the dependencies or latent structure between those dimensions.

Generative process:

  • Consider a vector of latent variables $z$ in a high-dimensional space $\mathcal{Z}$, which can be sampled according to a probability density function $P(z)$.

  • Let $f(z; \theta)$ represent a set of deterministic functions parameterized by the vector $\theta$, defined over the space $\mathcal{\Theta}$, such that: \begin{equation} f:\mathcal{Z}\times \mathcal{\Theta} \rightarrow \mathcal{X} \end{equation}

  • Goal: Optimize $\theta$ such that when $z$ is sampled from $P(z)$, with high probability, $f(z; \theta)$ produces outputs similar to the $X$’s in our dataset.

      Example: $f(z; \theta)$ generates realistic digits based on $z$.

  • Formally, we aim to maximize the likelihood of training data: \begin{equation} P(X) = \int P(X|z; \theta)P(z) dz \end{equation}

Key Concepts of VAEs

\begin{equation} P(X) = \int \underbrace{P(X|z; \theta)}_{\mathrm{ Gaussian} \\ \mathrm{distribution}}P(z) dz \end{equation}
\begin{equation} P(X|z; \theta)=\mathcal{N}(X| \underbrace{f(z;\theta)}_{\mathrm{mean}}, \underbrace{ \sigma^2*I}_{\mathrm{covariance}}) \end{equation}
$\rightarrow$ If we assume a Gaussian distribution, we can use gradient descent to increase $P(X)$ by adjusting $f(z; \theta)$ to get closer to $X$ for some $z$!

Two problems:
  1. How to define the latent variables $z$? What information they rapresent?

  2. How can we deal with the integral over $z$?

VAEs: How to define the latent variables $z$?

  • VAE approach : Draw $z$ from a simple Normal distribution $\rightarrow$ $\mathcal{N}(0,I)$
  • Any distribution in $d$ dimensions can be generated by taking a set of $d$ variables that are normally distributed and mapping them through a sufficiently complicated function (ex: neural networks).
  • Example: Remember that \begin{equation} P(X|z;\theta)=\mathcal{N}(X|f(z;\theta),\sigma^2*I) \end{equation} Let $f(z;\theta)$ be a neural network that uses its first $N$ layers to map the $z$ values sampled from the normal distribution to latent values with the desired distributions.

VAEs: How to deal with the integral over $z$?

    \begin{equation} P(X) = \underbrace{ \int P(X|z; \theta)P(z) dz}_{\mathrm{\color{#6699CC}{Intractable} \ to \ compute} \\ \mathrm{for \ every } \ z} \end{equation}
    \begin{equation} P(X) \approx \underbrace{\frac{1}{N}\sum_{i}P(Z|z_i)}_{N \ \mathrm{must \ be\ \color{#6699CC}{very} \ \color{#6699CC}{large}} \\ \mathrm{to \ estimate \ P(X) \ accurately}} \end{equation}
  • However, for most values of $ z$, $ P(X|z)$ will be nearly zero.

  • VAE's idea: Introduce a new function $ Q(z|X)$, from which we can sample values of $ z$ that are likely to have produced $ X$.
      $\Longrightarrow$ This allows us to compute $ \mathbb{E}_{z \sim Q} P(X|z)$ more easily!

How $Q(z)$ can help to optimize P(X)?

  • Some background first: Kullback-Leibler divergence
      $\rightarrow$ Kullback-Leibler divergence between $P(z|X)$ and an arbitrary $Q(z)$ is defined as: \begin{equation} \mathcal{D}_{KL}[Q(z)\|P(z|X)]= \mathbb{E}_{z \sim Q}[\log{Q(z)}-\log{P(z|X)}] \end{equation}
  • By applying Bayes Teorem: \begin{equation} P(X|z)P(z)=P(z|X)P(X) \end{equation}
    we obtain : \begin{equation} \mathcal{D}_{KL}[Q(z)\|P(z|X)]= \mathbb{E}_{z \sim Q}[\log{Q(z)}-\log{P(X|z)}-\log{P(z)}+\log{P(X)}] \end{equation}
  • $P(X)$ comes out of the expectation: \begin{equation} \mathcal{D}_{KL}[Q(z)\|P(z|X)]= \mathbb{E}_{z \sim Q}[\log{Q(z)}-\log{P(X|z)}-\log{P(z)}]+\underbrace{\log{P(X)}}_{\mathrm{does \ not \ depend \ on \ } z} \end{equation}

How $Q(z)$ can help to optimize P(X)?

  • $P(X)$ is factored out of the expectation: \begin{equation} \mathcal{D}_{KL}[Q(z)\|P(z|X)]= \mathbb{E}_{z \sim Q}[\log{Q(z)}-\log{P(X|z)}-\log{P(z)}]+\underbrace{\log{P(X)}}_{\mathrm{does \ not \ depend \ on \ } z} \end{equation}
  • Negating both sides and rearranging, we obtain: \begin{equation} \log{P(X)}-\mathcal{D}_{KL}[Q(z)\|P(z|X)]= \underbrace{\mathbb{E}_{z \sim Q}[-\log{Q(z)}+\log{P(X|z)}]}_{-\mathcal{D}_{KL}[Q(z)\|P(z)]}+\mathbb{E}_{z \sim Q}[\log{P(z)}] \end{equation}
  • Since we are interested in inferring $ P(X)$, it makes sense to construct a distribution $ Q(z)$ that depends on $ X$ and minimizes the divergence $ \mathcal{D}_{KL}[Q(z) \| P(z|X)]$, leading to: \begin{equation} \log{P(X)}-\mathcal{D}_{KL}[Q(z|X)\|P(z|X)]= \mathbb{E}_{z \sim Q}[\log{P(X|z)}]- \mathcal{D}_{KL}[Q(z|X)\|P(z)] \end{equation}
    • $\Longrightarrow$ VAE Objective Function

VAE Objective Function

    \begin{equation} \underbrace{\log{P(X)}}_{\mathrm{quantity \ we \ want \ to \ maximize}} -\underbrace{\mathcal{D}_{KL}[Q(z|X)\|P(z|X)]}_{\mathrm{error \ term}}= \mathbb{E}_{z \sim Q}[\log{\color{#996699}{P(X|z)}}]- \mathbb{E}_{z \sim Q}\underbrace{\mathcal{D}_{KL}[\color{#6699CC}{Q(z|X)}\|P(z)]}_{\mathrm{make \ approximate \ posterior} \\ \mathrm{distribution \ close \ to \ prior}} \end{equation}
    • Encoder-decoder structure:

      • Encoder $\color{#6699CC}{Q(z|X)}$: Maps $X$ to $z$.

      • Decoder $\color{#996699}{P(X|z)}$: Reconstructs the input data $X$ from $z$ .

Optimization

  • How can we perform stochastic gradient descent on the right-hand side of: \begin{equation} \log{P(X)} - \mathcal{D}_{KL}[Q(z|X) \| P(z|X)] = \mathbb{E}_{z \sim Q}[\log{P(X|z)}] - \underbrace{\mathcal{D}_{KL}[Q(z|X) \| P(z)]}? \end{equation}
    • We usually choose $ Q(z|X) = \mathcal{N}(z|\color{#6699CC}\mu(X;\theta), \color{#996699}\Sigma(X;\theta))$, where:
      • $ \color{#6699CC}\mu$ and $ \color{#996699}\Sigma$ are functions of parameters $ \theta$, which will be learned from data.

    • $ \mathcal{D}_{KL}[Q(z|X) \| P(z)]$ becomes the KL-divergence between two multivariate Gaussian distributions: \begin{equation} \mathcal{D}_{KL}\left[\mathcal{N}(\mu_0, \Sigma_0) \| \mathcal{N}(\mu_1, \Sigma_1)\right] = \frac{1}{2} \left( \mathrm{tr}(\Sigma_1^{-1} \Sigma_0) + (\mu_1 - \mu_0)^\top \Sigma_1^{-1} (\mu_1 - \mu_0) - k + \log\frac{\det\Sigma_1}{\det\Sigma_0} \right) \end{equation}
    • Remember that $ P(z) = \mathcal{N}(z|0, I)$, so we obtain: \begin{equation} \mathcal{D}_{KL}\left[\mathcal{N}(\mu(X), \Sigma(X)) \| \mathcal{N}(0, I)\right] = \frac{1}{2} \left( \mathrm{tr}(\Sigma(X)) + \mu(X)^\top \mu(X) - k - \log \det(\Sigma(X)) \right) \end{equation}

Optimization

  • How can we perform stochastic gradient descent on the right handside of \begin{equation} \log{P(X)}-\mathcal{D}_{KL}[Q(z|X)\|P(z|X)]= \underbrace{\mathbb{E}_{z \sim Q}[\log{P(X|z)}]}- \mathcal{D}_{KL}[Q(z|X)\|P(z)] ? \end{equation}
  • To optimize the function, we will do stochastic gradient descent over different values of $X$ sampled from a dataset D: \begin{equation} \mathbb{E}_{X \sim D} \left[ \log P(X) - \mathcal{D}_{KL}\left[ Q(z|X) \| P(z|X) \right] \right] = \mathbb{E}_{X \sim D} \left[ \mathbb{E}_{z \sim Q(z|X)} \left[ \log P(X|z) \right] - \mathcal{D}_{KL}\left[ Q(z|X) \| P(z) \right] \right] \end{equation}
  • Taking the gradient and moving the gradient symbol inside the expectation, we obtain: \begin{equation} \nabla \mathbb{E}_{X \sim D} \left[ \mathbb{E}_{z \sim Q(z|X)} \left[ \log P(X|z) \right] - \mathcal{D}_{KL}\left[ Q(z|X) \| P(z) \right] \right] = \mathbb{E}_{X \sim D} \left[ \mathbb{E}_{z \sim Q(z|X)} \left[ \nabla \log P(X|z) \right] - \nabla \mathcal{D}_{KL}\left[ Q(z|X) \| P(z) \right] \right]. \end{equation}
  • Estimating $\mathbb{E}_{z \sim Q}[\log{P(X|z)}]$ through sampling is computationally expensive:

      $\rightarrow$ Instead, we approximate $\mathbb{E}_{z \sim Q}[\log{P(X|z)}]$ with $\log{P(X|z)}$.

  • The final gradient is: \begin{equation} \nabla \mathbb{E}_{X \sim D} \left[ \mathbb{E}_{z \sim Q(z|X)} \left[ \log P(X|z) \right] - \mathcal{D}_{KL}\left[ Q(z|X) \| P(z) \right] \right] = \log{P(X|z)-\mathcal{D}_{KL}[Q(z|X)\|P(z)]} \end{equation} $\Longrightarrow$ We average the gradient of this function over arbitrarily many samples of $X$ and $z$, resulting in convergence to the gradient of the full equation.

Reparameterization Trick

  • Sampling $z$ from $Q(z|X)$ is a non-continuous operation:

    • Sampling is non-differentiable $\Longrightarrow$ We can not backpropagating the error.

  • Solution: Reparameterize as: \begin{equation} \mathbb{E}_{X \sim D} \left[ \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)} \left[ \log P(X | z = \mu(X) + \Sigma^{1/2}(X) \cdot \epsilon) \right] - \mathcal{D}_{KL}\left[ Q(z|X) \| P(z) \right] \right]. \end{equation} where $\mu(X)$ and $\Sigma(X)$ are the mean and covariance of $Q(z|X)$.

    • i.e. We sample from $\mathcal{N}(\mu(X), \Sigma(X))$ by first sampling $\epsilon ∼ \mathcal{N}(0, I)$, then computing $ z = \mu(X) + \Sigma^{1/2}(X) \cdot \epsilon$

      $\Longrightarrow$ Enables gradient-based optimization.

Summary

  • VAEs are powerful generative models that utilize a latent variable framework.
  • Practical applications include data synthesis, compression, and more.
  • VAEs can be viewed as autoencoders consisting of two neural networks.
  • VAEs define two distributions:
    • Encoders: $Q_{\theta}(z|X)$
    • Decoders: $P_{\varphi}(X|z)$

    Here, $\theta$ and $\varphi$ are the parameters of the neural networks.

Generative Adversarial Nets (GANs)

Generative Adversarial Networks (GANs)

  • Generative: $\rightarrow$ Generative Models:
    Learn the distribution from which the data originates, similar to Variational Autoencoders (VAEs).
  • Adversarial: $\rightarrow$ Adversarial Training:
    GANs consist of two competing networks (adversaries) that try to outperform each other.
  • Networks: $\rightarrow$ Neural Networks:
    GANs are composed of two neural networks whose goal is to generate data from an unlabeled distribution.

What can GANs do?

  • Generate examples for image datasets
  • Perform image-to-image translation
  • Synthesize images from text descriptions
  • Single Image Super Resolution
  • Generate photographs of human faces
  • Create realistic photographs
  • Predict video sequences
  • Generate 3D objects
  • And much more... (Check out this blog for more papers and impressive applications.)

Generate Examples for Image Datasets

    In the original paper "Generative Adversarial Networks" by Ian Goodfellow, GANs were used to generate new examples for:
    • The MNIST handwritten digit dataset
    • The CIFAR-10 small object photograph dataset
    • The Toronto Face Database
(Goodfellow, et al. 2014)

Progress of GANs

Progress of GANs over the years
(Brundage, et al. 2020)

How to Train a GAN?


  • Which network should we train first?
    • The Discriminator!

  • We can think of the Discriminator as a binary classifier with two classes: $\rightarrow$ Real and Fake.

  • But what data do we use to train it?
      - Real class data $\rightarrow$ training dataset.
      - Fake class data $\rightarrow$ generated by the generator

  • What's next? Train the Generator!

GAN training scheme

But how?
  • Step 1: Train the Discriminator using the current output of the Generator.
  • Step 2: Train the Generator to fool the Discriminator.
What is our training objective?
  • We aim to generate images from the Generator that are incorrectly classified by the Discriminator!

What is our training objective?

  • Let the discriminator be $D(\textbf{x}, \theta_d)$
    $\rightarrow$ a neural network with parameters $\theta_d$ that outputs a single scalar.
  • Let the generator be $G(\textbf{z}, \theta_g)$
    $\rightarrow$ a differentiable function represented by a neural network with parameters $\theta_g$.
  • $p_d$: actual data distribution
  • $p_g$: generator’s distribution
  • $p_z(\textbf{z})$: chosen prior in latent vector space
  • $D(\textbf{x})$: represents the probability that $\textbf{x}$ came from the data rather than $p_g$

What should the discriminator do?

If $\textbf{x} \sim p_d$, what should happen to the value of $D(\textbf{x})$?
  • $D(\textbf{x})$ should be maximized:
    $\rightarrow$ $\log{D(\textbf{x})}$ maximized
    $\rightarrow$ $\mathbb{E}_{\textbf{x} \sim p_d}[\log{D(\textbf{x})}]$ maximized
If $\textbf{x} = G(\textbf{z})$, so if $\textbf{x} \sim p_g$, what should happen to the value of $D(\textbf{x})$?
  • $D(\textbf{x})$ should be minimized:
    $\rightarrow$ $\log{D(\textbf{x})}$ minimized
    $\rightarrow$ $\log{(1 - D(\textbf{x}))}$ maximized
    $\rightarrow$ $\mathbb{E}_{\textbf{x} \sim p_g}[\log{(1-D(\textbf{x}))}]$ maximized
    $\rightarrow$ $\mathbb{E}_{\textbf{z} \sim p_z}[\log{(1 - D(G(\textbf{z})))}]$ maximized

Overall, the discriminator wants this sum to be maximized: \begin{equation} V(D,G) = \mathbb{E}_{\textbf{x} \sim p_d}\underbrace{[\log(D(\textbf{x}))]}_{\text{chance of real data} \\ \text{ being called real}} + \mathbb{E}_{\textbf{z} \sim p_z}\underbrace{[\log(1 - D(G(\textbf{z})))]}_{\text{chance of fake data}\\ \text{being called fake}} \end{equation}

What is our training objective?

  • Let the discriminator be $D(\textbf{x}, \theta_d)$
    $\rightarrow$ a neural network with parameters $\theta_d$ that outputs a single scalar.
  • Let the generator be $G(\textbf{z}, \theta_g)$
    $\rightarrow$ a differentiable function represented by a neural network with parameters $\theta_g$.
  • $p_d$: actual data distribution
  • $p_g$: generator’s distribution
  • $p_z(\textbf{z})$: chosen prior in the latent vector space
  • $D(\textbf{x})$: represents the probability that $\textbf{x}$ came from the data rather than $p_g$

What should the generator do?

If $\textbf{x} = G(\textbf{z})$, so $\textbf{x} \sim p_g$, what should happen to the value of $D(\textbf{x})$?
  • $D(\textbf{x})$ should be maximized:
    $\rightarrow$ $\log{D(\textbf{x})}$ should be maximized
    $\rightarrow$ $\log{(1 - D(\textbf{x}))}$ should be minimized
    $\rightarrow$ $\mathbb{E}_{\textbf{x} \sim p_g}[\log{(1-D(\textbf{x}))}]$ should be minimized
    $\rightarrow$ $\mathbb{E}_{\textbf{z} \sim p_z}[\log{(1 - D(G(\textbf{z})))}]$ should be minimized

Since the generator has no control over $\mathbb{E}_{\textbf{x} \sim p_d}[\log(D(\textbf{x}))]$, overall, the generator wants this sum to be minimized: \begin{equation} V(D,G) = \mathbb{E}_{\textbf{x} \sim p_d}\underbrace{[\log(D(\textbf{x}))]}_{\text{chance of real data } \\ \text{being called real}} + \mathbb{E}_{\textbf{z} \sim p_z}\underbrace{[\log(1-D(G(\textbf{z})))]}_{\text{chance of fake data} \\ \text{ being called fake}} \end{equation}

What is our training objective?

  • The discriminator maximizes the sum: \begin{equation} V(D, G) = \mathbb{E}_{\textbf{x} \sim P_D}\underbrace{[\log(D(\textbf{x}))]}_{\text{chance of real data } \\ \text{being called real}} + \mathbb{E}_{\textbf{z} \sim P_z}\underbrace{[\log(1 - D(G(\textbf{z})))]}_{\text{chance of fake data} \\ \text{ being called fake}} \end{equation}
  • The generator minimizes the sum: \begin{equation} V(D, G) = \mathbb{E}_{\textbf{x} \sim P_D}\underbrace{[\log(D(\textbf{x}))]}_{\text{chance of real data } \\ \text{being called real}} + \mathbb{E}_{\textbf{z} \sim P_z}\underbrace{[\log(1 - D(G(\textbf{z})))]}_{\text{chance of fake data} \\ \text{ being called fake}} \end{equation}
  • The original GAN formulation is a min-max game: \begin{equation} \min_{G}\max_{D}V(D, G) = \mathbb{E}_{\textbf{x} \sim P_D}[\log(D(\textbf{x}))] + \mathbb{E}_{\textbf{z} \sim P_z}[\log(1 - D(G(\textbf{z})))] \end{equation}
  • $\rightarrow$ The discriminator aims for $D(\textbf{x}) = 1$ (classifying real data as real) and $D(G(\textbf{z})) = 0$ (classifying fake data as fake).
    $\rightarrow$ The generator aims for $D(G(\textbf{z})) = 1$ (making generated data indistinguishable from real data).

What Does the Optimal Discriminator Look Like?

  • Objective:

    \begin{equation} \min_{G}\max_{D}V(D, G) = \mathbb{E}_{\mathbf{x} \sim p_d}[\log(D(\textbf{x}))] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\textbf{z})))] \end{equation}
  • When $G$ is fixed, the optimal discriminator $D$ is obtained by maximizing this formula: \begin{equation} f := \int_X \left[ p_d(\textbf{x}) \log D(\textbf{x}) +p_g(\textbf{x})\log(1 - D(\textbf{x})) \right] dx \end{equation}
To find the critical points, we take the derivative of $f$ with respect to $D(\textbf{x})$: \begin{align*} \frac{\partial f}{\partial D(\textbf{x})} &= \frac{p_d(\textbf{x})}{D(\textbf{x})} - \frac{p_g(\textbf{x})}{1 - D(\textbf{x})} = 0 \\ \frac{p_d(\textbf{x})}{D(\textbf{x})} &= \frac{p_g(\textbf{x})}{1 - D(\textbf{x})} \\ (1 - D(\textbf{x})) p_d(\textbf{x}) &= D(\textbf{x})p_g(\textbf{x}) \\ D(\textbf{x}) &= \frac{p_d(\textbf{x})}{p_g(\textbf{x}) + p_d(\textbf{x})} \end{align*}
We compute the second derivative of $ V(G, D)$ with respect to $ D(\textbf{x})$: \begin{equation} \frac{\partial^2 V}{\partial D(\textbf{x})^2} = -\underbrace{\frac{p_{\text{data}}(\textbf{x})}{D(\textbf{x})^2}}_{>0} - \underbrace{\frac{p_g(\textbf{x})}{(1 - D(\textbf{x}))^2}}_{>0}. \\ \frac{\partial^2 V}{\partial D(\textbf{x})^2} < 0. \end{equation} This confirms that $V(G, D)$ is $\textbf{concave}$ with respect to $D(\textbf{x})$.

What Does the Optimal Generator Look Like?

  • Objective:

    \begin{equation} \min_{G}\max_{D}V(D, G) = \mathbb{E}_{\mathbf{x} \sim p_d}[\log(D(\textbf{x}))] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\textbf{z})))] \end{equation}
  • By substituting the optimal discriminator back into the objective function, the optimal generator $G$ is obtained by maximizing the following formula: \begin{equation} \mathbb{E}_{\mathbf{x} \sim p_d} \log D(\mathbf{x}) + \mathbb{E}_{\mathbf{x} \sim p_g} \log(1 - D(\mathbf{x})) = \mathbb{E}_{\mathbf{x} \sim p_d} \left[ \log \frac{p_d(\mathbf{x})}{p_d(\mathbf{x}) + p_g(\mathbf{x})} \right] + \mathbb{E}_{\mathbf{x} \sim p_g} \left[ \log \frac{p_g(\mathbf{x})}{p_d(\mathbf{x}) + p_g(\mathbf{x})} \right] \end{equation}
  • We introduce a mixture distribution $p_m$ as the average of the real and generated data distributions: \begin{equation} p_m(\mathbf{x}) = \frac{1}{2} \left( p_d(\mathbf{x}) + p_g(\mathbf{x}) \right) \end{equation}
  • Now, the loss function can be expressed as: \begin{equation} \mathbb{E}_{\mathbf{x} \sim p_d} \left[ \log \frac{2p_d(\mathbf{x})}{p_m(\mathbf{x})} \right] + \mathbb{E}_{\mathbf{x} \sim p_g} \left[ \log \frac{2p_g(\mathbf{x})}{p_m(\mathbf{x})} \right] \end{equation}

Recognizing Jensen-Shannon Divergence

\begin{equation} \mathbb{E}_{\mathbf{x} \sim p_d} \left[ \log \frac{2p_d(\mathbf{x})}{p_m(\mathbf{x})} \right] + \mathbb{E}_{\mathbf{x} \sim p_g} \left[ \log \frac{2p_g(\mathbf{x})}{p_m(\mathbf{x})} \right] \end{equation}
  • Rembering that The Kullback-Leibler divergence between two distributions $ p$ and $ q$ is given by: \begin{equation} \mathcal{D}_{KL}(p \parallel q) = \mathbb{E}_{x \sim p} \left[ \log \frac{p(x)}{q(x)} \right] \end{equation}
  • The new form of the loss function: \begin{equation} \log{2}+ \mathbb{E}_{\mathbf{x} \sim p_d} \left[ \log \frac{p_d(\mathbf{x})}{p_m(\mathbf{x})} \right] +\log{2} + \mathbb{E}_{\mathbf{x} \sim p_g} \left[ \log \frac{p_g(\mathbf{x})}{p_m(\mathbf{x})} \right]= 2\log{2}+ \mathcal{D}_{KL}(p_d \parallel p_m) + \mathcal{D}_{KL}(p_g \parallel p_m) \end{equation}
  • This expression closely resembles the definition of the Jensen-Shannon Divergence ($\mathcal{D}_{JS}$) between two probability distributions $ p_d$ and $ p_g$: \begin{equation} \mathcal{D}_{JS}(p_d \parallel p_g) = \frac{1}{2} \left( \mathcal{D}_{KL}(p_d \parallel p_m) + \mathcal{D}_{KL}(p_g \parallel p_m) \right) \end{equation}
  • The final expression becomes: \begin{equation} L = 2 \cdot \mathcal{D}_{JS}(p_d \parallel p_g) - \log 4 \end{equation}

What is the optimal generator?

  • The final expression becomes: \begin{equation} 2 \cdot \mathcal{D}_{JS}(\color{#6699CC}{p_d} \parallel \color{#996699}{p_g}) - \log 4 \end{equation}


  • If we substitute the optimal discriminator we found back into the objective function, we see that the optimal generator minimizes the $\mathcal{D}_{JS}$ between the real and the generated distributions.


      $\rightarrow$ In other words, it tries to make the two distributions similar.

Min-Max Optimization

  • Both the discriminator and the generator need to be trained simultaneously.

    • If the discriminator is undertrained, it provides suboptimal feedback to the generator.


    • If the discriminator is overtrained, there is no local feedback for marginal improvements.

Diffusion models

Denoising Diffusion Probabilistic Models

  • The original paper "Denoising Diffusion Probabilistic Models",released by Jonathan Hoin the 2020s, have showcased the capabilities of diffusion models, achieving results comparable to GANs.

    • The first generation of denoising diffusion models achieved performance similar to what GANs could do, despite GANs benefiting from a decade of development.

    • Diffusion models accomplished this with a relatively limited development timeline.
(Ho, et al. 2020)

Example

Prompt:
A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

Diffusion models

The idea:Transforms simple noise distributions into complex data distributions
  • We take a picture:

    • Add a bit of Gaussian noise to it.

    • If we repeat this process...

    • And again...

    • Eventually, we get an unrecognizable picture of pure noise.

Diffusion Models


    What if we could figure out how to undo this process?

  • Start with a noisy image, gradually remove the noise, and end up with a coherent image.
  • $\Longrightarrow$ This is the basic idea behind diffusion models

  • The training of the Diffusion Model can be divided in two steps:

    1. Forward or Diffusion Process (fixed) $\rightarrow$ Add noise to the image

    2. Reverse Diffusion Process (generative) $\rightarrow$ Remove noise from the image

Forward Diffusion Process

  • We start by sampling from our target distribution (the training set) $ x_0$. The forward diffusion process gradually adds noise to the image over $ T$ time steps: \begin{equation} x_0 \to x_1 \to x_2 \to \dots \to x_T \end{equation}
  • For continuous data, each transition is parameterized by a Gaussian conditional distribution: \begin{equation} q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}) \end{equation} where $ \sqrt{1 - \beta_t} \mathbf{x}_{t-1}$ is the mean and $ \beta_t \mathbf{I}$ is the covariance. Here, $ \mathbf{I}$ is the identity matrix and $ \beta_t \in (0, 1)$ is the noise level.
  • The $ \beta_t$ parameter increases over time: \begin{equation} \beta_1 < \beta_2 < \beta_3 < \dots < \beta_T \end{equation} As $ t$ approaches $ T$, the data $ x_t$ becomes increasingly noisy, with the mean $ \sqrt{1 - \beta_t} \mathbf{x}_{t-1}$ deviating more from the previous data $ x_{t-1}$, and the variance $ \beta_t \mathbf{I}$ growing.

Forward Diffusion Process

  • The forward process $ q$ takes the form of a Markov chain, where the distribution at a particular time step depends only on the sample from the immediately previous time step: \begin{equation} q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_{t-2}, \dots, \mathbf{x}_0) = q(\mathbf{x}_t | \mathbf{x}_{t-1}) \end{equation}
  • Therefore, the joint distribution of the forward process is: \begin{equation} q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t \mid \mathbf{x}_{t-1}), \end{equation} where $ \mathbf{x}_{1:T}$ represents all latent states from $ \mathbf{x}_1$ to $ \mathbf{x}_T$, and $ \mathbf{x}_0$ is the initial state, i.e., the original data.

Generative Backward Process

  • The diffusion generative backward process, is the reverse process for generating data out of noise gradually: \begin{equation} \mathbf{x}_T \rightarrow \mathbf{x}_{T-1} \rightarrow \cdots \rightarrow \mathbf{x}_0. \end{equation}
  • Since the forward process uses Gaussian distributions, the backward process does as well. A machine learning model with parameters $\theta$ is used to model the backward process: \begin{equation} p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mathbf{\mu}_\theta(\mathbf{x}_t; t), \mathbf{\Sigma}_\theta(\mathbf{x}_t; t)), \end{equation} where $\mathbf{\mu}_\theta(\mathbf{x}_t; t) \in \mathbb{R}^d$ and $\mathbf{\Sigma}_\theta(\mathbf{x}_t; t) \in \mathbb{R}^{d \times d}$ represent the mean and covariance at time step $t$, respectively.
  • The covariance is typically assumed to be isotropic and diagonal: \begin{equation} \mathbf{\Sigma}_\theta(\mathbf{x}_t; t) = \sigma_t^2 \mathbf{I}. \end{equation}

  • $\rightarrow$ The goal of the diffusion model is to learn the parameters - the mean and variance - of this Gaussian distribution.

Generative Backward Process

  • The generative process is also a Markov chain, satisfying the first Markov property: \begin{equation} q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_{t+1}, \ldots, \mathbf{x}_T) = q(\mathbf{x}_{t-1} \mid \mathbf{x}_t). \end{equation}
  • The joint distribution of the generative process is: \begin{equation} p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t), \end{equation} where $\mathbf{x}_{0:T}$ represents all states from $\mathbf{x}_T$ to $\mathbf{x}_0$.
  • Typically, the noise $p(\mathbf{x}_T)$ is modeled as a standard Gaussian: \begin{equation} p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0}, \mathbf{I}). \end{equation}

Objective function

  • VAEs use a latent variable $\mathbf{z}$ to model observed data $\mathbf{x}$. Similarly, diffusion models use latent variables $\{\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_T\}$ for observed data $\mathbf{x}_0$. The likelihood of data in diffusion models is: \begin{equation} p_\theta(x_0) = \int p_\theta(x_0, x_{1:T}) \, dx_{1:T}, \end{equation} respectively.

  • As in variational inference, where the Evidence Lower Bound (ELBO) approximates the likelihood, diffusion models maximize the variational lower bound: \begin{equation} \mathcal{L} := \mathbb{E}_{q(x_{1:T}|x)} \big[ \log p_\theta(x_0|x_{1:T}) \big] - \mathcal{D}_{KL} \big( q(x_{1:T}|x_0) \| p_\theta(x_{1:T}) \big) \leq p_\theta(x_0). \end{equation} where $\mathbf{x}_{0:T}$ represents all states from $\mathbf{x}_T$ to $\mathbf{x}_0$.

  • After mathematical rearrangements, we obtain the variational lower bound for diffusion models (refer to the appendix A of the paper "Ho et al.", for the full derivation): \begin{equation} \mathcal{L} = - \mathcal{D}_{KL} \big( q(x_T|x_0) \| p_\theta(x_T) \big) \quad - \sum_{t=2}^T \mathcal{D}_{KL} \big( q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t) \big) \quad + \mathbb{E}_{q(x_{1:T}|x_0)} \big[ \log p_\theta(x_0|x_1) \big]. \end{equation}

Objective function

\begin{equation} \mathcal{L} =\underbrace{\color{#996699}{\mathbb{E}_{q(x_{1:T}|x_0)} \big[ \log p_\theta(x_0|x_1) \big]} - \color{#669900}{\mathcal{D}_{KL} \big( q(x_T|x_0) \| p_\theta(x_T) \big)}}_{\text{original ELBO equation}} - \color{#FFAA7F}{\sum_{t=2}^T \mathcal{D}_{KL} \big( q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t) \big)}. \end{equation}
  • Reconstruction: How well we can reconstruct our data point from the least noisy version of our data

  • Prior matching: Moving the posterior $q(x_T|x_0)$ closer to the prior $ p_\theta(x_T)$ on the final noisy step

  • Denoising: Moving the backward distribution $p_\theta(x_{t-1}|x_t)$ to become closer to the ground truth denoising distribution $q(x_{t-1}|x_t, x_0)$


  • $ \color{#FFAA7F}{q(x_{t-1} \mid x_t, x_0)}$ is tractable and can be calculated exactly without any approximation: \begin{equation} q(x_{t-1} \mid x_t, x_0) = \mathcal{N}(x_{t-1}; \mu_t, \Sigma_t \mathbf{I}) \end{equation} \begin{equation} \mu_t = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1}) x_t + \sqrt{\bar{\alpha}_{t-1} (1 - \alpha_t)} x_0}{1 - \bar{\alpha}_t}, \quad \Sigma_t = \frac{(1 - \alpha_t) (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \end{equation}

References