Machine Learning and Innovation

Introduction to Generative Models

Denise Lanzieri

slides at denise-lanzieri-csl.github.io/lecture_cs_2024/

Disclaimer

Please use Firefox for the best compatibility.

Alternatively, you can adjust the math rendering settings in your browser.
For example, in browsers supporting MathJax, you can follow these steps:
- Right-click on any mathematical expression.
- Navigate to Math Settings > Math Renderer.
- Select a different rendering option (COMMON HTML).

The image below demonstrates how to adjust these settings:

Supervised vs Unsupervised

Supervised Learning

Data: $(x, y)$ with $x$ as data and $y$ as labels

Goal: Learn the function $f$ that maps $x \to y$

Applications: Classification, regression, object detection, etc.

Unsupervised Learning

Data: $x$ only, no labels

Goal: Learn the underlying structure of the data

Applications: Clustering, dimensionality reduction, feature learning, density estimation, etc.

Unsupervised learning is a vast and very exciting area of research for two reasons:

It is relatively cost-effective since it doesn’t require labels. $\rightarrow$ This allows us to learn from a large amount of data at once, using much more data than if we had to annotate or label it manually.

Although unsupervised learning is still a relatively unsolved area of research with many open problems remaining, it holds immense potential. $\rightarrow$ By successfully learning and representing the underlying structure in data, it can significantly advance our understanding, bringing us closer to the ultimate goal of comprehending the structure of the visual world.

Generative Models

We get examples $X$ distributed according to some unknown distribution $P_{data}(X)$

Goal: Learn a model $P_{model}(X)$ which we can sample from, such that $P_{model}(X)$ is as similar as possible to $P_{data}(X)$

Example: We have a massive collection of images of galaxies from telescopes. Can the network learn to generate new galaxies? i.e. Generate new samples from the distribution of galaxy images

Why Study Generative Models?

They test our ability to represent and manipulate high-dimensional probability distributions.

They can be incorporated into reinforcement learning in various ways.

They can be trained with missing data and provide predictions for inputs with missing information (e.g., the case of missing data in semi-supervised learning).

They handle multi-modal outputs effectively. For many tasks, a single input may correspond to several different correct answers, each of which is acceptable.

Some tasks intrinsically require realistic generation of samples from a specific distribution: Single image super-resolution, Tasks aimed at creating art, Image-to-image translation applications, etc.

(Ledig, et al., 2016)

Variational Autoencoders (VAEs)

Some Background First: Autoencoders

Specific type of a neural network designed to learn a lower-dimensional feature representation of unlabeled training data:
1. Encode the input $\mathbf{x}$ into a compressed and meaningful representation (smaller set of features $\mathbf{z}$)
2. Decode it back such that the reconstructed input $\hat{\mathbf{x}}$ is similar as possible to the original one

To reconstruct the input data, an $L_2$ loss function is often used.

Some background first: Autoencoders

Autoencoders can reconstruct data and learn features to initialize a supervised model.

However, we can not generate new images from an autoencoder because we don't know the space of $z$.

How can we make an autoencoder a generative model?
- VAE: Generative models that attempt to describe data generation through a probabilistic distribution.

Latent Variables

Latent variables $z$: Hidden variables capturing underlying data characteristics.

Example: Generating digits
- $z$: Represents attributes such as digit identity, stroke width, orientation, abstract stylistic properties, etc.
  $\rightarrow$ Attributes may be correlated
- We won't determine what each dimension of $z$ encodes and explicitly defining the dependencies or latent structure between those dimensions.

Generative process:

Consider a vector of latent variables $z$ in a high-dimensional space $\mathcal{Z}$, which can be sampled according to a probability density function $P(z)$.

Let $f(z; \theta)$ represent a set of deterministic functions parameterized by the vector $\theta$, defined over the space $\mathcal{\Theta}$, such that: \begin{equation} f:\mathcal{Z}\times \mathcal{\Theta} \rightarrow \mathcal{X} \end{equation}

Goal: Optimize $\theta$ such that when $z$ is sampled from $P(z)$, with high probability, $f(z; \theta)$ produces outputs similar to the $X$’s in our dataset.

Formally, we aim to maximize the likelihood of training data: \begin{equation} P(X) = \int P(X|z; \theta)P(z) dz \end{equation}

Key Concepts of VAEs

\begin{equation} P(X) = \int \underbrace{P(X|z; \theta)}_{\mathrm{ Gaussian} \\ \mathrm{distribution}}P(z) dz \end{equation}

\begin{equation} P(X|z; \theta)=\mathcal{N}(X| \underbrace{f(z;\theta)}_{\mathrm{mean}}, \underbrace{ \sigma^2*I}_{\mathrm{covariance}}) \end{equation}

$\rightarrow$ If we assume a Gaussian distribution, we can use gradient descent to increase $P(X)$ by adjusting $f(z; \theta)$ to get closer to $X$ for some $z$!

Two problems:

How to define the latent variables $z$? What information they rapresent?

How can we deal with the integral over $z$?

VAEs: How to define the latent variables $z$?

VAE approach : Draw $z$ from a simple Normal distribution $\rightarrow$ $\mathcal{N}(0,I)$

Any distribution in $d$ dimensions can be generated by taking a set of $d$ variables that are normally distributed and mapping them through a sufficiently complicated function (ex: neural networks).

Example: Remember that \begin{equation} P(X|z;\theta)=\mathcal{N}(X|f(z;\theta),\sigma^2*I) \end{equation} Let $f(z;\theta)$ be a neural network that uses its first $N$ layers to map the $z$ values sampled from the normal distribution to latent values with the desired distributions.

VAEs: How to deal with the integral over $z$?

\begin{equation} P(X) \approx \underbrace{\frac{1}{N}\sum_{i}P(Z|z_i)}_{N \ \mathrm{must \ be\ \color{#6699CC}{very} \ \color{#6699CC}{large}} \\ \mathrm{to \ estimate \ P(X) \ accurately}} \end{equation}

However, for most values of $ z$, $ P(X|z)$ will be nearly zero.

VAE's idea: Introduce a new function $ Q(z|X)$, from which we can sample values of $ z$ that are likely to have produced $ X$.

How $Q(z)$ can help to optimize P(X)?

Some background first: Kullback-Leibler divergence
By applying Bayes Teorem: \begin{equation} P(X|z)P(z)=P(z|X)P(X) \end{equation}
we obtain : \begin{equation} \mathcal{D}_{KL}[Q(z)\|P(z|X)]= \mathbb{E}_{z \sim Q}[\log{Q(z)}-\log{P(X|z)}-\log{P(z)}+\log{P(X)}] \end{equation}
$P(X)$ comes out of the expectation: \begin{equation} \mathcal{D}_{KL}[Q(z)\|P(z|X)]= \mathbb{E}_{z \sim Q}[\log{Q(z)}-\log{P(X|z)}-\log{P(z)}]+\underbrace{\log{P(X)}}_{\mathrm{does \ not \ depend \ on \ } z} \end{equation}

How $Q(z)$ can help to optimize P(X)?

$P(X)$ is factored out of the expectation: \begin{equation} \mathcal{D}_{KL}[Q(z)\|P(z|X)]= \mathbb{E}_{z \sim Q}[\log{Q(z)}-\log{P(X|z)}-\log{P(z)}]+\underbrace{\log{P(X)}}_{\mathrm{does \ not \ depend \ on \ } z} \end{equation}
Negating both sides and rearranging, we obtain: \begin{equation} \log{P(X)}-\mathcal{D}_{KL}[Q(z)\|P(z|X)]= \underbrace{\mathbb{E}_{z \sim Q}[-\log{Q(z)}+\log{P(X|z)}]}_{-\mathcal{D}_{KL}[Q(z)\|P(z)]}+\mathbb{E}_{z \sim Q}[\log{P(z)}] \end{equation}
Since we are interested in inferring $ P(X)$, it makes sense to construct a distribution $ Q(z)$ that depends on $ X$ and minimizes the divergence $ \mathcal{D}_{KL}[Q(z) \| P(z|X)]$, leading to: \begin{equation} \log{P(X)}-\mathcal{D}_{KL}[Q(z|X)\|P(z|X)]= \mathbb{E}_{z \sim Q}[\log{P(X|z)}]- \mathcal{D}_{KL}[Q(z|X)\|P(z)] \end{equation}

$\Longrightarrow$ VAE Objective Function

VAE Objective Function

Encoder-decoder structure:
- Encoder $\color{#6699CC}{Q(z|X)}$: Maps $X$ to $z$.
- Decoder $\color{#996699}{P(X|z)}$: Reconstructs the input data $X$ from $z$ .

Optimization

How can we perform stochastic gradient descent on the right-hand side of: \begin{equation} \log{P(X)} - \mathcal{D}_{KL}[Q(z|X) \| P(z|X)] = \mathbb{E}_{z \sim Q}[\log{P(X|z)}] - \underbrace{\mathcal{D}_{KL}[Q(z|X) \| P(z)]}? \end{equation}
- We usually choose $ Q(z|X) = \mathcal{N}(z|\color{#6699CC}\mu(X;\theta), \color{#996699}\Sigma(X;\theta))$, where:
  - $ \color{#6699CC}\mu$ and $ \color{#996699}\Sigma$ are functions of parameters $ \theta$, which will be learned from data.
- $ \mathcal{D}_{KL}[Q(z|X) \| P(z)]$ becomes the KL-divergence between two multivariate Gaussian distributions: \begin{equation} \mathcal{D}_{KL}\left[\mathcal{N}(\mu_0, \Sigma_0) \| \mathcal{N}(\mu_1, \Sigma_1)\right] = \frac{1}{2} \left( \mathrm{tr}(\Sigma_1^{-1} \Sigma_0) + (\mu_1 - \mu_0)^\top \Sigma_1^{-1} (\mu_1 - \mu_0) - k + \log\frac{\det\Sigma_1}{\det\Sigma_0} \right) \end{equation}
- Remember that $ P(z) = \mathcal{N}(z|0, I)$, so we obtain: \begin{equation} \mathcal{D}_{KL}\left[\mathcal{N}(\mu(X), \Sigma(X)) \| \mathcal{N}(0, I)\right] = \frac{1}{2} \left( \mathrm{tr}(\Sigma(X)) + \mu(X)^\top \mu(X) - k - \log \det(\Sigma(X)) \right) \end{equation}

Optimization

How can we perform stochastic gradient descent on the right handside of \begin{equation} \log{P(X)}-\mathcal{D}_{KL}[Q(z|X)\|P(z|X)]= \underbrace{\mathbb{E}_{z \sim Q}[\log{P(X|z)}]}- \mathcal{D}_{KL}[Q(z|X)\|P(z)] ? \end{equation}
To optimize the function, we will do stochastic gradient descent over different values of $X$ sampled from a dataset D: \begin{equation} \mathbb{E}_{X \sim D} \left[ \log P(X) - \mathcal{D}_{KL}\left[ Q(z|X) \| P(z|X) \right] \right] = \mathbb{E}_{X \sim D} \left[ \mathbb{E}_{z \sim Q(z|X)} \left[ \log P(X|z) \right] - \mathcal{D}_{KL}\left[ Q(z|X) \| P(z) \right] \right] \end{equation}
Taking the gradient and moving the gradient symbol inside the expectation, we obtain: \begin{equation} \nabla \mathbb{E}_{X \sim D} \left[ \mathbb{E}_{z \sim Q(z|X)} \left[ \log P(X|z) \right] - \mathcal{D}_{KL}\left[ Q(z|X) \| P(z) \right] \right] = \mathbb{E}_{X \sim D} \left[ \mathbb{E}_{z \sim Q(z|X)} \left[ \nabla \log P(X|z) \right] - \nabla \mathcal{D}_{KL}\left[ Q(z|X) \| P(z) \right] \right]. \end{equation}
Estimating $\mathbb{E}_{z \sim Q}[\log{P(X|z)}]$ through sampling is computationally expensive:

The final gradient is: \begin{equation} \nabla \mathbb{E}_{X \sim D} \left[ \mathbb{E}_{z \sim Q(z|X)} \left[ \log P(X|z) \right] - \mathcal{D}_{KL}\left[ Q(z|X) \| P(z) \right] \right] = \log{P(X|z)-\mathcal{D}_{KL}[Q(z|X)\|P(z)]} \end{equation} $\Longrightarrow$ We average the gradient of this function over arbitrarily many samples of $X$ and $z$, resulting in convergence to the gradient of the full equation.

Reparameterization Trick

Sampling $z$ from $Q(z|X)$ is a non-continuous operation:
- Sampling is non-differentiable $\Longrightarrow$ We can not backpropagating the error.

Solution: Reparameterize as: \begin{equation} \mathbb{E}_{X \sim D} \left[ \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)} \left[ \log P(X | z = \mu(X) + \Sigma^{1/2}(X) \cdot \epsilon) \right] - \mathcal{D}_{KL}\left[ Q(z|X) \| P(z) \right] \right]. \end{equation} where $\mu(X)$ and $\Sigma(X)$ are the mean and covariance of $Q(z|X)$.

i.e. We sample from $\mathcal{N}(\mu(X), \Sigma(X))$ by first sampling $\epsilon ∼ \mathcal{N}(0, I)$, then computing $ z = \mu(X) + \Sigma^{1/2}(X) \cdot \epsilon$

$\Longrightarrow$ Enables gradient-based optimization.

Summary

VAEs are powerful generative models that utilize a latent variable framework.
Practical applications include data synthesis, compression, and more.
VAEs can be viewed as autoencoders consisting of two neural networks.
VAEs define two distributions:
- Encoders: $Q_{\theta}(z|X)$
- Decoders: $P_{\varphi}(X|z)$
Here, $\theta$ and $\varphi$ are the parameters of the neural networks.

Generative Adversarial Nets (GANs)

Generative Adversarial Networks (GANs)

Generative: $\rightarrow$ Generative Models:
Learn the distribution from which the data originates, similar to Variational Autoencoders (VAEs).
Adversarial: $\rightarrow$ Adversarial Training:
GANs consist of two competing networks (adversaries) that try to outperform each other.
Networks: $\rightarrow$ Neural Networks:
GANs are composed of two neural networks whose goal is to generate data from an unlabeled distribution.

What can GANs do?

Generate examples for image datasets
Perform image-to-image translation
Synthesize images from text descriptions
Single Image Super Resolution
Generate photographs of human faces
Create realistic photographs
Predict video sequences
Generate 3D objects
And much more... (Check out this blog for more papers and impressive applications.)

Generate Examples for Image Datasets

"Generative Adversarial Networks"

The MNIST handwritten digit dataset
The CIFAR-10 small object photograph dataset
The Toronto Face Database

(Goodfellow, et al. 2014)

Progress of GANs

The 2018 report “The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation” described the rapid progress of GANs from 2014 to 2017. The images are sourced from papers by:

(Brundage, et al. 2020)

How to Train a GAN?

Which network should we train first?
- The Discriminator!

We can think of the Discriminator as a binary classifier with two classes: $\rightarrow$ Real and Fake.

But what data do we use to train it?

What's next? Train the Generator!

But how?

Step 1: Train the Discriminator using the current output of the Generator.
Step 2: Train the Generator to fool the Discriminator.

What is our training objective?

We aim to generate images from the Generator that are incorrectly classified by the Discriminator!

What is our training objective?

Let the discriminator be $D(\textbf{x}, \theta_d)$
$\rightarrow$ a neural network with parameters $\theta_d$ that outputs a single scalar.
Let the generator be $G(\textbf{z}, \theta_g)$
$\rightarrow$ a differentiable function represented by a neural network with parameters $\theta_g$.

$p_d$: actual data distribution
$p_g$: generator’s distribution
$p_z(\textbf{z})$: chosen prior in latent vector space
$D(\textbf{x})$: represents the probability that $\textbf{x}$ came from the data rather than $p_g$

What should the discriminator do?

If $\textbf{x} \sim p_d$, what should happen to the value of $D(\textbf{x})$?

$D(\textbf{x})$ should be maximized:
$\rightarrow$ $\log{D(\textbf{x})}$ maximized
$\rightarrow$ $\mathbb{E}_{\textbf{x} \sim p_d}[\log{D(\textbf{x})}]$ maximized

If $\textbf{x} = G(\textbf{z})$, so if $\textbf{x} \sim p_g$, what should happen to the value of $D(\textbf{x})$?

$D(\textbf{x})$ should be minimized:
$\rightarrow$ $\log{D(\textbf{x})}$ minimized
$\rightarrow$ $\log{(1 - D(\textbf{x}))}$ maximized
$\rightarrow$ $\mathbb{E}_{\textbf{x} \sim p_g}[\log{(1-D(\textbf{x}))}]$ maximized
$\rightarrow$ $\mathbb{E}_{\textbf{z} \sim p_z}[\log{(1 - D(G(\textbf{z})))}]$ maximized

Overall, the discriminator wants this sum to be maximized: \begin{equation} V(D,G) = \mathbb{E}_{\textbf{x} \sim p_d}\underbrace{[\log(D(\textbf{x}))]}_{\text{chance of real data} \\ \text{ being called real}} + \mathbb{E}_{\textbf{z} \sim p_z}\underbrace{[\log(1 - D(G(\textbf{z})))]}_{\text{chance of fake data}\\ \text{being called fake}} \end{equation}

What is our training objective?

Let the discriminator be $D(\textbf{x}, \theta_d)$
$\rightarrow$ a neural network with parameters $\theta_d$ that outputs a single scalar.
Let the generator be $G(\textbf{z}, \theta_g)$
$\rightarrow$ a differentiable function represented by a neural network with parameters $\theta_g$.

$p_d$: actual data distribution
$p_g$: generator’s distribution
$p_z(\textbf{z})$: chosen prior in the latent vector space
$D(\textbf{x})$: represents the probability that $\textbf{x}$ came from the data rather than $p_g$

What should the generator do?

If $\textbf{x} = G(\textbf{z})$, so $\textbf{x} \sim p_g$, what should happen to the value of $D(\textbf{x})$?

$D(\textbf{x})$ should be maximized:
$\rightarrow$ $\log{D(\textbf{x})}$ should be maximized
$\rightarrow$ $\log{(1 - D(\textbf{x}))}$ should be minimized
$\rightarrow$ $\mathbb{E}_{\textbf{x} \sim p_g}[\log{(1-D(\textbf{x}))}]$ should be minimized
$\rightarrow$ $\mathbb{E}_{\textbf{z} \sim p_z}[\log{(1 - D(G(\textbf{z})))}]$ should be minimized

Since the generator has no control over $\mathbb{E}_{\textbf{x} \sim p_d}[\log(D(\textbf{x}))]$, overall, the generator wants this sum to be minimized: \begin{equation} V(D,G) = \mathbb{E}_{\textbf{x} \sim p_d}\underbrace{[\log(D(\textbf{x}))]}_{\text{chance of real data } \\ \text{being called real}} + \mathbb{E}_{\textbf{z} \sim p_z}\underbrace{[\log(1-D(G(\textbf{z})))]}_{\text{chance of fake data} \\ \text{ being called fake}} \end{equation}

What is our training objective?

The discriminator maximizes the sum: \begin{equation} V(D, G) = \mathbb{E}_{\textbf{x} \sim P_D}\underbrace{[\log(D(\textbf{x}))]}_{\text{chance of real data } \\ \text{being called real}} + \mathbb{E}_{\textbf{z} \sim P_z}\underbrace{[\log(1 - D(G(\textbf{z})))]}_{\text{chance of fake data} \\ \text{ being called fake}} \end{equation}
The generator minimizes the sum: \begin{equation} V(D, G) = \mathbb{E}_{\textbf{x} \sim P_D}\underbrace{[\log(D(\textbf{x}))]}_{\text{chance of real data } \\ \text{being called real}} + \mathbb{E}_{\textbf{z} \sim P_z}\underbrace{[\log(1 - D(G(\textbf{z})))]}_{\text{chance of fake data} \\ \text{ being called fake}} \end{equation}
The original GAN formulation is a min-max game: \begin{equation} \min_{G}\max_{D}V(D, G) = \mathbb{E}_{\textbf{x} \sim P_D}[\log(D(\textbf{x}))] + \mathbb{E}_{\textbf{z} \sim P_z}[\log(1 - D(G(\textbf{z})))] \end{equation}

$\rightarrow$ The discriminator aims for $D(\textbf{x}) = 1$ (classifying real data as real) and $D(G(\textbf{z})) = 0$ (classifying fake data as fake).
$\rightarrow$ The generator aims for $D(G(\textbf{z})) = 1$ (making generated data indistinguishable from real data).

What Does the Optimal Discriminator Look Like?

Objective:
\begin{equation} \min_{G}\max_{D}V(D, G) = \mathbb{E}_{\mathbf{x} \sim p_d}[\log(D(\textbf{x}))] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\textbf{z})))] \end{equation}
When $G$ is fixed, the optimal discriminator $D$ is obtained by maximizing this formula: \begin{equation} f := \int_X \left[ p_d(\textbf{x}) \log D(\textbf{x}) +p_g(\textbf{x})\log(1 - D(\textbf{x})) \right] dx \end{equation}

To find the critical points, we take the derivative of $f$ with respect to $D(\textbf{x})$: \begin{align*} \frac{\partial f}{\partial D(\textbf{x})} &= \frac{p_d(\textbf{x})}{D(\textbf{x})} - \frac{p_g(\textbf{x})}{1 - D(\textbf{x})} = 0 \\ \frac{p_d(\textbf{x})}{D(\textbf{x})} &= \frac{p_g(\textbf{x})}{1 - D(\textbf{x})} \\ (1 - D(\textbf{x})) p_d(\textbf{x}) &= D(\textbf{x})p_g(\textbf{x}) \\ D(\textbf{x}) &= \frac{p_d(\textbf{x})}{p_g(\textbf{x}) + p_d(\textbf{x})} \end{align*}

We compute the second derivative of $ V(G, D)$ with respect to $ D(\textbf{x})$: \begin{equation} \frac{\partial^2 V}{\partial D(\textbf{x})^2} = -\underbrace{\frac{p_{\text{data}}(\textbf{x})}{D(\textbf{x})^2}}_{>0} - \underbrace{\frac{p_g(\textbf{x})}{(1 - D(\textbf{x}))^2}}_{>0}. \\ \frac{\partial^2 V}{\partial D(\textbf{x})^2} < 0. \end{equation} This confirms that $V(G, D)$ is $\textbf{concave}$ with respect to $D(\textbf{x})$.

What Does the Optimal Generator Look Like?

Objective:
\begin{equation} \min_{G}\max_{D}V(D, G) = \mathbb{E}_{\mathbf{x} \sim p_d}[\log(D(\textbf{x}))] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\textbf{z})))] \end{equation}
By substituting the optimal discriminator back into the objective function, the optimal generator $G$ is obtained by maximizing the following formula: \begin{equation} \mathbb{E}_{\mathbf{x} \sim p_d} \log D(\mathbf{x}) + \mathbb{E}_{\mathbf{x} \sim p_g} \log(1 - D(\mathbf{x})) = \mathbb{E}_{\mathbf{x} \sim p_d} \left[ \log \frac{p_d(\mathbf{x})}{p_d(\mathbf{x}) + p_g(\mathbf{x})} \right] + \mathbb{E}_{\mathbf{x} \sim p_g} \left[ \log \frac{p_g(\mathbf{x})}{p_d(\mathbf{x}) + p_g(\mathbf{x})} \right] \end{equation}
We introduce a mixture distribution $p_m$ as the average of the real and generated data distributions: \begin{equation} p_m(\mathbf{x}) = \frac{1}{2} \left( p_d(\mathbf{x}) + p_g(\mathbf{x}) \right) \end{equation}
Now, the loss function can be expressed as: \begin{equation} \mathbb{E}_{\mathbf{x} \sim p_d} \left[ \log \frac{2p_d(\mathbf{x})}{p_m(\mathbf{x})} \right] + \mathbb{E}_{\mathbf{x} \sim p_g} \left[ \log \frac{2p_g(\mathbf{x})}{p_m(\mathbf{x})} \right] \end{equation}

Recognizing Jensen-Shannon Divergence

\begin{equation} \mathbb{E}_{\mathbf{x} \sim p_d} \left[ \log \frac{2p_d(\mathbf{x})}{p_m(\mathbf{x})} \right] + \mathbb{E}_{\mathbf{x} \sim p_g} \left[ \log \frac{2p_g(\mathbf{x})}{p_m(\mathbf{x})} \right] \end{equation}

Rembering that The Kullback-Leibler divergence between two distributions $ p$ and $ q$ is given by: \begin{equation} \mathcal{D}_{KL}(p \parallel q) = \mathbb{E}_{x \sim p} \left[ \log \frac{p(x)}{q(x)} \right] \end{equation}
The new form of the loss function: \begin{equation} \log{2}+ \mathbb{E}_{\mathbf{x} \sim p_d} \left[ \log \frac{p_d(\mathbf{x})}{p_m(\mathbf{x})} \right] +\log{2} + \mathbb{E}_{\mathbf{x} \sim p_g} \left[ \log \frac{p_g(\mathbf{x})}{p_m(\mathbf{x})} \right]= 2\log{2}+ \mathcal{D}_{KL}(p_d \parallel p_m) + \mathcal{D}_{KL}(p_g \parallel p_m) \end{equation}
This expression closely resembles the definition of the Jensen-Shannon Divergence ($\mathcal{D}_{JS}$) between two probability distributions $ p_d$ and $ p_g$: \begin{equation} \mathcal{D}_{JS}(p_d \parallel p_g) = \frac{1}{2} \left( \mathcal{D}_{KL}(p_d \parallel p_m) + \mathcal{D}_{KL}(p_g \parallel p_m) \right) \end{equation}
The final expression becomes: \begin{equation} L = 2 \cdot \mathcal{D}_{JS}(p_d \parallel p_g) - \log 4 \end{equation}

What is the optimal generator?

The final expression becomes: \begin{equation} 2 \cdot \mathcal{D}_{JS}(\color{#6699CC}{p_d} \parallel \color{#996699}{p_g}) - \log 4 \end{equation}

If we substitute the optimal discriminator we found back into the objective function, we see that the optimal generator minimizes the $\mathcal{D}_{JS}$ between the real and the generated distributions.

Min-Max Optimization

Both the discriminator and the generator need to be trained simultaneously.
- If the discriminator is undertrained, it provides suboptimal feedback to the generator.
- If the discriminator is overtrained, there is no local feedback for marginal improvements.

Diffusion models

Denoising Diffusion Probabilistic Models

The original paper "Denoising Diffusion Probabilistic Models",released by Jonathan Hoin the 2020s, have showcased the capabilities of diffusion models, achieving results comparable to GANs.
- The first generation of denoising diffusion models achieved performance similar to what GANs could do, despite GANs benefiting from a decade of development.
- Diffusion models accomplished this with a relatively limited development timeline.

(Ho, et al. 2020)

Example

Prompt:

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

Diffusion models

The idea:Transforms simple noise distributions into complex data distributions

We take a picture:
- Add a bit of Gaussian noise to it.
- If we repeat this process...
- And again...
- Eventually, we get an unrecognizable picture of pure noise.

Diffusion Models

Start with a noisy image, gradually remove the noise, and end up with a coherent image.

$\Longrightarrow$ This is the basic idea behind diffusion models

The training of the Diffusion Model can be divided in two steps:

Forward or Diffusion Process (fixed) $\rightarrow$ Add noise to the image

Reverse Diffusion Process (generative) $\rightarrow$ Remove noise from the image

Forward Diffusion Process

We start by sampling from our target distribution (the training set) $ x_0$. The forward diffusion process gradually adds noise to the image over $ T$ time steps: \begin{equation} x_0 \to x_1 \to x_2 \to \dots \to x_T \end{equation}
For continuous data, each transition is parameterized by a Gaussian conditional distribution: \begin{equation} q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}) \end{equation} where $ \sqrt{1 - \beta_t} \mathbf{x}_{t-1}$ is the mean and $ \beta_t \mathbf{I}$ is the covariance. Here, $ \mathbf{I}$ is the identity matrix and $ \beta_t \in (0, 1)$ is the noise level.
The $ \beta_t$ parameter increases over time: \begin{equation} \beta_1 < \beta_2 < \beta_3 < \dots < \beta_T \end{equation} As $ t$ approaches $ T$, the data $ x_t$ becomes increasingly noisy, with the mean $ \sqrt{1 - \beta_t} \mathbf{x}_{t-1}$ deviating more from the previous data $ x_{t-1}$, and the variance $ \beta_t \mathbf{I}$ growing.

Forward Diffusion Process

The forward process $ q$ takes the form of a Markov chain, where the distribution at a particular time step depends only on the sample from the immediately previous time step: \begin{equation} q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_{t-2}, \dots, \mathbf{x}_0) = q(\mathbf{x}_t | \mathbf{x}_{t-1}) \end{equation}
Therefore, the joint distribution of the forward process is: \begin{equation} q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t \mid \mathbf{x}_{t-1}), \end{equation} where $ \mathbf{x}_{1:T}$ represents all latent states from $ \mathbf{x}_1$ to $ \mathbf{x}_T$, and $ \mathbf{x}_0$ is the initial state, i.e., the original data.

Generative Backward Process

The diffusion generative backward process, is the reverse process for generating data out of noise gradually: \begin{equation} \mathbf{x}_T \rightarrow \mathbf{x}_{T-1} \rightarrow \cdots \rightarrow \mathbf{x}_0. \end{equation}
Since the forward process uses Gaussian distributions, the backward process does as well. A machine learning model with parameters $\theta$ is used to model the backward process: \begin{equation} p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mathbf{\mu}_\theta(\mathbf{x}_t; t), \mathbf{\Sigma}_\theta(\mathbf{x}_t; t)), \end{equation} where $\mathbf{\mu}_\theta(\mathbf{x}_t; t) \in \mathbb{R}^d$ and $\mathbf{\Sigma}_\theta(\mathbf{x}_t; t) \in \mathbb{R}^{d \times d}$ represent the mean and covariance at time step $t$, respectively.
The covariance is typically assumed to be isotropic and diagonal: \begin{equation} \mathbf{\Sigma}_\theta(\mathbf{x}_t; t) = \sigma_t^2 \mathbf{I}. \end{equation}

mean

variance

Generative Backward Process

The generative process is also a Markov chain, satisfying the first Markov property: \begin{equation} q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_{t+1}, \ldots, \mathbf{x}_T) = q(\mathbf{x}_{t-1} \mid \mathbf{x}_t). \end{equation}
The joint distribution of the generative process is: \begin{equation} p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t), \end{equation} where $\mathbf{x}_{0:T}$ represents all states from $\mathbf{x}_T$ to $\mathbf{x}_0$.
Typically, the noise $p(\mathbf{x}_T)$ is modeled as a standard Gaussian: \begin{equation} p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0}, \mathbf{I}). \end{equation}

Objective function

VAEs use a latent variable $\mathbf{z}$ to model observed data $\mathbf{x}$. Similarly, diffusion models use latent variables $\{\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_T\}$ for observed data $\mathbf{x}_0$. The likelihood of data in diffusion models is: \begin{equation} p_\theta(x_0) = \int p_\theta(x_0, x_{1:T}) \, dx_{1:T}, \end{equation} respectively.

As in variational inference, where the Evidence Lower Bound (ELBO) approximates the likelihood, diffusion models maximize the variational lower bound: \begin{equation} \mathcal{L} := \mathbb{E}_{q(x_{1:T}|x)} \big[ \log p_\theta(x_0|x_{1:T}) \big] - \mathcal{D}_{KL} \big( q(x_{1:T}|x_0) \| p_\theta(x_{1:T}) \big) \leq p_\theta(x_0). \end{equation} where $\mathbf{x}_{0:T}$ represents all states from $\mathbf{x}_T$ to $\mathbf{x}_0$.

After mathematical rearrangements, we obtain the variational lower bound for diffusion models (refer to the appendix A of the paper "Ho et al.", for the full derivation): \begin{equation} \mathcal{L} = - \mathcal{D}_{KL} \big( q(x_T|x_0) \| p_\theta(x_T) \big) \quad - \sum_{t=2}^T \mathcal{D}_{KL} \big( q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t) \big) \quad + \mathbb{E}_{q(x_{1:T}|x_0)} \big[ \log p_\theta(x_0|x_1) \big]. \end{equation}

Objective function

\begin{equation} \mathcal{L} =\underbrace{\color{#996699}{\mathbb{E}_{q(x_{1:T}|x_0)} \big[ \log p_\theta(x_0|x_1) \big]} - \color{#669900}{\mathcal{D}_{KL} \big( q(x_T|x_0) \| p_\theta(x_T) \big)}}_{\text{original ELBO equation}} - \color{#FFAA7F}{\sum_{t=2}^T \mathcal{D}_{KL} \big( q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t) \big)}. \end{equation}

Reconstruction: How well we can reconstruct our data point from the least noisy version of our data

Prior matching: Moving the posterior $q(x_T|x_0)$ closer to the prior $ p_\theta(x_T)$ on the final noisy step

Denoising: Moving the backward distribution $p_\theta(x_{t-1}|x_t)$ to become closer to the ground truth denoising distribution $q(x_{t-1}|x_t, x_0)$

tractable

exactly

References

Variational autoencoders:

Generative Adversarial Nets:

Diffusion Probabilistic Models:

Introduction to Generative Models

Denise Lanzieri

Disclaimer

Supervised vs Unsupervised

Unsupervised learning is a vast and very exciting area of research for two reasons:

Generative Models

Why Study Generative Models?

Variational Autoencoders (VAEs)

Some Background First: Autoencoders

Some background first: Autoencoders

Latent Variables

Generative process:

Key Concepts of VAEs

VAEs: How to define the latent variables $z$?

VAEs: How to deal with the integral over $z$?

How $Q(z)$ can help to optimize P(X)?

How $Q(z)$ can help to optimize P(X)?

$\Longrightarrow$ VAE Objective Function

VAE Objective Function

Optimization

Optimization

Reparameterization Trick

Summary

Generative Adversarial Nets (GANs)

Generative Adversarial Networks (GANs)

What can GANs do?

Generate Examples for Image Datasets

Progress of GANs

How to Train a GAN?

What is our training objective?

What should the discriminator do?

What is our training objective?

What should the generator do?

What is our training objective?

What Does the Optimal Discriminator Look Like?

What Does the Optimal Generator Look Like?

Recognizing Jensen-Shannon Divergence

What is the optimal generator?

$\rightarrow$ In other words, it tries to make the two distributions similar.

Min-Max Optimization

Diffusion models

Denoising Diffusion Probabilistic Models

Example

Diffusion models

Diffusion Models

Forward Diffusion Process

Forward Diffusion Process

Generative Backward Process

Generative Backward Process

Objective function

Objective function

References