Variational Inference

This post is mainly based on

Mini course by machine-learning-and-simulation
MLSS 2019 Tutorial on VI by David Blei
Slides by Willie Neiswanger

Motivation

Consider the Bayesian inference problem where we assume the model $\operatorname{P}(x=D|z)$, where $z$ are the parameters we want to estimate. $z = z_1, z_2, …, z_D$. We observe data $x$ as $D$, where $D$ is the dimension of latent variables. By Bayes’ theorem,

\[\operatorname{P}(z|x=D) = \frac{\operatorname{P}(x=D|z) \operatorname{P}(z)}{\operatorname{P}(x=D)}\]

The problem here is that the marginal $\operatorname{P}(x=D)$ is intractable:

\[\operatorname{P}(x) = \int_{z_1, z_2, ..., z_D} \operatorname{P}(x,z) \mathrm{d}z_1, z_2, ..., z_D\]

To solve this intractable $\operatorname{P}(x=D)$ problem, Variational Inference (VI) use a surrogate distribution $q(z)$ to approximate the target distribution $\operatorname{P}(z|x=D)$. The similarity between surrogate and target distribution is measured by KL-Divergence. We will see that by minimizing the KL-Divergence, we can avoid integrating over the intractable $\operatorname{P}(x=D)$. Instead, we will maximize over a quantity call Evidence Lower Bound (ELBO).

The term Variational in VI comes from Calculus of Variations. Recall we previously mentioned that $q(z)$ belongs to a class of simple distribution $\mathcal{Q}$. Changing $q(z)$ will cause $\operatorname{KL}(q(z) || \operatorname{P}(z|x=D))$ to change. The mapping $q(z) \rightarrow \mathbb{R}$ is called a functional. Calculus of Variations studies how to find maxima and minima of functionals.

Evidence Lower Bound

We want to approximate $p(z|D)$ by $q(z)$, therefore the optimization problem is as follow:

\[q^*(z) = \operatorname{argmin}_{q(z) \in \mathcal{Q}} \operatorname{KL}( q(z) || p(z|D) )\]

By definition of KL-divergence,

\[\begin{align} \operatorname{KL}( q(z) || p(z|D) ) &= \int_{z} q(z) \log \frac{q(z)}{ p(z|D) } \mathrm{d}z \\ &= \int_{z} q(z) \log \frac{q(z) p(D)}{ p(z, D) } \mathrm{d}z \\ &= \int_{z} q(z) \log \frac{q(z)}{p(z, D)} \mathrm{d}z + \int_{z} q(z) \log p(D) \mathrm{d}z \\ &= \mathbb{E}_{z \sim q(z)}\left[ \log \frac{q(z)}{p(z,D)} \right] + \mathbb{E}_{z \sim q(z)}[ \log p(D) ]\\ &= -\mathbb{E}_{z \sim q(z)}\left[ \log \frac{p(z,D)}{q(z)} \right] + \log p(D)\\ &= -\mathcal{L}(q) + \log p(D) \end{align}\]

Note that we write integral over $\int_{z_1, z_2, …, z_D}$ as $\int_z$. We can take off the the second expectation since it is an integral of $\operatorname{p}(x=D)$ over all $z_i$. Another way to view this problem is: $\log p(D)$ depends on our assumption over model and prior $\int_z p(D|z)p(z) \mathrm{d}z$; changing $q(z)$ will not affect $\log p(D)$. This is our first important finding: we can avoid integrating over the intractable $\operatorname{P}(x=D)$

We define the Evidence Lower Bound (ELBO) as $\mathcal{L}(q) = \mathbb{E}_{z \sim q(z)}\left[ \log \frac{p(z,D)}{q(z)} \right]$. We can compute ELBO since we know:

the joint $p(z,D)$
the surrogate $q(z)$
how to sample from the surrogate $q(z)$

The name ELBO comes from:

We called the log over marginal / data $p(D)$, the “evidence”
$p(D) \in [0,1]$, therefore $\log p(D)$ is a negative quantity
KL-divergence is non-negative, therefore $-\mathcal{L}(q) + \log p(D) > 0$

$\mathcal{L}(q) < \log p(D)$, hence $\mathcal{L}(q)$ is a lower bound over the “evidence”. Our second important finding is the below two optimization problems are equivalent:

\[\operatorname{argmin}_{q(z) \in \mathcal{Q}} \operatorname{KL}( q(z) || p(z|D) ) = \operatorname{argmax}_{q(z) \in \mathcal{Q}} \mathcal{L}(q)\]

If you want to get more intuitions over ELBO, the target and the surrogate, checkout this interactive tool by machine-learning-and-simulation.

Mean Field Approach

Now we have a new optimization problem $\operatorname{argmax}_{q(z) \in \mathcal{Q}} \mathcal{L}(q)$, but we still don’t know how to find this $q$. The Mean Field Approach describes an assumption which we can use to simplify the optimization problem. The assumption that each latent variable is independent:

\[q(z) = \prod_{d=1}{D} q_d(z_d)\]

We will write some lazy notations: $z_i$ as the i-th latent variable and $z_{-i}$ as all other latent variables excluding the i-th. For example, $q_{-i} = \prod_{j \not= i} q_i$. Substitute $q(z)$ into ELBO. Our goal is to isolate individual axis $i$ such that we can solve the optimization problem for each dimension at a time.

\[\begin{align} \mathcal{L}(q) &= \mathbb{E}_{z \sim q(z)}\left[ \log \frac{p(z,D)}{q(z)} \right] \\ &= \int_z q(z) \log \frac{p(z,D)}{q(z)} \mathrm{d}z \\ &= \int_z q(z) \log p(z,D) - \log q(z) \mathrm{d}z \\ &= \int_z \prod q_i(z) (\log p(z,D) - \log q(z)) \mathrm{d}z \\ &= \int_{z_i} q_i(z_i) \int_{z_{-i}} q_{-i}(z_{-i}) (\log p(z,D) - \log q(z)) \mathrm{d}z_{-i} \mathrm{d}z_i \\ &= \int_{z_i} q_i(z_i) \int_{z_{-i}} q_{-i}(z_{-i}) (\log p(z,D) - \log q_i(z_i) - \log q_{-i}(z_{-i})) \mathrm{d}z_{-i} \mathrm{d}z_i \\ \end{align}\]

The second last equation is obtained by breaking the integral into two parts: $z_i$ and non-$z_i$. The last equation is obtained by breaking $\log q(z)$ into two parts: $z_i$ and non-$z_i$. In the last equation, observed that:

$\log p(z,D) - \log q_i(z_i)$ term depends on $z_{i}$
$- \log q_{-i}(z_{-i})$ term does not depends on $z_{i}$

Therefore, we can convert the optimization problem over all latent variables into just optimizing over one coordinate $q_i(z_i)$.

\[\mathcal{L}(q) = \int_{z_i} q_i(z_i) \mathbb{E}_{z_{-i}} \left[ \log p(z,D) - \log q_i(z_i) \right] \mathrm{d}z_i - \mathbb{E}_{z_{-i}}[ q_{-i}(z_{-1})]\]

Back to the optimization problem, you can perform a coordinate ascent over $q_1, q_2, …, q_D$: fix all except $q_i$ and then optimize for $q^*_i = \operatorname{argmax}_{q_i} \mathcal{L}(q)$. You can ignore the second term $\mathbb{E}_{z_{-i}}[ q_{-i}(z_{-i})]$ since it is a fixed quantity.

Coordinate Ascent Variational Inference (CAVI)

We described above that how to perform coordinate ascent under Mean Field Approach. However, we still do not know how to solve the optimization problem

\[q^*_i = \operatorname{argmax}_{q_i} \mathcal{L}(q)\]

As previously mentioned, $\mathcal{L}(q)$ is a functional and solving this requires tools from Calculus of Variations. It also worth mentioning that you need to solve for all $q^*_i$ on case by case basis (if you make a little change to the model, you may need to perform all derivations again). Generally, CAVI should be more efficient than gradient based optimization like ADVI since it directly finds an analytical solution for each coordinate.

We will not go into details of CAVI since its solution is problem specific and does not generalize. If you are interested in derivation, checkout this example by machine-learning-and-simulation. The author solved a Normal with unknown mean and variance. We will just compared the CAVI solution to an analytical solution (with a conjugate prior $\mu | \sigma^2$ and $\sigma^2$). Note that the author use precision $\tau$, but we will use variance $\sigma^2$ here ($\tau = 1/\sigma$).

\[X | \mu, \sigma^2 \sim N(\mu, \sigma^2)\]

For analytical solution, the posterior is a Normal-Inverse-$\chi^2$ distribution: $p(\mu, \sigma | y) \propto p(\mu | \sigma^2) p(\sigma^2)$. For CAVI solution, the posterior is a product between a normal and an inverse-gamma $q(\mu, \sigma) = q(\mu | \sigma^2) q(\sigma^2 | \mu)$. Note that $\mu$ and $\sigma^2$ still conditioned on each other because the parameters of their distribution are still intertwined (check example for details).