The Metropolis-Hastings (MH) algorithm take an arbitrary Markov chain and adjusts it using the “accept-reject” mechanism to ensure the resulting Markov chain has the target distribution as its stationarity distribution. Metropolis algorithm requires the proposal distribution to be symmetric. Metropolis Hastings Algorithm is the general case of Metropolis algorithm where the symmetric requirement is removed.

Algorithm

Consider the target distribution $p(x) = \frac{f(x)}{C} \propto f(x)$ where $C$ is the normalizing constant, and a proposal distribution $g(x)$.

  • Initialize the chain to $x_1$
  • Sample from proposal distribution $x’ \sim g(x | x_t)$
    • e.g., $g(x | x_t) = \mathcal{N}(x_t, \sigma^2)$
  • Accept the proposal with probability $A(x_t \rightarrow x’)$
    • $A(x_t \rightarrow x’) = \operatorname{min}(1, \frac{f(x’)}{f(x_t)} \frac{g(x_t | x’)}{g(x’ | x_t)})$
    • $u \sim \operatorname{Unif}(0,1)$
    • If $u < A(x_t \rightarrow x’)$, $x_{t+1} = x’$, else, $x_{t+1} = x_t$
    • $A(x_t \rightarrow x_{t+1})$ is chosen s.t. the stationary distribution of Markov chain is $f(x)$

Intuitions

Consider a symmetric proposal distribution $g$, then $g(x_t | x’) = g(x’ | x_t)$ and $A(x_t \rightarrow x’) = \operatorname{min}(1, \frac{f(x’)}{f(x_t)}) = \operatorname{min}(1, \frac{p(x’)}{p(x_t)})$.

  • If $p(x’) \geq p(x_t)$, jump to state $x’$ with probability $1$
  • If $p(x’) < p(x_t)$, jump to state $x’$ with probability $\frac{p(x’)}{p(x_t)}$

Therefore, the Markov chain is more likely to jump from a low probability region to a high probability region of the distribution and less likely to make jumps in the opposite direction, which fit the intuition that more sample should be sampled from regions where $p(x)$ is high.

Detailed balance condition

How do we know $A(x_t \rightarrow x_{t+1})$ guarantees that the stationary distribution of Markov chain is $f(x)$? We need a tool called detailed balance condition.

Detailed balance condition
$\forall x,y$, if $p(x)T(x \rightarrow y) = p(y)T(y \rightarrow x)$, then the stationary distribution of Markov chain is $p$.
Proof:
Sum both side of the detailed balance condition by $x$ \(\sum_x p(x)T(x \rightarrow y) = p(y) \sum_x T(y \rightarrow x) = p(y)\) i.e., $Ta=a$, where $T$ is the transition matrix, $a$ is the distribution vector and $x,y \in a$ are individual states
Apparently $a$ is the stationary distribution under the transition matrix $T$

Note that the transition probability $T(x_t \rightarrow x’) = g(x’ | x_t)A(x_t \rightarrow x’)$ in Metropolis Hastings Algorithm. To achieve a jump from $x_t$ tp $x’$, you have to first propose $x’$ and then decide whether to accpect this jump proposal. The claim is that under MH’s “accept-reject” mechanism, the transition probability under g(x) is adjusted, such that the resulting Markov chain has the stationary distribution f(x). To confirm this, we need to check if the detailed balance condition holds under MH’s $A(x \rightarrow y)$.

Proof: \(p(x)T(x \rightarrow y) = p(y)T(y \rightarrow x)\) \(f(x)g(y|x)A(x \rightarrow y) = f(y)g(x|y)A(y \rightarrow x)\) \(\frac{A(x \rightarrow y)}{A(y \rightarrow x)} = \frac{f(y)}{f(x)}\frac{g(x \| y)}{g(y \| x)}\)

Let $r_f = \frac{f(y)}{f(x)}$ and $r_g = \frac{g(x | y)}{g(y | x)}$. Then $ \frac{A(x \rightarrow y)}{A(y \rightarrow x)} = r_f \cdot r_g$. Looking at the bigger picture, we need to check if the above ratio holds under MH’s definition of acceptance probability $A(x \rightarrow y)$.

Recall that MH’s $A(x \rightarrow y) = \operatorname{min}(1, r_f \cdot r_g)$
If $r_f \cdot r_g < 1$, then $A(x \rightarrow y) = r_f \cdot r_g$ and $A(y \rightarrow x) = 1$. The above ratio holds.
If $r_f \cdot r_g \geq 1$, then $A(x \rightarrow y) = 1$ and $A(y \rightarrow x) = \frac{1}{r_f \cdot r_g}$. The above ratio holds.

Therefore, the detailed balance condition is satisfied and the stationary distribution of Markov chain is $f(x)$. If you are interested in a guided walk through, check ritvikmath’s explaination.

The proposal distribution

Metropolis Hastings Algorithm is not widely used in practice since choosing a “good” proposal distribution in high dimensional space is hard.

Problem 1: appropriate jump size
Consider the proposal distribution $g(x | x_t) = \mathcal{N}(x_t, \sigma^2)$.

  • If $\sigma \uparrow$
    • MH will propose many jumps into the tails of target distribution $f(x)$ which are likely to be rejected
    • This will result in stagnation of chain (sampling of same $x$) and low effective sample size
  • If $\sigma \downarrow$
    • Proposed jumps are too small, which also result in high auto-correlation in samples and low effective sample size

Problem 2: appropriate jump size in high dimensional distributions
MCMC is often used to estimate the posterior of a probablistic model, which could contains hundreds and thousands of parameters. Those parameters could be of different scales on posterior distribution. In the context of an inference problem, the model could be very sensitive to a few paramters but at not sensitive to others. Therefore, finding a good proposal distribution which works in all dimensions is very hard.

Summary

Despite its disadvantages, Metropolis Hastings Algorithm is an important algorithm in the MCMC world. The reason is that Metropolis Hastings built a theoretical foundation for other more efficient MCMC methods. For example, Gibbs sampling can be reduced to a special case of MH algorithm and therefore enjoys all theoretical guarantees of MH algorithm.

This review has a more rigorous description on the Metropolis-Hastings algorithm and convergence of MCMC.