Hamiltonian Monte Carlo

This post is mainly based on

Hamiltonian Monte Carlo (HMC)

Random walk Metropolis Hastings is inefficient in high-dimensional spaces due to low acceptance rate. In high-dimensional spaces, the “positive density” of distribution concentrates in a increasingly thin and narrow area. This means you will only be able to reach area with “positive density” in very few directions and you will walk into “0-density” in exponential growing number of directions. This renders the Random walk Metropolis Hastings increasingly inefficient as dimension grows. To efficiently sample from a high dimensional distribution, we must take the geometry of the distribution into consideration and strategically avoid proposals outside the “positive density”.

The Hamiltonian Monte Carlo was originally developed in 1987 and we only started to understand why it can efficiently explore high-dimensional spaces until late 2000s. Theoretical result shows that in a distribution of dimension $D$, the computation cost of HMC per independent sample is roughly $O(D^{5/4})$. In comparison, the computation cost of random-walk Metropolis is $O(D^2)$. The HMC was built upon on differential geometry and achieve efficient exploration by orbiting different energy levels in a constructed vector field (Hamiltonian system). I imagine this introduction of HMC looks confusing, but we will see how different part fit together by the end of this post.

Algorithm

Given a momentum distribution $p(\cdot)$, step size $\epsilon$ and integration time (the number of steps taken along the Hamiltonian trajectory) $T$. Start from a random position $q$

Sample momentum $p \sim p(\cdot)$
Propose a new state using Hamiltonian dynamics
- Compute Hamiltonian trajectory with the “leapfrog” integrator for $T$ steps and step size $\epsilon$
- Record the last step’s position $q’$ and momentum $p’$
- The Metropolis Hastings proposal is $q’$
Accept the proposal with probability $A[(p,q) \rightarrow (p’,q’)]$

HMC sampling: after adding a random momentum (gray arrow), trajectory is simulated with “leapfrog” integrator

Conversion to Hamiltonian system

Recall the goal HMC is to exploit information about the geometry of “positive density” areas and generate proposals in those areas. HMC achieve this by constructing a Hamiltonian system where the target distribtion serves as a gravitational field and proposals are generated following the law of conservation of the energy.

The intuition is: think of the gravitational field as an inverted target distribution. Consider a partical start from some random position $q$. The partical tends to fall into “bottom” of the gravitational field (i.e., high density area of the target distribution). For each sample, we will give the particle a random kick. If the kick force is “good”, the partical will make an orbit in the gravitational field. Sampling the position of this partical is equivlent to sampling from the target distribution. If the “kick” is too hard, the partical will probably have too much energy, making it possible to escape from the gravitational pull (i.e., leaving the high density area). HMC will compare some modified version of the total energy between the previous partical and the current partical and reject the proposal if current partical’s total energy is too high.

Advantage

Generate samples in high density area
- The proposal tends to drift into high density area with gravitational pull
- In case total energy is too high, the proposal will likely be rejected, reducing the risk of leaving the high density area
Stochastic exploration between different energy levels
- The random momentum added in each iteration lift the partical onto different orbits
- For multimodal distribution, the random momentum could kick the partical out of the gravitational pull of one mode and being captured by a different mode
Deterministic exploration of same energy level
- One orbit could bring the partical to different areas of the target distribution
- Reducing correlations between consecutive samples

Stochastic exploration: HMC/NUTS can efficiently explore multimodal distribution

Hamiltonian Markov transition

Now, we have a game plan for HMC but we are still missing key details on implementation:

Problem 1: “Construct” the gravitational field
Problem 2: Find a total energy function H(q, p) that ensures the law of conservation of the energy
Problem 3: Solve the orbits given the gravitational field and momentum

Problem 1
Since we are in high-dimensional space, constructing a global gravitational field is impossible (curse of dimensionality). Observe that the orbit can be solved step by step therefore we only need local information of the gravitational field. This local information about the gravitational field can be exposed by the gradient of the target probability density function. Using differential geometry, we can convert the gravitational field into a vector field which aligned with the “contour” of points with same potential energy.

Problem 2
I did not find a good way to simplify the explanation. Please refer to [Chapter 3.2: Phase Space and Hamilton’s Equations] of Michael Betancourt’s review.

Problem 3
The orbits can be estimated by solving by the ordinary differential equations we derived from Problem 2. However, given a fixed step size $\epsilon$, the estimation error of most of ODE solvers will accumulate over steps $T$. If we integrate for a short time $T$, we limit the exploration of the Hamiltonian trajectories. If we integrate for a short time $T$, the orbit is no longer accurate. Some smart people figured out an extremely powerful family of numerical solvers, known as symplectic integrators, which use the geometry of phase space itself to construct the orbit. The approximation error of the orbit is bounded for the symplectic integrators and the numerical trajectories oscillate around the exact level set, instead of drifting off, even as the integration time $T$ increses.

HMC’s hyperparameters

HMC requires 3 hyperparameters: a momentum distribution $p(\cdot)$, a step size $\epsilon$ and an integration time $T$.

Momentum distribution $p(\cdot)$
- Empirical results seems to implies that Gaussian distribution is the best choice
- For Stan, a mass matrix $M$ is required as input whose inverse is the covariance matrix of the Gaussian distribution
Step size $\epsilon$
- Smaller step sizes result in more accurate estimation of trajectory
- Smaller step sizes result in less exploration and higher autocorrelation
Integration time $T$
- Shorter integration time result in insufficient exploreation and undesirable random walk behavior
- Longer integration time wastes computational resources
- The optimal integration time dependes on underlying trajectory and no single integration time is optimal

The Hamiltonian Monte Carlo is an efficient sampling tool which is built upon rich theoretical foundation. I found the parallel between a exploring a probability distribution and a orbiting a gravitational field quite intriguing. If you are interested in more detail about HMC, please refer to the original paper. You may also consider his talk at Stan Conference for more information.

The No-U-Turn Sampler (NUTS)

Apart from wasting computational resources, a longer integration time may send the partical back to the starting position / previsouly explored region in orbits (see figure 1). Consider starting point $q$ and proposal $q’$, NUTS looks at the derivative of the distance between these two points $\frac{\partial (q’-q) \cdot (q’-q)}{\partial t}$ and continues to simulate trajectory until the derivative turns negative. Intuitively, this implies that the partical started to make a U-turn and reverse back to the starting point.

Compared to HMC, NUTS:

Automatically determines of integration time $T$ and $\epsilon$
Integration is performed in two directions
Proposal is sampled from a candidate position-momentum set on the trajectory

NUTS Sampling: trajectory simulation terminated before “U-turn”

The above is just a basic sketch of NUTS idea. The implementation of NUTS involves quite a lot of details like trajectory binary tree, repeated doubling, candidate position-momentum states, etc. You can refer to the NUTS paper for details.