Bandit problem is a single state Markov Decision Process. The agent gains knowledge by interacting with the environment. The goal is to design an agent that learns to take optimal actions with minimal number of mistakes.

Consider a scenario where we need to conduct A/B testing on online advertising. The problem is we have $n$ options to display an ad to a user, resulting in ${N\choose 2}$ many combinations for A/B testing. It is also reasonable to assume the optimal choice of ad depends on user profile $x$, where $x$ is a feature vector. In this case, it is unclear how to conduct A/B testing without setting up some user buckets. Each time we display a suboptimal ad, we are expected to loss a click-through event and its associated revenue. Bandit problem can help us design algorithms that learns the optimal action with minimal expect revenue loss.

Most of the posts below are based on Prof. Akshay Krishnamurthy’s topic course on Bandits and Reinforcement Learning. Special thanks to his excellent teaching.

Topics