This post is mainly based on

Entropy

The entropy of a random variable is the average level of “information”, “surprise”, or “uncertainty” inherent in the variable’s possible outcomes. Entropy was introduced by Claude Shannon in his 1948 paper “A Mathematical Theory of Communication”.

Consider a coin flipping example:

  • A biased coin with probability $p$ of landing on heads and probability $1-p$ of landing on tails
  • For the maximum uncertainty, set $p = 1/2$
    • Each out come is equally likely
    • State of the coin requires $1$ binary variable to store and has entropy of $1$ bit
  • For the minimum uncertainty, set $p = 0$ or $p = 1$
    • The outcome of the event is known
    • State of the coin requires $0$ binary variable to store abd has entropy of $0$ bit

We now have a rough idea on what average level of “information” of a random variable is, but a lot of details are missing: for example, how to measure the average level of “information” when $p=0.25$? The next step is to formalize the idea.

For an event $A$ occur with low probability $P(A)$, the uncertainty of its occurrence is high. Consider measure the uncertainty by $\frac{1}{P(A)}$. Define information level as reduction of uncertainty when an event occur. We will measure $1$ bit of information as:

\[1 \text{ bit of information} = \text{reduce the uncertainty of an event by a factor of }2\]

For $P(A) = 0.25$, its uncertainty is $4$. The occurrence of $A$ reduce the uncertainty from $4$ to $1$, therefore, occurrence of $A$ contains $2$ bits of information. This automatically yields the definition of Information Content:

Definition Information Content

Given an event $E$ occurs with probability $P(E)$, the information content of $E$ is defined as

\[I(E) = \log_2 \frac{1}{P(E)} = -\log_2 P(E)\]

$I(E)$ measures how surprising the outcome is, which is why it is also called surprisal. If $P(E)$ is low, the occurrence of $E$ is a surprise and $I(E)$ is high. Intuitively, an unexpected news carries a lot of information. For an event which you are already confident of, confirming it contributes only a small amount of information.

Definition Entropy

Given a discrete random variable $X$, with possible outcomes $x_1, …, x_n$, which occur with probability $p(x_1), …, p(x_n)$. The entropy $H(X)$ is defined as

\[H(x) = -\sum_{i=1}^n p(x_i) \log p(x_i) = \mathbb{E}[ -\log p(x_i) ]\]

Interpretation

  • Entropy of $X$ is the expectation of information content $I(X = x)$
  • Entropy measures how unpredictable the $X$ is
  • $H(X)$ equals to the minimal transmission of $X$ under its optimal encoding (will be discussed later)
  • Uniform probability yields maximum uncertainty and therefore maximum entropy

Cross Entropy

Optimal Encoding

Example 1

Consider a random variable $A$ with $8$ possible outcomes $\mathcal{A} = \{A_1, …, A_8\}$. $A$ is uniformly distributed, $p_a(A=a) = 1/8, \forall a \in \mathcal{A}$. Encoding the outcomes requires $3$ bits. If you want to communicate the outcome of $A$ with someone else, the average transmission cost $3$ bits.

Example 2

Now consider random variable $B$ with the following outcome distribution:

  • $ P(B_1) = P(B_2) = 0.35$
  • $ P(B_3) = P(B_4) = 0.10$
  • $ P(B_5) = P(B_6) = 0.04$
  • $ P(B_7) = P(B_8) = 0.01$

If you still encode outcome of $B$ with $A$’s encoding ($3$ bits for each outcome). The average transmission cost will be $3$ bits. However, due to some event occur with higher probability, you can change the encoding schema to achieve a lower average transmission cost.

Example 3

Consider the following encoding:

  • $\operatorname{Encode}(B_1) = 00$, $\operatorname{Encode}(B_2) = 01$
  • $\operatorname{Encode}(B_3) = 100$, $\operatorname{Encode}(B_4) = 101$
  • $\operatorname{Encode}(B_5) = 1100$, $\operatorname{Encode}(B_6) = 1101$
  • $\operatorname{Encode}(B_7) = 11100$, $\operatorname{Encode}(B_8) = 11101$

Note that this encoding is unambiguous: if we chain multiple messages into one communication, there is one way to interpret the messages. For example $011100$ can only mean $B_2, B_5$.

\[\text{The average transmission cost} = 0.7 \cdot 2 + 0.2 \cdot 3 + 0.08 \cdot 4 + 0.02 \cdot 5 = 2.42\]

Cross Entropy

The cross entropy H(P, Q) measures the average transmission cost of $P$, when the optimal encoding for $Q$ is used. The cross entropy of the distribution $P$ relative to a distribution $Q$ over a given set $X$ is defined as:

\[H(P,Q) = \sum_{x \in \mathcal{X}} -p(x) \log q(x) = -\mathbb{E}_{x \sim P}[ \log q(x) ]\]

If $P, Q$ induce same distribution, then $H(P,Q) = H(P)$.

Let’s verify this transmission cost under optimal encoding argument using examples above:

  • Example 1: $H(A,A) = H(A) = 3$
  • Example 2: $H(B,A) = 3$

Example 3:

The entropy of $B$ is

\[H(B) = -\left[ 0.7 \log_2(0.35) + 0.2 \log_2(0.1) + 0.08 \log_2(0.04) + 0.02 \log_2(0.01) \right] = 2.22897\]

However, the average transmission cost we calculated is $2.42$. This implies that the encoding we used in Example 3 is not optimal for $B$. Consider this encoding is for a random variable $C$. We can solve for the “distribution” implied by the encoding of $C$ (not a real distribution since probabilities do not sum to 1):

\[\frac{1}{2^2} = 0.25, \frac{1}{2^3} = 0.125, \frac{1}{2^4} = 0.0625, \frac{1}{2^5} = 0.03125\]

The “distribution” of $C$ is much closer to the distribution of $B$.

Example 4

Suppose we flipped the distribution into:

  • $ P(D_1) = P(D_2) = 0.01$
  • $ P(D_3) = P(D_4) = 0.04$
  • $ P(D_5) = P(D_6) = 0.10$
  • $ P(D_7) = P(D_8) = 0.35$

Now the implied “distribution” changed to:

\[0.03125, 0.0625, 0.125, 0.25\]

If we use $C$’s encoding to transmit $D$. The cross entropy $H(D,C)$ jump to $4.58$ bits.

Intuitively, if $P$ and $Q$ induce “very different” distribution, $H(P,Q)$ is high. Consider the low density “regions” in $Q$ where $q(x)$ is low for all \(x \in \mathcal{X}_{low}\). To achieve lower average transmission cost, the optimal encoding should allocate longer code to \(x \in \mathcal{X}_{low}\). However, if $P$ is “very different” from $Q$, then $p(x)$ is high for some \(x \in \mathcal{X}_{low}\). This results in transmission using longer code for high frequency $x$. Hence $H(P,Q)$ is high. This is a very interest property as it allow us to quantify the difference between two distribution.

Cross entropy is commonly used as the loss function in classification problem. Each one-hot encoded label induced a one-hot distribution. Minimizing the cross entropy loss forces the softmax distribution to move closer to the one-hot distribution of ground truth label.

KL-divergence

Kullback–Leibler divergence $\operatorname{KL}(P \| Q)$ measure of cross entropy increase due to the use of a suboptimal encoding derived from a different distribution $Q$:

\[D_{\operatorname{KL}}(p \| q) = H(p,q) - H(p)\]

It is also called relative entropy. KL-divergence can be used to measure of how one probability distribution is different from a second, reference probability distribution. It has two different forms:

\[\begin{align} D_{\operatorname{KL}}(p \| q) &= H(p,q) - H(p)\\ &= \sum_{x \in \mathcal{X}} -p(x) \log q(x) - \sum_{x \in \mathcal{X}} -p(x) \log p(x) \\ &= \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} \\ &= \mathbb{E}_{x \sim P}[ \log p(x) - \log q(x) ] \end{align}\]

The above equation also tells us how to estimate KL-divergence. If there are two distribution $P,Q$ and we know how to sample from $P$, then we can calculate $D_{\operatorname{KL}}(P \| Q)$.

KL Divergence is NOT a distance metric due to it does not satisfy the symmetric requirement of a metric space

\[D_{\operatorname{KL}}(P \| Q) \not= D_{\operatorname{KL}}(Q \| P)\]

The KL-divergence can also be used as a loss function in optimization problem. For example, in derivation of ELBO in Variational Inference.

Another view on cross entropy loss is: given \(H(p,q) = H(p) + D_{\operatorname{KL}}(p \| q)\), since sample and labels are fixed, $H(p)$ is fixed. Therefore, minimizing cross entropy loss is equivalent to minimizing the KL-divergence.

Conditional Entropy

The conditional Entropy quantifies the amount of information needed to describe the outcome of a random variable $Y$ given that the value of another random variable $X$ is known.

Definition Conditional Entropy

Given two discrete random variable $X, Y$, with marginal probability $p(x_i), p(y_i)$ and joint probability $p(x,y)$. The conditional entropy $H(Y|X)$ is defined

\[H(Y | X) = \sum_i \sum_j - p(x_i, y_i) \log\frac{p(x_i, y_j)}{p(x_j)} = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} -p(x,y) \log \frac{p(x,y)}{p(x)}\]

proof

Define $H(Y | X=x)$ as

\[H(Y | X=x) = \sum_{y \in \mathcal{Y}} -p(y|x) \log p(y|x)\]

Then take expectation over $X$, we have,

\[\begin{align} H(Y | X) &= \sum_{x \in \mathcal{X}} p(x) H(Y | X=x_i) \\ &= \sum_{x \in \mathcal{X}} p(x) \sum_{y \in \mathcal{Y}} -p(y|x) \log p(y|x) \\ &= \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} -p(x,y) \log p(y|x) \\ &= \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} -p(x,y) \log \frac{p(x,y)}{p(x)} \end{align}\]

If X,Y are independent

\[\begin{align} H(Y | X) &= \sum_x \sum_y -p(x,y) \log \frac{p(x,y)}{p(x)} \\ &= \sum_x \sum_y -p(x)p(y) \log \frac{p(x)p(y)}{p(x)} \\ &= \sum_x p(x) \sum_y -p(y) \log p(y) \\ &= \sum_y -p(y) \log p(y) \\ &= H(Y) \end{align}\]

If Y is completely determined by X

\[\begin{align} H(Y | X) &= \sum_x \sum_y -p(x,y) \log \frac{p(x,y)}{p(x)} \\ &= \sum_x -p(x) \log \frac{p(x)}{p(x)} \\ &= \sum_x p(x) \cdot 0 \\ &= 0 \end{align}\]

Chain Rule

There is a notation overloading here: $H(X,Y)$ represent joint entropy, where $X,Y$ are random variables. For cross entropy, $H(P,Q)$ operator on same random variable $X$, where $P,Q$ are two distributions.

\[H(Y | X) = H(X,Y) - H(X)\]

proof

\[\begin{align} H(Y | X) &= \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} -p(x,y) \log \frac{p(x,y)}{p(x)} \\ &= \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x,y) [\log p(x) - \log p(x,y)] \\ &= \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} - p(x,y) \log p(x,y) - \sum_{x \in \mathcal{X}} - p(x) \log p(x)\\ &= H(X,Y) - H(X) \end{align}\]

Mutual Information

Mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. Compared to Pearson Correlation, MI measures both linear and nonlinear dependency (MI can be constructed as a metric). MI can also be interpreted as the amount of information obtained about one random variable by observing the other random variable.

Definition Mutual Information

Let $X, Y$ be random variables, with joint distribution $p(x,y)$ and marginal $p(x), p(y)$. The mutual information $I(X;Y)$ is defined as,

\[I(X;Y) = D_{\operatorname{KL}}(p(x,y) \| p(x)p(y)) = \sum_{y \in Y}\sum_{x \in X}p(x,y) \log \left(\frac{p(x,y)}{p(x)p(y)} \right )\]

Properties of Mutual Information

  • Non-negative: $I(X;Y) \geq 0$
  • Symmetric: $I(X;Y) = I(Y;X)$
  • $I(X;Y) = 0 \Rightarrow X,Y$ are independent

The properties above are easy to prove (the first one requires Jensen’s inequality).

MI vs conditional and joint entropy

\[\begin{align} I (X;Y) &= H(X) - H(X | Y)\\ &= H(Y) - H(Y | X)\\ &= H(X) + H(Y) - H(X,Y)\\ &= H(X,Y) - H(X | Y) - H(Y | X) \end{align}\]