Christopher Blöcker - Developing an Intuition for Entropy

Developing an Intuition for Entropy

Posted on August 14, 2018 by Christopher Blöcker
Tags: Entropy, PhD

Introduction

Entropy is a concept used in information theory, thermodynamics, and statistics in general; it measures the predictability (or chaos) of a system. One application of entropy is to design efficient encodings to describe the behaviour of stochastic processes. But what does an entropy level of, say, 2.45 mean? Here, we will develop an intuition to understand entropy by looking at the definintion and behaviour of Shannon entropy.

Background

Before talking about entropy, we cover some background on probability spaces and random variables. If you know about those topics, feel free to skip to shannon entropy.

Probability Space

Probability spaces are used to describe situations with uncertain outcomes. The most prominent examples are tossing coins or throwing dice. We know that fair coins come up heads half of the time and tails the other half. On a fair six-sided die, each outcome is equally likely, i.e., they all have a probability of \(\frac{1}{6}\). But there are also examples where the outcomes have different probabilities: think of dropping a marmalade toast. Marmalade toasts almost always hit the floor with the marmalade side down, rarely up.

Mathematically, a probability space is a triple \(\left( \Omega, \mathcal{F}, P \right)\) where

\(\Omega\), the sample space, i.e., the possible outcomes
\(\mathcal{F} \subseteq 2^\Omega\), the possible events, i.e., combinations of outcomes
\(P \colon \mathcal{F} \to \left[0, 1\right]\), a function, called the probability measure, which assigns probabilities to events.

In the discrete case (and that is what we care about here), probabilities are assigned to the elements of the sample space by the so called probability mass function \(p \colon \Omega \to \left[0, 1\right]\). The probabilities must add up to \(1\), i.e., \(\sum_{\omega \in \Omega} p\left(\omega\right) = 1\). The probability measure for events \(f \in \mathcal{F}\) is then defined in terms of \(p\) as \(P \left(f\right) = \sum_{\omega \in f} p\left(\omega\right)\).

For example, the probability space for tossing a fair coin is \(\left( \Omega_{coin}, \mathcal{F}_{coin}, P_{coin} \right)\) with

\(\Omega_{coin} = \left\{ heads, tails\right\}\)
\(\mathcal{F}_{coin} = 2^{\Omega_{coin}} = \left\{ \emptyset, \left\{heads\right\}, \left\{tails\right\}, \Omega_{coin} \right\}\)
\(p_{coin}\left(heads\right) = \frac{1}{2}\)
\(p_{coin}\left(tails\right) = \frac{1}{2}\)
\(P_{coin}\left(\emptyset\right) = 0\)
\(P_{coin}\left(\left\{heads\right\}\right) = p_{coin}\left(heads\right) = \frac{1}{2}\)
\(P_{coin}\left(\left\{tails\right\}\right) = p_{coin}\left(tails\right) = \frac{1}{2}\)
\(P_{coin}\left(\Omega_{coin}\right) = \sum_{\omega \in \Omega_{coin}} P_{coin}\left(\left\{\omega\right\}\right) = P_{coin}\left(\left\{heads\right\}\right) + P_{coin}\left(\left\{tails\right\}\right) = 1\).

And the probability space for dropping a marmelade toast is \(\left( \Omega_{toast}, \mathcal{F}_{toast}, P_{toast} \right)\) with

\(\Omega_{toast} = \left\{ up, down\right\}\)
\(\mathcal{F}_{toast} = 2^{\Omega_{toast}} = \left\{ \emptyset, \left\{up\right\}, \left\{down\right\}, \Omega_{toast} \right\}\)
\(p_{toast}\left(up\right) = \frac{1}{100}\)
\(p_{toast}\left(down\right) = \frac{99}{100}\)
\(P_{toast}\left(\emptyset\right) = 0\)
\(P_{toast}\left(\left\{up\right\}\right) = p_{toast}\left(up\right) = \frac{1}{100}\)
\(P_{toast}\left(\left\{down\right\}\right) = p_{toast}\left(down\right) = \frac{99}{100}\)
\(P_{toast}\left(\Omega_{toast}\right) = \sum_{\omega \in \Omega_{toast}} P_{toast}\left(\left\{\omega\right\}\right) = P_{toast}\left(\left\{up\right\}\right) + P_{toast}\left(\left\{down\right\}\right) = 1\).

Technically, both the coin and the toast could end up on their edges, but we ignore those case.

Random Variable

A random variable \(X \colon \Omega \to \mathbb{R}\) is a function that maps outcomes, i.e., the elements of a sample space, to numerical values. For example, we could define a game:

we roll a fair six-sided die
when it comes up \(1\) or \(6\), you win \(3€\)
when it comes up \(2, 3, 4\), or \(5\), you lose \(1€\).

And model it as a random variable:

\(\omega\)	\(1\)	\(2\)	\(3\)	\(4\)	\(5\)	\(6\)
\(X_{game}\left(\omega\right)\)	\(+3€\)	\(-1€\)	\(-1€\)	\(-1€\)	\(-1€\)	\(+3€\)

Now we can ask, e.g., what is the probability that you win \(3€\) on a roll? Since this happens for \(2\) out of \(6\) possible outcomes and the die is fair, it is \(\frac{2}{6} = \frac{1}{3}\). Similarly, the probability for losing \(1€\) on a roll is \(\frac{4}{6} = \frac{2}{3}\). By the way, your average expected win for a roll is \(\frac{1}{3} \cdot 3€ - \frac{2}{3} \cdot 1€ = \frac{1}{3}€\), so you should play this game!

We can define random variables to assign all sorts of values to outcomes. For example, we could go to a shopping mall, randomly select persons, and measure their height to define the random variable \(X_{height}\). Or we could ask them for their income to define \(X_{income}\).

Shannon Entropy

The idea behind entropy is to measure how predictable a situation with uncertain outcome is. Situations whose outcome is easy to predict should have low entropy while situations whose outcome is hard to predict should have high entropy.

In other words, entropy measures how surprising the outcome of an experiment is. If one of the outcomes is very likely, i.e., much more likely than all other outcomes together, then it does not surprise us to observe it; after all we expect it because it is so likely. But if all outcomes are similarly likely, we have no reason to expect one outcome any more than another. There is no “safe bet” and we would be surprised by the outcome most of the time.

Definition

For an experiment with sample space \(\Omega\) and probability mass function \(p \colon \Omega \to \left[0, 1\right]\), we can define the entropy as a random variable \(X_{entropy}\). The amount of entropy for an outcome \(\omega \in \Omega\) is \(\log_2\left(p\left(\omega\right)\right)\) and we set \(X_{entropy} \left(\omega\right) = \log_{2}\left(p\left(\omega\right)\right)\). And the Shannon entropy \(\mathcal{H}\) for the entire sample space \(\Omega\) is the sum over the probability-weighted entropy values of the outcomes \[\begin{eqnarray} \mathcal{H}\left(p\right) & = & - \sum_{\omega \in \Omega} p\left(\omega\right) X_{entropy}\left(\omega\right) \\ & & \\ & = & - \sum_{\omega \in \Omega} p\left(\omega\right) \log_2 \left(p\left(\omega\right)\right). \end{eqnarray}\]

(Since logarithms of values \(\lt 1\) are negative, we need the minus to make the expression positive.)

Shannon defined this measure back in 1948 in his paper A Mathematical Theory of Communication. He wanted to compress messages such that the amount of information per time that is sent over channels with limited capacity is maximised. For that, he considered how likely certain messages occur and therefore, how surprising they are. He encoded less surprising messages, i.e., common ones, with a short code word. The more surprising messages, i.e., rare ones, are encoded by a long code word.

Examples

To get an idea about how Shannon entropy behaves, we look at the examples from before.

The entropy in the coin toss example is \[\begin{eqnarray} \mathcal{H}\left(p_{coin}\right) & = & - p_{coin}\left(heads\right) \log_2\left(p_{coin}\left(heads\right)\right) - p_{coin}\left(tails\right) \log_2\left(p_{coin}\left(tails\right)\right) \\ & & \\ & = & - \frac{1}{2} \log_2\left(\frac{1}{2}\right) - \frac{1}{2} \log_2\left(\frac{1}{2}\right) \\ & & \\ & = & 1. \end{eqnarray}\]

While the entropy in the toast example is \[\begin{eqnarray} \mathcal{H}\left(p_{toast}\right) & = & - p_{toast}\left(up\right) \log_2\left(p_{toast}\left(up\right)\right) - p_{toast}\left(down\right) \log_2\left(p_{toast}\left(down\right)\right) \\ & & \\ & = & - \frac{1}{100} \log_2\left(\frac{1}{100}\right) - \frac{99}{100} \log_2\left(\frac{99}{100}\right) \\ & & \\ & \approx & 0.081. \end{eqnarray}\]

The coin toss has a much higher entropy than the toast drop, which tells us that the outcome of a coin toss is more surprising than the outcome of a toast drop. If you think about it, it makes sense: we know already that the toast will land on the marmalade side, but we do not know what side the coin will land on.

Let us consider the general case for an experiment with two possible outcomes which we simply call \(a\) and \(b\). The probability for \(a\) is \(p\left(a\right)\) and the probability for \(b\) is \(1-p\left(a\right)\), so that \(p\left(a\right) + p\left(b\right) = 1\). The entropy is highest when \(a\) and \(b\) are equally likely, i.e., \(p\left(a\right) = p\left(b\right)\). But as soon as one of them is more likely than the other, the entropy decreses and the outcome of the experiment becomes less surprising.

Entropy distribution for an experiment with two possible outcomes.

Similarly, for an experiment with three possible outcomes, \(x\), \(y\), and \(z\), the entropy is highest when \(p\left(x\right) = p\left(y\right) = p\left(z\right)\).

Entropy distribution for an experiment with three possible outcomes. This plot was created with python-ternary.

Upper Bound

As we see, entropy reaches its highest value when all outcomes are equally likely as it was the case in the coin toss example. For \(n\) possible outcomes, the upper bound on their entropy is \[\begin{eqnarray} \mathcal{H}_{max}^n & = & - \sum_{i = 1}^n \frac{1}{n}\log_2\left(\frac{1}{n}\right) \\ & & \\ & = & n \cdot \left(-\frac{1}{n} \log_2\left(\frac{1}{n}\right)\right) \\ & & \\ & = & - \log_2\left(\frac{1}{n}\right) \\ & & \\ & = & \log_2\left(n\right) . \end{eqnarray}\]

That means the upper bound on entropy grows slowly with the number of possible outcomes and is \(\mathcal{O}\left(\log n\right)\). But it also means that we have to consider what the upper limit for entropy is when we think about what an entropy value of, say, \(2.45\) means. For an experiment with \(3\) outcomes, an entropy value of \(2.45\) is impossible, the upper bound is \(\approx 1.58\). For an experiment with \(7\) outcomes, it is rather high, the upper bound is \(\approx 2.81\). And for an experiment with \(1024\) outcomes, it is rather low with an upper bound of \(10\).

Conclusion

We looked at some background on probability spaces and random variables, and discussed the idea behind entropy. Intuitively, a low entropy level says that the outcome of an experiment is easy to predict and a high level says that the experiment is hard to predict. Then, we looked at the definintion of Shannon entropy and visualised its values for the outcome probabilities for experiments with \(2\) and \(3\) possible outcomes. Finally, we saw that every experiment has an upper bound on entropy and that we have to think about entropy levels in relation to this upper bound when we want to judge whether an entropy level is high or low.