Probability is a hard thing for humans to think about. The debate between Bayesian and orthodox (frequentist) statistics around the relationship of event frequency and probability makes that clear. Setting that aside, there are a whole bunch of fields that care about *log probability*. Log probability is an elemental quantity of information theory. Entropy is a measure of uncertainty, and at the core of information theory and data compression/coding. Expected log probability is entropy:

For a uniform distribution this can be simplified even further. A fair coin toss, or die throw, for example, has uniform probability of heads/tails or any number, and we get:

where for a coin flip and for a 8-sided die (because there are 2 and 8 possible values, respectively). So with a base-2 log, we get for the coin flip, and for a 8-sided die.

One thing that might not be immediately obvious is that this allows us to compare different types of events to each other. We can now compare the uncertainty in a coin flip and the uncertainty in a 8-sided die. so it takes 3 coin flips to have the same uncertainty. In fact, this means you could simulate an 8-sided die with 3 coin flips, but not 2 coin flips with some sort of tree structure: the first flip determines if the die is in 1-4 or 5-8, the next if it is in 1-2 or 3-4 if the first flip was heads, or 5-6 vs 7-8 for tails on the first flip, and the last flip resolves which of those two numbers the die ends up on.

You probably would have been able to come up with this scheme to simulate die throws from coinflips without knowing about how entropy is formulated. I find this interesting for a couple reasons. First, this means there may be something intuitive about entropy that we have in our brains that we can dig up. The second is that this gives us a formal way to verify and check what we intuited about randomness. For this post I want to focus on intuition.

The first time you are presented with entropy, you might wonder why we take the log of probability. That would be a funny thing to do without a reason. Why couldn’t I say, ‘take your logs and build a door and get out, I’ll just take the square or root instead and use that for *my* measure of uncertainty’, and continue with my life? It turns out there are reasons. I wanted to use this post to capture those reasons and the reference.

If you look at Shannon’s A Mathematical Theory of Communication, you will find a proof-based solution that’s quite nice. Even after looking at it if you haven’t looked at convexity-based proof in a while, it can still be somewhat unintuitive why there needs to be a logarithm involved. Here is a less formal and incomplete explanation that would have been useful for me to get more perspective on the problem.

There are a few desirable properties of entropy that Shannon describes. For example, entropy should be highest when all the events are equally likely. Another one of them is how combining independent events like coin flips or dice rolls combine the possibilities. This means that the number of outcomes is exponential on the number coin flips or die rolls. So if I compute the entropy of one coin flip and another coinflip, and add them together, the sum should be the same value if I were to compute a single entropy on those two coinflips together.

If you want a measure that grows linearly on number of coin flips or die rolls that achieves this property, then taking the logarithm of the number of combinations gives you just that. No other function that isn’t a logarithm will do that. This is because the number of possibilities of n coinflips is exponential. Notice that , where is the number of coin tosses and are the possible outcomes for coin tosses). So the log inverts the exponent added by the combinations of multiple events, which gives entropy the linear property on .