The Currency of Certainty: Entropy

What is "Information"? In 1948, Claude Shannon realized that information is simply the reduction of uncertainty. If I tell you "The sun will rise tomorrow," I've given you zero information, because you were already 100% sure. If I tell you "The lottery numbers are 5, 12, 23...", I've given you a lot of information because those numbers were highly uncertain.

Entropy (HH) is the mathematical measure of that uncertainty.

Intuition
The Yes/No Question Game

Entropy is the average minimum number of Yes/No questions you need to ask to figure out a random outcome.

  • Fair Coin (p=0.5p=0.5): Exactly 1 question needed ("Is it Heads?"). H=1H = 1 bit.
  • Biased Coin (99% Heads): You almost always know the answer. You need far fewer than 1 question on average. H0.08H \approx 0.08 bits.
  • Certain Event: 0 questions needed. H=0H = 0 bits.

The Mathematical Definition

For a discrete random variable XX with outcomes x1,,xnx_1, \ldots, x_n:

H(X)=i=1nP(xi)log2P(xi)H(X) = -\sum_{i=1}^n P(x_i) \log_2 P(x_i)

Binary Entropy Function

Entropy is highest (1 bit) when we are most uncertain (p=0.5). If we are certain of the outcome (p=0 or p=1), the uncertainty (and thus entropy) drops to zero.

-0.10.20.50.81.1-0.020.1880.3960.6040.8121.02xy
Uncertainty (Bits)
KL Divergence: The Cost of Being Wrong

Kullback-Leibler (KL) Divergence (DKL(PQ)D_{KL}(P \parallel Q)) measures the "distance" between two probability distributions. More specifically, it measures how much information we lose if we assume the world follows distribution QQ when it actually follows PP.

EExample
The Optimistic Gambler

Imagine a gambler thinks a coin is fair (their model QQ), but it's actually biased (the truth PP). The KL Divergence tells the gambler exactly how much "surprisingness" or extra uncertainty they will encounter because their model doesn't match reality.

Machine Learning Connection

In ML, PP is the true distribution of our data (the labels) and QQ is our model's prediction. Minimizing the KL Divergence is the same as making our model as accurate as possible!

KL Divergence Visualization

The gap between the True Distribution (blue) and our Estimated Model (green) is the KL Divergence. Reducing this 'gap' is the primary goal of training a neural network.

0.0140.1220.230.3380.446-0.080.7521.5842.4163.2484.08xy
The Truth (P)Our Model (Q)
Mutual Information: The Signal in the Noise

Mutual Information (I(X;Y)I(X;Y)) measures how much information XX tells you about YY.

EExample
Radio Channels

If you're talking on a phone, XX is what you said and YY is what the receiver hears. If the connection is perfect, the mutual information is high (knowing YY tells you exactly what XX was). If there's only static, I(X;Y)=0I(X;Y) = 0, meaning the two variables are independent.

Summary Table

MetricIntuitionVariable Meanings
EntropyTotal uncertaintyXX: A random outcome
KL DivergenceDistance between modelsPP: Truth, QQ: Model
Cross-EntropyCost of using our modelUsed as loss functions in Deep Learning
Mutual InformationOverlap between variablesMeasures non-linear dependence between XX and YY
EExample
The Big Picture

Information theory is the hidden machinery behind everything from ZIP file compression and JPEG images to 5G networks and Large Language Models (LLMs). It teaches us that uncertainty isn't just a nuisance—it's a measurable, physical quantity that can be optimized.

← Previous
Stochastic Processes
Course Progression
25 of 25
Next →
End of Series