Information Theory

The Currency of Certainty: Entropy

What is "Information"? In 1948, Claude Shannon realized that information is simply the reduction of uncertainty. If I tell you "The sun will rise tomorrow," I've given you zero information, because you were already 100% sure. If I tell you "The lottery numbers are 5, 12, 23...", I've given you a lot of information because those numbers were highly uncertain.

Entropy ( $H$ ) is the mathematical measure of that uncertainty.

✦Intuition

The Yes/No Question Game

Entropy is the average minimum number of Yes/No questions you need to ask to figure out a random outcome.

Fair Coin ( $p=0.5$ ): Exactly 1 question needed ("Is it Heads?"). $H = 1$ bit.
Biased Coin (99% Heads): You almost always know the answer. You need far fewer than 1 question on average. $H \approx 0.08$ bits.
Certain Event: 0 questions needed. $H = 0$ bits.

The Mathematical Definition

For a discrete random variable $X$ with outcomes $x_1, \ldots, x_n$ :

H(X) = -\sum_{i=1}^n P(x_i) \log_2 P(x_i)

Binary Entropy Function

Entropy is highest (1 bit) when we are most uncertain (p=0.5). If we are certain of the outcome (p=0 or p=1), the uncertainty (and thus entropy) drops to zero.

Uncertainty (Bits)

KL Divergence: The Cost of Being Wrong

Kullback-Leibler (KL) Divergence ( $D_{KL}(P \parallel Q)$ ) measures the "distance" between two probability distributions. More specifically, it measures how much information we lose if we assume the world follows distribution $Q$ when it actually follows $P$ .

EExample

The Optimistic Gambler

Imagine a gambler thinks a coin is fair (their model $Q$ ), but it's actually biased (the truth $P$ ). The KL Divergence tells the gambler exactly how much "surprisingness" or extra uncertainty they will encounter because their model doesn't match reality.

Machine Learning Connection

In ML, $P$ is the true distribution of our data (the labels) and $Q$ is our model's prediction. Minimizing the KL Divergence is the same as making our model as accurate as possible!

KL Divergence Visualization

The gap between the True Distribution (blue) and our Estimated Model (green) is the KL Divergence. Reducing this 'gap' is the primary goal of training a neural network.

The Truth (P)Our Model (Q)

Mutual Information: The Signal in the Noise

Mutual Information ( $I(X;Y)$ ) measures how much information $X$ tells you about $Y$ .

EExample

Radio Channels

If you're talking on a phone, $X$ is what you said and $Y$ is what the receiver hears. If the connection is perfect, the mutual information is high (knowing $Y$ tells you exactly what $X$ was). If there's only static, $I(X;Y) = 0$ , meaning the two variables are independent.

Summary Table

Metric	Intuition	Variable Meanings
Entropy	Total uncertainty	$X$ : A random outcome
KL Divergence	Distance between models	$P$ : Truth, $Q$ : Model
Cross-Entropy	Cost of using our model	Used as loss functions in Deep Learning
Mutual Information	Overlap between variables	Measures non-linear dependence between $X$ and $Y$

EExample

The Big Picture

Information theory is the hidden machinery behind everything from ZIP file compression and JPEG images to 5G networks and Large Language Models (LLMs). It teaches us that uncertainty isn't just a nuisance—it's a measurable, physical quantity that can be optimized.

← Previous

Stochastic Processes

Course Progression

25 of 25

End of Series