What is "Information"? In 1948, Claude Shannon realized that information is simply the reduction of uncertainty. If I tell you "The sun will rise tomorrow," I've given you zero information, because you were already 100% sure. If I tell you "The lottery numbers are 5, 12, 23...", I've given you a lot of information because those numbers were highly uncertain.
Entropy () is the mathematical measure of that uncertainty.
Entropy is the average minimum number of Yes/No questions you need to ask to figure out a random outcome.
- Fair Coin (): Exactly 1 question needed ("Is it Heads?"). bit.
- Biased Coin (99% Heads): You almost always know the answer. You need far fewer than 1 question on average. bits.
- Certain Event: 0 questions needed. bits.
The Mathematical Definition
For a discrete random variable with outcomes :
Binary Entropy Function
Entropy is highest (1 bit) when we are most uncertain (p=0.5). If we are certain of the outcome (p=0 or p=1), the uncertainty (and thus entropy) drops to zero.
Kullback-Leibler (KL) Divergence () measures the "distance" between two probability distributions. More specifically, it measures how much information we lose if we assume the world follows distribution when it actually follows .
Imagine a gambler thinks a coin is fair (their model ), but it's actually biased (the truth ). The KL Divergence tells the gambler exactly how much "surprisingness" or extra uncertainty they will encounter because their model doesn't match reality.
Machine Learning Connection
In ML, is the true distribution of our data (the labels) and is our model's prediction. Minimizing the KL Divergence is the same as making our model as accurate as possible!
KL Divergence Visualization
The gap between the True Distribution (blue) and our Estimated Model (green) is the KL Divergence. Reducing this 'gap' is the primary goal of training a neural network.
Mutual Information () measures how much information tells you about .
If you're talking on a phone, is what you said and is what the receiver hears. If the connection is perfect, the mutual information is high (knowing tells you exactly what was). If there's only static, , meaning the two variables are independent.
Summary Table
| Metric | Intuition | Variable Meanings |
|---|---|---|
| Entropy | Total uncertainty | : A random outcome |
| KL Divergence | Distance between models | : Truth, : Model |
| Cross-Entropy | Cost of using our model | Used as loss functions in Deep Learning |
| Mutual Information | Overlap between variables | Measures non-linear dependence between and |
Information theory is the hidden machinery behind everything from ZIP file compression and JPEG images to 5G networks and Large Language Models (LLMs). It teaches us that uncertainty isn't just a nuisance—it's a measurable, physical quantity that can be optimized.