How Good is Your Guess?

We now know how to estimate parameters using MoM or MLE, but how do we know which estimator is "best"? Is the Sample Mean better than the Sample Median? To answer this, we evaluate estimators (θ^\hat{\theta}) based on two fundamental criteria: Bias and Variance.

Intuition
The Archery Analogy

Imagine shooting arrows at a target where the bullseye is the true parameter θ\theta:

  • Bias: How far is the average of all your shots from the bullseye? (Systematic Error/Accuracy)
  • Variance: How tight is the grouping of your shots? (Precision)
  • MSE: The overall quality of your shooting, accounting for both aim and consistency.
ScenarioBiasVarianceInterpretation
BullseyeZeroLowYour aim is true and your hands are steady.
Systematic ErrorHighLowYour hands are steady, but your sight is misaligned.
High NoiseZeroHighYour aim is true on average, but your hands are shaky.
ChaosHighHighYour sight is off AND your hands are shaky.
Mean Squared Error (MSE): The Total Error

The MSE is the most common metric for comparing estimators. It measures the average squared distance between the estimator and the true parameter.

MSE(θ^)=E[(θ^θ)2]\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2]

One of the most important results in statistics is that MSE can be decomposed into two parts:

MSE(θ^)=Bias2(θ^)+Var(θ^)\text{MSE}(\hat{\theta}) = \text{Bias}^2(\hat{\theta}) + \text{Var}(\hat{\theta})
The Bias-Variance Tradeoff

In a perfect world, we want zero bias and zero variance. However, in the real world (and especially in Machine Learning), reducing one often increases the other.

EExample
The Overfitting Trap

A highly complex model (like a 100th-degree polynomial) can fit your specific data points perfectly (zero bias), but it will change wildly if you add even one new data point (high variance). A simpler model might miss some nuances (higher bias) but is much more stable and reliable (lower variance). This balance is the "holy grail" of data science.

Why is Sample Variance n1n-1?

If you've ever used a statistics calculator, you might have noticed that the Sample Variance formula (S2S^2) divides by n1n-1 instead of nn. This is called Bessel's Correction.

Intuition
Correcting for 'Mean Drift'

If we divide by nn, our estimator for variance is systematically too small (it is biased). This happens because we don't know the true population mean (μ\mu); we are forced to use the sample mean (Xˉ\bar{X}). Since Xˉ\bar{X} is calculated from the same data, the data points are naturally "closer" to it than they are to the true μ\mu. Dividing by n1n-1 slightly inflates the result, making the estimator unbiased.

Efficiency: The 'Speed Limit' of Estimators

Some estimators are more "efficient" than others—they extract more information from the same amount of data. The Fisher Information (I(θ)I(\theta)) measures how much information each data point provides about the parameter.

TTheorem
The Cramér-Rao Lower Bound

There is a mathematical "speed limit" on how low the variance of an unbiased estimator can be. You can never do better than: Var(θ^)1nI(θ)Var(\hat{\theta}) \ge \frac{1}{n \cdot I(\theta)}

An estimator that hits this limit (like many MLEs) is called Efficient.

Efficiency Comparison

As we gather more data (n), the variance of all estimators drops toward zero. The MLE (blue) stays at the absolute theoretical floor, while others (green) are 'wasteful' and require more data to reach the same precision.

-0.0130.0320.0760.120.1650.2102.1204.0306.0407.9509.8Cramér-Rao Limit (Floor): (10, 0.1)Cramér-Rao Limit (Floor): (50, 0.02)Cramér-Rao Limit (Floor): (100, 0.01)Cramér-Rao Limit (Floor): (500, 0.002)MLE (Efficient): (10, 0.1)MLE (Efficient): (50, 0.02)MLE (Efficient): (100, 0.01)MLE (Efficient): (500, 0.002)MoM (Less Efficient): (10, 0.15)MoM (Less Efficient): (50, 0.03)MoM (Less Efficient): (100, 0.015)MoM (Less Efficient): (500, 0.003)xy
Cramér-Rao Limit (Floor)MLE (Efficient)MoM (Less Efficient)