We now know how to estimate parameters using MoM or MLE, but how do we know which estimator is "best"? Is the Sample Mean better than the Sample Median? To answer this, we evaluate estimators () based on two fundamental criteria: Bias and Variance.
Imagine shooting arrows at a target where the bullseye is the true parameter :
- Bias: How far is the average of all your shots from the bullseye? (Systematic Error/Accuracy)
- Variance: How tight is the grouping of your shots? (Precision)
- MSE: The overall quality of your shooting, accounting for both aim and consistency.
| Scenario | Bias | Variance | Interpretation |
|---|---|---|---|
| Bullseye | Zero | Low | Your aim is true and your hands are steady. |
| Systematic Error | High | Low | Your hands are steady, but your sight is misaligned. |
| High Noise | Zero | High | Your aim is true on average, but your hands are shaky. |
| Chaos | High | High | Your sight is off AND your hands are shaky. |
The MSE is the most common metric for comparing estimators. It measures the average squared distance between the estimator and the true parameter.
One of the most important results in statistics is that MSE can be decomposed into two parts:
In a perfect world, we want zero bias and zero variance. However, in the real world (and especially in Machine Learning), reducing one often increases the other.
A highly complex model (like a 100th-degree polynomial) can fit your specific data points perfectly (zero bias), but it will change wildly if you add even one new data point (high variance). A simpler model might miss some nuances (higher bias) but is much more stable and reliable (lower variance). This balance is the "holy grail" of data science.
If you've ever used a statistics calculator, you might have noticed that the Sample Variance formula () divides by instead of . This is called Bessel's Correction.
If we divide by , our estimator for variance is systematically too small (it is biased). This happens because we don't know the true population mean (); we are forced to use the sample mean (). Since is calculated from the same data, the data points are naturally "closer" to it than they are to the true . Dividing by slightly inflates the result, making the estimator unbiased.
Some estimators are more "efficient" than others—they extract more information from the same amount of data. The Fisher Information () measures how much information each data point provides about the parameter.
There is a mathematical "speed limit" on how low the variance of an unbiased estimator can be. You can never do better than:
An estimator that hits this limit (like many MLEs) is called Efficient.
Efficiency Comparison
As we gather more data (n), the variance of all estimators drops toward zero. The MLE (blue) stays at the absolute theoretical floor, while others (green) are 'wasteful' and require more data to reach the same precision.