Two Ways to See the World

In the previous chapters, we knew the parameters (like pp or λ\lambda) and predicted the data. In Estimation, we do the opposite: we have the data, and we need to guess the parameters that created it.

There are two main schools of thought:

DDefinition
Frequentist vs. Bayesian
  • Frequentist: Parameters (θ\theta) are fixed, unknown constants. If you flip a coin, it has one true bias pp. We try to find the "best" single value for pp based on many repeated experiments.
  • Bayesian: Parameters (θ\theta) are random variables themselves. We have an initial "prior belief" about pp, and we update that belief to a "posterior distribution" as we observe more data.

This chapter focuses on the Frequentist goal: finding a single "best guess" (a point estimate, θ^\hat{\theta}) for an unknown parameter.

Method of Moments (MoM): The Simple Match

The Method of Moments is the most intuitive estimation technique: you simply match the averages you see in your sample to the theoretical averages of the mathematical model.

EExample
MoM Intuition

If you know that, on average, a bus arrives every λ\lambda minutes (theoretical mean), and you observe 5 buses with an average wait time of 10 minutes (sample mean), your "MoM" estimate is that λ=10\lambda = 10. You are literally just matching the Sample Mean (Xˉ\bar{X}) to the Theoretical Mean (E[X]E[X]).

The Formula

To estimate θ\theta, we solve the equation:

Sample Mean: Xˉ=1ni=1nXiE[X]=μ(θ)\text{Sample Mean: } \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i \quad \longleftrightarrow \quad E[X] = \mu(\theta)

Estimating Customer Arrivals

medium

You track the time (in minutes) between customers entering a shop: [2, 5, 1, 3, 4]. Assuming an Exponential distribution (E[X]=1/λE[X] = 1/\lambda), what is your MoM estimate for the arrival rate λ?

Maximum Likelihood Estimation (MLE): The Best Fit

MLE is the gold standard of estimation. It asks: "Out of all possible values for θ\theta, which one makes the specific data I actually saw the most likely to have occurred?"

Intuition
The Likelihood Peak

Imagine you flip a coin 10 times and get 9 heads.

  • Could the true bias be p=0.5p=0.5? Yes, but it's very unlikely you'd get 9 heads.
  • Could it be p=0.9p=0.9? Yes, that's very likely!
  • Could it be p=0.1p=0.1? It's almost impossible. MLE chooses p=0.9p=0.9 because it "maximizes the likelihood" (the probability) of the observed data.

The Likelihood Function

We define the Likelihood L(θx1,,xn)\mathcal{L}(\theta \mid x_1, \ldots, x_n) as the joint probability of the data points given θ\theta:

L(θx)=i=1nf(xiθ)\mathcal{L}(\theta \mid x) = \prod_{i=1}^n f(x_i \mid \theta)

Likelihood Curve for Binomial p

This curve shows the likelihood of seeing our data for every possible value of p from 0 to 1. The highest point (the peak) is our MLE estimate.

-0.0080.0210.050.0790.1080.0840.250.4170.5830.750.916xy
L(p | observed heads)
EExample
The German Tank Problem

In WWII, the Allies used MLE to estimate how many tanks the Germans were producing. By looking at the serial numbers on captured tanks, they calculated the most likely "maximum serial number." Their statistical estimates (which said production was ~250/month) were far more accurate than intelligence reports from spies (which said ~1,000/month)!

MLE Workflow

Input
Observed sample data and a model (e.g., Normal, Poisson)
Output
The best-fit parameter estimate θ-hat
Complexity
-
  1. 1
    Write Likelihood
    Form the product of the probabilities for every data point: L(theta) = product f(xi | theta).
  2. 2
    Log-Likelihood
    Take the natural log of L(theta). Sums are much easier to differentiate than products!
  3. 3
    Maximize
    Take the derivative with respect to theta, set it to zero, and solve.
  4. 4
    Verify
    Ensure the second derivative is negative to confirm you found a maximum, not a minimum.