In the previous chapters, we knew the parameters (like or ) and predicted the data. In Estimation, we do the opposite: we have the data, and we need to guess the parameters that created it.
There are two main schools of thought:
- Frequentist: Parameters () are fixed, unknown constants. If you flip a coin, it has one true bias . We try to find the "best" single value for based on many repeated experiments.
- Bayesian: Parameters () are random variables themselves. We have an initial "prior belief" about , and we update that belief to a "posterior distribution" as we observe more data.
This chapter focuses on the Frequentist goal: finding a single "best guess" (a point estimate, ) for an unknown parameter.
The Method of Moments is the most intuitive estimation technique: you simply match the averages you see in your sample to the theoretical averages of the mathematical model.
If you know that, on average, a bus arrives every minutes (theoretical mean), and you observe 5 buses with an average wait time of 10 minutes (sample mean), your "MoM" estimate is that . You are literally just matching the Sample Mean () to the Theoretical Mean ().
The Formula
To estimate , we solve the equation:
Estimating Customer Arrivals
You track the time (in minutes) between customers entering a shop: [2, 5, 1, 3, 4]. Assuming an Exponential distribution (), what is your MoM estimate for the arrival rate λ?
MLE is the gold standard of estimation. It asks: "Out of all possible values for , which one makes the specific data I actually saw the most likely to have occurred?"
Imagine you flip a coin 10 times and get 9 heads.
- Could the true bias be ? Yes, but it's very unlikely you'd get 9 heads.
- Could it be ? Yes, that's very likely!
- Could it be ? It's almost impossible. MLE chooses because it "maximizes the likelihood" (the probability) of the observed data.
The Likelihood Function
We define the Likelihood as the joint probability of the data points given :
Likelihood Curve for Binomial p
This curve shows the likelihood of seeing our data for every possible value of p from 0 to 1. The highest point (the peak) is our MLE estimate.
In WWII, the Allies used MLE to estimate how many tanks the Germans were producing. By looking at the serial numbers on captured tanks, they calculated the most likely "maximum serial number." Their statistical estimates (which said production was ~250/month) were far more accurate than intelligence reports from spies (which said ~1,000/month)!
MLE Workflow
- 1Write LikelihoodForm the product of the probabilities for every data point: L(theta) = product f(xi | theta).
- 2Log-LikelihoodTake the natural log of L(theta). Sums are much easier to differentiate than products!
- 3MaximizeTake the derivative with respect to theta, set it to zero, and solve.
- 4VerifyEnsure the second derivative is negative to confirm you found a maximum, not a minimum.