The N-Dimensional Bell Curve

In previous chapters, we looked at the 1D Bell Curve. But the real world is multi-dimensional. When you're modeling a robot's position (x,y,z)(x,y,z), or the price of 500 stocks in the S&P 500, you need the Multivariate Normal (MVN) distribution.

The MVN is the bedrock of modern Machine Learning, from Gaussian Processes to Variational Autoencoders (VAEs).

Intuition
The Geometry of Probability

In 1D, the variance σ2\sigma^2 tells you how wide the bell is. In NN dimensions, we have a Mean Vector (μ\boldsymbol{\mu}) and a Covariance Matrix (Σ\Sigma). This matrix doesn't just tell you how wide the distribution is; it tells you its shape (is it a circle? a stretched ellipse?) and its orientation (is it tilted?).

The Covariance Matrix (Σ)

LayerOperationShapeNote
VariablesInput dim[d]Number of random variables in the vector
CovarianceΣ = E[(X-μ)(X-μ)ᵀ][d, d]Symmetric, positive semi-definite matrix
f(x)=1(2π)dΣexp(12(xμ)TΣ1(xμ))f(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp\left( -\frac{1}{2} (\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu}) \right)

Visualizing the Covariance Matrix

The diagonal elements (Σii\Sigma_{ii}) are the variances of each individual variable. The off-diagonal elements (Σij\Sigma_{ij}) are the covariances between variable ii and variable jj.

Visualizing Correlation

High off-diagonal values (0.8) mean the variables move together. In a 2D contour plot, this would look like a thin, tilted ellipse rather than a circle.

RowStock AStock B
Stock A
1
0.8
Stock B
0.8
1
Low
High
Why is MVN so Special?

The MVN is beloved by mathematicians and engineers because it is "mathematically closed" under almost every important operation:

  1. Marginals are Normal: If you have a 100-dimensional Normal distribution and you ignore 98 of the variables, the remaining 2 variables follow a 2D Normal distribution.
  2. Conditionals are Normal: If you have a joint distribution of Temperature and Humidity, and you observe that it's exactly 30°C, the distribution of Humidity (the conditional distribution) is still a Normal distribution.
  3. Linear Combinations are Normal: If you add two Normal vectors together, or multiply them by a matrix, the result is guaranteed to be Normal. This makes it incredibly easy to "propagate" uncertainty through linear systems.
1.Let X\mathbf{X} be a Gaussian random vector: XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma).
2.Consider a linear transformation: Y=AX+b\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}.
3.The new mean vector is: E[Y]=Aμ+bE[\mathbf{Y}] = \mathbf{A}\boldsymbol{\mu} + \mathbf{b}.
4.The new covariance matrix is: Cov(Y)=AΣATCov(\mathbf{Y}) = \mathbf{A\Sigma A^T}.
5.Result: YN(Aμ+b,AΣAT)\mathbf{Y} \sim \mathcal{N}(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\Sigma \mathbf{A}^T). This property is the mathematical foundation for the Kalman Filter, which NASA used to land the Apollo missions on the moon!
EExample
Portfolio Optimization

In finance, investors use the MVN to model the returns of multiple assets. The Covariance Matrix tells them which stocks move together (bad for diversification) and which move in opposite directions (good for hedging). Finding the "Minimum Variance Portfolio" is an exercise in linear algebra using these Gaussian properties.