All Posts
Mathematics

The Central Limit Theorem: Why the Normal Distribution Is Everywhere

We state and prove the Central Limit Theorem — the reason the bell curve appears throughout nature, science, and statistics — and explore its assumptions, generalizations, and applications.

The Theorem

The Central Limit Theorem (Lindeberg-Lévy)

Let X1,X2,X_1, X_2, \ldots be independent and identically distributed random variables with mean μ\mu and finite variance σ2>0\sigma^2 > 0. Let Sn=X1++XnS_n = X_1 + \cdots + X_n. Then:

SnnμσndN(0,1)as n\frac{S_n - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty

In words: regardless of the distribution of the individual XiX_i, their normalized sum converges in distribution to a standard normal. This is why the bell curve appears everywhere — it is the universal attractor for sums of independent random variables.


What Does "Convergence in Distribution" Mean?

The notation ZndZZ_n \xrightarrow{d} Z means that for every aRa \in \mathbb{R} where the CDF of ZZ is continuous:

limnP(Zna)=P(Za)=Φ(a)\lim_{n \to \infty} P(Z_n \leq a) = P(Z \leq a) = \Phi(a)

where Φ(a)=12πaet2/2dt\Phi(a) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^a e^{-t^2/2}\,dt is the standard normal CDF.


Intuition

Why Sums Become Normal

Consider rolling a single die — the distribution is uniform on {1,2,3,4,5,6}\{1, 2, 3, 4, 5, 6\}. Now roll nn dice and sum them:

  • n=1n = 1: flat distribution (uniform)
  • n=2n = 2: triangular distribution
  • n=5n = 5: already visibly bell-shaped
  • n=30n = 30: nearly indistinguishable from a Gaussian

The CLT explains this universality: the specific shape of the original distribution is "washed out" by summation. Only the mean and variance survive in the limit.

A Precise Statement

If Xˉn=Sn/n\bar{X}_n = S_n/n is the sample mean, the CLT equivalently says:

n(Xˉnμ)dN(0,σ2)\sqrt{n}\left(\bar{X}_n - \mu\right) \xrightarrow{d} N(0, \sigma^2)

or in the approximate form used in practice:

Xˉn  ˙  N ⁣(μ,σ2n)for large n\bar{X}_n \;\dot\sim\; N\!\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{for large } n


Proof via Characteristic Functions

The most elegant proof uses characteristic functions (Fourier transforms of probability distributions).

Proof.

Step 1 — Setup. Without loss of generality, assume μ=0\mu = 0 and σ=1\sigma = 1 (replace XiX_i by (Xiμ)/σ(X_i - \mu)/\sigma). We must show:

Zn=Snn=X1++XnndN(0,1)Z_n = \frac{S_n}{\sqrt{n}} = \frac{X_1 + \cdots + X_n}{\sqrt{n}} \xrightarrow{d} N(0,1)

Step 2 — Characteristic function of ZnZ_n. The characteristic function of XiX_i is φ(t)=E[eitXi]\varphi(t) = E[e^{itX_i}]. By independence:

φZn(t)=E ⁣[eitZn]=k=1nE ⁣[eitXk/n]=[φ ⁣(tn)]n\varphi_{Z_n}(t) = E\!\left[e^{itZ_n}\right] = \prod_{k=1}^n E\!\left[e^{it X_k / \sqrt{n}}\right] = \left[\varphi\!\left(\frac{t}{\sqrt{n}}\right)\right]^n

Step 3 — Taylor expansion. Since E[Xi]=0E[X_i] = 0 and E[Xi2]=1E[X_i^2] = 1:

φ(s)=1+isE[X]s22E[X2]+o(s2)=1s22+o(s2)\varphi(s) = 1 + is \cdot E[X] - \frac{s^2}{2}E[X^2] + o(s^2) = 1 - \frac{s^2}{2} + o(s^2)

Substituting s=t/ns = t/\sqrt{n}:

φ ⁣(tn)=1t22n+o(1/n)\varphi\!\left(\frac{t}{\sqrt{n}}\right) = 1 - \frac{t^2}{2n} + o(1/n)

Step 4 — Take the limit.

φZn(t)=[1t22n+o(1/n)]net2/2as n\varphi_{Z_n}(t) = \left[1 - \frac{t^2}{2n} + o(1/n)\right]^n \to e^{-t^2/2} \quad \text{as } n \to \infty

The function et2/2e^{-t^2/2} is the characteristic function of N(0,1)N(0,1).

Step 5 — Apply Lévy's continuity theorem. Since φZn(t)et2/2\varphi_{Z_n}(t) \to e^{-t^2/2} pointwise and et2/2e^{-t^2/2} is continuous at 00, we conclude ZndN(0,1)Z_n \xrightarrow{d} N(0,1). \square


The Berry-Esseen Theorem

The CLT says the distribution converges — but how fast?

Berry-Esseen Theorem

If E[Xi3]=ρ<E[|X_i|^3] = \rho < \infty, then:

supxP ⁣(Snnμσnx)Φ(x)Cρσ3n\sup_x \left|P\!\left(\frac{S_n - n\mu}{\sigma\sqrt{n}} \leq x\right) - \Phi(x)\right| \leq \frac{C \rho}{\sigma^3 \sqrt{n}}

where C0.4748C \leq 0.4748 is a universal constant.

The error is O(1/n)O(1/\sqrt{n}) — so for practical purposes, n30n \geq 30 often gives a good normal approximation.


Generalizations

Lindeberg CLT (Non-Identical Distributions)

If X1,X2,X_1, X_2, \ldots are independent (but not necessarily identically distributed) with E[Xk]=0E[X_k] = 0, Var(Xk)=σk2\operatorname{Var}(X_k) = \sigma_k^2, sn2=k=1nσk2s_n^2 = \sum_{k=1}^n \sigma_k^2, and the Lindeberg condition holds:

1sn2k=1nE ⁣[Xk21Xk>εsn]0for every ε>0\frac{1}{s_n^2} \sum_{k=1}^n E\!\left[X_k^2 \cdot \mathbf{1}_{|X_k| > \varepsilon s_n}\right] \to 0 \quad \text{for every } \varepsilon > 0

then Sn/sndN(0,1)S_n / s_n \xrightarrow{d} N(0,1).

Multivariate CLT

If X1,X2,Rd\mathbf{X}_1, \mathbf{X}_2, \ldots \in \mathbb{R}^d are i.i.d. with mean μ\boldsymbol{\mu} and covariance matrix Σ\Sigma, then:

n(Xˉnμ)dN(0,Σ)\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \xrightarrow{d} N(\mathbf{0}, \Sigma)

CLT for Dependent Variables

Under various mixing conditions, CLT-type results hold for weakly dependent sequences — essential in time series analysis and ergodic theory.


When the CLT Fails

The CLT requires finite variance. If Var(Xi)=\operatorname{Var}(X_i) = \infty, the theorem fails. For example:

  • Cauchy distribution: XiCauchyX_i \sim \text{Cauchy}, then Xˉn\bar{X}_n is still Cauchy — no convergence to normal.
  • Stable distributions: For heavy-tailed distributions with infinite variance, normalized sums converge to non-Gaussian stable laws.

The generalized CLT states that the only possible limits of normalized sums are the α\alpha-stable distributions with 0<α20 < \alpha \leq 2 (the Gaussian corresponds to α=2\alpha = 2).


Applications

Polling and Surveys

If you survey nn people, the sample proportion p^\hat{p} satisfies:

p^  ˙  N ⁣(p,p(1p)n)\hat{p} \;\dot\sim\; N\!\left(p, \frac{p(1-p)}{n}\right)

A 95%95\% confidence interval is p^±1.96p^(1p^)/n\hat{p} \pm 1.96\sqrt{\hat{p}(1-\hat{p})/n}, a direct application of the CLT.

Hypothesis Testing

Most standard statistical tests (zz-test, tt-test for large nn) rely on the CLT to justify using normal critical values.

Finance

The Black-Scholes model assumes log-returns are normally distributed — justified by viewing daily returns as sums of many small, roughly independent shocks. (When the independence or finite-variance assumptions fail, as in financial crises, the model breaks down.)

Physics

The Maxwell-Boltzmann distribution of molecular velocities in a gas arises because each velocity component is a sum of many independent random impulses — the CLT in action.


Historical Development

  • De Moivre (1733) proved the CLT for coin flips: the binomial distribution B(n,1/2)B(n, 1/2) converges to a normal.
  • Laplace (1812) extended this to general B(n,p)B(n, p) and recognized the broader principle.
  • Chebyshev (1887) and Markov (1898) gave proofs using the method of moments.
  • Lyapunov (1901) proved the CLT under his condition (E[Xi2+δ]<E[|X_i|^{2+\delta}] < \infty) using characteristic functions.
  • Lindeberg (1922) gave the definitive condition for non-identical variables.
  • Feller (1935) proved the Lindeberg condition is also necessary (in a certain sense).

Summary

X1,X2, i.i.d., E[Xi]=μ,  Var(Xi)=σ2<X1++XnnμσndN(0,1)Rate: O(1/n) (Berry-Esseen)Failure: infinite variance    stable laws, not Gaussian\begin{aligned} &X_1, X_2, \ldots \text{ i.i.d., } E[X_i] = \mu, \; \operatorname{Var}(X_i) = \sigma^2 < \infty \\[8pt] &\frac{X_1 + \cdots + X_n - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0,1) \\[8pt] &\text{Rate: } O(1/\sqrt{n}) \text{ (Berry-Esseen)} \\[8pt] &\text{Failure: infinite variance} \implies \text{stable laws, not Gaussian} \end{aligned}

References