Scaling Laws

Bigger is Better?

Difficulty
Intermediate
Duration
10-12 min
Prerequisites
Pre-training
Step
1/ 7

The Scaling Hypothesis

The scaling hypothesis is one of the most important ideas in modern AI: performance improves predictably as you increase model size, data, and compute.

This wasn't obvious. Before the scaling era, the common belief was that architectural innovations (better attention mechanisms, clever training tricks) were the primary driver of progress. The scaling hypothesis flipped this: given a good-enough architecture (the transformer), simply making it bigger yields consistent, predictable improvements.

The key observations:

  • Loss follows a power law with respect to model size, data size, and compute
  • These power laws hold over many orders of magnitude (10M to 100B+ parameters)
  • The improvements are smooth and predictable — no sudden breakthroughs or plateaus
  • This means you can predict how well a larger model will perform before training it

L(N) = (N_c / N)^α — loss as a function of parameters, where N_c and α are constants

This predictability is remarkable. In most engineering fields, scaling doesn't work so cleanly — you hit diminishing returns, new failure modes, or fundamental bottlenecks. For LLMs, the loss just keeps going down on a smooth curve.

The practical implication: labs can run small-scale experiments, fit the scaling curve, and extrapolate to determine whether a much larger (and much more expensive) training run is worth the investment.

Loss vs Model Size (Log Scale)

Parameters (millions, log scale)Cross-Entropy Loss0.001.322.643.965.28020000400006000080000100000Kaplan et al. (2020) Power Law

Rules of Thumb for Scaling

Scale FactorWhat ChangesObserved Effect
10x parametersModel capacity (width, depth)Loss decreases by ~0.3-0.5
10x training dataInformation available to learnLoss decreases by ~0.2-0.4
10x computeTotal FLOPs (params x data)Loss decreases by ~0.3-0.5
100x computeMajor scale-upQualitative new capabilities may emerge