Every neuron in a neural network does two things: compute a weighted sum of its inputs, then pass that sum through an activation function. Without activation functions, a neural network — no matter how deep — would just be a fancy linear regression. Activation functions are what give networks the ability to learn complex, non-linear patterns.

But which one should you use? Let's compare the three most important activation functions.

Why Activation Functions Exist

Consider a two-layer network without activation functions:

Layer 1: h = W₁ × x + b₁
Layer 2: y = W₂ × h + b₂

Substitute layer 1 into layer 2:

y = W₂ × (W₁ × x + b₁) + b₂
y = (W₂ × W₁) × x + (W₂ × b₁ + b₂)
y = W' × x + b'

It collapses to a single linear transformation. Two layers behave exactly like one. A hundred layers would behave like one. The depth is useless.

Activation functions break this linearity. By applying a non-linear function after each layer, the network can represent arbitrarily complex mappings from inputs to outputs.

Sigmoid (blue) squashes to [0,1], Tanh (green) to [-1,1], ReLU (red) passes positives unchanged.

Sigmoid: The Original

σ(z) = 1 / (1 + e^(-z))

Output range: (0, 1)

Sigmoid squashes any input into a value between 0 and 1, making it naturally interpretable as a probability. It was the default activation function for decades.

Strengths:

Output is bounded (0, 1) — good for probabilities
Smooth and differentiable everywhere
Intuitive interpretation

Weaknesses:

Vanishing gradients: The maximum derivative is 0.25 (at z=0). In deep networks, gradients shrink by at least 75% at every layer, making early layers nearly impossible to train
Not zero-centered: Outputs are always positive, which can cause zig-zagging during gradient descent
Computationally expensive: The exponential function is slower than simple operations

When to use: Output layer for binary classification (predicting probabilities). Rarely used in hidden layers of modern networks.

Tanh: The Improved Sigmoid

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))

Output range: (-1, 1)

Tanh is a rescaled sigmoid: tanh(z) = 2σ(2z) - 1. It fixes sigmoid's biggest practical problem — the outputs are zero-centered.

Strengths:

Zero-centered: Outputs range from -1 to 1, making optimization easier
Stronger gradients than sigmoid (max derivative is 1.0 vs 0.25)
Smooth and differentiable

Weaknesses:

Still suffers from vanishing gradients in very deep networks (derivative under 1 for most inputs)
Computationally expensive (two exponentials)
Saturates at extreme values (very positive or very negative inputs have near-zero gradient)

When to use: LSTM and GRU gates (they need values between -1 and 1). Sometimes in hidden layers when zero-centered output matters. Rarely the first choice for deep feedforward networks.

ReLU: The Modern Default

ReLU(z) = max(0, z)

Output range: [0, ∞)

ReLU (Rectified Linear Unit) is disarmingly simple: if the input is positive, pass it through unchanged. If negative, output zero. This simplicity is its greatest strength.

Strengths:

No vanishing gradient for positive values — the gradient is exactly 1, allowing gradients to flow unchanged through many layers
Computationally trivial — just a comparison, no exponentials
Sparse activation — many neurons output exactly zero, which can be beneficial for regularization and efficiency
Empirically works great — faster convergence than sigmoid or tanh in almost all cases

Weaknesses:

Dead neurons: If a neuron's input is always negative (due to a large negative bias or unlucky weight initialization), its gradient is permanently zero. The neuron never updates and is effectively dead. This can happen to 10-40% of neurons in a network.
Not zero-centered: Outputs are always ≥ 0
Unbounded: Very large activations can cause numerical issues

When to use: Default choice for hidden layers in most architectures (CNNs, feedforward networks, etc.). The go-to unless you have a specific reason to use something else.

The Comparison Table

Property	Sigmoid	Tanh	ReLU
Range	(0, 1)	(-1, 1)	[0, ∞)
Zero-centered	No	Yes	No
Max gradient	0.25	1.0	1.0
Vanishing gradient	Severe	Moderate	None (for z > 0)
Dead neurons	No	No	Yes
Compute cost	High	High	Very low
Default for hidden layers	No	No	Yes
Good for output layer	Binary classification	RNN gates	No (use softmax/sigmoid)

ReLU Variants

ReLU's dead neuron problem has spawned several variants:

Leaky ReLU

LeakyReLU(z) = z if z > 0, else 0.01 × z

Instead of outputting zero for negative inputs, it allows a small gradient (0.01). This prevents dead neurons entirely.

Parametric ReLU (PReLU)

PReLU(z) = z if z > 0, else α × z

Same as Leaky ReLU, but α is a learnable parameter. The network decides how much negative signal to allow.

ELU (Exponential Linear Unit)

ELU(z) = z if z > 0, else α × (e^z - 1)

Smooth transition at z=0 and produces negative values, pushing the mean activation closer to zero. Slightly more expensive to compute.

GELU (Gaussian Error Linear Unit)

GELU(z) = z × Φ(z)  // Φ is the Gaussian CDF

Used in Transformers (BERT, GPT). Instead of a hard cutoff at zero, it smoothly gates the input based on its value. The most popular activation in modern language models.

Swish / SiLU

Swish(z) = z × sigmoid(z)

Self-gated activation that often outperforms ReLU in deep networks. Used in EfficientNet and some modern architectures.

Choosing the Right Activation Function

Here's the decision flowchart:

Hidden layers in a feedforward network or CNN? → ReLU (or Leaky ReLU if you're worried about dead neurons)

RNN/LSTM/GRU hidden layers? → Tanh (for hidden state) and Sigmoid (for gates)

Transformer hidden layers? → GELU (the standard for modern transformers)

Output layer — binary classification? → Sigmoid (output is a probability)

Output layer — multi-class classification? → Softmax (outputs sum to 1)

Output layer — regression? → Linear (no activation, or ReLU if output must be non-negative)

The Impact on Training

The choice of activation function dramatically affects how well a network trains:

A 10-layer network with sigmoid activations: gradients shrink by 75%+ at each layer. By layer 10, the gradient is 0.25¹⁰ ≈ 0.000001 of the output gradient. The first few layers barely learn.

The same network with ReLU: gradients pass through unchanged for active neurons. Layer 1 gets the same gradient magnitude as layer 10. The entire network learns simultaneously.

This is why the switch from sigmoid to ReLU was such a breakthrough — it made deep networks (10+ layers) practically trainable for the first time.

What is a Neural Network? — Learn the fundamentals that activation functions power
Backpropagation Explained — See why gradient flow through activations matters
What are CNNs? — How ReLU became the default activation in convolutional networks
LSTM Networks Explained — Where sigmoid and tanh activations are still essential

See It In Action

Our interactive visualization lets you compare activation functions side by side. See how each one transforms inputs, examine their derivatives, explore the dead neuron phenomenon with ReLU, and understand why gradient flow matters for training deep networks.

The visual difference between sigmoid's gradient squeezing and ReLU's clean gradient flow makes the theory click instantly.

Activation Functions: ReLU vs Sigmoid vs Tanh

Why Activation Functions Exist

Sigmoid: The Original

Tanh: The Improved Sigmoid

ReLU: The Modern Default

The Comparison Table

ReLU Variants

Leaky ReLU

Parametric ReLU (PReLU)

ELU (Exponential Linear Unit)

GELU (Gaussian Error Linear Unit)

Swish / SiLU

Choosing the Right Activation Function

The Impact on Training

Related Articles

See It In Action