Back to Blog
activation-functionsReLUsigmoidneural-networksdeep-learning

Activation Functions: ReLU vs Sigmoid vs Tanh

Activation functions add non-linearity to neural networks. Compare ReLU, Sigmoid, and Tanh — learn when to use each, their trade-offs, and why ReLU became the default.

CS VisualizationsApril 15, 20267 min

Interactive Visualization

Activation Functions Deep Dive

See this concept in action with our step-by-step interactive visualization.

Try the Visualization

Every neuron in a neural network does two things: compute a weighted sum of its inputs, then pass that sum through an activation function. Without activation functions, a neural network — no matter how deep — would just be a fancy linear regression. Activation functions are what give networks the ability to learn complex, non-linear patterns.

But which one should you use? Let's compare the three most important activation functions.

Why Activation Functions Exist

Consider a two-layer network without activation functions:

Layer 1: h = W₁ × x + b₁
Layer 2: y = W₂ × h + b₂

Substitute layer 1 into layer 2:

y = W₂ × (W₁ × x + b₁) + b₂
y = (W₂ × W₁) × x + (W₂ × b₁ + b₂)
y = W' × x + b'

It collapses to a single linear transformation. Two layers behave exactly like one. A hundred layers would behave like one. The depth is useless.

Activation functions break this linearity. By applying a non-linear function after each layer, the network can represent arbitrarily complex mappings from inputs to outputs.

σtanhReLUinput (z)
Sigmoid (blue) squashes to [0,1], Tanh (green) to [-1,1], ReLU (red) passes positives unchanged.

Sigmoid: The Original

σ(z) = 1 / (1 + e^(-z))

Output range: (0, 1)

Sigmoid squashes any input into a value between 0 and 1, making it naturally interpretable as a probability. It was the default activation function for decades.

Strengths:

  • Output is bounded (0, 1) — good for probabilities
  • Smooth and differentiable everywhere
  • Intuitive interpretation

Weaknesses:

  • Vanishing gradients: The maximum derivative is 0.25 (at z=0). In deep networks, gradients shrink by at least 75% at every layer, making early layers nearly impossible to train
  • Not zero-centered: Outputs are always positive, which can cause zig-zagging during gradient descent
  • Computationally expensive: The exponential function is slower than simple operations

When to use: Output layer for binary classification (predicting probabilities). Rarely used in hidden layers of modern networks.

Tanh: The Improved Sigmoid

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))

Output range: (-1, 1)

Tanh is a rescaled sigmoid: tanh(z) = 2σ(2z) - 1. It fixes sigmoid's biggest practical problem — the outputs are zero-centered.

Strengths:

  • Zero-centered: Outputs range from -1 to 1, making optimization easier
  • Stronger gradients than sigmoid (max derivative is 1.0 vs 0.25)
  • Smooth and differentiable

Weaknesses:

  • Still suffers from vanishing gradients in very deep networks (derivative under 1 for most inputs)
  • Computationally expensive (two exponentials)
  • Saturates at extreme values (very positive or very negative inputs have near-zero gradient)

When to use: LSTM and GRU gates (they need values between -1 and 1). Sometimes in hidden layers when zero-centered output matters. Rarely the first choice for deep feedforward networks.

ReLU: The Modern Default

ReLU(z) = max(0, z)

Output range: [0, ∞)

ReLU (Rectified Linear Unit) is disarmingly simple: if the input is positive, pass it through unchanged. If negative, output zero. This simplicity is its greatest strength.

Strengths:

  • No vanishing gradient for positive values — the gradient is exactly 1, allowing gradients to flow unchanged through many layers
  • Computationally trivial — just a comparison, no exponentials
  • Sparse activation — many neurons output exactly zero, which can be beneficial for regularization and efficiency
  • Empirically works great — faster convergence than sigmoid or tanh in almost all cases

Weaknesses:

  • Dead neurons: If a neuron's input is always negative (due to a large negative bias or unlucky weight initialization), its gradient is permanently zero. The neuron never updates and is effectively dead. This can happen to 10-40% of neurons in a network.
  • Not zero-centered: Outputs are always ≥ 0
  • Unbounded: Very large activations can cause numerical issues

When to use: Default choice for hidden layers in most architectures (CNNs, feedforward networks, etc.). The go-to unless you have a specific reason to use something else.

The Comparison Table

PropertySigmoidTanhReLU
Range(0, 1)(-1, 1)[0, ∞)
Zero-centeredNoYesNo
Max gradient0.251.01.0
Vanishing gradientSevereModerateNone (for z > 0)
Dead neuronsNoNoYes
Compute costHighHighVery low
Default for hidden layersNoNoYes
Good for output layerBinary classificationRNN gatesNo (use softmax/sigmoid)

ReLU Variants

ReLU's dead neuron problem has spawned several variants:

Leaky ReLU

LeakyReLU(z) = z if z > 0, else 0.01 × z

Instead of outputting zero for negative inputs, it allows a small gradient (0.01). This prevents dead neurons entirely.

Parametric ReLU (PReLU)

PReLU(z) = z if z > 0, else α × z

Same as Leaky ReLU, but α is a learnable parameter. The network decides how much negative signal to allow.

ELU (Exponential Linear Unit)

ELU(z) = z if z > 0, else α × (e^z - 1)

Smooth transition at z=0 and produces negative values, pushing the mean activation closer to zero. Slightly more expensive to compute.

GELU (Gaussian Error Linear Unit)

GELU(z) = z × Φ(z)  // Φ is the Gaussian CDF

Used in Transformers (BERT, GPT). Instead of a hard cutoff at zero, it smoothly gates the input based on its value. The most popular activation in modern language models.

Swish / SiLU

Swish(z) = z × sigmoid(z)

Self-gated activation that often outperforms ReLU in deep networks. Used in EfficientNet and some modern architectures.

Choosing the Right Activation Function

Here's the decision flowchart:

Hidden layers in a feedforward network or CNN?ReLU (or Leaky ReLU if you're worried about dead neurons)

RNN/LSTM/GRU hidden layers?Tanh (for hidden state) and Sigmoid (for gates)

Transformer hidden layers?GELU (the standard for modern transformers)

Output layer — binary classification?Sigmoid (output is a probability)

Output layer — multi-class classification?Softmax (outputs sum to 1)

Output layer — regression?Linear (no activation, or ReLU if output must be non-negative)

The Impact on Training

The choice of activation function dramatically affects how well a network trains:

A 10-layer network with sigmoid activations: gradients shrink by 75%+ at each layer. By layer 10, the gradient is 0.25¹⁰ ≈ 0.000001 of the output gradient. The first few layers barely learn.

The same network with ReLU: gradients pass through unchanged for active neurons. Layer 1 gets the same gradient magnitude as layer 10. The entire network learns simultaneously.

This is why the switch from sigmoid to ReLU was such a breakthrough — it made deep networks (10+ layers) practically trainable for the first time.

Related Articles

See It In Action

Our interactive visualization lets you compare activation functions side by side. See how each one transforms inputs, examine their derivatives, explore the dead neuron phenomenon with ReLU, and understand why gradient flow matters for training deep networks.

The visual difference between sigmoid's gradient squeezing and ReLU's clean gradient flow makes the theory click instantly.

Interactive Visualization

Activation Functions Deep Dive

See this concept in action with our step-by-step interactive visualization.

Try the Visualization