Back to Blog
LSTMRNNdeep-learningsequential-dataAI

LSTM Networks Explained: How AI Remembers

LSTM networks solve the vanishing gradient problem with gates that control memory. Learn how forget, input, and output gates work — with clear examples and interactive demos.

CS VisualizationsApril 22, 20268 min

Interactive Visualization

Recurrent Neural Networks (RNNs)

See this concept in action with our step-by-step interactive visualization.

Try the Visualization

Regular neural networks have no memory. They process each input independently, with no awareness of what came before. That's fine for classifying images, but useless for understanding language, predicting stock prices, or generating music — tasks where context matters.

Recurrent Neural Networks (RNNs) were designed to fix this by maintaining a hidden state — a form of memory. But vanilla RNNs have a fatal flaw: they can't remember things for very long. Enter Long Short-Term Memory networks.

The Problem: Vanishing Gradients

t=11.000t=20.250t=30.063t=40.016t=50.004t=60.001Gradient magnitude at each time step (×0.25 each step)
Vanishing gradients: the gradient shrinks exponentially at each time step, making early inputs invisible.

Imagine reading this sentence: "The author, who grew up in France and studied at the Sorbonne before moving to London where she worked for a decade, speaks fluent ___."

You need to remember "France" from the beginning to predict "French" at the end. That's a long-range dependency — the relevant information is far from where it's needed.

Vanilla RNNs struggle with this because during training, gradients (the learning signals) must flow backward through every time step. At each step, the gradient is multiplied by a weight matrix. If those weights are small, the gradient shrinks exponentially:

  • After 5 steps: gradient × 0.25⁵ = 0.001
  • After 10 steps: gradient × 0.25¹⁰ = 0.000001
  • After 20 steps: effectively zero

The network simply cannot learn from information that's more than a few steps back. The gradient vanishes before it arrives.

The LSTM Solution: Gated Memory

LSTM networks, introduced by Hochreiter and Schmidhuber in 1997, solve this with an elegant mechanism: instead of one hidden state, they maintain two:

  • Hidden state (h): The short-term working memory — what the network is currently "thinking about"
  • Cell state (c): The long-term memory — a highway that information can flow along with minimal interference

The cell state is the key innovation. Information can travel along it for many time steps with the gradient barely diminishing — because the operations on it are carefully controlled by gates.

The Three Gates

Cell State (long-term memory)Forgetσ → [0,1]Inputσ × tanhOutputσ × tanh(c)Hidden State (short-term / output)
An LSTM cell has three gates controlling information flow: forget, input, and output.

An LSTM has three gates, each a neural network layer with sigmoid activation (outputting values between 0 and 1):

1. Forget Gate: "What should I erase?"

The forget gate looks at the current input and previous hidden state, then decides which parts of the cell state to keep and which to erase.

f_t = sigmoid(W_f · [h_{t-1}, x_t] + b_f)
  • Output of 1.0 = keep this information completely
  • Output of 0.0 = erase this information completely

Analogy: You're reading a book and a new chapter starts with a different character. The forget gate says "forget the previous character's details, we're following someone new now."

2. Input Gate: "What new information should I store?"

The input gate has two parts:

  • A sigmoid layer decides which values to update
  • A tanh layer creates candidate values to add
i_t = sigmoid(W_i · [h_{t-1}, x_t] + b_i)
c̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c)

The new cell state combines forgetting and remembering:

c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t

Analogy: You encounter an important plot point. The input gate says "this is important, store it in long-term memory" and writes it into the cell state.

3. Output Gate: "What should I reveal right now?"

The output gate decides which parts of the cell state to expose as the hidden state (the network's current output).

o_t = sigmoid(W_o · [h_{t-1}, x_t] + b_o)
h_t = o_t ⊙ tanh(c_t)

Analogy: You know many facts about the story, but right now someone asks "what just happened?" The output gate selects the relevant recent events to share, not everything you remember.

Why This Solves Vanishing Gradients

The cell state update is:

c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t

When the forget gate is close to 1 and the input gate is close to 0, the cell state passes through almost unchanged: c_t ≈ c_{t-1}. This means the gradient can flow backward through time along the cell state with minimal decay.

Compare this to vanilla RNNs where the gradient passes through a tanh activation at every step (maximum derivative 1.0, typically much less). The LSTM's cell state creates a gradient highway — a path where information and gradients can travel long distances.

LSTM in Action: Predicting Text

Let's trace an LSTM processing "The cat sat on the ___":

Time StepInputForget GateInput GateCell StateOutput
t=1"The"Keep all (new sequence)Store: article detected[article context]Low confidence
t=2"cat"Keep article infoStore: subject is "cat"[article, cat subject]"cat" features
t=3"sat"Keep subjectStore: action is sitting[cat, sitting]"sat" features
t=4"on"Keep subject+actionStore: preposition follows[cat, sitting, prep]Expecting location
t=5"the"Keep all contextStore: another article[cat, sitting, prep, article]High confidence: noun next

At t=5, the LSTM "knows" it needs a location noun because it remembers the subject (cat), the action (sat), and the preposition (on). This long-range context is exactly what vanilla RNNs lose.

GRU: The Simpler Alternative

The Gated Recurrent Unit (GRU), introduced in 2014, simplifies the LSTM by combining the forget and input gates into a single update gate, and merging the cell state and hidden state:

AspectLSTMGRU
Gates3 (forget, input, output)2 (update, reset)
States2 (hidden + cell)1 (hidden only)
ParametersMore~25% fewer
Training speedSlowerFaster
Long sequencesSlightly betterComparable

GRUs perform comparably to LSTMs on most tasks while being faster to train. Use LSTMs when you need maximum memory capacity; use GRUs when training speed matters.

Real-World Applications

LSTMs power (or powered) many of the AI systems you use daily:

  • Machine translation: Google Translate used LSTM-based seq2seq models before switching to Transformers
  • Speech recognition: Siri, Alexa, and Google Assistant used LSTM layers for converting speech to text
  • Text generation: Early language models were LSTM-based
  • Music composition: LSTMs can learn musical patterns and generate new compositions
  • Anomaly detection: Monitoring server logs, financial transactions, or sensor data for unusual patterns
  • Time series forecasting: Stock prices, weather, energy demand

LSTMs vs Transformers

Since 2017, Transformers have largely replaced LSTMs for natural language processing. But LSTMs still have advantages:

LSTMs win when:

  • Processing streaming data in real-time (one element at a time)
  • Memory is limited (Transformers need O(n²) memory for attention)
  • The sequence is very long and you don't need global attention
  • You're working with time series or sensor data

Transformers win when:

  • You need to capture relationships between distant tokens
  • Parallel training is important (LSTMs are inherently sequential)
  • You have enough data and compute
  • The task is NLP (Transformers dominate every benchmark)

Related Articles

See It In Action

Understanding LSTMs requires seeing how gates open and close as data flows through. Our interactive RNN visualization lets you:

  • Watch hidden states evolve as the network processes a sequence
  • See how the gates control information flow at each time step
  • Compare vanilla RNN hidden states with LSTM cell states
  • Understand why gradients vanish in regular RNNs but not in LSTMs

Step through the process and build intuition for how gated memory works.

Interactive Visualization

Recurrent Neural Networks (RNNs)

See this concept in action with our step-by-step interactive visualization.

Try the Visualization