Back to Blog
CNNconvolutional-neural-networksdeep-learningcomputer-visionAI

What are CNNs? Convolutional Neural Networks Explained

CNNs use filters to detect visual patterns in images — from edges to faces. Learn how convolution, feature maps, and pooling work with clear examples and interactive demos.

CS VisualizationsApril 19, 20268 min

Interactive Visualization

CNN Interactive Visualization

See this concept in action with our step-by-step interactive visualization.

Try the Visualization

Every time you unlock your phone with your face, search Google Photos for "beach," or see a self-driving car navigate traffic, a Convolutional Neural Network (CNN) is doing the heavy lifting. CNNs are the architecture that gave machines the ability to see.

But how do they actually work?

Why Regular Neural Networks Fail at Images

A regular (fully connected) neural network treats every input independently. For a 28×28 grayscale image, that's 784 inputs — each connected to every neuron in the first layer.

This has three major problems:

Too many parameters. A 224×224 color image has 150,528 pixels. With 256 neurons in the first layer, that's 38 million connections — just in one layer. The network would be massive and slow.

No spatial understanding. A fully connected network doesn't know that pixel [10,10] is next to pixel [10,11]. It treats a pixel in the top-left corner the same as one in the bottom-right. All spatial information is lost.

Not translation invariant. If the network learns to recognize a cat in the center of an image, it can't recognize the same cat in the corner — because it learned specific pixel positions, not visual patterns.

CNNs solve all three problems.

The Key Insight: Local Patterns

Images have a special property: nearby pixels are related. An edge, a corner, a texture — these are all local patterns defined by neighboring pixels. You don't need to look at the entire image to detect an edge. You just need to look at a small region.

CNNs exploit this by using filters (also called kernels) — small matrices that slide across the image, detecting patterns one local region at a time.

How Convolution Works

Input Patch120013231×Filter-101-202-101=4
Convolution: multiply the filter by the input region, sum all products → one output value.

The convolution operation is beautifully simple:

  1. Take a small filter (e.g., 3×3 pixels)
  2. Place it on the top-left of the image
  3. Multiply each filter value by the corresponding pixel value
  4. Sum all the products — that's one output value
  5. Slide the filter one pixel to the right and repeat
  6. When you reach the right edge, move down and start from the left

The result is a feature map — a new "image" that highlights wherever the filter's pattern was detected.

Concrete Example

Consider a 3×3 vertical edge detection filter:

[-1  0  1]
[-2  0  2]
[-1  0  1]

The negative values on the left and positive values on the right create a "difference detector." When this filter slides over a vertical edge (dark on left, light on right), the output is large. Over a uniform region, the output is near zero.

The network doesn't use hand-crafted filters like this — it learns which filters to use during training. Early layers might learn edge detectors, while deeper layers learn to detect increasingly complex patterns.

The Feature Hierarchy

This is where CNNs get really interesting. Each layer builds on what the previous layer detected:

Layer 1 (Simple Features): Edges, color gradients, simple textures. These are the building blocks of all visual patterns.

Layer 2 (Combinations): Edges combine into corners, curves, and more complex textures. A horizontal edge plus a vertical edge might activate a "corner detector."

Layer 3+ (Complex Features): Corners and curves combine into parts of objects — eyes, wheels, windows, fur patterns.

Final Layers: Object-level features — "this looks like a face," "this looks like a car."

This hierarchical feature learning is what makes CNNs so powerful. They automatically discover the right representation for the task, from low-level pixels to high-level concepts.

Pooling: Reducing Dimensions

After convolution, pooling layers shrink the feature maps by summarizing local regions. The most common type is max pooling:

  • Slide a 2×2 window across the feature map
  • Keep only the maximum value in each window
  • The feature map shrinks by half in each dimension

Why pooling matters:

  • Reduces computation — smaller feature maps mean fewer operations
  • Translation invariance — if a feature shifts by a pixel, the max is likely the same
  • Controls overfitting — fewer parameters to memorize

A 32×32 feature map becomes 16×16 after one pooling layer, and 8×8 after two. The spatial dimensions shrink while the number of learned features (channels) grows.

Stride and Padding

Two parameters control the convolution operation:

Stride — how far the filter moves at each step. Stride 1 moves one pixel at a time (most detail). Stride 2 skips every other position (halves the output size).

Padding — adding zeros around the image border. Without padding, a 3×3 filter on a 5×5 image produces a 3×3 output (it shrinks). With "same" padding, the output stays the same size as the input.

The output size formula:

output = (input + 2 × padding - filter_size) / stride + 1

The Complete CNN Architecture

Input32×32×3Conv→ 32×32×16Pool→ 16×16Conv→ 16×16×32Pool→ 8×8Dense→ 10
A CNN pipeline: convolution extracts features, pooling reduces size, dense layers classify.

A typical CNN for image classification chains these operations:

Input Image (224×224×3)
  → Conv + ReLU (32 filters) → Pool → Feature maps (112×112×32)
  → Conv + ReLU (64 filters) → Pool → Feature maps (56×56×64)
  → Conv + ReLU (128 filters) → Pool → Feature maps (28×28×128)
  → Flatten → Dense layer (256 neurons) → Output (10 classes)

The convolutional layers extract features. The pooling layers reduce dimensions. The final dense layers combine all features to make a classification decision.

Famous CNN Architectures

CNNs have a rich history of increasingly clever architectures:

LeNet-5 (1998): The original CNN, designed by Yann LeCun for handwritten digit recognition. Just 5 layers, but proved that CNNs work.

AlexNet (2012): Won the ImageNet competition by a massive margin, sparking the deep learning revolution. Used ReLU activation and dropout — techniques that are now standard.

VGGNet (2014): Showed that deeper networks (16-19 layers) with small 3×3 filters outperform shallower networks with larger filters.

ResNet (2015): Introduced skip connections, enabling networks with 50, 101, or even 152 layers. Solved the degradation problem where deeper networks performed worse than shallower ones.

EfficientNet (2019): Systematically balanced network depth, width, and resolution for optimal performance per computational budget.

CNNs Beyond Image Classification

While CNNs were invented for images, they're used for any grid-like data:

  • Object detection: YOLO and SSD detect and locate multiple objects in a single image
  • Semantic segmentation: Label every pixel in an image (autonomous driving)
  • Medical imaging: Detect tumors, analyze X-rays, read pathology slides
  • Natural language processing: 1D convolutions over text sequences
  • Audio processing: Spectrograms are 2D images — CNNs work great on them
  • Video analysis: 3D convolutions across space and time

See It In Action

The convolution operation is much easier to understand when you can watch a filter slide across an image and see the feature map emerge. Our interactive CNN visualization lets you:

  • See a 3×3 filter applied to a real input matrix
  • Watch feature maps highlight detected patterns
  • Step through max pooling as it shrinks dimensions
  • Trace data through a complete CNN pipeline

There's no substitute for seeing convolution happen in real time. Try it and the concept clicks immediately.

Related Articles

Interactive Visualization

CNN Interactive Visualization

See this concept in action with our step-by-step interactive visualization.

Try the Visualization