Day 132: Building CNNs in PyTorch

Building a CNN in PyTorch matters because it forces you to turn architectural intuition into concrete tensors, shapes, and modules that actually run.

Today's "Aha!" Moment

Up to now, the CNN block has been mostly architectural: convolutions detect local patterns, pooling reduces spatial size, deeper stages build richer features, and classic models show different ways to combine those ideas.

PyTorch is where that architectural story becomes executable. A CNN in code is not mysterious. It is still a sequence of blocks that transform a tensor from "image-shaped input" to "compact feature representation" to "class logits." The main new difficulty is not the concept itself. It is learning to keep track of shapes, channels, and boundaries between the feature extractor and the classifier head.

That is why building a CNN is such a good exercise. It teaches you that architecture is not just a diagram on a slide. Every design choice becomes a concrete tensor transformation with a real cost and a real output shape.

That is the aha. A CNN in PyTorch is simply the architectural logic you already know, made explicit in modules and tensor dimensions.

Why This Matters

Suppose the warehouse team now wants a first real vision model in PyTorch for damaged-package detection. At this point, there are two equally bad extremes.

One extreme is to stay too abstract: talk about convolutions, pooling, and feature hierarchies without ever writing a model that can actually train. The other is to copy a code sample blindly and hope the shapes line up.

This lesson sits in the middle. It shows how to build a small CNN from first principles in a way that stays readable. That matters because most practical work with deep learning is not inventing a new architecture from scratch. It is reading, modifying, debugging, and adapting existing model code without losing the conceptual picture.

Learning Objectives

By the end of this session, you will be able to:

Translate a CNN diagram into PyTorch modules - Build convolutional blocks, downsampling, and a classifier head.
Track tensor shapes through the network - Reason about channels, height, width, and flattening without guesswork.
Recognize the implementation mistakes that break beginner CNNs - Especially shape mismatches and incorrect train/eval assumptions.

Core Concepts Explained

Concept 1: A Small CNN Is Usually "Feature Extractor + Classifier Head"

The cleanest way to read a CNN in code is to split it into two parts.

The first part is the feature extractor. This is where convolutional blocks and pooling gradually transform the image into richer but spatially smaller feature maps.

The second part is the classifier head. This takes the final feature representation and turns it into class logits.

image
  -> conv block
  -> conv block
  -> downsample
  -> conv block
  -> downsample
  -> compact feature map
  -> flatten or global pooling
  -> linear layer(s)
  -> logits

That split is useful because it matches how you think about the network conceptually. The early part answers "what visual features are present?" The final part answers "given those features, which class is most plausible?"

In PyTorch, that often looks like this:

import torch.nn as nn

class SmallCNN(nn.Module):
    def __init__(self, num_classes=2):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32 * 16 * 16, 64),
            nn.ReLU(),
            nn.Linear(64, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

This code is not doing anything beyond the ideas from the previous lessons. It is just packaging them in a reusable structure.

Concept 2: Most Beginner CNN Bugs Are Really Shape Bugs

The hardest part of implementing a CNN for the first time is usually not convolution itself. It is keeping track of tensor dimensions as they change.

For images in PyTorch, the standard layout is:

(batch, channels, height, width)

So an RGB batch of 64 x 64 images might start as:

(32, 3, 64, 64)

After one Conv2d(3, 16, kernel_size=3, padding=1), the channels become 16 and the spatial size stays 64 x 64 because padding preserves it:

(32, 16, 64, 64)

After MaxPool2d(2), height and width are halved:

(32, 16, 32, 32)

This is why it helps to read a CNN as a shape pipeline:

(B, 3, 64, 64)
-> conv -> (B, 16, 64, 64)
-> pool -> (B, 16, 32, 32)
-> conv -> (B, 32, 32, 32)
-> pool -> (B, 32, 16, 16)
-> flatten -> (B, 8192)
-> linear -> logits

If you lose track of this pipeline, your classifier head often breaks first. That is why experienced practitioners often print shapes during the first forward pass or use tools like AdaptiveAvgPool2d to make the final transition easier to control.

Concept 3: The Implementation Details Still Reflect Architectural Trade-offs

Once you can build a small CNN, the next useful question is not "can it run?" but "what architectural decisions are now visible in code?"

Some examples:

More output channels mean richer feature maps but more compute.
More pooling or larger strides reduce cost but can erase detail.
A larger classifier head increases capacity but can add many parameters.
Global average pooling can reduce parameter count compared with flattening large feature maps.

That last point is especially important. Instead of flattening a large spatial tensor, some models use adaptive pooling to compress the spatial dimensions before the final linear layer:

self.head = nn.Sequential(
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(32, num_classes),
)

This is a nice example of architecture and implementation meeting each other. The code change is small, but the design implication is large: the classifier now depends less on exact spatial layout and uses far fewer parameters.

That is the deeper practical lesson. PyTorch code is not separate from architecture. The code is where the architecture becomes real enough to inspect, train, and modify.

Troubleshooting

Issue: The model crashes at the first linear layer.

Why it happens / is confusing: The convolutional blocks ran fine, so it feels like the architecture is correct.

Clarification / Fix: This is usually a shape mismatch between the final feature map and the expected input size of Linear. Trace the tensor shapes through the network explicitly.

Issue: The model runs, but the classifier head has far too many parameters.

Why it happens / is confusing: Flattening works, so the design feels acceptable.

Clarification / Fix: Check the spatial size before flattening. If it is still large, consider more downsampling or a global pooling layer before the final linear layer.

Issue: Validation behavior changes strangely even though the code "looks right."

Why it happens / is confusing: CNN code often includes modules whose behavior changes between training and inference, especially once batch norm or dropout are added.

Clarification / Fix: Use model.train() during training and model.eval() during validation or inference, just as in earlier PyTorch lessons.

Issue: Copying an architecture diagram does not produce a good model.

Why it happens / is confusing: The diagram captures the block pattern, but not the task-specific details like input resolution, class count, normalization, augmentation, or optimization settings.

Clarification / Fix: Treat architecture diagrams as design templates, not complete runnable systems.

Advanced Connections

Connection 1: CNNs in PyTorch ↔ Architectural Reading Skills

The parallel: Once you can implement a small CNN yourself, architecture papers and codebases become much easier to read because you can map blocks directly to tensor transformations.

Real-world case: This is the step that turns "I know what a CNN is" into "I can modify a real model without guessing."

Connection 2: CNNs in PyTorch ↔ Production Model Hygiene

The parallel: The same lessons from the PyTorch block still apply here: explicit modules, controlled state, train/eval mode, and reproducible shape assumptions.

Real-world case: Many deployment bugs in vision systems are not caused by exotic models, but by simple implementation mismatches in preprocessing, tensor layout, or inference mode.

Resources

Optional Deepening Resources

[DOCS] PyTorch Conv2d
- Link: https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html
- Focus: Check exact arguments for channels, kernel size, padding, and stride.
[DOCS] PyTorch MaxPool2d
- Link: https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html
- Focus: See how downsampling changes shapes in code.
[DOCS] PyTorch AdaptiveAvgPool2d
- Link: https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html
- Focus: Understand a common way to make classifier heads simpler and more robust to shape changes.
[TUTORIAL] PyTorch Training a Classifier
- Link: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
- Focus: See a compact end-to-end CNN example with data loading, training, and evaluation.

Key Insights

A CNN in code is still the same architecture story - Feature extractor first, classifier head second.
Shape tracking is a first-class skill - Most beginner implementation failures come from losing track of tensor dimensions.
Implementation details expose architectural trade-offs - Choices about pooling, channels, flattening, and heads directly change cost and behavior.

Knowledge Check (Test Questions)

What is the cleanest way to mentally split a small CNN in PyTorch?
- A) Into random layers and helper functions.
- B) Into a feature extractor and a classifier head.
- C) Into training code and optimizer code only.
What tensor layout does PyTorch usually expect for image batches?
- A) (height, width, channels, batch)
- B) (batch, height, width, channels)
- C) (batch, channels, height, width)
Why might AdaptiveAvgPool2d((1, 1)) be useful before the final linear layer?
- A) It increases the number of channels automatically.
- B) It reduces spatial dimensions in a controlled way so the classifier head needs fewer parameters.
- C) It replaces the need for a loss function.

Answers

1. B: That split matches both the architecture and the implementation logic of most small CNNs.

2. C: PyTorch image tensors are typically ordered as batch, channels, height, width.

3. B: Global-style pooling is a common way to simplify the head and reduce parameter count.

← Back to Learning