Fine-Tuning Pre-trained Models

Day 138: Fine-Tuning Pre-trained Models

Fine-tuning matters because a pretrained model is useful only if you can adapt it without destroying the structure it already learned.


Today's "Aha!" Moment

Yesterday's lesson introduced transfer learning as representation reuse. Fine-tuning is the next question: once you have a useful pretrained model, how much should you let it change?

That is the key tension. If you freeze everything, the model may remain too tied to the source task. If you unfreeze everything too aggressively, you can quickly overwrite useful pretrained structure, especially when your target dataset is small.

So fine-tuning is not just "train the pretrained model a bit more." It is controlled adaptation. You are deciding how much of the old representation to preserve, how much of it to reshape, and how safely to move the weights toward the new task.

That is the aha. Fine-tuning is really a question about how to spend your adaptation budget without erasing the knowledge you started with.


Why This Matters

Suppose the warehouse team now has a pretrained vision backbone and a modest labeled dataset from a new camera angle. A head-only model works, but performance stalls because the new images differ enough from the source domain that the frozen representation is no longer ideal.

Fine-tuning is what lets the model cross that gap. But it also introduces risk. With too much learning rate or too much unfreezing too early, the model may overfit the small target dataset or forget the robust generic features that made transfer valuable in the first place.

This is why fine-tuning is one of the most important practical skills in applied deep learning. It is the difference between "I loaded pretrained weights" and "I adapted a pretrained model intelligently."


Learning Objectives

By the end of this session, you will be able to:

  1. Explain what fine-tuning changes compared with feature extraction - Understand that you are adapting the backbone, not only replacing the head.
  2. Choose a fine-tuning strategy deliberately - Decide how much of the model to unfreeze and how aggressively to update it.
  3. Recognize the main failure modes - Especially overfitting, catastrophic forgetting, and unstable optimization.

Core Concepts Explained

Concept 1: Fine-Tuning Means Updating Part of the Pretrained Backbone

Feature extraction treats the pretrained model as fixed and learns only a new task head. Fine-tuning goes further: it allows some or all of the backbone to update on the target task.

Conceptually:

pretrained backbone + new head
    |
feature extraction: train head only
fine-tuning: train head + some/all backbone

This matters because the pretrained representation may be close to what you need, but not perfectly aligned. Fine-tuning lets the model shift its internal features toward the new domain.

For example, a vision backbone pretrained on general images may already detect useful edges and textures. But if your target domain has unusual lighting, packaging materials, or camera geometry, the later layers may need to adapt.

So the question is not whether the pretrained model has knowledge. It is whether that knowledge is sufficiently aligned already, or whether it needs controlled updating.

Concept 2: Unfreezing Strategy Is the Real Decision

In practice, fine-tuning is often gradual rather than all-or-nothing.

Common strategies:

This staged view is useful because early layers often contain more generic features, while later layers are more task-specific. That means later layers are usually the safest place to begin adaptation.

A small PyTorch sketch looks like this:

for param in model.backbone.parameters():
    param.requires_grad = False

for param in model.backbone.layer4.parameters():
    param.requires_grad = True

optimizer = torch.optim.AdamW([
    {"params": model.backbone.layer4.parameters(), "lr": 1e-4},
    {"params": model.head.parameters(), "lr": 1e-3},
])

That code reflects two practical ideas:

This is often called discriminative fine-tuning in spirit, even if the exact optimizer setup varies.

Concept 3: Fine-Tuning Fails Mostly When Adaptation Is Too Aggressive for the Data

The biggest risks in fine-tuning are not mysterious. They usually come from a mismatch between how much freedom the model has and how much reliable target data you actually have.

Three common failure modes:

too little adaptation  -> underfitting to the target domain
too much adaptation    -> overfitting or forgetting useful pretrained structure
too large learning rate -> unstable updates that damage the backbone quickly

This is where catastrophic forgetting becomes a useful mental model. The pretrained model starts in a good region of parameter space. If you push it too hard with noisy target gradients, it can lose the broad, useful features it had learned before replacing them with brittle target-specific behavior.

That is why careful fine-tuning usually combines:

Fine-tuning is therefore not just an optimization setting. It is a risk-management strategy for adaptation.

Troubleshooting

Issue: Head-only training plateaus well below the desired accuracy.

Why it happens / is confusing: The pretrained backbone feels powerful enough that freezing it should be sufficient.

Clarification / Fix: The source representation may not be aligned enough to the target task. Unfreezing later layers is often the next sensible step.

Issue: Validation performance gets worse right after unfreezing.

Why it happens / is confusing: More trainable layers sound like they should only help.

Clarification / Fix: The learning rate may be too high for pretrained weights, or too much of the backbone was unfrozen too early.

Issue: The model fits the training set quickly but generalizes badly.

Why it happens / is confusing: Fine-tuning often reduces training loss fast, which can look like success.

Clarification / Fix: Small target datasets make overfitting easy. Use validation carefully and do not mistake fast adaptation for robust transfer.

Issue: Treating all pretrained layers as equally task-specific.

Why it happens / is confusing: The checkpoint looks like one block of weights.

Clarification / Fix: Earlier layers are often more generic, later layers more specialized. Fine-tuning strategy should reflect that.


Advanced Connections

Connection 1: Fine-Tuning ↔ Optimization Stability

The parallel: Fine-tuning is not only about architecture reuse; it is about making smaller, safer parameter updates from a good initialization.

Real-world case: Much of practical fine-tuning success comes from conservative optimization choices rather than from architecture changes alone.

Connection 2: Fine-Tuning ↔ Domain Shift

The parallel: The more the target domain diverges from the source domain, the more carefully you must think about how much adaptation the backbone needs.

Real-world case: A model transferred across camera setups, writing styles, or user populations often needs more than a frozen head, but not always full retraining.


Resources

Optional Deepening Resources


Key Insights

  1. Fine-tuning is controlled adaptation of a pretrained representation - You are updating the backbone, not just adding a new head.
  2. How much you unfreeze is the core strategic choice - Later layers are often the safest place to begin adaptation.
  3. Most fine-tuning failures are adaptation-budget failures - Too much change, too quickly, on too little data.

Knowledge Check (Test Questions)

  1. What makes fine-tuning different from pure feature extraction?

    • A) Fine-tuning updates some or all of the pretrained backbone on the target task.
    • B) Fine-tuning removes the classifier head entirely.
    • C) Fine-tuning never uses validation data.
  2. Why is it common to use a smaller learning rate for pretrained layers than for a new head?

    • A) Because pretrained layers often need gentler updates to avoid damaging useful learned structure.
    • B) Because pretrained layers do not have gradients.
    • C) Because the new head cannot overfit.
  3. What is a common reason for catastrophic forgetting during fine-tuning?

    • A) The model is forced to use fewer parameters.
    • B) The pretrained backbone is updated too aggressively relative to the size and quality of the target data.
    • C) The backbone is kept frozen for too long.

Answers

1. A: Fine-tuning changes the pretrained representation itself, not only the task-specific head.

2. A: Smaller updates are often safer for valuable pretrained weights than large steps tuned for a randomly initialized head.

3. B: Forgetting usually happens when the target-task gradients move the pretrained representation too far, too fast.



← Back to Learning