Day 111: Overfitting, Underfitting, and Regularization

A model has learned something useful only if its performance survives outside the training set; fitting the training data beautifully is not the goal, just one step on the way.

Today's "Aha!" Moment

Imagine you are still working on churn prediction. You train a very flexible model and the training score looks excellent. That feels like progress. But when you evaluate on validation data, the gain shrinks or even disappears. The model did learn something, but part of what it learned was too specific to the quirks of the training sample.

That is the important shift in perspective. Overfitting is not "learning too much" in some vague sense. It is learning patterns that do not travel well. Underfitting is the opposite failure: the model is too weak or the representation too poor to capture the useful structure even in the training data.

Once you see those two failure modes clearly, regularization stops looking like a bag of tricks. It becomes a way of pushing the model away from brittle, overly specific solutions and toward patterns that are more likely to repeat on new data.

So the aha is this: the real object you are tuning is not training fit, but the gap between what the model can explain in-sample and what still holds out-of-sample.

Why This Matters

The problem: Good training performance and good generalization are not the same thing, and practical ML depends on not confusing them.

Before:

Higher training accuracy feels like unquestionable progress.
Model complexity is added without a clear diagnostic reason.
Regularization feels abstract or decorative.

After:

Model quality is judged by what survives on validation data.
Overfitting and underfitting become different diagnoses with different fixes.
Regularization becomes a tool for controlling how aggressively the model fits the sample.

Real-world impact: A large part of applied ML is really the work of controlling model flexibility so the system captures reusable structure instead of sample-specific noise.

Learning Objectives

By the end of this session, you will be able to:

Distinguish overfitting from underfitting - Use training and validation behavior to tell which failure mode you are seeing.
Explain regularization as controlled restraint - Understand how penalties, early stopping, and related mechanisms discourage brittle fits.
Choose the next intervention more intelligently - Decide whether the problem calls for more capacity, less capacity, better features, more data, or stronger regularization.

Core Concepts Explained

Concept 1: Underfitting and Overfitting Are Opposite Ways to Miss Generalization

The cleanest way to read these two failures is by comparing training and validation performance.

If both training and validation are poor, the model is usually underfitting. It is not flexible enough, the features are too weak, or the task representation is still missing useful structure. The model fails early, before it even gets the training sample right.

If training is very strong but validation is noticeably worse, you are usually seeing overfitting. The model has started using patterns that are too tied to this dataset and do not transfer well.

underfitting:
  training bad
  validation bad

overfitting:
  training good
  validation noticeably worse

That distinction matters because the next move depends on it. More flexibility may help underfitting and worsen overfitting. Stronger regularization may help overfitting and worsen underfitting.

The trade-off is between expressive power and stability. Models need enough capacity to capture real structure, but not so much unchecked freedom that they start memorizing noise.

Concept 2: Regularization Discourages Brittle Solutions

Regularization is the general idea of making certain kinds of solutions less attractive during training.

For linear models, L2 regularization discourages large coefficients and tends to spread weight more smoothly. L1 regularization pushes some coefficients toward zero, which can produce sparser models. For iterative learners such as boosting or neural networks, early stopping can play a similar role by preventing training from continuing into increasingly noisy refinements.

objective = data_loss + lambda_ * penalty

That one line captures the core idea: fitting the data is not the only objective. You also penalize solutions that are unnecessarily extreme, unstable, or complex.

Regularization is not a rejection of learning. It is a way of saying: among the models that fit the data reasonably well, prefer the one that does so with less fragility.

The trade-off is immediate. Stronger regularization usually improves stability but can suppress useful signal if pushed too far. Too little regularization leaves the model free to chase accidental quirks in the sample.

Concept 3: Diagnosis Should Determine the Fix

This is where many workflows go wrong. Teams see disappointing validation performance and start changing knobs at random: bigger model, smaller model, more features, fewer features, more penalty, less penalty.

The better approach is diagnostic.

If both training and validation are poor:

the model may be too simple
the features may be too weak
the representation may still hide useful structure

If training is strong but validation lags:

the model may be too flexible for the amount of data
regularization may be too weak
the features may include noisy or leaky signals
more data may help if the task supports it

read the pattern first
      |
      +--> underfit? increase useful capacity
      +--> overfit? add restraint or cleaner signal

This is why regularization belongs next to evaluation, not apart from it. You do not regularize because regularization is fashionable. You regularize because the pattern of failure tells you the model is fitting too specifically.

The trade-off is speed versus rigor. Random tuning can feel faster in the moment, but good diagnosis usually gets to the right intervention with far fewer wasted iterations.

Troubleshooting

Issue: Seeing training performance improve and assuming the model is getting better overall.

Why it happens / is confusing: Training feedback is immediate and concrete, so it feels authoritative.

Clarification / Fix: Always read training and validation together. Better fit on training data can be a warning sign if validation does not follow.

Issue: Applying stronger regularization to an already underfitting model.

Why it happens / is confusing: Regularization is often taught as a generally good thing, so it can look like a safe default.

Clarification / Fix: If the model cannot even fit the useful structure in the training set, stronger regularization usually makes the real problem worse.

Issue: Calling every weak validation score "overfitting."

Why it happens / is confusing: Overfitting is the most famous failure mode, so it becomes the default label.

Clarification / Fix: Check whether training performance is also weak. If it is, the model may be underfitting instead.

Advanced Connections

Connection 1: Regularization ↔ Bayesian Preference for Simpler Explanations

The parallel: Many regularization terms act like a prior preference against unnecessarily extreme parameter settings.

Real-world case: Penalized optimization often encodes the same practical instinct as Bayesian priors: prefer explanations that fit without becoming implausibly wild.

Connection 2: Overfitting ↔ Memorization Without Transfer

The parallel: The difference between memorizing examples and understanding the pattern also exists in human learning.

Real-world case: A student can ace rehearsed practice problems and still fail on new variants, which is exactly the shape of poor generalization.

Resources

Optional Deepening Resources

These resources are optional and are not required for the core 30-minute path.
[VIDEO] StatQuest - Bias and Variance
- Link: https://www.youtube.com/watch?v=EuBBz3bI-aA
- Focus: Reinforce the intuitive difference between models that are too simple and models that are too fragile.
[TUTORIAL] Scikit-learn User Guide - Linear models and regularization
- Link: https://scikit-learn.org/stable/modules/linear_model.html
- Focus: Review how Ridge, Lasso, and Elastic Net express different regularization choices.
[DOCS] Scikit-learn User Guide - Model complexity and validation curves
- Link: https://scikit-learn.org/stable/modules/learning_curve.html#validation-curve
- Focus: Connect regularization and model complexity to the behavior of training and validation scores.
[BOOK] Hands-On Machine Learning
- Link: https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/
- Focus: Reinforce how penalties and early stopping change model behavior in practice.

Key Insights

Generalization is the real target - Training fit matters only insofar as it leads to performance that survives on new data.
Regularization is controlled restraint - It discourages solutions that fit the sample too specifically.
Diagnosis should come before intervention - Underfitting and overfitting need different fixes, so the training/validation pattern matters.

Knowledge Check (Test Questions)

What pattern most strongly suggests overfitting?
- A) Training performance is strong, but validation performance is clearly worse.
- B) Training and validation are both weak.
- C) The model has only a few parameters.
What is regularization trying to do?
- A) Discourage brittle or overly extreme fits that do not generalize well.
- B) Guarantee that the model becomes simple enough to be perfect.
- C) Replace the need for validation data.
If both training and validation scores are poor, what is often more plausible than "use more regularization"?
- A) The model may be underfitting and need better features or more expressive capacity.
- B) The model has already memorized the training set.
- C) The only problem is that the test set is too small.

Answers

1. A: A strong training score combined with noticeably weaker validation performance is the classic overfitting pattern.

2. A: Regularization tries to steer learning away from unstable fits that exploit quirks of the training sample.

3. A: When both scores are poor, the model is often too weak or the representation too limited rather than insufficiently regularized.

← Back to Learning