Day 102: Decision Trees and Rule-Based Splits

Decision trees matter because they classify by asking a sequence of data-driven questions, which makes the model feel less like a hidden formula and more like a visible path through the feature space.

Today's "Aha!" Moment

The last few lessons were about models that compute a score from features. Decision trees take a very different route: instead of combining evidence into one number, they keep splitting the data into simpler and simpler groups.

Keep one example throughout the lesson. The learning platform wants to classify whether a student is at high risk of dropping out. A tree might ask:

Is attendance below 70%?
If yes, is assignment completion below 50%?
If no, has quiz performance recently dropped sharply?

Each question divides the students into smaller groups that are less mixed than before.

That is the aha. A decision tree learns by repeatedly asking, "What question would make this group easier to classify right now?" The goal is not to write rules by hand. The goal is to let the data determine which sequence of splits best separates the classes.

Once you see that, two things become clear. First, trees are often easier to inspect than many other models because you can follow the path of one prediction. Second, interpretability does not make them safe by default. A tree can still grow into an overfit mess if it keeps making highly specific splits.

Why This Matters

The problem: Many classification models work, but not all of them expose an understandable chain of reasoning when someone asks why a decision was made.

Before:

Classification can feel like one opaque formula.
It is harder to see how features divide the data.
Teams may trust a model without noticing that it has become too specific to the training set.

After:

A prediction becomes a visible path through a sequence of decisions.
Split quality can be understood as reducing label confusion step by step.
Overfitting becomes easier to see as a tree grows deeper and more specific.

Real-world impact: Decision trees are useful on their own for interpretability and also form the basis of powerful ensemble methods such as random forests and gradient boosting.

Learning Objectives

By the end of this session, you will be able to:

Explain how a decision tree makes a prediction - Describe classification as recursive splitting into simpler groups.
Explain what impurity reduction means - Understand why trees choose questions that make labels less mixed.
Explain why trees can still overfit - See how depth and overly specific splits can hurt generalization.

Core Concepts Explained

Concept 1: A Decision Tree Learns by Splitting the Feature Space into Easier Regions

At the root, the tree starts with one big mixed pool of examples. Some students drop out, some do not. The tree's job is to find one question that makes the groups on each side more predictable than the original mixture.

For the student-risk example, a first split might be:

attendance < 0.70 ?

This could separate many high-risk students from lower-risk ones. Each child node is then treated as its own smaller problem, and the tree asks the next best question within that group.

all students
   |
   +-- attendance < 0.70 ?
         |
         +-- yes -> ask about assignment completion
         |
         +-- no  -> ask about quiz trend

This is what makes trees feel so different from linear models. A tree does not assume one global score explains the world. It carves the space into regions using local questions.

The trade-off is flexibility versus stability. Trees can capture nonlinear patterns naturally, but that same flexibility lets them become too tailored if they keep splitting too far.

Concept 2: Good Splits Are the Ones That Reduce Label Confusion

The tree needs a way to judge whether one question is better than another. That is where impurity measures such as Gini impurity or entropy come in.

You do not need the full math first. The intuition is enough:

a node with a 50/50 class mix is very impure
a node with 95% of one class is much purer

So a good split is one that turns a confusing node into child nodes that are each easier to classify.

def gini(class_probabilities):
    return 1 - sum(p * p for p in class_probabilities)

The formula is less important than what it represents: "how mixed is this node?"

Imagine two possible questions:

Split A creates child groups that are still half-and-half
Split B creates one mostly-positive node and one mostly-negative node

Split B is better because it reduces confusion more.

bad split:
  mixed -> mixed + mixed

good split:
  mixed -> mostly positive + mostly negative

The trade-off is greedy local improvement versus global certainty. Trees choose the best split they can see at each step, which works well often, but it is still a step-by-step heuristic, not a guarantee of perfect global structure.

Concept 3: Trees Are Easy to Read, but They Can Memorize Noise If Left Unchecked

It is tempting to trust trees because they look understandable. But a readable model can still overfit.

Suppose the tree keeps splitting until tiny leaves contain only one or two training examples. Training accuracy may look excellent, but the model may now be using quirks that will not repeat on new students.

shallow tree:
  broader rules
  easier generalization

very deep tree:
  many tiny special cases
  higher overfitting risk

This is why tree training always involves control knobs such as:

max_depth
min_samples_split
min_samples_leaf
pruning strategies

These controls limit how specific the tree is allowed to become.

The core lesson is important: interpretability is not the same as reliability. A tree can explain its path clearly and still be learning the wrong thing too literally from the training data.

The trade-off is transparency versus over-specificity. Trees are appealing because you can inspect them, but that does not remove the need for proper evaluation on unseen data.

Troubleshooting

Issue: Thinking decision trees are just hand-written if/else logic.

Why it happens / is confusing: Once trained, they look like explicit rules.

Clarification / Fix: The final model is rule-like, but the rules were learned from data by optimizing split quality, not authored manually.

Issue: Assuming an interpretable tree cannot overfit.

Why it happens / is confusing: Readability feels like a sign of trustworthiness.

Clarification / Fix: A tree can be easy to read and still too specific to the training set. Interpretability helps inspection, not automatic generalization.

Issue: Treating Gini or entropy as abstract formulas to memorize.

Why it happens / is confusing: The math can overshadow the simple intuition.

Clarification / Fix: Translate both back into one idea: how mixed is this node, and does the split make the child groups less mixed?

Advanced Connections

Connection 1: Decision Trees ↔ Human Heuristics

The parallel: Trees resemble human decision checklists because both make choices by asking conditional questions in sequence.

Real-world case: Triage protocols and support-routing playbooks often look tree-like even when they are not learned from data.

Connection 2: Decision Trees ↔ Ensemble Methods

The parallel: Many strong modern tabular-data methods use trees as building blocks rather than relying on one tree alone.

Real-world case: Random forests and gradient-boosted trees improve performance by combining many trees to reduce variance or bias.

Resources

Optional Deepening Resources

These resources are optional and are not required for the core 30-minute path.
[TUTORIAL] Scikit-learn User Guide - Decision Trees
- Link: https://scikit-learn.org/stable/modules/tree.html
- Focus: Review how trees split data and how practical depth controls work.
[BOOK] Interpretable Machine Learning - Decision Trees
- Link: https://christophm.github.io/interpretable-ml-book/tree.html
- Focus: See how tree paths map into human-readable decision logic and what their limits are.
[BOOK] Hands-On Machine Learning
- Link: https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/
- Focus: Use the decision-tree chapter as a practical next step after this lesson.
[VIDEO] Decision Trees, Clearly Explained - StatQuest
- Link: https://www.youtube.com/watch?v=7VeUPuFGJHk
- Focus: Reinforce the intuition of recursive splitting and impurity reduction.

Key Insights

Decision trees classify by asking learned questions - Each split tries to create child groups that are easier to classify than the parent group.
Impurity reduction is the tree's split criterion - Gini and entropy are just ways to score how mixed a node is.
Interpretability does not remove overfitting risk - Trees still need depth and split controls to avoid memorizing the training set.

Knowledge Check (Test Questions)

What is the main goal of a split in a decision tree?
- A) Create child groups that are less mixed than the parent group.
- B) Maximize the number of leaves as quickly as possible.
- C) Remove the need for evaluation data.
Why are Gini impurity and entropy useful?
- A) They measure how mixed the labels are in a node, which helps compare candidate splits.
- B) They make trees immune to overfitting.
- C) They are only decorative statistics with no role in training.
Why can a very deep tree be risky?
- A) It may memorize training quirks and generalize poorly to new data.
- B) It automatically becomes a linear classifier.
- C) It stops being interpretable in any sense.

Answers

1. A: A useful split reduces label confusion by producing child groups that are easier to classify than the original mixed node.

2. A: Both metrics quantify node mixedness, which is why they help the tree choose better questions.

3. A: Deeper trees gain more flexibility, but that flexibility can turn into memorization of noise instead of reusable structure.

← Back to Learning