Lesson 6 • Intermediate
Decision Trees & Random Forests
Learn exactly how a tree chooses its questions, why deep trees overfit, and how random forests combine many trees into one of the most reliable models in machine learning.
What You'll Learn in This Lesson
- ✓How a tree splits data, one yes/no question at a time
- ✓Measure how mixed a group is with Gini impurity and entropy
- ✓Score a split by hand using information gain
- ✓Why deep trees overfit, and how pruning fixes it
- ✓Read feature importance to see which inputs matter
- ✓How a random forest votes many trees into one answer
🎯 Real-World Analogy: 20 Questions
A decision tree works just like the game 20 Questions. To guess an animal you ask: "Is it bigger than a cat?" → Yes → "Does it live in water?" → No → "Does it have stripes?" → Yes → "It's a tiger!" Each question splits the remaining possibilities into smaller groups until only one answer is left.
A good player asks the most useful question first — the one that rules out the most options. A decision tree does the same thing automatically: at every step it picks the question that separates the classes best. The rest of this lesson is really just "how does the tree decide which question is best?"
1How a Tree Splits Data
A decision tree is built from nodes. Each internal node asks one question about a single feature (like "is temperature > 23?"). The answer sends each row down a branch. When a group is "good enough" — mostly one class — the tree stops and that group becomes a leaf, which holds the prediction.
To choose a question, the tree needs to measure how mixed a group is. A group of all "yes" rows is pure and needs no further questions; a 50/50 mix of "yes" and "no" is the most impure. The two standard ways to measure this are Gini impurity and entropy.
The split-building loop, in plain English:
- For every feature, try every possible threshold or category.
- Measure the impurity of the two groups that split would create.
- Keep the split that lowers impurity the most.
- Repeat on each child group until the groups are pure or a stop rule is hit.
2Gini Impurity — Scoring a Split by Hand
Gini impurity is the chance you'd label a random row wrong if you guessed based only on the group's class mix. The formula is 1 − Σ pᵢ², where pᵢ is the fraction of each class. It is 0.0 for a pure group and 0.5 for a 50/50 two-class mix. To score a split, you take the impurity of each side and weight it by how many rows land there — the side with more rows counts for more.
Run this worked example and watch the numbers. Lower weighted Gini means a better split.
Worked Example: Gini Impurity
Compute Gini for pure, mixed, and split groups — plain Python, no libraries
# How a tree scores a split: Gini impurity (no libraries needed)
# Gini = 1 - sum(p_i ** 2) where p_i is the fraction of each class
# 0.0 = pure group (all one class) -> a perfect leaf
# 0.5 = worst case for 2 classes (50/50 mix)
def gini(labels):
n = len(labels)
if n == 0:
return 0.0
counts = {}
for label in labels:
counts[label] = counts.get(label, 0) + 1 # tally each class
impurity = 1.0
for c in counts.values():
p = c / n
...3Entropy & Information Gain
Entropy measures the same idea as Gini — how mixed a group is — but with logarithms: −Σ pᵢ·log₂(pᵢ). It is 0 for a pure group and 1.0 for a perfect 50/50 two-class split. Information gain is simply how much the entropy drops after a split: the entropy before, minus the weighted entropy of the groups after. The tree picks the feature with the highest information gain.
Gini and entropy almost always agree on which split is best. Gini is a touch faster (no logs), which is why it's scikit-learn's default; entropy/information gain is the classic textbook framing.
Worked Example: Information Gain
Compute entropy and the information gain of splitting on 'outlook'
import math
# Entropy is another impurity measure. Information Gain = how much
# entropy DROPS after a split. The tree picks the split with the most gain.
def entropy(labels):
n = len(labels)
if n == 0:
return 0.0
counts = {}
for label in labels:
counts[label] = counts.get(label, 0) + 1
return -sum((c / n) * math.log2(c / n) for c in counts.values())
# 10 rows: should we play tennis? (label = "play")
data = [
{"outlook": "sunny", "play": "no"},
{
...🌍 Worked Example: The Real Tool (scikit-learn)
In practice you don't hand-code splits — you call a library. Here's the same idea with scikit-learn's DecisionTreeClassifier. Study it (it isn't runnable in the in-browser sandbox), then notice how the concepts map: criterion="gini" is the impurity measure you just computed, max_depth=2 limits how deep the tree can grow (your anti-overfit dial), and feature_importances_ tells you which inputs mattered.
# WORKED EXAMPLE — what the real tool looks like (scikit-learn).
# You don't run this here; study it so the library code feels familiar.
from sklearn.tree import DecisionTreeClassifier, export_text
# Features: [temperature, humidity] Label: play tennis? (0 = no, 1 = yes)
X = [[30, 85], [32, 90], [21, 70], [18, 65], [25, 80], [22, 60]]
y = [0, 0, 1, 1, 0, 1]
# criterion="gini" is the default; max_depth=2 keeps the tree shallow (anti-overfit)
clf = DecisionTreeClassifier(criterion="gini", max_depth=2, random_state=0)
clf.fit(X, y) # the tree learns its split thresholds here
print(clf.predict([[20, 60]])) # predict a cool, dry day
print(dict(zip(["temp", "humidity"], clf.feature_importances_.round(2))))
print(export_text(clf, feature_names=["temp", "humidity"]))
# Expected output:
# [1]
# {'temp': 1.0, 'humidity': 0.0}
# |--- temp <= 23.50
# | |--- class: 1
# |--- temp > 23.50
# | |--- class: 0The expected output is shown in the comments. Notice temp has importance 1.0 and humidity 0.0 — this tiny tree solved the problem using temperature alone, so humidity contributed nothing.
4Depth, Overfitting & Pruning
Left unchecked, a tree keeps splitting until every leaf holds a single training row. That tree scores nearly 100% on the data it learned from — but it has memorised the training set, including its random noise, instead of learning the real pattern. This is overfitting, and it shows up as a big gap between training accuracy and test accuracy.
You keep a tree general by limiting its complexity — this is pruning:
- max_depth — cap how many questions can be chained (e.g. 3–6).
- min_samples_leaf — refuse leaves smaller than, say, 5 rows.
- min_samples_split — don't split a node that's already tiny.
- cost-complexity pruning (
ccp_alpha) — grow fully, then snip back weak branches.
5Feature Importance
Because a tree explicitly chooses which feature to split on, it can tell you which inputs mattered most. Each feature's importance is the total impurity it removed across all the splits that used it, scaled so the scores add up to 1.0. A feature that's never used scores 0.0.
This is one of the biggest reasons people love trees: they're interpretable. A doctor, a loan officer, or a regulator can read the importances and the split thresholds and understand exactly why the model said what it said — something a neural network struggles to offer.
6Random Forests — Wisdom of the Crowd
One deep tree overfits easily. A random forest fixes that by training many trees and letting them vote. Two tricks make the trees different from each other:
- Bootstrap sampling (bagging): each tree trains on a random sample of the rows, drawn with replacement, so every tree sees a slightly different dataset.
- Feature randomness: at each split a tree may only choose from a random subset of features, so no single strong feature dominates every tree.
Each tree makes different mistakes. When you take the majority vote (for classification) or the average (for regression), the random errors cancel out and the shared signal remains. The forest is almost always more accurate and far more stable than any single tree — which is why random forests are a go-to baseline for tabular data.
🎯 Your Turn 1: Finish the Gini Function
Fill in the two blanks so gini() returns the impurity of a group. Remember: each p is a class's fraction, and you subtract p² from 1. The expected output is in the comments so you can self-check.
Your Turn: Complete gini()
Fill in the ___ blanks to finish the impurity calculation
# 🎯 YOUR TURN — finish the Gini impurity function
# Gini = 1 - sum(p ** 2) for each class fraction p.
def gini(labels):
n = len(labels)
if n == 0:
return 0.0
counts = {}
for label in labels:
counts[label] = counts.get(label, 0) + 1
impurity = 1.0
for c in counts.values():
p = ___ # 👉 fraction in this class: count c divided by n
impurity -= ___ # 👉 subtract p squared (p * p)
return impurity
print(round(gini(
...🎯 Your Turn 2: A One-Split Classifier
A decision tree is ultimately a stack of if statements. Here you write the prediction rule of a tree with a single decision node. Fill in the two blanks so the rule predicts correctly for hot and cool days.
Your Turn: Tiny Rule-Based Tree
Complete the predict() rule for a one-node decision tree
# 🎯 YOUR TURN — write the prediction rule of a tiny one-split tree.
# The tree already "learned" this split: if temperature > 23, predict "no",
# otherwise predict "yes". Fill in the rule below.
def predict(temperature):
if temperature > 23: # this is the tree's single decision node
return ___ # 👉 return "no" (hot days: don't play)
else:
return ___ # 👉 return "yes" (cool days: play)
# Test the rule on a few days
for t in [30, 18, 24,
...🎯 Mini-Challenge: Pick the Better Split
Now do it with the support faded. You're given two candidate splits of the same six rows. Write the code to compute each split's weighted Gini and print which one the tree should pick (lower wins). Only a comment outline is provided — the logic is up to you.
Mini-Challenge: Better Split
Compute the weighted Gini of two splits and choose the lower one
# 🎯 MINI-CHALLENGE: pick the better split by hand
# You are given two ways to split the same 6 rows. Lower weighted Gini wins.
#
# 1. Define gini(labels) = 1 - sum((count/total) ** 2 for each class)
# 2. Split A: left = ["yes","yes","yes"] right = ["no","no","yes"]
# 3. Split B: left = ["yes","no","yes"] right = ["no","yes","no"]
# 4. For each split, compute the weighted Gini:
# (len(left)/6)*gini(left) + (len(right)/6)*gini(right)
# 5. print() both scores and which split the tree
...7Common Errors (And How to Fix Them)
These are the traps that bite almost everyone working with trees. Watch for them.
❌ Letting the tree grow with no depth limit
A fully grown tree memorises the training set: 100% train accuracy, poor test accuracy.
# ❌ No limits — this tree will overfit clf = DecisionTreeClassifier() # grows until every leaf is pure
✅ Fix: cap the depth (and/or set a minimum leaf size):
# ✅ Shallow tree generalises better clf = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5)
❌ Judging the tree on training accuracy only
If you only look at training accuracy you'll never notice overfitting, so you'll never prune.
print(clf.score(X_train, y_train)) # 1.00 — looks perfect, but meaningless
✅ Fix: always compare against held-out test data:
print(clf.score(X_train, y_train)) # 0.92 print(clf.score(X_test, y_test)) # 0.88 <- the number that matters
❌ Wasting time scaling features for a tree
Standardising inputs is required for k-NN/SVM, but for trees it's pointless effort and can confuse your pipeline.
# ❌ Unnecessary for trees — splits are threshold-based X_scaled = StandardScaler().fit_transform(X) DecisionTreeClassifier().fit(X_scaled, y)
✅ Fix: feed trees the raw values — no scaling needed:
DecisionTreeClassifier().fit(X, y) # raw features are fine
❌ Data leakage — letting the answer sneak into the features
If a feature secretly encodes the label (or you fit on the test set), accuracy looks amazing in training and collapses in production.
# ❌ "approved" basically IS the label "loan_repaid" — leakage! X = df[["income", "age", "approved"]] y = df["loan_repaid"]
✅ Fix: drop anything you wouldn't know at prediction time, and split before fitting:
X = df[["income", "age"]] # only legitimate, available features X_train, X_test = train_test_split(X) # split FIRST, fit on train only
📋 Quick Reference
| Concept | What It Is | Key Point |
|---|---|---|
| Decision tree | Flowchart of yes/no splits | Easy to read and explain |
| Gini impurity | 1 − Σ p² | 0 = pure, 0.5 = 50/50 mix |
| Entropy | −Σ p·log₂(p) | 0 = pure, 1.0 = 50/50 mix |
| Information gain | Entropy drop after a split | Higher = better split |
| max_depth | Cap on tree depth | Main anti-overfit dial |
| Pruning | Trimming weak branches | Improves generalisation |
| Feature importance | Impurity removed per feature | Sums to 1.0; unused = 0 |
| Random forest | Many trees that vote | More accurate, less overfit |
| Bootstrap (bagging) | Sample rows with replacement | Each tree sees different data |
❓ Frequently Asked Questions
Q: What is a decision tree in machine learning?
A: A decision tree is a model that makes predictions by asking a series of yes/no questions about your data. Each internal node tests one feature, each branch is an answer, and each leaf is a final prediction. It works like a flowchart, which makes it one of the easiest models to read and explain.
Q: What is Gini impurity and how is it different from entropy?
A: Both measure how mixed the labels are in a group. Gini impurity is 1 minus the sum of squared class probabilities; entropy is the sum of -p*log2(p). Both are 0 when a group is pure (all one class) and largest when classes are evenly split. Gini is slightly faster to compute and is the default in scikit-learn; entropy (used to compute information gain) is the classic textbook choice. They usually pick very similar splits.
Q: What is information gain?
A: Information gain is how much the impurity (entropy) drops after you split the data on a feature. The tree tries every feature, measures the weighted impurity of the resulting groups, and picks the split with the highest information gain — the question that best separates the classes.
Q: Why do deep decision trees overfit?
A: If you let a tree keep splitting, it eventually creates a tiny leaf for almost every training row. That memorises the training data — including its noise — so it scores nearly 100% on training but poorly on new data. Limiting max_depth, raising min_samples_leaf, or pruning keeps the tree general.
Q: Do I need to scale or normalise features for decision trees?
A: No. Trees split on thresholds for one feature at a time ("is age > 30?"), so the scale or units of a feature do not matter. This is different from distance- or gradient-based models like k-NN, SVMs, or neural networks, which do need scaling.
Q: What is a random forest and why is it better than one tree?
A: A random forest trains many trees on random bootstrap samples of the data and random subsets of features, then takes a majority vote. The individual trees make different mistakes, so averaging them cancels out the noise. The result is usually more accurate and far less prone to overfitting than any single deep tree.
🎉 Lesson Complete!
You now understand how a tree turns data into questions: it measures impurity with Gini or entropy, picks splits by information gain, controls overfitting with depth limits and pruning, reports feature importance, and scales up to a random forest that votes many trees into one robust answer.
🚀 Up next: Neural Networks — where the model learns its own features instead of splitting on the ones you give it.
Sign up for free to track which lessons you've completed and get learning reminders.