Lesson 4 • Beginner
Linear Regression
Your first real machine-learning algorithm. By the end you'll fit a line to data by hand, measure its error with a cost function, watch gradient descent learn the line for you, split data to test honestly, and run it all in one line with scikit-learn.
What You'll Learn in This Lesson
- ✓What a line y = wx + b represents as a model
- ✓How the MSE cost function scores a fit
- ✓What gradient descent does, step by step
- ✓How to code a tiny training loop in plain Python
- ✓Why you split data into training and test sets
- ✓How to do it for real with sklearn's LinearRegression
🌍 Real-World Analogy
Imagine a seasoned estate agent who has watched hundreds of houses sell. Over time they develop a gut feeling: "bigger houses sell for more." Linear regression turns that gut feeling into an exact formula — something like price = 150 × size + 50000.
The agent's instinct is fuzzy; the formula is precise and repeatable. That is the whole job of this lesson: take a scatter of dots and find the single straight line that best summarises the trend, so you can predict the next dot you haven't seen yet.
1The Model Is Just a Line: y = wx + b
A linear-regression model is nothing more than two numbers. You feed it an input x (a feature, like study hours) and it returns a prediction y using the straight-line formula:
y = w·x + b
- • x — the input feature you measure
- • w — the weight (slope): how much y changes per unit of x
- • b — the bias (intercept): the value of y when x is 0
- • y — the prediction the model produces
"Training" the model just means picking the w and b that draw the line closest to all your data points. Everything else in this lesson is about how we pick them. The example below starts with a hand-guessed line and a function that applies it.
2Scoring the Fit: the MSE Cost Function
How do you know one line is better than another? You need a single number that says "this is how wrong you are." That number is the cost function, and the standard one for regression is Mean Squared Error (MSE):
MSE = average of (actual − predicted)²
For each point you take the gap between the real value and the model's guess (the error), square it, then average across all points. Squaring forces every error to be positive (so a +5 and a −5 don't cancel) and makes big misses hurt much more than small ones. Lower MSE means a better fit. The worked example below prints each error and then the MSE.
Worked Example: Fit a line and measure it with MSE
A hand-picked line, then the cost that scores it — plain Python, no numpy.
# === Fitting a line y = w*x + b, then measuring how good it is ===
# Plain Python only — no numpy needed. Data is just two lists.
# Dataset: study hours -> exam score
hours = [1, 2, 3, 4, 5] # the feature (x)
scores = [52, 58, 67, 71, 79] # the target (y)
# A "model" is just two numbers: a slope w and an intercept b.
# Read it as: "score = w * hours + b".
w = 6.5 # each extra hour adds ~6.5 points (a guess for now)
b = 46.0 # a student who studied 0 hours scores ~46 (a guess
...🎯 Your Turn: build the MSE calculator
🎯 Your Turn: complete the MSE function
Two blanks: square the error, then average over the points.
# 🎯 YOUR TURN — finish the MSE (Mean Squared Error) calculator
# Fill in each blank marked with ___ . Plain Python, no numpy.
actual = [10, 20, 30, 40] # the real values
predicted = [12, 18, 33, 39] # a model's guesses
total = 0.0
for a, p in zip(actual, predicted):
# 👉 1) the error is (actual - predicted); square it and add to total
total += (a - p) ___ 2 # 👉 replace ___ with the power operator (**)
# 👉 2) MSE is the average, so divide the total by the number of poi
...3Gradient Descent: How the Model Learns
Guessing w and b by hand doesn't scale. Gradient descent is the algorithm that finds them automatically. Picture the MSE as a valley: every (w, b) is a spot on the hillside, and the lowest point is the best line. Gradient descent is like walking downhill blindfolded — feel which way the ground slopes, take a small step that way, and repeat.
- Gradient — the direction the cost increases fastest. You step the opposite way (downhill).
- Learning rate — how big each step is. Too big and you overshoot the valley; too small and training crawls.
- Epoch — one full pass over the data that nudges w and b once.
Worked Example: a tiny gradient-descent training loop
Start from a bad guess and watch w and b settle near the best line — plain Python.
# === Gradient descent: let the computer FIND w and b for us ===
# This is the core loop behind how almost every ML model learns.
# Still plain Python — only the 'lists + arithmetic' you already know.
# Same study-hours data
hours = [1, 2, 3, 4, 5]
scores = [52, 58, 67, 71, 79]
n = len(hours)
# Start from a deliberately bad guess and let the loop fix it.
w = 0.0
b = 0.0
learning_rate = 0.01 # step size: how big a nudge we take each round
# One "epoch" = one full pass over the data that nud
...🎯 Your Turn: complete the update step
🎯 Your Turn: nudge w and b downhill
Fill in the two parameter-update lines and recover the line y = 2x + 1.
# 🎯 YOUR TURN — complete ONE gradient-descent step
# The loop body is written for you except for the two update lines.
xs = [1, 2, 3, 4]
ys = [3, 5, 7, 9] # the true line is y = 2x + 1
n = len(xs)
w = 0.0
b = 0.0
learning_rate = 0.05
for epoch in range(1000):
err_sum = err_x_sum = 0.0
for x, y in zip(xs, ys):
error = (w * x + b) - y # predicted - actual
err_sum += error
err_x_sum += error * x
grad_w = (2 / n) * err_x_sum
grad_b = (2 / n
...4Training vs Test: the Honesty Check
A model that scores its own homework will always claim an A. To get an honest measure of how it performs on data it has never seen, you split your data into two parts before training:
- Training set (usually ~80%) — the model learns w and b from these.
- Test set (usually ~20%) — held back, used only to score the finished model.
If the model does great on the training set but poorly on the test set, it has overfit — it memorised the specific points instead of learning the underlying trend. The test score is the number you actually trust. In real projects, train_test_split from scikit-learn does this split for you in one line.
5The Real Way: scikit-learn
You now understand what happens under the hood — so in real projects you let a library do it. scikit-learn (imported as sklearn) gives you LinearRegression: create it, call .fit(X, y) to train, and .predict(...) to use it. It runs the maths far faster and more reliably than hand-rolled code.
Read the example below as a worked walkthrough — it needs scikit-learn installed locally to run, and the comments show the expected output. Notice the w and b it finds match what your gradient-descent loop learned above.
Worked Example: sklearn LinearRegression (read-along)
The same fit in three lines — fit, predict, done. Expected output is in the comments.
# === The real-world way: scikit-learn does the maths for you ===
# In practice you do NOT hand-write gradient descent. You call a library.
# (This needs scikit-learn installed: pip install scikit-learn)
from sklearn.linear_model import LinearRegression
# sklearn expects X as a list of rows; each row is one sample's features.
X = [[1], [2], [3], [4], [5]] # study hours (one feature per row)
y = [52, 58, 67, 71, 79] # exam scores
model = LinearRegression() # create the (untrained)
...Common Errors (And How to Fix Them)
❌ Not splitting the data
You report 99% accuracy, ship the model, and it fails on real users.
✅ Fix: hold back a test set the model never trains on, and judge it only by the test score. Use train_test_split(X, y, test_size=0.2).
❌ Features on wildly different scales
One feature is in the thousands (square feet) and another is single digits (bedrooms). Gradient descent zig-zags or diverges, and weights become hard to compare.
✅ Fix: scale features to a similar range first (e.g. StandardScaler) before training. A smaller learning rate also helps stop the cost blowing up.
❌ Assuming the relationship is linear
Your data curves, but you force a straight line through it. The MSE stays high no matter how long you train.
✅ Fix: plot the data first. If it bends, add polynomial features, transform a column (e.g. a log), or pick a model that can capture curves.
❌ Extrapolating far beyond the training range
You trained on houses 600–2000 sq ft, then predict a 10,000 sq ft mansion. The line keeps going forever, but reality doesn't.
✅ Fix: only trust predictions inside (or near) the range you trained on. Flag inputs far outside it as unreliable.
📋 Quick Reference
| Term | Formula / Code | Meaning |
|---|---|---|
| Model | y = w*x + b | A line: weight w, bias b |
| MSE (cost) | mean((y - ŷ)²) | How wrong the line is (lower is better) |
| Gradient step | w -= lr * grad_w | Nudge parameters downhill |
| Learning rate | lr = 0.01 | Step size each epoch |
| Train / test split | train_test_split(...) | Learn on one slice, score on another |
| sklearn fit | model.fit(X, y) | Finds best w and b for you |
| sklearn predict | model.predict(X) | Apply the trained line |
Frequently Asked Questions
Q: What does linear regression actually predict?
A continuous number — a value on a sliding scale rather than a category. Think house prices, exam scores, temperatures, or sales figures. If the answer you want is 'how much' or 'how many', linear regression is a sensible first tool. If the answer is a label like 'spam vs not spam' or 'cat vs dog', that is classification instead, which you'll meet in the next lesson.
Q: What is the cost function and why square the errors?
The cost function is a single number that measures how wrong the model is across all the data, and the standard choice for regression is Mean Squared Error (MSE): the average of (actual - predicted) squared. Squaring does two jobs at once — it makes every error positive so they can't cancel out, and it punishes large mistakes far more than small ones (an error of 10 contributes 100, an error of 1 contributes only 1). Training means finding the w and b that make this number as small as possible.
Q: What is gradient descent in one sentence?
It is a loop that repeatedly nudges the model's parameters a tiny step 'downhill' on the cost surface — compute which direction lowers the error (the gradient), take a small step that way (controlled by the learning rate), and repeat until the cost stops shrinking. It is the same core idea that trains neural networks, just with many more parameters.
Q: Why do I split data into training and test sets?
Because a model that has memorised the exact answers it was trained on can look perfect and still fail on anything new. You fit (train) on one slice of the data and check accuracy on a separate test slice the model never saw. If it does well on training data but badly on the test set, it has overfit — it memorised noise instead of learning the real pattern. The test score is your honest estimate of real-world performance.
Q: When is linear regression the wrong choice?
When the real relationship is not roughly a straight line. If sales rise then plateau, or price grows exponentially with size, a straight line will fit poorly no matter how you train it. Plot your data first — if it curves, you need polynomial features, a transform (like taking a log), or a different model entirely. Forcing a line onto a curve gives confident-looking but wrong predictions.
Mini-Challenge: pick the better line
No blanks this time — just a brief and a comment outline. Write the two helper functions, score both candidate lines with MSE, and print which one fits better. Check your answer against the expected result in the comments.
🎯 Mini-Challenge: compare two lines with MSE
Build predict() and mse(), score line A and line B, print the winner.
# 🎯 MINI-CHALLENGE: compare two candidate lines and pick the better one
#
# Data (temperature -> ice creams sold):
# temps = [15, 18, 20, 25, 30]
# sales = [22, 33, 40, 62, 81]
#
# 1. Write predict(x, w, b) that returns w * x + b
# 2. Write mse(w, b) that loops over the data and returns the Mean Squared Error
# 3. Score line A: w = 3.0, b = -25
# 4. Score line B: w = 3.0, b = -20
# 5. Print both MSEs and print which line fits better (lower MSE wins)
#
# ✅ Expected: line B has the lower MSE,
...🎉 Lesson Complete!
You've built your first ML model from the ground up: you fit a line y = wx + b, scored it with the MSE cost function, wrote a gradient-descent loop that learns w and b on its own, understood why training and test sets matter, and saw how scikit-learn does it all in three lines.
🚀 Up next: Classification — instead of predicting a number on a sliding scale, you'll predict a category (spam vs not spam, cat vs dog), reusing this same predict-measure-improve loop.
Sign up for free to track which lessons you've completed and get learning reminders.