Lesson 2 • Beginner
Python for Machine Learning
Meet the four tools every ML project leans on — NumPy, pandas, scikit-learn, and matplotlib — and learn the one workflow (split → fit → predict → score) that ties them together.
What You'll Learn in This Lesson
- ✓What a NumPy array is and why it beats a plain Python list for maths
- ✓How a pandas DataFrame stores labelled, table-shaped data
- ✓The scikit-learn pattern: fit, predict, and score on every model
- ✓Why you split data with train_test_split before you train
- ✓How pipelines and transformers stop data leakage
- ✓Where matplotlib fits — and where it deliberately does not
🛠️ Real-World Analogy: A Workshop and Its Tools
Picture building a piece of furniture in a workshop. You don't reach for one magic machine — you reach for the right tool at the right step. The Python ML stack works the same way: four specialised tools, each doing one job well, used in a fixed order.
📐 NumPy — the workbench
Fast arrays and matrix maths. Every other tool stacks its work on this surface.
📊 pandas — the parts bins
Labelled tables (DataFrames) to load, clean, filter, and sort your raw materials.
🧪 scikit-learn — the power tools
The algorithms that do the cutting and joining: fit, predict, score — same handles on every one.
📈 matplotlib — the tape measure
Charts to check your work before and after — it measures and inspects, it doesn't build.
Keep this picture in mind: pandas prepares the materials, NumPy holds the numbers, scikit-learn does the work, matplotlib checks it.
1NumPy Arrays — The Container for the Numbers
A NumPy array is a grid of numbers that all share one type, stored together in memory. That layout is why maths on an array is fast: one operation runs across every element at once — called a vectorised operation — instead of you writing a Python loop.
Two words you'll meet constantly: a 1D array is a vector (a single row of values), and a 2D array is a matrix (a table where rows are samples and columns are features). The array's .shape tells you how many of each.
Worked example — read the comments, every line states its result:
import numpy as np
# A NumPy array is a fast, fixed-type grid of numbers. It is the
# container every ML library passes data around in.
prices = np.array([10.0, 20.0, 30.0, 40.0]) # 1D array (a "vector")
print(prices) # Expected output: [10. 20. 30. 40.]
print(prices.shape) # Expected output: (4,) <- 4 elements, 1 axis
# Vectorised math: one operation applies to EVERY element, no loop.
print(prices * 1.2) # Expected output: [12. 24. 36. 48.]
print(prices.mean()) # Expected output: 25.0
# A 2D array is a matrix: rows = samples, columns = features.
X = np.array([[1, 2],
[3, 4],
[5, 6]])
print(X.shape) # Expected output: (3, 2) <- 3 samples, 2 features
print(X.sum(axis=0)) # Expected output: [9 12] <- sum down each columnshape and vectorised maths and the rest of the stack stops feeling like magic.2pandas DataFrames — Your Data as a Table
Real data rarely arrives as a tidy grid of one type — it has named columns like age, city, and price, mixing numbers, text, and true/false. A DataFrame is pandas' answer: a labelled table you can program, like a spreadsheet with code.
The move you'll repeat in every project is splitting that table into X (the feature columns the model learns from) and y (the single target column it learns to predict). Everything after this lesson assumes you can do that split.
Worked example:
import pandas as pd
# A DataFrame is a labelled table: named columns + indexed rows.
# Think of it as a spreadsheet you can program.
df = pd.DataFrame({
"city": ["London", "Paris", "Berlin", "Rome"],
"temp_c": [12, 16, 9, 18],
"rain": [True, False, True, False],
})
print(df.shape) # Expected output: (4, 3) <- 4 rows, 3 columns
print(df["temp_c"].mean()) # Expected output: 13.75
# Filter rows with a boolean condition (this is the workhorse of pandas):
warm = df[df["temp_c"] > 12]
print(warm["city"].tolist()) # Expected output: ['Paris', 'Rome']
# Split into features (X) and the label/target (y) for ML:
X = df[["temp_c", "rain"]] # the inputs the model learns FROM
y = df["city"] # the answer the model learns to predict
print(X.shape, y.shape) # Expected output: (4, 2) (4,)Notice df[df["temp_c"] > 12]: you build a column of True/False values, then use it to keep only the matching rows. That boolean-filter trick is the single most-used pandas operation.
3The scikit-learn Pattern — fit / predict / score
scikit-learn's superpower is consistency: a decision tree, a linear model, and a support-vector machine all expose the same three methods. Learn them once and you can drive any model.
.fit(X, y)— train: learn the patterns linking featuresXto labelsy..predict(X)— use: guess labels for new, unseen rows..score(X, y)— check: for a classifier, the fraction of predictions that were correct (accuracy).
Two supporting pieces make those three honest. train_test_split holds back a slice of data the model never trains on, so the score is earned, not memorised. A Pipeline glues a transformer (something that reshapes the data, like a scaler that re-scales features) to the model, so the transformer is fitted on the training data only — never on the test data.
Worked example — the full pattern in one place:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Every scikit-learn model speaks the SAME three verbs:
# .fit(X, y) -> learn from labelled data
# .predict(X) -> guess labels for new data
# .score(X, y) -> how often it was right (accuracy, 0.0-1.0)
# 1) Split FIRST so the test set stays unseen during training.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 2) A Pipeline chains a transformer + a model into one object, so the
# scaler is fitted on TRAIN ONLY and reused on test — no leakage.
model = make_pipeline(
StandardScaler(), # transformer: re-scales features
LogisticRegression(), # estimator: makes the prediction
)
model.fit(X_train, y_train) # learn
preds = model.predict(X_test) # predict
print(model.score(X_test, y_test)) # Expected output: 0.97 (example)
# .score() just compares preds to the true labels — exactly what you will
# compute by hand in the runnable exercise below.4Under the Hood — A Prediction Is Just Arithmetic
np.dot and .score sound advanced, but underneath they are plain arithmetic you can write without any library. Seeing that demystifies the whole stack — and it runs in the editor right now.
A single linear prediction is a dot product: multiply each feature by its weight, sum the results, and add a constant bias. That's all np.dot(features, weights) + bias does.
Worked example — a dot product by hand:
Worked Example: Dot Product by Hand
The arithmetic behind every np.dot — no imports needed
# What does numpy actually DO under the hood? A model prediction is mostly
# a "dot product": multiply each feature by its weight, then add them up.
# Here it is in plain Python so you can see there is no magic.
features = [2.0, 3.0, 1.0] # one sample: 3 feature values
weights = [0.5, 1.0, 2.0] # what the model "learned" for each feature
bias = 1.0 # a constant added at the end
total = bias
for f, w in zip(features, weights):
total += f * w # accumulate
...🎯 Your Turn: Finish the Dot Product
Fill in the two blanks so the prediction comes out to 12.0
# 🎯 YOUR TURN — compute a dot product by hand (this is what np.dot does)
features = [4.0, 2.0, 5.0]
weights = [1.0, 3.0, 0.0]
bias = 2.0
total = ___ # 👉 start the running total at the bias value
for f, w in zip(features, weights):
total = ___ # 👉 add f * w to the running total each loop
print("Prediction:", total)
# ✅ Expected output: Prediction: 12.0
# (2.0 + 4*1.0 + 2*3.0 + 5*0.0 = 2 + 4 + 6 + 0 = 12.0)🎯 Your Turn: Score the Model by Hand
Count the correct guesses and compute accuracy — exactly what .score() does
# 🎯 YOUR TURN — score a model by hand (this is what .score() does)
# Accuracy = (number of correct guesses) / (total guesses)
predictions = ["cat", "dog", "cat", "bird", "dog"] # what the model guessed
labels = ["cat", "dog", "dog", "bird", "dog"] # the true answers
correct = 0
for pred, true in zip(predictions, labels):
if ___: # 👉 count it only when the guess matches the label
correct += 1
accuracy = ___ / len(labels) # 👉 divide correct guess
...📈 Where matplotlib Fits
matplotlib (and Seaborn, which is built on top of it) is for seeing your data — it never trains a model. You reach for it at two moments: before modelling to explore patterns, and after to inspect how the model did.
import matplotlib.pyplot as plt # BEFORE modelling — explore the data plt.scatter(df["temp_c"], df["rain"]) # do features relate to the target? plt.hist(df["temp_c"]) # what does the spread look like? # AFTER modelling — inspect the results plt.scatter(y_test, preds) # predicted vs actual: close = good plt.show() # render the figure on screen
Keep the boundary clear: pandas/NumPy hold the data, scikit-learn models it, matplotlib pictures it.A chart never changes a prediction — it changes your understanding.
!Common Errors (And How to Fix Them)
These four trip up almost every beginner. Spotting them early saves hours.
❌ Fitting on the test data (data leakage)
Scaling or fitting using the whole dataset before splitting:
scaler.fit(X) # ❌ sees the test rows too X_train, X_test = train_test_split(X)
✅ Fix: split first, then fit on train only (a Pipeline does this for you):
X_train, X_test = train_test_split(X) scaler.fit(X_train) # ✅ test set stays unseen
❌ Not splitting at all
Scoring on the same rows the model trained on:
model.fit(X, y) model.score(X, y) # ❌ ~1.0 — it just memorised the answers
✅ Fix: always evaluate on held-out data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model.fit(X_train, y_train) model.score(X_test, y_test) # ✅ an honest score
❌ Shape mismatch
ValueError: Found input variables with inconsistent numbers of samples — X and y have different lengths, or a 1D array was passed where 2D was expected:
model.fit(X, y) # ❌ X has 100 rows, y has 90
✅ Fix: check shapes line up, and reshape a single feature to 2D:
print(X.shape, y.shape) # must share the same first number X = X.reshape(-1, 1) # ✅ make one feature column 2D
❌ Scaling after the split, but fitting the scaler on test too
You split correctly, then re-fit the scaler on the test set:
scaler.fit(X_train); X_train = scaler.transform(X_train) X_test = scaler.fit_transform(X_test) # ❌ refits on test
✅ Fix: fit once on train; only transform the test set:
scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) # ✅ same parameters, no leakage
📋 Quick Reference
📐 NumPy
| Call | Does |
|---|---|
| np.array([1,2,3]) | Make an array (vector / matrix) |
| arr.shape | Rows & columns: (samples, features) |
| arr * 2, arr + arr2 | Vectorised maths, no loop |
| np.dot(a, b) | Dot / matrix product |
| arr.mean(), arr.std() | Summary statistics |
📊 pandas
| Call | Does |
|---|---|
| pd.DataFrame(data) | Build a labelled table |
| df["col"] | Select one column (a Series) |
| df[df["x"] > 0] | Filter rows by a condition |
| df.groupby("g").mean() | Aggregate by group |
| df.describe() | Quick stats for every column |
🧪 scikit-learn
| Call | Does |
|---|---|
| train_test_split(X, y) | Hold back a test set |
| model.fit(X_train, y_train) | Train the model |
| model.predict(X_test) | Guess labels for new data |
| model.score(X_test, y_test) | Accuracy on held-out data |
| make_pipeline(scaler, model) | Chain transformer + model, no leakage |
❓ Frequently Asked Questions
Q: Why use NumPy arrays instead of plain Python lists?
A: NumPy stores numbers in a contiguous, single-type block of memory and runs math in fast compiled C, so vectorised operations like arr * 2 are often 10-50x faster than a Python loop. It also adds the shape, broadcasting, and matrix maths that pandas and scikit-learn are built on top of.
Q: What is the difference between fit, predict, and score in scikit-learn?
A: fit(X, y) trains the model by learning patterns from labelled data. predict(X) uses the trained model to guess labels for new, unseen inputs. score(X, y) measures how good those guesses are — for classifiers it returns accuracy (the fraction of correct predictions), a number from 0.0 to 1.0.
Q: Why do I have to split data into train and test sets?
A: If you evaluate a model on the same rows it learned from, it can simply memorise them and look perfect while failing on real data. train_test_split holds back a portion (commonly 20%) that the model never sees during training, so the score reflects how it performs on genuinely new examples.
Q: What is data leakage and how does a Pipeline prevent it?
A: Leakage is when information from the test set sneaks into training — most commonly by fitting a scaler on the whole dataset before splitting. A scikit-learn Pipeline fits every transformer on the training fold only and reuses those exact parameters on the test fold, so the test set stays truly unseen.
Q: Where does matplotlib fit into the machine-learning workflow?
A: Matplotlib (and Seaborn, built on it) is for visualising, not modelling. You use it before training to explore distributions and relationships, and after training to inspect predictions, errors, and feature importance. It never touches the model itself — it just helps you understand the data and results.
🎯 Mini-Challenge: Train, Predict, Score (Plain Python)
Put the whole workflow together — no libraries. Your "model" is a simple threshold rule, and you'll score it by hand, exactly the way scikit-learn's .score() works. The starter below is a comment outline only.
Mini-Challenge
Build predictions from a threshold, then compute accuracy
# 🎯 MINI-CHALLENGE: a tiny train / predict / score loop in plain Python
#
# A "model" here is just a threshold: predict "pass" if score >= 50, else "fail".
#
# 1. Make a list 'scores' = [80, 45, 60, 30, 95] and a matching list of true
# 'labels' = ["pass", "fail", "pass", "fail", "pass"].
# 2. Loop through scores and build a 'predictions' list using the threshold 50.
# 3. Count how many predictions match the labels.
# 4. Print the accuracy (correct / total).
#
# ✅ Expected output: Accuracy:
...Lesson 2 complete — you know the ML toolkit and the workflow!
You can describe a NumPy array and its shape, split a DataFrame into X and y, drive any scikit-learn model with fit / predict / score, split data to avoid leakage, and place matplotlib correctly in the pipeline. You even computed a prediction and an accuracy by hand — so none of it is a black box.
🚀 Up next: Data Preprocessing — turn messy, real-world data into clean features a model can actually learn from.
Sign up for free to track which lessons you've completed and get learning reminders.