Lesson 46 • Advanced
Graph Neural Networks (GNNs) 🕸️
Learn how GNNs read graph-structured data — nodes, edges, and features — by passing messages between neighbours, and build one message-passing step by hand in plain Python before meeting GCN, GraphSAGE, and GAT in PyTorch Geometric.
What You'll Learn in This Lesson
- ✓How graphs store data: nodes, edges, features, adjacency
- ✓The message-passing / neighbour-aggregation paradigm
- ✓Code one aggregation step by hand in plain Python
- ✓How GCN, GraphSAGE, and GAT (attention) differ
- ✓Node-level, edge-level, and graph-level tasks
- ✓Where GNNs win: social, molecules, recommenders
🌍 Real-World Analogy: Learning From Your Friends
Want to guess someone's politics, music taste, or whether they will repay a loan? You learn a surprising amount just from who they spend time with. "You are the average of the five people you hang out with" is folk wisdom — and it is almost exactly how a GNN thinks.
Picture a friendship network. Each person holds a few facts about themselves (their features). In one "round of gossip", everyone updates their own view by averaging in what their direct friends think. Do it again and your friends-of-friends start to influence you too. After a few rounds, your updated profile quietly encodes your whole local neighbourhood — not just you. That round of gossip is message passing, and stacking rounds is exactly what stacking GNN layers does.
1Graph Data — Nodes, Edges, Features, Adjacency
A graph is data made of nodes (the things — people, atoms, products) joined by edges (the relationships — friendships, bonds, "bought together"). Unlike a table, where every row is independent, a graph's whole point is that the connections carry meaning.
Three pieces describe any graph for a GNN:
- Node features — a vector of numbers per node (a user's age and activity; an atom's element and charge).
- Edges — which nodes are connected, sometimes with their own features (a bond type, a transaction amount).
- Adjacency — the bookkeeping of who-connects-to-whom. You can store it as an
N×Nmatrix (1 = edge, 0 = no edge) or, more simply, as a dictionary mapping each node to its list of neighbours.
🗂️ The simplest way to store a graph in code
An adjacency dict — every example below uses this:
adjacency = {
"Alice": ["Bob", "Dave"], # Alice is friends with Bob and Dave
"Bob": ["Alice", "Carol"],
# ...
}2Message Passing — The One Idea Behind Every GNN
Almost every GNN is a stack of message-passing layers, and each layer does the same three steps for every node:
- Gather the feature vectors of the node's neighbours (the "messages").
- Aggregate them into one vector — usually a mean, sometimes a sum or a max.
- Transform that aggregate with a learnable weight matrix and a non-linearity (like ReLU).
After one layer, a node's new embedding reflects its 1-hop neighbourhood. Stack a second layer and it reaches 2 hops (friends of friends); k layers reach k hops. That is the entire trick — aggregate, then transform, then repeat.
The worked example below does step 1 and step 2 in plain Python: each node's new feature is simply the average of its neighbours' features. No NumPy, no PyTorch — just a dictionary and a loop. Read every comment, then run it.
Worked Example: One Message-Passing Step (plain Python)
Each node's new feature becomes the average of its neighbours' features, using an adjacency dict
# === ONE message-passing step, in plain Python ===
# A graph is just nodes joined by edges. Store it as an "adjacency dict":
# key = a node
# value = the list of nodes it is connected to (its neighbours)
#
# Friendship graph (undirected):
# Alice -- Bob -- Carol
# | |
# Dave -- Eve --- Frank
adjacency = {
"Alice": ["Bob", "Dave"],
"Bob": ["Alice", "Carol"],
"Carol": ["Bob", "Frank"],
"Dave": ["Alice", "Eve"],
"Eve": ["Dave", "Frank"],
...Run the same averaging rule several times and watch a single signal spread across the graph — and start to flatten out. That flattening is a clue to a real GNN pitfall you will meet in Common Errors.
Worked Example: Information Spreading Over Rounds
Apply message passing three times and watch one node's signal reach the whole graph
# === Watch information spread over several message-passing rounds ===
# Same averaging rule, applied 3 times. Notice how the numbers drift TOGETHER.
adjacency = {
"Alice": ["Bob", "Dave"],
"Bob": ["Alice", "Carol"],
"Carol": ["Bob", "Frank"],
"Dave": ["Alice", "Eve"],
"Eve": ["Dave", "Frank"],
"Frank": ["Carol", "Eve"],
}
# One number per node so the output is easy to read.
state = {"Alice": 0.0, "Bob": 0.0, "Carol": 0.0, "Dave": 0.0, "Eve": 9.0, "Frank": 0.0}
# O
...3GCN, GraphSAGE, and GAT — The Three You Must Know
All three are message-passing layers. They differ in how they aggregate neighbours:
GCN
Takes a degree-normalised average of all neighbours. The simplest layer and a strong baseline — it is exactly the averaging you coded above, with a normalisation that stops high-degree nodes from dominating.
GraphSAGE
Samples a fixed number of neighbours instead of using all of them. That makes it scale to graphs with billions of edges and lets it handle nodes it never saw in training (inductive learning).
GAT
Learns an attention weight for each neighbour, so important neighbours count more. Not every friend should count equally — GAT figures out who matters, the same idea that powers Transformers.
In practice you rarely hand-code these. PyTorch Geometric (PyG) ships GCNConv, SAGEConv, and GATConv as drop-in layers. The next example is read-only PyG showing a single GCNConv doing aggregate-then-transform for you.
Worked Example: A GCN Layer in PyTorch Geometric
Read-only — the same aggregate-then-transform idea, built into a single GCNConv layer with expected output
# === The same idea in PyTorch Geometric (PyG) — the real-world way ===
# PyG ships GCN, GraphSAGE, GAT and more so you don't hand-code aggregation.
import torch
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv
# A graph in PyG: edge_index is a 2 x E tensor of [source, target] columns.
# These edges describe a tiny 4-node graph: 0-1, 1-2, 2-3 (both directions).
edge_index = torch.tensor(
[[0, 1, 1, 2, 2, 3],
[1, 0, 2, 1, 3, 2]], dtype=torch.long)
# Node fea
...4Three Task Levels — Node, Edge, and Graph
Once message passing has given every node a rich embedding, you can predict at three different levels:
| Level | Question | Example | How |
|---|---|---|---|
| Node | What is this node? | Flag a fraudulent account | Classify each node embedding |
| Edge | Is there a link? | Recommend a friend; predict a bond | Score pairs of node embeddings |
| Graph | What is the whole graph? | Is this molecule toxic? | Pool all nodes into one vector (a "readout") |
The read-only example below uses a GATConv (attention) layer and then produces all three: per-node logits, per-edge scores, and one graph-level vector from mean pooling.
Worked Example: Attention (GAT) + Node / Edge / Graph Tasks
Read-only — one GAT layer feeds node classification, edge scoring, and a graph-level readout, with expected output
# === GAT (attention) and the three GNN task levels in PyG ===
import torch
from torch_geometric.nn import GATConv, global_mean_pool
# 5 nodes, 4 features each.
x = torch.randn(5, 4)
edge_index = torch.tensor(
[[0, 1, 1, 2, 3, 4],
[1, 0, 2, 1, 4, 3]], dtype=torch.long)
# GAT learns how much ATTENTION each node pays to each neighbour,
# instead of treating all neighbours equally like GCN does.
gat = GATConv(in_channels=4, out_channels=8, heads=2) # 2 attention heads
node_embeddings =
...🚀 Where GNNs Win in the Real World
GNNs shine wherever relationships matter as much as the items themselves:
- Social networks. Detect communities, predict who will connect, and flag fake or fraudulent accounts by the company they keep.
- Molecules & drug discovery. Atoms are nodes and bonds are edges, so a molecule is a graph. GNNs predict toxicity, solubility, and binding affinity, and propose new candidate compounds.
- Recommenders. Users and items form a giant interaction graph. Pinterest's PinSage and many e-commerce systems use GNNs to power "you might also like".
- Maps & traffic. Road networks are graphs; Google Maps uses GNNs to improve arrival-time estimates.
- Knowledge graphs & fraud rings. Complete missing facts, and spot money-laundering patterns that single-account models miss entirely.
Now you try. Fill in the blanks marked ___. Gathering a node's neighbours and averaging them is message passing — get this and you have the heart of every GNN.
🎯 Your Turn 1: Finish the Aggregation Step
Collect each node's neighbours' feature vectors and average them
# 🎯 YOUR TURN — finish the message-passing step.
# Each node's new feature = the AVERAGE of its neighbours' features.
adjacency = {
"A": ["B", "C"],
"B": ["A"],
"C": ["A"],
}
features = {
"A": [2.0, 4.0],
"B": [6.0, 8.0],
"C": [10.0, 12.0],
}
def average(vectors):
n = len(vectors)
length = len(vectors[0])
return [sum(v[i] for v in vectors) / n for i in range(length)]
new_features = {}
for node, neighbours in adjacency.items():
# 👉 build a list of the
...Your turn again. Real GNN layers include each node in its own neighbourhood — a self-loop — so a node never forgets its own features. Add it here.
🎯 Your Turn 2: Add a Self-Loop Before Averaging
Include the node itself in the group, then average over itself plus its neighbours
# 🎯 YOUR TURN — aggregate over the node AND its neighbours (a self-loop).
# Pure GCN-style averaging forgets a node's own features. Fix that by
# including the node itself in the group before averaging.
adjacency = {
"A": ["B", "C"],
"B": ["A"],
"C": ["A"],
}
features = {
"A": [3.0],
"B": [6.0],
"C": [9.0],
}
new_features = {}
for node, neighbours in adjacency.items():
# 👉 start the group with the node's OWN id, then add its neighbours
group_nodes = [node] +
...5Common Errors (And How to Fix Them)
These four mistakes trip up almost everyone building their first GNN:
❌ Over-smoothing from too many layers
Stack too many message-passing layers and every node keeps averaging in more of the graph until all embeddings collapse to nearly the same vector — the model can no longer tell nodes apart and accuracy crashes.
# ❌ 8 GCN layers on a small graph → embeddings all converge
for _ in range(8):
h = gcn_layer(h, edges)✅ Fix: keep it shallow (2-4 layers); add residual / jumping-knowledge connections only if you truly need depth.
for _ in range(2): # ✅ 2-3 layers is usually plenty
h = relu(gcn_layer(h, edges))❌ Forgetting self-loops
If you aggregate only over neighbours, a node throws away its own features in every layer and loses its identity.
agg = average(neighbours[node]) # ❌ the node's own features vanish
✅ Fix: add a self-loop — include the node in its own neighbourhood. PyG's GCNConv does this automatically.
group = [node] + neighbours[node] # ✅ self-loop keeps the node's own info agg = average([features[n] for n in group])
❌ Loading a giant graph all at once
Full-batch GCN on a graph with millions of nodes builds an enormous adjacency operation and runs out of memory.
out = gcn(full_graph_x, full_edge_index) # ❌ OOM on huge graphs
✅ Fix: sample neighbourhoods in mini-batches (GraphSAGE-style) with a neighbour loader.
from torch_geometric.loader import NeighborLoader loader = NeighborLoader(data, num_neighbors=[10, 10], batch_size=512) # ✅
❌ Choosing the wrong aggregation
Mean aggregation throws away how many neighbours a node has — yet for tasks like counting substructures in molecules, degree is the signal. Mean can't distinguish "1 neighbour" from "100 identical neighbours".
agg = mean(messages) # ❌ blind to neighbour count for some tasks
✅ Fix: match the aggregator to the task — sum (GIN) preserves count and is most expressive; mean and max suit others.
from torch_geometric.nn import GINConv # ✅ sum aggregation, very expressive
📋 Quick Reference
| Concept | What it is | Use when |
|---|---|---|
Adjacency | Who connects to whom (dict or matrix) | Describing any graph's structure |
Message passing | Gather, aggregate, transform neighbours | The core of every GNN layer |
GCN | Degree-normalised mean of neighbours | Strong, simple baseline |
GraphSAGE | Sample neighbours; inductive | Huge / changing graphs |
GAT | Attention-weighted neighbours | Some neighbours matter more |
Self-loop | Node aggregates itself too | Always — keep a node's own features |
Readout / pooling | Pool nodes into one graph vector | Graph-level tasks (e.g. molecules) |
❓ Frequently Asked Questions
Q: What is a graph neural network (GNN)?
A: A GNN is a neural network that operates directly on graph-structured data — data made of nodes (entities) joined by edges (relationships), where each node and edge can carry a feature vector. Instead of assuming a fixed grid like an image or a sequence like text, a GNN learns a representation for each node by repeatedly mixing in information from its neighbours. This makes it the natural model for social networks, molecules, knowledge graphs, road networks, and recommendation graphs.
Q: What is message passing in a GNN?
A: Message passing is the core operation of almost every GNN. In each layer, every node (1) collects 'messages' — the feature vectors of its neighbours, (2) aggregates them into one vector, usually by averaging, summing, or taking a max, and (3) transforms that aggregate with a learnable weight matrix and a non-linearity. After one layer a node's embedding reflects its 1-hop neighbourhood; after k stacked layers it reflects everything within k hops. This 'aggregate, then transform' loop is what lets structure influence predictions.
Q: What is the difference between GCN, GraphSAGE, and GAT?
A: They differ mainly in how they aggregate neighbours. GCN (Graph Convolutional Network) takes a degree-normalised average of all neighbours — simple and a strong baseline. GraphSAGE samples a fixed number of neighbours instead of using all of them, which makes it scale to huge graphs and generalise to nodes never seen during training (inductive learning). GAT (Graph Attention Network) learns an attention weight for each neighbour, so important neighbours count more than unimportant ones rather than every neighbour being treated equally.
Q: What are node-level, edge-level, and graph-level tasks?
A: They are the three levels at which a GNN makes predictions. Node-level tasks label individual nodes — for example flagging fraudulent accounts in a transaction graph. Edge-level tasks predict relationships between pairs of nodes — for example link prediction ('will these two users connect?') or recommendation. Graph-level tasks produce one prediction for an entire graph — for example deciding whether a molecule is toxic. Node and edge tasks read node embeddings directly; graph-level tasks first pool all node embeddings into a single vector (a 'readout').
Q: What is over-smoothing in graph neural networks?
A: Over-smoothing is what happens when you stack too many message-passing layers: every node keeps averaging in its neighbours' neighbours' features until all the node embeddings collapse toward the same value and become indistinguishable. The model then can't tell nodes apart and accuracy drops. Most GNNs are shallow (2-4 layers) for this reason, and tricks like residual connections, jumping-knowledge layers, or PairNorm are used when you genuinely need depth.
Q: What is over-smoothing's opposite problem — and why add self-loops?
A: If you aggregate only over neighbours and forget the node itself, a node can lose its own identity in a single layer. Adding a self-loop — treating each node as one of its own neighbours — preserves its original features in the aggregate. Frameworks like PyTorch Geometric add self-loops automatically inside GCNConv, but if you implement message passing by hand you must remember to include each node in its own neighbourhood, or your embeddings will throw away useful information.
Q: What are graph neural networks used for in the real world?
A: GNNs power recommendation at Pinterest and Uber Eats, fraud and money-laundering detection in financial transaction graphs, drug discovery and molecular property prediction (atoms are nodes, bonds are edges), traffic and arrival-time estimation in Google Maps, knowledge-graph completion, and even chip-layout optimisation. Anywhere the relationships between things matter as much as the things themselves, a GNN tends to beat models that look at each item in isolation.
🎯 Mini-Challenge: Two Rounds of Message Passing
Time to fly with less scaffolding. The starter gives you a line graph and a comment outline — write the step() function yourself and apply it twice. Remember the self-loop: average each node together with its neighbours.
Mini-Challenge
Write a step() that averages each node with its neighbours, then run two rounds
# 🎯 MINI-CHALLENGE: two rounds of message passing
#
# You have a 4-node line graph: A - B - C - D.
# Each node holds a single number. Spread information with the averaging rule.
#
# 1. Write a step(adjacency, state) function that returns a NEW dict where
# each node = the average of itself AND its neighbours (include a self-loop).
# 2. Apply it TWICE, printing the state after round 1 and round 2.
#
# ✅ After round 1, A should be 1.5 (average of A=1, B=2).
# The values should drift closer
...Lesson 46 complete — you can now reason about graphs the way a GNN does!
You can describe a graph with nodes, edges, features, and an adjacency dict; you have hand-coded a message-passing step (average of neighbours) and watched information spread; you can explain how GCN, GraphSAGE, and GAT aggregate differently; you know the node-, edge-, and graph-level tasks; and you can spot the classic traps — over-smoothing, missing self-loops, scalability, and the wrong aggregator.
🚀 Up next: AutoML & NAS — let algorithms search for the best model and architecture for you.
Sign up for free to track which lessons you've completed and get learning reminders.