Skip to main content
    Courses/AI & ML/Graph Neural Networks

    Lesson 46 • Advanced

    Graph Neural Networks (GNNs) 🕸️

    Learn how GNNs read graph-structured data — nodes, edges, and features — by passing messages between neighbours, and build one message-passing step by hand in plain Python before meeting GCN, GraphSAGE, and GAT in PyTorch Geometric.

    What You'll Learn in This Lesson

    • How graphs store data: nodes, edges, features, adjacency
    • The message-passing / neighbour-aggregation paradigm
    • Code one aggregation step by hand in plain Python
    • How GCN, GraphSAGE, and GAT (attention) differ
    • Node-level, edge-level, and graph-level tasks
    • Where GNNs win: social, molecules, recommenders

    🌍 Real-World Analogy: Learning From Your Friends

    Want to guess someone's politics, music taste, or whether they will repay a loan? You learn a surprising amount just from who they spend time with. "You are the average of the five people you hang out with" is folk wisdom — and it is almost exactly how a GNN thinks.

    Picture a friendship network. Each person holds a few facts about themselves (their features). In one "round of gossip", everyone updates their own view by averaging in what their direct friends think. Do it again and your friends-of-friends start to influence you too. After a few rounds, your updated profile quietly encodes your whole local neighbourhood — not just you. That round of gossip is message passing, and stacking rounds is exactly what stacking GNN layers does.

    1Graph Data — Nodes, Edges, Features, Adjacency

    A graph is data made of nodes (the things — people, atoms, products) joined by edges (the relationships — friendships, bonds, "bought together"). Unlike a table, where every row is independent, a graph's whole point is that the connections carry meaning.

    Three pieces describe any graph for a GNN:

    • Node features — a vector of numbers per node (a user's age and activity; an atom's element and charge).
    • Edges — which nodes are connected, sometimes with their own features (a bond type, a transaction amount).
    • Adjacency — the bookkeeping of who-connects-to-whom. You can store it as an N×N matrix (1 = edge, 0 = no edge) or, more simply, as a dictionary mapping each node to its list of neighbours.

    🗂️ The simplest way to store a graph in code

    An adjacency dict — every example below uses this:

    adjacency = {
        "Alice": ["Bob", "Dave"],   # Alice is friends with Bob and Dave
        "Bob":   ["Alice", "Carol"],
        # ...
    }

    2Message Passing — The One Idea Behind Every GNN

    Almost every GNN is a stack of message-passing layers, and each layer does the same three steps for every node:

    1. Gather the feature vectors of the node's neighbours (the "messages").
    2. Aggregate them into one vector — usually a mean, sometimes a sum or a max.
    3. Transform that aggregate with a learnable weight matrix and a non-linearity (like ReLU).

    After one layer, a node's new embedding reflects its 1-hop neighbourhood. Stack a second layer and it reaches 2 hops (friends of friends); k layers reach k hops. That is the entire trick — aggregate, then transform, then repeat.

    The worked example below does step 1 and step 2 in plain Python: each node's new feature is simply the average of its neighbours' features. No NumPy, no PyTorch — just a dictionary and a loop. Read every comment, then run it.

    Worked Example: One Message-Passing Step (plain Python)

    Each node's new feature becomes the average of its neighbours' features, using an adjacency dict

    Try it Yourself »
    Python
    # === ONE message-passing step, in plain Python ===
    # A graph is just nodes joined by edges. Store it as an "adjacency dict":
    #   key   = a node
    #   value = the list of nodes it is connected to (its neighbours)
    #
    # Friendship graph (undirected):
    #     Alice -- Bob -- Carol
    #       |               |
    #     Dave -- Eve --- Frank
    
    adjacency = {
        "Alice": ["Bob", "Dave"],
        "Bob":   ["Alice", "Carol"],
        "Carol": ["Bob", "Frank"],
        "Dave":  ["Alice", "Eve"],
        "Eve":   ["Dave", "Frank"],
    
    ...

    Run the same averaging rule several times and watch a single signal spread across the graph — and start to flatten out. That flattening is a clue to a real GNN pitfall you will meet in Common Errors.

    Worked Example: Information Spreading Over Rounds

    Apply message passing three times and watch one node's signal reach the whole graph

    Try it Yourself »
    Python
    # === Watch information spread over several message-passing rounds ===
    # Same averaging rule, applied 3 times. Notice how the numbers drift TOGETHER.
    
    adjacency = {
        "Alice": ["Bob", "Dave"],
        "Bob":   ["Alice", "Carol"],
        "Carol": ["Bob", "Frank"],
        "Dave":  ["Alice", "Eve"],
        "Eve":   ["Dave", "Frank"],
        "Frank": ["Carol", "Eve"],
    }
    
    # One number per node so the output is easy to read.
    state = {"Alice": 0.0, "Bob": 0.0, "Carol": 0.0, "Dave": 0.0, "Eve": 9.0, "Frank": 0.0}
    # O
    ...

    3GCN, GraphSAGE, and GAT — The Three You Must Know

    All three are message-passing layers. They differ in how they aggregate neighbours:

    GCN

    Takes a degree-normalised average of all neighbours. The simplest layer and a strong baseline — it is exactly the averaging you coded above, with a normalisation that stops high-degree nodes from dominating.

    GraphSAGE

    Samples a fixed number of neighbours instead of using all of them. That makes it scale to graphs with billions of edges and lets it handle nodes it never saw in training (inductive learning).

    GAT

    Learns an attention weight for each neighbour, so important neighbours count more. Not every friend should count equally — GAT figures out who matters, the same idea that powers Transformers.

    In practice you rarely hand-code these. PyTorch Geometric (PyG) ships GCNConv, SAGEConv, and GATConv as drop-in layers. The next example is read-only PyG showing a single GCNConv doing aggregate-then-transform for you.

    Worked Example: A GCN Layer in PyTorch Geometric

    Read-only — the same aggregate-then-transform idea, built into a single GCNConv layer with expected output

    Try it Yourself »
    Python
    # === The same idea in PyTorch Geometric (PyG) — the real-world way ===
    # PyG ships GCN, GraphSAGE, GAT and more so you don't hand-code aggregation.
    import torch
    from torch_geometric.data import Data
    from torch_geometric.nn import GCNConv
    
    # A graph in PyG: edge_index is a 2 x E tensor of [source, target] columns.
    # These edges describe a tiny 4-node graph: 0-1, 1-2, 2-3 (both directions).
    edge_index = torch.tensor(
        [[0, 1, 1, 2, 2, 3],
         [1, 0, 2, 1, 3, 2]], dtype=torch.long)
    
    # Node fea
    ...

    4Three Task Levels — Node, Edge, and Graph

    Once message passing has given every node a rich embedding, you can predict at three different levels:

    LevelQuestionExampleHow
    NodeWhat is this node?Flag a fraudulent accountClassify each node embedding
    EdgeIs there a link?Recommend a friend; predict a bondScore pairs of node embeddings
    GraphWhat is the whole graph?Is this molecule toxic?Pool all nodes into one vector (a "readout")

    The read-only example below uses a GATConv (attention) layer and then produces all three: per-node logits, per-edge scores, and one graph-level vector from mean pooling.

    Worked Example: Attention (GAT) + Node / Edge / Graph Tasks

    Read-only — one GAT layer feeds node classification, edge scoring, and a graph-level readout, with expected output

    Try it Yourself »
    Python
    # === GAT (attention) and the three GNN task levels in PyG ===
    import torch
    from torch_geometric.nn import GATConv, global_mean_pool
    
    # 5 nodes, 4 features each.
    x = torch.randn(5, 4)
    edge_index = torch.tensor(
        [[0, 1, 1, 2, 3, 4],
         [1, 0, 2, 1, 4, 3]], dtype=torch.long)
    
    # GAT learns how much ATTENTION each node pays to each neighbour,
    # instead of treating all neighbours equally like GCN does.
    gat = GATConv(in_channels=4, out_channels=8, heads=2)   # 2 attention heads
    node_embeddings =
    ...

    🚀 Where GNNs Win in the Real World

    GNNs shine wherever relationships matter as much as the items themselves:

    • Social networks. Detect communities, predict who will connect, and flag fake or fraudulent accounts by the company they keep.
    • Molecules & drug discovery. Atoms are nodes and bonds are edges, so a molecule is a graph. GNNs predict toxicity, solubility, and binding affinity, and propose new candidate compounds.
    • Recommenders. Users and items form a giant interaction graph. Pinterest's PinSage and many e-commerce systems use GNNs to power "you might also like".
    • Maps & traffic. Road networks are graphs; Google Maps uses GNNs to improve arrival-time estimates.
    • Knowledge graphs & fraud rings. Complete missing facts, and spot money-laundering patterns that single-account models miss entirely.

    Now you try. Fill in the blanks marked ___. Gathering a node's neighbours and averaging them is message passing — get this and you have the heart of every GNN.

    🎯 Your Turn 1: Finish the Aggregation Step

    Collect each node's neighbours' feature vectors and average them

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — finish the message-passing step.
    # Each node's new feature = the AVERAGE of its neighbours' features.
    
    adjacency = {
        "A": ["B", "C"],
        "B": ["A"],
        "C": ["A"],
    }
    
    features = {
        "A": [2.0, 4.0],
        "B": [6.0, 8.0],
        "C": [10.0, 12.0],
    }
    
    def average(vectors):
        n = len(vectors)
        length = len(vectors[0])
        return [sum(v[i] for v in vectors) / n for i in range(length)]
    
    new_features = {}
    for node, neighbours in adjacency.items():
        # 👉 build a list of the
    ...

    Your turn again. Real GNN layers include each node in its own neighbourhood — a self-loop — so a node never forgets its own features. Add it here.

    🎯 Your Turn 2: Add a Self-Loop Before Averaging

    Include the node itself in the group, then average over itself plus its neighbours

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — aggregate over the node AND its neighbours (a self-loop).
    # Pure GCN-style averaging forgets a node's own features. Fix that by
    # including the node itself in the group before averaging.
    
    adjacency = {
        "A": ["B", "C"],
        "B": ["A"],
        "C": ["A"],
    }
    
    features = {
        "A": [3.0],
        "B": [6.0],
        "C": [9.0],
    }
    
    new_features = {}
    for node, neighbours in adjacency.items():
        # 👉 start the group with the node's OWN id, then add its neighbours
        group_nodes = [node] + 
    ...

    5Common Errors (And How to Fix Them)

    These four mistakes trip up almost everyone building their first GNN:

    ❌ Over-smoothing from too many layers

    Stack too many message-passing layers and every node keeps averaging in more of the graph until all embeddings collapse to nearly the same vector — the model can no longer tell nodes apart and accuracy crashes.

    # ❌ 8 GCN layers on a small graph → embeddings all converge
    for _ in range(8):
        h = gcn_layer(h, edges)

    ✅ Fix: keep it shallow (2-4 layers); add residual / jumping-knowledge connections only if you truly need depth.

    for _ in range(2):                 # ✅ 2-3 layers is usually plenty
        h = relu(gcn_layer(h, edges))

    ❌ Forgetting self-loops

    If you aggregate only over neighbours, a node throws away its own features in every layer and loses its identity.

    agg = average(neighbours[node])    # ❌ the node's own features vanish

    ✅ Fix: add a self-loop — include the node in its own neighbourhood. PyG's GCNConv does this automatically.

    group = [node] + neighbours[node]  # ✅ self-loop keeps the node's own info
    agg = average([features[n] for n in group])

    ❌ Loading a giant graph all at once

    Full-batch GCN on a graph with millions of nodes builds an enormous adjacency operation and runs out of memory.

    out = gcn(full_graph_x, full_edge_index)   # ❌ OOM on huge graphs

    ✅ Fix: sample neighbourhoods in mini-batches (GraphSAGE-style) with a neighbour loader.

    from torch_geometric.loader import NeighborLoader
    loader = NeighborLoader(data, num_neighbors=[10, 10], batch_size=512)  # ✅

    ❌ Choosing the wrong aggregation

    Mean aggregation throws away how many neighbours a node has — yet for tasks like counting substructures in molecules, degree is the signal. Mean can't distinguish "1 neighbour" from "100 identical neighbours".

    agg = mean(messages)   # ❌ blind to neighbour count for some tasks

    ✅ Fix: match the aggregator to the task — sum (GIN) preserves count and is most expressive; mean and max suit others.

    from torch_geometric.nn import GINConv   # ✅ sum aggregation, very expressive

    📋 Quick Reference

    ConceptWhat it isUse when
    AdjacencyWho connects to whom (dict or matrix)Describing any graph's structure
    Message passingGather, aggregate, transform neighboursThe core of every GNN layer
    GCNDegree-normalised mean of neighboursStrong, simple baseline
    GraphSAGESample neighbours; inductiveHuge / changing graphs
    GATAttention-weighted neighboursSome neighbours matter more
    Self-loopNode aggregates itself tooAlways — keep a node's own features
    Readout / poolingPool nodes into one graph vectorGraph-level tasks (e.g. molecules)

    ❓ Frequently Asked Questions

    Q: What is a graph neural network (GNN)?

    A: A GNN is a neural network that operates directly on graph-structured data — data made of nodes (entities) joined by edges (relationships), where each node and edge can carry a feature vector. Instead of assuming a fixed grid like an image or a sequence like text, a GNN learns a representation for each node by repeatedly mixing in information from its neighbours. This makes it the natural model for social networks, molecules, knowledge graphs, road networks, and recommendation graphs.

    Q: What is message passing in a GNN?

    A: Message passing is the core operation of almost every GNN. In each layer, every node (1) collects 'messages' — the feature vectors of its neighbours, (2) aggregates them into one vector, usually by averaging, summing, or taking a max, and (3) transforms that aggregate with a learnable weight matrix and a non-linearity. After one layer a node's embedding reflects its 1-hop neighbourhood; after k stacked layers it reflects everything within k hops. This 'aggregate, then transform' loop is what lets structure influence predictions.

    Q: What is the difference between GCN, GraphSAGE, and GAT?

    A: They differ mainly in how they aggregate neighbours. GCN (Graph Convolutional Network) takes a degree-normalised average of all neighbours — simple and a strong baseline. GraphSAGE samples a fixed number of neighbours instead of using all of them, which makes it scale to huge graphs and generalise to nodes never seen during training (inductive learning). GAT (Graph Attention Network) learns an attention weight for each neighbour, so important neighbours count more than unimportant ones rather than every neighbour being treated equally.

    Q: What are node-level, edge-level, and graph-level tasks?

    A: They are the three levels at which a GNN makes predictions. Node-level tasks label individual nodes — for example flagging fraudulent accounts in a transaction graph. Edge-level tasks predict relationships between pairs of nodes — for example link prediction ('will these two users connect?') or recommendation. Graph-level tasks produce one prediction for an entire graph — for example deciding whether a molecule is toxic. Node and edge tasks read node embeddings directly; graph-level tasks first pool all node embeddings into a single vector (a 'readout').

    Q: What is over-smoothing in graph neural networks?

    A: Over-smoothing is what happens when you stack too many message-passing layers: every node keeps averaging in its neighbours' neighbours' features until all the node embeddings collapse toward the same value and become indistinguishable. The model then can't tell nodes apart and accuracy drops. Most GNNs are shallow (2-4 layers) for this reason, and tricks like residual connections, jumping-knowledge layers, or PairNorm are used when you genuinely need depth.

    Q: What is over-smoothing's opposite problem — and why add self-loops?

    A: If you aggregate only over neighbours and forget the node itself, a node can lose its own identity in a single layer. Adding a self-loop — treating each node as one of its own neighbours — preserves its original features in the aggregate. Frameworks like PyTorch Geometric add self-loops automatically inside GCNConv, but if you implement message passing by hand you must remember to include each node in its own neighbourhood, or your embeddings will throw away useful information.

    Q: What are graph neural networks used for in the real world?

    A: GNNs power recommendation at Pinterest and Uber Eats, fraud and money-laundering detection in financial transaction graphs, drug discovery and molecular property prediction (atoms are nodes, bonds are edges), traffic and arrival-time estimation in Google Maps, knowledge-graph completion, and even chip-layout optimisation. Anywhere the relationships between things matter as much as the things themselves, a GNN tends to beat models that look at each item in isolation.

    🎯 Mini-Challenge: Two Rounds of Message Passing

    Time to fly with less scaffolding. The starter gives you a line graph and a comment outline — write the step() function yourself and apply it twice. Remember the self-loop: average each node together with its neighbours.

    Mini-Challenge

    Write a step() that averages each node with its neighbours, then run two rounds

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: two rounds of message passing
    #
    # You have a 4-node line graph: A - B - C - D.
    # Each node holds a single number. Spread information with the averaging rule.
    #
    # 1. Write a step(adjacency, state) function that returns a NEW dict where
    #    each node = the average of itself AND its neighbours (include a self-loop).
    # 2. Apply it TWICE, printing the state after round 1 and round 2.
    #
    # ✅ After round 1, A should be 1.5 (average of A=1, B=2).
    #    The values should drift closer 
    ...
    🎉

    Lesson 46 complete — you can now reason about graphs the way a GNN does!

    You can describe a graph with nodes, edges, features, and an adjacency dict; you have hand-coded a message-passing step (average of neighbours) and watched information spread; you can explain how GCN, GraphSAGE, and GAT aggregate differently; you know the node-, edge-, and graph-level tasks; and you can spot the classic traps — over-smoothing, missing self-loops, scalability, and the wrong aggregator.

    🚀 Up next: AutoML & NAS — let algorithms search for the best model and architecture for you.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service