Advanced Track
Full-Text Search & Fuzzy Matching
By the end of this lesson you'll be able to build real, scalable search: index your text, find documents by meaning (not just exact characters), rank the hits by relevance, and tolerate typos. You'll see exactly why LIKE '%word%' falls apart — and what professionals reach for instead.
What You'll Learn
- ✓Why LIKE '%...%' can't scale or rank results
- ✓PostgreSQL tsvector, tsquery, and the @@ match operator
- ✓Boolean & natural-language search (AND/OR/NOT, phrases)
- ✓Rank results by relevance with ts_rank
- ✓Make it fast with a GIN index and weighted vectors
- ✓MySQL MATCH() AGAINST() plus fuzzy matching for typos
Our Sample Table: articles
Every query in this lesson searches this little articles table. The body column is shown shortened, but full-text search reads the whole thing.
Result:
| id | title | body (excerpt) |
|---|---|---|
| 1 | Getting Started with Databases | A database stores rows in tables… |
| 2 | Indexing Strategies for Big Tables | An index makes lookups fast; database tuning… |
| 3 | Normalisation Explained | Split data to remove duplication… |
| 4 | When a B-tree Index Beats a Full Scan | Index performance and query speed… |
| 5 | A Gentle Intro to NoSQL | Document stores trade joins for speed… |
1. Why Not Just LIKE '%word%'?
You already know how to filter text with LIKE. So why does every serious app use something else for search? Because a leading-wildcard LIKE '%word%' has three problems that get worse as your data grows.
📚 Real-world analogy
LIKE '%word%' is reading every book in the library cover to cover to find a word. Full-text search is the index at the back of each book that a librarian built ahead of time — ask for "database optimization" and she pulls the right books instantly, sorted by how relevant they are.
The LIKE approach (and why it hurts)
A full table scan that can't rank or stem.
-- The "obvious" way to search text: LIKE with wildcards
SELECT id, title
FROM articles
WHERE body LIKE '%database%';
-- This works on tiny tables, but it has three deal-breakers:
-- 1. It can't use a normal index, so it reads EVERY row
-- and scans EVERY character — a full table scan.
-- 2. It can't rank results. A title hit and a hit buried
-- in paragraph 90 come back equally "matched".
-- 3. It's literal: 'database' won't match 'databases',
-- 'Database', or a search fo
...The fix is to pre-process text into a searchable form once, index it, and then search the index. That's what full-text search does.
2. PostgreSQL: tsvector, tsquery & @@
PostgreSQL splits search into two data types. A tsvector is the document side: your text broken into normalised root words (called lexemes). A tsquery is the search side: the words you're looking for. The @@ operator returns true when the document matches the query.
to_tsvector('english', …) reduces every word to its root (running, ran, runs → run) and throws away ultra-common stop-words (the, is, a, of) that carry no search value. That's why a search for databases still finds the word database — both stem to databas.to_tsvector — the document side
Turn text into stemmed, stop-word-free lexemes.
-- to_tsvector turns text into a tsvector: a sorted list of
-- normalised "lexemes" (root words) with their positions.
SELECT to_tsvector('english',
'PostgreSQL is a powerful open-source database system');
-- Result (one tsvector value):
-- 'databas':6 'open-sourc':4 'postgresql':1 'power':3
-- 'sourc':5 'system':7
-- Notice what happened:
-- • 'powerful' became 'power' (stemming)
-- • 'database' became 'databas' (stemming)
-- • 'is' and 'a' vanished (stop words r
...Result — one tsvector value:
| to_tsvector |
|---|
| 'databas':6 'open-sourc':4 'postgresql':1 'power':3 'sourc':5 'system':7 |
to_tsquery & the @@ operator
Build the search side and match it.
-- to_tsquery builds the SEARCH side. It is also stemmed,
-- so your search word matches every form of the word.
SELECT to_tsquery('english', 'powerful & databases');
-- Result: 'power' & 'databas'
-- The @@ operator asks "does this document match this query?"
SELECT to_tsvector('english', 'PostgreSQL is a powerful database')
@@ to_tsquery('english', 'powerful & databases');
-- Result: true
-- 't' even though the text says "database" (singular) and
-- the query said "databases" (pl
...Result — the @@ match returns true:
| ?column? |
|---|
| t |
Your Turn: complete the match
Fill in the blanks so the query finds every article whose title or body mentions index. Hint: use the friendly query builder for a single plain word. The expected result is in the comments so you can check yourself.
🎯 Your Turn: tsvector @@ tsquery
Complete the to_tsquery call to match documents.
-- 🎯 YOUR TURN — fill in the two blanks, then press "Try it Yourself"
-- Goal: return every article whose title OR body mentions "index".
SELECT id, title
FROM articles
WHERE to_tsvector('english', title || ' ' || body)
@@ ___('english', ___); -- 👉 1) the query builder for a plain word
-- 2) the search word, in single quotes
-- ✅ Expected result: rows 2 and 4
-- 2 | Indexing Strategies for Big Tables
-- 4 | When a B-tree Index Beats a Full Sca
...3. Boolean & Phrase Search
Beyond single words, to_tsquery understands a small operator language so users can be precise: combine terms, exclude terms, or demand an exact phrase.
AND / OR / NOT / phrase
The & | ! and <-> operators.
-- tsquery has a small operator language:
-- & AND (both terms must appear)
-- | OR (either term)
-- ! NOT (exclude a term)
-- <-> FOLLOWED BY (phrase: the words are adjacent)
-- Articles about databases but NOT about MySQL:
SELECT id, title FROM articles
WHERE to_tsvector('english', body)
@@ to_tsquery('english', 'database & !mysql');
-- The exact phrase "full text search":
SELECT id, title FROM articles
WHERE to_tsvector('english', body)
@@ to_tsquery('english'
...4. Ranking Results with ts_rank
The @@ operator only answers yes/no — but real search shows the best result first. ts_rank(document, query) returns a relevance score (higher is better) based on how often the terms appear and where. Sort by it with ORDER BY rank DESC.
Listing the query once in the FROM clause (plainto_tsquery(…) AS query) lets you reuse it in both the SELECT and the WHERE without building it twice. Prefer ts_rank_cd when how close the words sit to each other matters.
ts_rank — relevance scoring
Score and sort matches like a search engine.
-- @@ only answers yes/no. ts_rank gives a relevance SCORE
-- (higher = more relevant) so you can ORDER results like Google.
SELECT id, title,
ts_rank(to_tsvector('english', title || ' ' || body), query) AS rank
FROM articles,
plainto_tsquery('english', 'database performance') AS query -- run query once
WHERE to_tsvector('english', title || ' ' || body) @@ query
ORDER BY rank DESC;
-- ts_rank scores on term frequency and position.
-- ts_rank_cd ("cover density") also rewards terms
...Result — ordered by relevance, best first:
| id | title | rank |
|---|---|---|
| 2 | Indexing Strategies for Big Tables | 0.0607 |
| 4 | When a B-tree Index Beats a Full Scan | 0.0304 |
| 1 | Getting Started with Databases | 0.0152 |
Your Turn: add relevance ranking
The WHERE already finds the matches. Add the scoring function and the sort direction so the most relevant article comes back first.
🎯 Your Turn: rank & order
Compute the score and order best-first.
-- 🎯 YOUR TURN — add relevance ranking to this search.
-- The WHERE already finds the matches; you must SCORE and SORT them.
SELECT id, title,
___(to_tsvector('english', title || ' ' || body), query) AS rank
FROM articles,
plainto_tsquery('english', 'index performance') AS query
WHERE to_tsvector('english', title || ' ' || body) @@ query
ORDER BY rank ___; -- 👉 1) the function that returns a relevance score
-- 2) the direction so the BEST match is first
...5. Make It Fast: GIN Index & Weights
Everything so far recomputes to_tsvector() for every row on every query — fine for demos, far too slow for millions of rows. Store the vector in a column and put a GIN index on it (Generalised INverted index — the same idea as a book's back-of-book index). With setweight() you also tag where each word came from so title hits outrank body hits.
GENERATED ALWAYS AS (…) STORED column keeps the tsvector perfectly in sync with your text automatically — no triggers to maintain, and the GIN index updates itself.GIN index + weighted vector
Pre-store lexemes for instant, weighted search.
-- Speed: by default Postgres recomputes to_tsvector() for every
-- row on every query. A GIN index pre-stores the lexemes so the
-- search jumps straight to matching rows — instant on millions.
-- Best practice (Postgres 12+): a STORED generated column,
-- so the tsvector is always in sync with the text automatically.
ALTER TABLE articles
ADD COLUMN search tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector
...6. MySQL: MATCH() AGAINST()
MySQL takes a different route. You add a FULLTEXT index over the text columns, then search with MATCH(columns) AGAINST('terms'). Natural-language mode (the default) ranks by relevance automatically; boolean mode unlocks +/-/"phrase" operators much like Postgres' tsquery.
MySQL MATCH … AGAINST
FULLTEXT index, natural-language & boolean mode.
-- MySQL does full-text search with a FULLTEXT index and
-- the MATCH(...) AGAINST(...) pair instead of tsvector/tsquery.
-- 1) Add a FULLTEXT index over the columns you'll search:
CREATE FULLTEXT INDEX idx_articles_ft ON articles (title, body);
-- 2) Natural-language mode (the default): MySQL ranks by relevance.
-- Putting MATCH in the SELECT returns that relevance score.
SELECT id, title,
MATCH(title, body) AGAINST('database performance') AS score
FROM articles
WHERE MATCH(title, b
...7. Fuzzy Matching for Typos
Full-text search needs the right word (after stemming). But users misspell things — "databse", "Samsnug". Fuzzy matching measures how similar two strings are so a near-miss still matches. Two common tools, both high-level here: trigram similarity (pg_trgm) and Levenshtein distance (fuzzystrmatch).
similarity() match when nothing comes back.Trigram similarity & Levenshtein
Typo-tolerant matching at a high level.
-- Full-text search needs the RIGHT word (after stemming).
-- For typos and misspellings you need FUZZY matching.
-- (a) Trigram similarity — pg_trgm. It compares the 3-letter
-- chunks two strings share and scores them from 0 to 1.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
SELECT similarity('database', 'databse'); -- ~0.55 (typo still close)
SELECT title FROM articles
WHERE similarity(title, 'databse') > 0.3 -- catches the misspelling
ORDER BY similarity(title, 'databse') DESC;
-- In
...Common Errors (and the fix)
- No index, so search crawls: querying
to_tsvector(body) @@ …with no GIN index (Postgres) or noFULLTEXTindex (MySQL) silently does a full table scan. It "works" on 100 rows and times out on a million — add the index before you ship. - Using
LIKEfor real search:LIKE '%term%'can't use a normal index, can't stem, and can't rank. Reach for it only for tiny tables or exact substring checks, never for a search box. - Ignoring stop-words: searching for
'the' & 'of'returns nothing — those words were stripped from the vector. Search on the meaningful words, and remember a search fordatabasesmatchesdatabasethanks to stemming. - Forgetting to rank: filtering with
@@but nots_rank+ORDER BYreturns matches in arbitrary order. Always score and sort, or your "search" lists the worst hit first. - Wrong tsquery for user input: feeding a raw phrase to
to_tsquery(to_tsquery('english', 'database performance')) errors on the space — useplainto_tsqueryfor plain user text.
📘 Quick Reference
| Syntax | Purpose |
|---|---|
| to_tsvector('english', t) | Text → searchable lexemes (stemmed, stop-words removed) |
| to_tsquery('english', 'a & b') | Build a search with operators (& | ! <->) |
| plainto_tsquery('english', t) | Build a search from plain user text (ANDs words) |
| doc @@ query | TRUE if the document matches the query |
| ts_rank(doc, query) | Relevance score — ORDER BY it, DESC |
| setweight(v, 'A') | Tag lexemes A…D so title outranks body |
| USING gin (search) | Index a tsvector column for fast search |
| MATCH(cols) AGAINST('t') | MySQL full-text search (needs a FULLTEXT index) |
| similarity(a, b) | pg_trgm trigram score 0–1 for typo tolerance |
Frequently Asked Questions
Q: When is LIKE still fine?
For exact substring checks on small tables, or a prefix match like LIKE 'abc%' (no leading wildcard), which can use a normal index. For a real search box over lots of text, use full-text search.
Q: What's the difference between to_tsquery and plainto_tsquery?
to_tsquery expects operator syntax ('database & !mysql') and errors on a bare space. plainto_tsquery takes free text from a user and ANDs the words for you — safer for search-box input.
Q: Do I need a GIN index to use full-text search?
No — it works without one, but every query rescans the table. Once your data grows, a GIN index (Postgres) or FULLTEXT index (MySQL) is what makes search fast.
Q: Full-text search vs. fuzzy matching — which do I want?
Both. Full-text finds documents about a topic (with stemming and ranking); fuzzy matching (pg_trgm, Levenshtein) catches typos. Production search usually layers full-text first with a fuzzy fallback.
Mini-Challenge: A Ranked Search, End to End
Put it all together — a brief, a blank canvas, and the expected result in the comments. Write it, then copy it into a PostgreSQL playground to confirm.
🎯 Mini-Challenge
Search 'tuning' OR 'speed', rank with ts_rank, top 5.
-- 🎯 MINI-CHALLENGE — a ranked search, end to end
-- Using ONLY what this lesson covered:
-- 1. Search the articles table for the words "tuning" OR "speed"
-- (build the query with to_tsquery and the | operator).
-- 2. Compute a relevance column called rank with ts_rank.
-- 3. Return id, title, rank — most relevant FIRST — top 5 only.
--
-- ✅ Expected: at most 5 rows, ordered by rank descending,
-- with the best-matching article at the top.
-- your query here🎉 Lesson Complete
- ✅
LIKE '%word%'can't scale, stem, or rank — full-text search can - ✅
to_tsvector+to_tsquerymatched with@@(with stemming & stop-words) - ✅
ts_rankscores relevance so you canORDER BY rank DESC - ✅ A GIN index on a weighted
tsvectorcolumn makes it fast - ✅ MySQL does the same with
MATCH() AGAINST()and aFULLTEXTindex - ✅ Fuzzy tools (
pg_trgm, Levenshtein) tolerate typos - ✅ Next: storing and querying
JSON& semi-structured data
Sign up for free to track which lessons you've completed and get learning reminders.