Skip to main content

    Lesson 8 • Intermediate

    Text Processing 🔍

    By the end of this lesson you'll be able to search, slice, summarise, and transform text from the command line — and chain those tools into a real log-analysis pipeline that answers questions in seconds, not minutes.

    What You'll Learn

    • Search files with grep and its key flags (-i, -r, -n, -v, -E)
    • Locate files by name and age with find (and combine it with grep)
    • Slice columns with cut and count things with sort | uniq -c
    • Count lines/words with wc and peek with head, tail, and tail -f
    • Find-and-replace with sed — and avoid the -i in-place trap
    • Read columns with a gentle awk intro ($1, $2 fields)

    Setup: a log file to practise on

    Every example in this lesson works on the same web-server log. Run this block first to create access.log in your current folder. Each line has the same shape: date time LEVEL METHOD url status user — handy because each piece sits in its own column.

    Create the sample log
    # Create a realistic web-server log to practise on.
    # (A heredoc just writes these lines verbatim into access.log.)
    cat > access.log <<'EOF'
    2026-06-15 09:01:12 INFO  GET /home 200 alice
    2026-06-15 09:01:30 INFO  GET /products 200 bob
    2026-06-15 09:02:05 WARN  GET /search 200 alice
    2026-06-15 09:02:41 ERROR POST /login 500 bob
    2026-06-15 09:03:10 INFO  GET /products 200 carol
    2026-06-15 09:03:55 ERROR GET /cart 404 alice
    2026-06-15 09:04:22 INFO  POST /login 200 carol
    2026-06-15 09:05:01 ERROR POST /checkout 500 bob
    2026-06-15 09:05:48 WARN  GET /products 200 alice
    EOF
    
    # How many lines did we write?
    wc -l access.log
    Output
    9 access.log
    This is real code — run it for free atonecompiler.com/bashor in your own editor.

    1️⃣ grep — search for patterns

    grep ("Global Regular Expression Print") scans a file line by line and prints every line that matches your pattern. A handful of flags do most of the work: -i ignores case, -n adds line numbers, -v inverts the match (lines that don't match), -r searches a whole folder, and -E turns on extended regex so | means "or".

    grep with -i, -n, -v, -E, and -r
    # grep prints every line that MATCHES a pattern.
    
    # 1) Plain match: every line containing ERROR
    grep "ERROR" access.log
    
    echo "--- -i: case-insensitive ---"
    # -i ignores case, so 'warn', 'WARN', 'Warn' all match
    grep -i "warn" access.log
    
    echo "--- -n: show line numbers ---"
    grep -n "ERROR" access.log
    
    echo "--- -v: INVERT — lines that do NOT match ---"
    # Everything that is NOT an INFO line
    grep -v "INFO" access.log
    
    echo "--- -E: extended regex (ERROR or WARN) ---"
    # -E turns on extended regex so | means OR. Without -E you'd
    # have to write the clunky \| instead.
    grep -E "ERROR|WARN" access.log
    
    echo "--- -r: search a whole folder recursively ---"
    # -r walks every file under logs/ ; -n adds file:line numbers
    grep -rn "ERROR" logs/
    Output
    2026-06-15 09:02:41 ERROR POST /login 500 bob
    2026-06-15 09:03:55 ERROR GET /cart 404 alice
    2026-06-15 09:05:01 ERROR POST /checkout 500 bob
    --- -i: case-insensitive ---
    2026-06-15 09:02:05 WARN  GET /search 200 alice
    2026-06-15 09:05:48 WARN  GET /products 200 alice
    --- -n: show line numbers ---
    4:2026-06-15 09:02:41 ERROR POST /login 500 bob
    6:2026-06-15 09:03:55 ERROR GET /cart 404 alice
    8:2026-06-15 09:05:01 ERROR POST /checkout 500 bob
    --- -v: INVERT — lines that do NOT match ---
    2026-06-15 09:02:05 WARN  GET /search 200 alice
    2026-06-15 09:02:41 ERROR POST /login 500 bob
    2026-06-15 09:03:55 ERROR GET /cart 404 alice
    2026-06-15 09:05:01 ERROR POST /checkout 500 bob
    2026-06-15 09:05:48 WARN  GET /products 200 alice
    --- -E: extended regex (ERROR or WARN) ---
    2026-06-15 09:02:05 WARN  GET /search 200 alice
    2026-06-15 09:02:41 ERROR POST /login 500 bob
    2026-06-15 09:03:55 ERROR GET /cart 404 alice
    2026-06-15 09:05:01 ERROR POST /checkout 500 bob
    2026-06-15 09:05:48 WARN  GET /products 200 alice
    --- -r: search a whole folder recursively ---
    logs/access.log:2026-06-15 09:02:41 ERROR POST /login 500 bob
    logs/access.log:2026-06-15 09:03:55 ERROR GET /cart 404 alice
    logs/access.log:2026-06-15 09:05:01 ERROR POST /checkout 500 bob
    This is real code — run it for free atonecompiler.com/bashor in your own editor.

    🎯 Your Turn: find the 500 errors

    Fill in the blank with the flag that prints line numbers, then run it and check the output matches.

    Fill in the blank
    # 🎯 YOUR TURN — find every line that is a 500 error.
    # Goal: show the line numbers of lines containing "500".
    
    # 1) Add the flag that prints line numbers
    grep ___ "500" access.log   # 👉 replace ___ with the line-number flag
    
    # ✅ Expected output:
    # 4:2026-06-15 09:02:41 ERROR POST /login 500 bob
    # 8:2026-06-15 09:05:01 ERROR POST /checkout 500 bob
    Output
    4:2026-06-15 09:02:41 ERROR POST /login 500 bob
    8:2026-06-15 09:05:01 ERROR POST /checkout 500 bob
    This is real code — run it for free atonecompiler.com/bashor in your own editor.

    2️⃣ find — locate files

    Where grep searches inside files, find searches for files by name, type, size, or age. Give it a starting folder (. means "here") and conditions like -name "*.log" or -mtime -1 (changed in the last day). With -exec you can even run another command — like grep — on every file it finds.

    find by name, by age, and combined with grep
    # find walks a directory tree looking for files by NAME, type,
    # size, or age — it finds files, grep finds text inside them.
    
    # Every .log file anywhere under the current folder (. means "here")
    find . -name "*.log"
    
    echo "--- only files (not folders) modified in the last day ---"
    # -type f = regular files only; -mtime -1 = changed < 1 day ago
    find . -type f -mtime -1
    
    echo "--- combine find + grep: search inside everything find returns ---"
    # -exec runs grep on each match; {} is the filename, \; ends -exec
    find . -name "*.log" -exec grep -l "ERROR" {} \;
    Output
    ./access.log
    ./logs/access.log
    --- only files (not folders) modified in the last day ---
    ./access.log
    ./logs/access.log
    --- combine find + grep: search inside everything find returns ---
    ./access.log
    ./logs/access.log
    Run this in your own terminal — the output assumes a logs/ subfolder also exists. Online runners are sandboxed, so file listings will differ.

    3️⃣ cut, sort, uniq -c & wc — slice and summarise

    Most logs are columns. cut -d' ' -f4 splits on a space and keeps field 4. The classic counting trick is sort | uniq -c: uniq only collapses adjacent duplicates, so you must sort first to group identical lines together; -c then prefixes each with its count. Add another sort -rn to rank biggest-first, and wc -l to count lines.

    cut a column, then sort | uniq -c to count
    # These tools slice text into columns and summarise it.
    
    # cut -d' ' splits on a space; -f4 keeps field 4 (the log LEVEL).
    echo "--- cut: pull out the log level (field 4) ---"
    cut -d' ' -f4 access.log
    
    echo "--- sort + uniq -c: COUNT how many of each level ---"
    # uniq only collapses ADJACENT duplicates, so you MUST sort first.
    # -c prefixes each line with its count.
    cut -d' ' -f4 access.log | sort | uniq -c
    
    echo "--- sort -rn: sort those counts, biggest first ---"
    # -n = numeric sort, -r = reverse (descending)
    cut -d' ' -f4 access.log | sort | uniq -c | sort -rn
    
    echo "--- wc: count lines, then count just the errors ---"
    wc -l access.log
    grep -c "ERROR" access.log
    Output
    --- cut: pull out the log level (field 4) ---
    INFO
    INFO
    WARN
    ERROR
    INFO
    ERROR
    INFO
    ERROR
    WARN
    --- sort + uniq -c: COUNT how many of each level ---
          3 ERROR
          4 INFO
          2 WARN
    --- sort -rn: sort those counts, biggest first ---
          4 INFO
          3 ERROR
          2 WARN
    --- wc: count lines, then count just the errors ---
    9 access.log
    3
    This is real code — run it for free atonecompiler.com/bashor in your own editor.

    4️⃣ head & tail — peek at the ends

    head -3 shows the first 3 lines; tail -2 shows the last 2. They're perfect on the end of a pipeline to keep just the "top N" results. The killer feature is tail -f ("follow"): it stays open and prints new lines as they're written, so you can watch a live server log in real time (press Ctrl+C to stop).

    head, tail, and a top-3 one-liner
    # head and tail peek at the START or END of a file.
    
    echo "--- head -3: first 3 lines ---"
    head -3 access.log
    
    echo "--- tail -2: last 2 lines ---"
    tail -2 access.log
    
    echo "--- top 3 busiest pages (a real one-liner) ---"
    # field 6 is the URL; count, sort descending, keep the top 3
    cut -d' ' -f6 access.log | sort | uniq -c | sort -rn | head -3
    
    # tail -f FOLLOWS a file live — new lines appear as they're written.
    # Perfect for watching a server log in real time. Ctrl+C to stop.
    #   tail -f access.log
    Output
    --- head -3: first 3 lines ---
    2026-06-15 09:01:12 INFO  GET /home 200 alice
    2026-06-15 09:01:30 INFO  GET /products 200 bob
    2026-06-15 09:02:05 WARN  GET /search 200 alice
    --- tail -2: last 2 lines ---
    2026-06-15 09:05:01 ERROR POST /checkout 500 bob
    2026-06-15 09:05:48 WARN  GET /products 200 alice
    --- top 3 busiest pages (a real one-liner) ---
          3 /products
          1 /search
          1 /login
    This is real code — run it for free atonecompiler.com/bashor in your own editor.

    5️⃣ sed — find and replace

    sed ("stream editor") transforms text as it flows past. Its bread-and-butter is substitution: sed 's/old/new/' replaces the first match on each line, and adding /g replaces every match. By default it prints the result and leaves your file untouched — -i rewrites the file in place with no undo, so always preview first.

    sed substitution and line deletion
    # sed is a stream editor. Its #1 job is find-and-replace:
    #   sed 's/old/new/'   replaces the FIRST match on each line
    #   sed 's/old/new/g'  replaces EVERY match (g = global)
    
    echo "--- swap GET for FETCH (first match per line) ---"
    sed 's/GET/FETCH/' access.log | head -3
    
    echo "--- delete every INFO line, keep the rest ---"
    # /pattern/d deletes lines matching the pattern
    sed '/INFO/d' access.log
    
    # ⚠️ sed prints to the screen and leaves the file untouched.
    # Add -i to rewrite the file IN PLACE — there is no undo:
    #   sed -i 's/ERROR/CRITICAL/' access.log
    Output
    --- swap GET for FETCH (first match per line) ---
    2026-06-15 09:01:12 INFO  FETCH /home 200 alice
    2026-06-15 09:01:30 INFO  FETCH /products 200 bob
    2026-06-15 09:02:05 WARN  FETCH /search 200 alice
    --- delete every INFO line, keep the rest ---
    2026-06-15 09:02:05 WARN  GET /search 200 alice
    2026-06-15 09:02:41 ERROR POST /login 500 bob
    2026-06-15 09:03:55 ERROR GET /cart 404 alice
    2026-06-15 09:05:01 ERROR POST /checkout 500 bob
    2026-06-15 09:05:48 WARN  GET /products 200 alice
    This is real code — run it for free atonecompiler.com/bashor in your own editor.

    6️⃣ awk — a gentle intro to fields

    awk is a tiny language for columnar text. It reads each line and splits it into fields you reference by number: $1 is the first field, $2 the second, and $0 the whole line. awk '{print $1, $2}' prints the first two columns; put a /pattern/ before the braces to act only on matching lines. That's already enough to do real work.

    awk fields: $1, $2, filtering, and a tally
    # awk reads each line as fields split on whitespace:
    #   $1 = first field, $2 = second, ... $0 = the whole line.
    
    echo "--- print the time ($2) and the user (last field) ---"
    awk '{print $2, $7}' access.log | head -4
    
    echo "--- only ERROR lines, show URL ($6) and status ($7-ish) ---"
    # A /pattern/ before {...} filters to matching lines first
    awk '/ERROR/ {print $5, $6}' access.log
    
    echo "--- count requests per user with an awk tally ---"
    # Build up a count keyed by user ($7), print the totals at the END
    awk '{count[$7]++} END {for (u in count) print count[u], u}' access.log | sort -rn
    Output
    --- print the time ($2) and the user (last field) ---
    09:01:12 alice
    09:01:30 bob
    09:02:05 alice
    09:02:41 bob
    --- only ERROR lines, show URL ($6) and status ($7-ish) ---
    POST /login
    GET /cart
    POST /checkout
    --- count requests per user with an awk tally ---
    4 alice
    3 bob
    2 carol
    This is real code — run it for free atonecompiler.com/bashor in your own editor.

    🎯 Your Turn: requests per user

    This pipeline is almost done. Fill in the one missing step so the counts come out right (remember the rule about uniq).

    Complete the pipeline
    # 🎯 YOUR TURN — build a mini log-analysis pipeline.
    # Goal: count how many requests EACH user made, busiest first.
    # Pull out the user (field 7) -> sort -> count uniques -> sort by count.
    
    cut -d' ' -f7 access.log | ___ | uniq -c | sort -rn
    #                          👆 replace ___ : uniq -c only collapses
    #                             ADJACENT duplicates, so what must run first?
    
    # ✅ Expected output:
    #       4 alice
    #       3 bob
    #       2 carol
    Output
          4 alice
          3 bob
          2 carol
    This is real code — run it for free atonecompiler.com/bashor in your own editor.

    Putting It Together: a log-analysis report

    Here's a single script that answers the questions an on-call engineer actually asks — total traffic, a breakdown by level, every error with its line number, the busiest page, and who hit the most errors. Every line uses a tool from this lesson. Read it top to bottom; you understand all of it now.

    A real-world log-analysis pipeline
    # === A real log-analysis report, built from this lesson's tools ===
    
    echo "TOTAL REQUESTS:"
    wc -l < access.log                       # < feeds the file to wc (no filename printed)
    
    echo
    echo "REQUESTS BY LEVEL:"
    cut -d' ' -f4 access.log | sort | uniq -c | sort -rn
    
    echo
    echo "ALL ERRORS (with line numbers):"
    grep -nE "ERROR" access.log
    
    echo
    echo "TOP PAGE:"
    cut -d' ' -f6 access.log | sort | uniq -c | sort -rn | head -1
    
    echo
    echo "ERRORS PER USER:"
    awk '/ERROR/ {print $7}' access.log | sort | uniq -c | sort -rn
    Output
    TOTAL REQUESTS:
    9
    
    REQUESTS BY LEVEL:
          4 INFO
          3 ERROR
          2 WARN
    
    ALL ERRORS (with line numbers):
    4:2026-06-15 09:02:41 ERROR POST /login 500 bob
    6:2026-06-15 09:03:55 ERROR GET /cart 404 alice
    8:2026-06-15 09:05:01 ERROR POST /checkout 500 bob
    
    TOP PAGE:
          3 /products
    
    ERRORS PER USER:
          2 bob
          1 alice
    This is real code — run it for free atonecompiler.com/bashor in your own editor.

    Pro Tips

    • 💡 The counting idiom is worth memorising: cut -d' ' -f<N> file | sort | uniq -c | sort -rn answers "what are the most common values in column N?" for any file.
    • 💡 Build pipelines one stage at a time. Add a | step, check the output, then add the next. It's far easier than debugging a 5-stage pipe all at once.
    • 💡 Preview before you destroy. Run sed and find -exec rm commands without the destructive part first to confirm they target the right lines/files.
    • 💡 grep -c counts matching lines; wc -l counts all lines. Use the right one for the question you're asking.

    Common Errors (and the fix)

    • Regex | matches nothing: plain grep "ERROR|WARN" looks for the literal text ERROR|WARN. Add -E for extended regex so | means "or": grep -E "ERROR|WARN".
    • Counts look wrong / duplicates not merged: you ran uniq without sorting. uniq only collapses adjacent duplicates — always sort first: sort file | uniq -c.
    • sed -i wiped your file and there's no undo: run it without -i first to preview. On macOS, -i also needs an argument — use sed -i '' 's/a/b/' file (an empty string) or you'll get "command c expects \ followed by text".
    • grep: folder: Is a directory: you searched a folder without -r. Add it: grep -r "text" folder/.
    • Sorting numbers gives 1, 10, 2, 3…: that's alphabetical order. Use sort -n for numeric order (and -rn for biggest-first).

    📋 Quick Reference

    CommandWhat it does
    grep -in "x" fFind "x" (case-insensitive) with line numbers
    grep -rE "a|b" dir/Recursively find a OR b (extended regex)
    grep -v "x" fLines that do NOT contain "x"
    find . -name "*.log"All .log files under the current folder
    cut -d' ' -f4 fKeep field 4 (space-separated)
    sort | uniq -cCount each unique line (sort first!)
    sort -rnNumeric sort, biggest first
    wc -l fCount lines in a file
    head -3 / tail -2First 3 / last 2 lines
    tail -f fFollow a file live (new lines stream in)
    sed 's/old/new/g' fReplace every "old" with "new"
    awk '{print $1, $2}' fPrint fields 1 and 2 of each line

    Frequently Asked Questions

    Q: What's the difference between grep and find?

    find locates FILES by name, type, size, or age (e.g. every *.log file in a folder). grep searches the TEXT inside files for a pattern. They pair up perfectly: use find to choose which files, then grep to search within them with find ... -exec grep ... {} \;.

    Q: Why do I need to sort before uniq?

    uniq only collapses duplicate lines that are next to each other. If identical lines are scattered through the file it won't see them as duplicates. Running sort first groups all identical lines together, so the pipeline cut ... | sort | uniq -c gives correct counts.

    Q: When do I need grep -E?

    Use -E (extended regular expressions) when you want features like alternation a|b, grouping (ab)+, or + and ? quantifiers without backslashes. Plain grep uses basic regex where you'd have to escape them as \|, \+ and so on. -E keeps patterns readable: grep -E "ERROR|WARN".

    Q: Is sed -i safe to run?

    Be careful: -i edits the file in place and there is no undo. Always run the command WITHOUT -i first to preview the result on screen, and ideally keep the file in version control. On macOS, -i even requires an argument (use sed -i '' 's/.../.../' file) — another reason to test first.

    Q: Should I learn awk if I know grep, cut, and sed?

    For simple jobs grep/cut/sed are quicker. Reach for awk when you need columns AND logic together — filtering on a field's value, doing arithmetic, or tallying with arrays (count[$7]++). It's a tiny programming language built for tabular text, and a little goes a long way.

    Q: Do these tools work on Windows?

    They're native to macOS and Linux. On Windows, use WSL (Windows Subsystem for Linux) or Git Bash, both of which ship these exact commands. PowerShell has its own equivalents (Select-String for grep, Where-Object for filtering), but the Unix tools above are the universal standard.

    Mini-Challenge: profile a log file

    No blanks this time — just a brief and an outline. Build the pipeline yourself on access.log (or any log you have), run it, and check it against the expected snippet in the comments. This is exactly the kind of throwaway analysis engineers run every day.

    Your challenge
    # 🎯 MINI-CHALLENGE: profile your own log file.
    # Using ONLY the tools from this lesson (grep, cut, sort,
    # uniq -c, wc, head, awk), produce a short report that shows:
    #
    #   1. The total number of requests        (hint: wc -l)
    #   2. A count of each HTTP method          (GET / POST is field 5)
    #   3. The single busiest user              (count field 7, take head -1)
    #   4. Every WARN or ERROR line, numbered   (hint: grep -nE "WARN|ERROR")
    #
    # ✅ Example of what #3 should print:
    #       4 alice
    
    # your pipeline here
    This is real code — run it for free atonecompiler.com/bashor in your own editor.

    🎉 Lesson Complete — and that's the Command Line course!

    • grep searches text — -i (case), -n (numbers), -v (invert), -r (recursive), -E (extended regex)
    • find locates files by name/age and can -exec another command on each
    • cut slices columns; sort | uniq -c | sort -rn counts and ranks them
    • wc -l counts lines; head/tail peek at the ends; tail -f follows live logs
    • sed 's/old/new/g' find-and-replaces (mind the -i in-place trap)
    • awk reads fields $1, $2… for columnar logic and tallies

    This was the final lesson in the Command Line course — you can now navigate the filesystem, manage files, control permissions, write small scripts, chain commands with pipes, and slice through text like a pro. Where next? Take these skills into the Git course to version-control your projects, or revisit the CLI course overview to fill any gaps. The terminal is now yours. 🚀

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service

    Install LearnCodingFast

    Learn faster with the app on your home screen.