Advanced grep and awk Techniques - find/grep/awk Master Series
The advanced guide covers grep environment variable optimization, next-generation high-speed tools, and awk's associative arrays, user-defined functions, and stream processing. Master professional-level data processing techniques.
grep Command: The Text Search Wizard
grep stands for "Global Regular Expression Print". It extracts lines matching a pattern from files or input. Combined with regular expressions, it becomes an extremely powerful search tool.
Basic Syntax
grep [options] pattern filename
Basic Usage
String search:
# Display lines containing "Linux" grep "Linux" document.txt # Case-insensitive search grep -i "linux" document.txt # Display lines NOT containing "error" (inverse search) grep -v "error" log.txt
Line numbers and context:
# Show line numbers grep -n "function" script.js # Show 3 lines before and after grep -C 3 "ERROR" app.log # Show 1 line before and 2 lines after grep -A 2 -B 1 "WARNING" app.log
File search and counting:
# Show only filenames containing "TODO" grep -l "TODO" *.js # Count lines containing "error" grep -c "error" log.txt # Recursive search (note: /etc/ may contain sensitive information) grep -r "password" /etc/
Combining with Regular Expressions
The true power of grep comes from combining it with regular expressions.
# Lines starting with "Linux" grep "^Linux" document.txt # Lines ending with "finished" grep "finished$" log.txt # Empty lines grep "^$" file.txt # Lines containing one or more digits grep -E "[0-9]+" numbers.txt # Match "color" or "colour" grep -E "colou?r" text.txt # Match any of multiple patterns (OR) grep -E "error|warning|fatal" log.txt
Practical patterns:
# IP address pattern
grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log
# Email address pattern
grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" contacts.txt
# Date pattern (YYYY-MM-DD)
grep -E "20[0-9]{2}-[0-1][0-9]-[0-3][0-9]" log.txtCombining grep with Pipes
You can build powerful data processing pipelines by piping to other commands.
# Show only nginx processes ps aux | grep "nginx" # Exclude grep itself when filtering ps aux | grep -v "grep" | grep "python" # Real-time error monitoring tail -f /var/log/app.log | grep --line-buffered "ERROR" # Count 404 errors cat access.log | grep "404" | wc -l # Show processes listening on port 80 netstat -an | grep ":80 "
Speed-up Techniques
Environment variables and locale optimization significantly improve performance on large files.
# Eliminate UTF-8 processing overhead (up to 10x faster) LC_ALL=C grep "ERROR" huge_log.txt # Skip binary files LC_ALL=C grep --binary-files=without-match "pattern" /var/log/* # Disable colors for additional speed GREP_OPTIONS="--color=never" LC_ALL=C grep -F "ERROR" *.log # Fixed-string search (skip regex engine) grep -F "literal_string" file.txt
Next-Generation grep: ripgrep and ag
Faster, more feature-rich alternatives to traditional grep.
ripgrep (rg) - Rust-based fast grep:
# Search only JavaScript files (fast) rg --type js "function" /var/www/ # JSON output for structured processing rg --json "ERROR" /var/log/ | jq '.data.lines.text' # Show stats and counts rg --stats --count "TODO" ./src/
ag (The Silver Searcher):
# Multi-core parallel processing ag --parallel "pattern" /large/directory/ # Show 5 lines context, grouped ag --context=5 --group "ERROR" /var/log/
Performance comparison (1GB file search):
| Tool | Time | Memory | Notes |
|---|---|---|---|
| grep | 15.2s | 2MB | Standard, stable |
| LC_ALL=C grep | 8.1s | 2MB | Optimized |
| ripgrep (rg) | 2.3s | 8MB | Fastest, feature-rich |
| ag | 4.1s | 12MB | Fast, dev-friendly |
Large File Processing
# Real-time log monitoring with search tail -f /var/log/huge.log | grep --line-buffered "ERROR" # Search compressed files directly zgrep "ERROR" /var/log/app.log.gz # bzip2 files too bzgrep "pattern" archive.log.bz2 # Split large files for parallel processing split -l 1000000 huge.log chunk_ && grep "ERROR" chunk_* | sort
Report Generation
# Generate CSV error report
grep -n "ERROR" *.log | awk -F: '{print $1","$2","$3}' > error_report.csv
# Comprehensive error analysis report
{
echo "=== ERROR Analysis Report $(date) ==="
echo "Total errors: $(grep -c ERROR app.log)"
echo "Top 5 errors:"
grep -o 'ERROR.*' app.log | sort | uniq -c | sort -nr | head -5
}awk Command: The Data Processing Magician
awk is named after "Alfred Aho, Peter Weinberger, Brian Kernighan" (the initials of its creators). It's a powerful text processing language that excels with CSV files and log files.
How awk Thinks
awk processes input as records (typically lines) and fields (typically columns).
Name,Age,Job
Tanaka,25,Engineer
Sato,30,Designer
Yamada,28,Manager
$1: First field (Name)$2: Second field (Age)$3: Third field (Job)$0: Entire recordNF: Number of fieldsNR: Record number
Basic Syntax
awk 'pattern { action }' fileRun an action on lines matching a pattern.
Basic awk Operations
Extracting columns:
# Print column 1
awk '{print $1}' employees.csv
# Print columns 2 and 3
awk '{print $2, $3}' employees.csv
# Print with line number
awk '{print NR ": " $0}' file.txtSpecifying delimiters:
# Comma-separated, column 1
awk -F ',' '{print $1}' data.csv
# Colon-separated /etc/passwd, username and UID
awk -F ':' '{print $1, $3}' /etc/passwd
# Tab-separated, column 2
awk 'BEGIN {FS="\t"} {print $2}' tab_separated.txtConditional processing:
# Show people older than 25
awk '$2 > 25 {print $1, $2}' employees.csv
# Show engineers
awk '$3 == "Engineer" {print $1}' employees.csv
# Show lines with more than 3 fields
awk 'NF > 3 {print NR, $0}' data.txtCalculations and Aggregation
Basic calculations:
# Sum of column 3
awk '{sum += $3} END {print "Sum:", sum}' sales.csv
# Average of column 2
awk '{sum += $2; count++} END {print "Avg:", sum/count}' ages.txt
# Maximum of column 2
awk 'BEGIN {max=0} {if($2>max) max=$2} END {print "Max:", max}' numbers.txtGroup-by aggregation:
# Salary sum by department
awk '{dept[$3] += $2} END {for (d in dept) print d, dept[d]}' salary.csv
# Access count by IP
awk '{count[$1]++} END {for (c in count) print c, count[c]}' access.logBEGIN and END Patterns
BEGIN: Runs before processing the fileEND: Runs after processing the file
# Print header before processing data
awk 'BEGIN {print "Start", "Name", "Age"} {print NR, $1, $2}' data.txt
# Print total records after processing
awk '{count++} END {print "Total records:", count}' data.txt
# Report-style sales aggregation
awk 'BEGIN {print "=== Sales Report ==="} {total+=$3} END {print "Total:", total}' sales.txtAdvanced Usage
# Process multiple files with file name labels
awk 'FNR==1{print "=== " FILENAME " ==="} {print NR, $0}' file1.txt file2.txt
# Add pass/fail based on condition
awk '{if($2>=60) grade="Pass"; else grade="Fail"; print $1, $2, grade}' scores.txtMastering Associative Arrays
awk's true power lies in associative arrays (hash tables). They shine in multidimensional data processing.
Multidimensional aggregation (region x month sales):
awk -F, '
NR>1 {
sales[$2][$3] += $4;
total_by_region[$2] += $4;
total_by_month[$3] += $4;
grand_total += $4;
}
END {
printf "%-12s", "Region/Month";
for (month in total_by_month) printf "%10s", month;
printf "%12s\n", "Region Total";
for (region in total_by_region) {
printf "%-12s", region;
for (month in total_by_month) {
printf "%10d", (month in sales[region]) ? sales[region][month] : 0;
}
printf "%12d\n", total_by_region[region];
}
}' sales_data.csvUser-Defined Functions
Reuse complex logic with functions for maintainable code.
Statistical library:
awk '
function average(arr, count, sum, i) {
sum = 0;
for (i = 1; i <= count; i++) sum += arr[i];
return sum / count;
}
function stddev(arr, count, avg, sum_sq, i) {
avg = average(arr, count);
sum_sq = 0;
for (i = 1; i <= count; i++) {
sum_sq += (arr[i] - avg) ^ 2;
}
return sqrt(sum_sq / count);
}
{
if (NF >= 2 && $2 ~ /^[0-9]+\.?[0-9]*$/) {
values[++count] = $2;
sum += $2;
}
}
END {
if (count > 0) {
printf "n=%d\n", count;
printf "Avg: %.2f\n", average(values, count);
printf "StdDev: %.2f\n", stddev(values, count);
}
}' numerical_data.txtStream Processing and getline
Real-time data processing and external command integration shine here.
Real-time log monitoring:
tail -f /var/log/apache2/access.log | awk '
BEGIN {
window_size = 300;
alert_threshold = 100;
}
{
"date +%s" | getline current_time;
close("date +%s");
access_times[current_time]++;
for (time in access_times) {
if (current_time - time > window_size) {
delete access_times[time];
}
}
total_access = 0;
for (time in access_times) total_access += access_times[time];
if (total_access > alert_threshold) {
printf "[ALERT] High traffic: %d requests in last 5 minutes\n", total_access;
}
}'Performance Optimization
awk speed-up techniques
- Avoid unnecessary string concatenation (use arrays)
- Periodically
deletefrom large data structures - Process only the fields you need
- Initialize constants in
BEGIN
Memory-efficient large file processing:
awk '
BEGIN {
processed = 0;
batch_size = 10000;
}
{
process_record($0);
processed++;
if (processed % batch_size == 0) {
cleanup_memory();
printf "Processed: %d records\n", processed > "/dev/stderr";
}
}
function process_record(record, fields) {
split(record, fields, ",");
if (fields[2] > threshold) {
summary[fields[1]] += fields[3];
}
}
function cleanup_memory( key) {
for (key in old_cache) delete old_cache[key];
}
END {
for (key in summary) printf "%s: %d\n", key, summary[key];
}' huge_data_file.csvAdvanced Output Formatting
ASCII art chart generation:
awk -F, '
NR > 1 { sales[$1] += $3; }
END {
max_sales = 0;
for (person in sales) {
if (sales[person] > max_sales) max_sales = sales[person];
}
chart_width = 50;
scale = max_sales / chart_width;
print "Sales Chart";
print "===========";
for (person in sales) {
bar_length = int(sales[person] / scale);
printf "%-10s |", person;
for (j = 1; j <= bar_length; j++) printf "█";
printf " %d\n", sales[person];
}
}' sales_report.csv