Combining find, grep, and awk - Practical Techniques for Real Work

Combining find, grep, and awk - Practical Techniques for Real Work

The practical guide explains practical data processing patterns combining find, grep, and awk, real-world use cases, and performance optimization. Master immediately usable skills for engineers and data analysts.

What You'll Learn

  • Practical patterns that pipe find, grep, and awk together
  • Real-world use cases and workflows organized by job role
  • Performance optimization tips for handling large datasets
  • Immediately usable data processing skills for the field

Combination Techniques

Conclusion: Piping find, grep and awk turns hard tasks into powerful solutions.

A true Linux master uses find, grep, and awk together. Tasks that are difficult with one command become powerful solutions when combined.

Basic Pipe Patterns

find + grep:

# Find log files containing ERROR
find /var/log -name "*.log" -exec grep -l "ERROR" {} \;

# Search "password" in txt files with line numbers
find /home -name "*.txt" | xargs grep -n "password"

grep + awk:

# Extract date, time, and last field from error lines
grep "ERROR" /var/log/app.log | awk '{print $1, $2, $NF}'

# Sum CPU usage of nginx processes
ps aux | grep "nginx" | awk '{sum+=$4} END {print "Total CPU:", sum "%"}'

find + awk:

# Calculate total size and count of log files
find /var -name "*.log" -printf "%s %p\n" | awk '{size+=$1; count++} END {printf "Total: %.2f MB Files: %d\n", size/1024/1024, count}'

Real-World Combined Processing

Scenario 1: Web server access analysis

Extract the top 10 IP addresses with the most errors from the last week.

find /var/log/apache2 -name "access.log*" -mtime -7 | \
xargs grep " 5[0-9][0-9] " | \
awk '{print $1}' | \
sort | uniq -c | \
sort -rn | \
head -10 | \
awk '{printf "%-15s %d times\n", $2, $1}'

Step-by-step explanation:

  1. find: Find access logs from the last 7 days
  2. grep: Extract lines with 5xx errors (server errors)
  3. awk: Extract IP addresses (column 1)
  4. sort | uniq -c: Count by IP address
  5. sort -rn: Sort by count descending
  6. head -10: Top 10
  7. awk: Format output

Scenario 2: Bulk delete old temporary files

Safely delete temporary files older than 30 days from the system.

# 1. First check target files
find /tmp /var/tmp /home -name "*.tmp" -o -name "temp*" -o -name "*.temp" | \
xargs ls -la

# 2. After confirming, run the deletion
find /tmp /var/tmp /home -name "*.tmp" -mtime +30 -size +0 | \
xargs -I {} bash -c 'echo "Delete: {}"; rm "{}"'

Always list and verify target files before deletion. Only target files older than 30 days with size greater than zero.

Scenario 3: Database connection log analysis

Analyze MySQL log connection counts by hour.

find /var/log/mysql -name "*.log" -mtime -1 | \
xargs grep -h "Connect" | \
awk '{
    match($0, /[0-9]{4}-[0-9]{2}-[0-9]{2}T([0-9]{2})/, time_parts);
    hour = time_parts[1];
    connections[hour]++;
}
END {
    print "MySQL connections by hour (last 24h)";
    for (h = 0; h < 24; h++) {
        printf "%02d:00-%02d:59 | ", h, h;
        count = (h in connections) ? connections[h] : 0;
        printf "%5d ", count;
        for (i = 0; i < count/10; i++) printf "▓";
        printf "\n";
    }
}'

Useful One-Liners

Convenient one-liners commonly used in practice. Ready to use, with high practical value.

Disk and file management:

# Top 20 largest files
find . -type f -exec du -h {} + | sort -rh | head -20

# Calculate total size of old log files
find /var -name "*.log" -mtime +7 -exec ls -lh {} \; | awk '{size+=$5} END {print "Recoverable size:", size/1024/1024 "MB"}'

Network and access analysis:

# Top 10 IPs by today's access count
grep "$(date '+%d/%b/%Y')" /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10

# IPs with most SSH login failures
find /var/log -name "*.log" | xargs grep -h "Failed password" | awk '{print $11}' | sort | uniq -c | sort -rn

System monitoring:

# Memory usage per process
find /proc -maxdepth 2 -name "status" 2>/dev/null | xargs grep -l "VmRSS" | xargs -I {} bash -c 'echo -n "$(basename $(dirname {})): "; grep VmRSS {}'

# Today's system error/warning sources
find /var/log -name "syslog*" | xargs grep "$(date '+%b %d')" | grep -i "error\|warn\|fail" | awk '{print $5}' | sort | uniq -c | sort -rn

Pipeline Design Patterns

Error handling and recovery patterns:

In production, failure is expected. Designs that handle errors and continue processing matter.

#!/bin/bash
set -euo pipefail

handle_error() {
    echo "ERROR: pipeline failure on line $1" >&2
    exit 1
}

trap 'handle_error $LINENO' ERR

process_logs_safely() {
    local input_pattern="$1"
    local output_file="$2"
    local temp_dir="/tmp/pipeline_$$"

    mkdir -p "$temp_dir"

    find /var/log -name "$input_pattern" -type f 2>/dev/null > "$temp_dir/file_list" || {
        echo "WARNING: some files not accessible" >&2
    }

    if [[ ! -s "$temp_dir/file_list" ]]; then
        echo "ERROR: no target files" >&2
        rm -rf "$temp_dir"
        return 1
    fi

    while IFS= read -r logfile; do
        if [[ -r "$logfile" ]]; then
            grep -h "ERROR\|WARN" "$logfile" 2>/dev/null >> "$temp_dir/errors.log" || true
        fi
    done < "$temp_dir/file_list"

    if [[ -s "$temp_dir/errors.log" ]]; then
        awk '
        {
            if ($0 ~ /ERROR/) error_count++;
            if ($0 ~ /WARN/) warn_count++;
        }
        END {
            printf "ERROR: %d\n", error_count;
            printf "WARN:  %d\n", warn_count;
        }' "$temp_dir/errors.log" > "$output_file"
    fi

    rm -rf "$temp_dir"
}

Parallel processing pipeline:

Speed up CPU-intensive processing with parallelism.

parallel_log_analysis() {
    local log_pattern="$1"
    local output_dir="$2"
    local cpu_cores=$(nproc)
    local max_parallel=$((cpu_cores - 1))

    find /var/log -name "$log_pattern" -type f | \
    xargs -n 1 -P "$max_parallel" -I {} bash -c '
        logfile="$1"
        output_dir="$2"
        result_file="$output_dir/result_$$.tmp"

        grep -E "([0-9]{1,3}\.){3}[0-9]{1,3}" "$logfile" | \
        awk "
        {
            ip = \$1;
            if (match(ip, /^192\.168\./)) region = \"local\";
            else if (match(ip, /^10\./)) region = \"internal\";
            else region = \"external\";
            total_by_region[region]++;
        }
        END {
            for (region in total_by_region) {
                printf \"region:%s total:%d\n\", region, total_by_region[region];
            }
        }" > "$result_file"
    ' -- {} "$output_dir"
}

Real-World Use Cases

Conclusion: Role-based examples show how the three commands help in real work.

Practice over theory. Examples by job role.

Web Engineer

Sudden production incident response

"Site is slow" report. Need to identify the cause quickly.

# 1. Check error logs
find /var/log/apache2 /var/log/nginx -name "*.log" | xargs grep -E "$(date '+%d/%b/%Y')" | grep -E "5[0-9][0-9]|error|timeout" | tail -50

# 2. Identify slow queries
find /var/log/mysql -name "*slow.log" | xargs grep -A 5 "Query_time" | awk '/Query_time: [5-9]/ {getline; print}'

# 3. Detect abnormal access patterns
grep "$(date '+%d/%b/%Y')" /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | awk '$1 > 1000 {print "Abnormal:", $2, "count:", $1}'

Tasks that take 30-60 minutes manually are completed in 5 minutes.

Monthly report generation

Summarize last month's access stats and error rates.

#!/bin/bash
LAST_MONTH=$(date -d "last month" '+%b/%Y')

echo "=== $LAST_MONTH Access Stats ==="

# Total access
TOTAL_ACCESS=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | wc -l)
echo "Total access: $TOTAL_ACCESS"

# Unique visitors
UNIQUE_VISITORS=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | awk '{print $1}' | sort -u | wc -l)
echo "Unique visitors: $UNIQUE_VISITORS"

# Error rate
ERROR_COUNT=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | grep -E " [45][0-9][0-9] " | wc -l)
ERROR_RATE=$(echo "scale=2; $ERROR_COUNT * 100 / $TOTAL_ACCESS" | bc)
echo "Error rate: $ERROR_RATE%"

Infrastructure Engineer

Server monitoring and maintenance

Periodic health check of multiple servers.

#!/bin/bash

echo "=== Server Health Report ==="
date

# Disk usage warning
echo "=== Disk Usage (warn at 80%+) ==="
df -h | awk 'NR>1 {gsub(/%/, "", $5); if($5 > 80) printf "WARNING: %s: %s used (%s%%)\n", $6, $3, $5}'

# Memory usage
echo "=== Memory Usage ==="
free -m | awk 'NR==2{printf "Memory: %.1f%% (%dMB / %dMB)\n", $3*100/$2, $3, $2}'

# Top CPU processes
echo "=== Top 5 CPU ==="
ps aux --no-headers | sort -rn -k3 | head -5 | awk '{printf "%-10s %5.1f%% %s\n", $1, $3, $11}'

# Recent error spike check
echo "=== Errors in last hour ==="
find /var/log -name "*.log" -mmin -60 | xargs grep -h -E "$(date '+%b %d %H')|$(date -d '1 hour ago' '+%b %d %H')" | grep -ci error

Data Analyst

Large data preprocessing

Several GB CSV file can't be opened in Excel. Needs preprocessing.

#!/bin/bash

CSV_FILE="sales_data_2024.csv"
OUTPUT_DIR="processed_data"
mkdir -p $OUTPUT_DIR

# File size and row count
echo "File size: $(du -h "$CSV_FILE" | cut -f1)"
echo "Total rows: $(wc -l < "$CSV_FILE")"

# Data quality check
echo "Empty rows: $(grep -c '^$' "$CSV_FILE")"
echo "Invalid rows: $(awk -F',' 'NF != 5 {count++} END {print count+0}' "$CSV_FILE")"

# Split into monthly files
awk -F',' 'NR==1 {header=$0; next}
{
    month=substr($1,1,7);
    if(!seen[month]) {
        print header > "'$OUTPUT_DIR'/sales_" month ".csv";
        seen[month]=1;
    }
    print $0 > "'$OUTPUT_DIR'/sales_" month ".csv"
}' "$CSV_FILE"

# Monthly summary
find $OUTPUT_DIR -name "sales_*.csv" | sort | while read file; do
    month=$(basename "$file" .csv | cut -d'_' -f2)
    total_sales=$(awk -F',' 'NR>1 {sum+=$4} END {print sum}' "$file")
    record_count=$(expr $(wc -l < "$file") - 1)
    printf "%s: %d records, total sales: %d\n" "$month" "$record_count" "$total_sales"
done

Industry Case Studies

Game development: large-scale log analysis

Detect cheating from 100GB/day of player action logs.

analyze_game_logs() {
    local log_date="$1"
    local output_dir="/analysis/$(date +%Y%m%d)"
    mkdir -p "$output_dir"

    find /game/logs -name "*${log_date}*.log" -type f | \
    xargs grep -h "PLAYER_ACTION" | \
    awk -F'|' '
    {
        player_id = $3;
        action = $4;
        value = $5;

        # Detect rapid mass actions
        if (action == "LEVEL_UP") {
            player_levelups[player_id]++;
            if (player_levelups[player_id] > 10) {
                print "SUSPICIOUS_LEVELUP", player_id > "/tmp/cheat_suspects.log";
            }
        }

        # Abnormal currency spike
        if (action == "GOLD_CHANGE" && value > 1000000) {
            print "SUSPICIOUS_GOLD", player_id, value > "/tmp/gold_anomaly.log";
        }

        player_actions[player_id]++;
        total_actions++;
    }
    END {
        avg_actions = total_actions / length(player_actions);
        for (player in player_actions) {
            if (player_actions[player] > avg_actions * 5) {
                printf "HIGH_ACTIVITY: %s (%d actions)\n", player, player_actions[player];
            }
        }
    }'
}

E-commerce: customer behavior analysis

Analyze purchase patterns from EC site access logs.

analyze_customer_journey() {
    local analysis_period="$1"
    local output_dir="/analytics/customer_journey"
    mkdir -p "$output_dir"

    find /var/log/nginx -name "access.log*" | \
    xargs grep "$analysis_period" | \
    awk '
    BEGIN { session_timeout = 1800; }
    {
        ip = $1;
        url = $7;

        if (url ~ /\/checkout|\/purchase/) {
            purchase_sessions[ip]++;
        }

        if (url ~ /\/products\/([0-9]+)/) {
            match(url, /\/products\/([0-9]+)/, product_match);
            product_views[product_match[1]]++;
        }
    }
    END {
        print "=== Customer Journey Analysis ===";
        for (product in product_views) {
            printf "Product %s: %d views\n", product, product_views[product];
        }
    }'
}

Performance Optimization

Conclusion: Limit scope, use fixed strings, and cut waste to run fast on big data.

Basic Speed-up Techniques

# 1. Limit search scope
find /var/log -name "*.log"  # Good
# find / -name "*.log"       # Bad (slow)

# 2. Locale optimization
LC_ALL=C grep "ERROR" huge.log

# 3. Fixed strings with -F
grep -F "literal_string" file.txt

# 4. Skip unwanted directories
find /var -path "*/node_modules" -prune -o -name "*.log" -print

Advanced Optimization Techniques

# Parallel processing
find /var/log -name "*.log" | xargs -P 4 grep "ERROR"

# Search compressed files directly
zgrep "ERROR" /var/log/app.log.gz

# memmap fast file reading (GNU grep)
grep --mmap "ERROR" huge_file.log

Performance Measurement

# Measure execution time
time grep "ERROR" /var/log/huge.log

# Detailed resource usage
/usr/bin/time -v grep "ERROR" /var/log/huge.log

# Effect of parallel processing
for cores in 1 2 4 8; do
    echo "cores: $cores"
    time find /var/log -name "*.log" | xargs -P $cores grep -c "ERROR"
done

Next Steps

Take the combination techniques from the practical guide further with the professional guide's exercises and troubleshooting.