Combining find, grep, and awk - Practical Techniques for Real Work

Combining find, grep, and awk - Practical Techniques for Real Work

The practical guide explains practical data processing patterns combining find, grep, and awk, real-world use cases, and performance optimization. Master immediately usable skills for engineers and data analysts.

Combination Techniques

A true Linux master uses find, grep, and awk together. Tasks that are difficult with one command become powerful solutions when combined.

Basic Pipe Patterns

find + grep:

# Find log files containing ERROR
find /var/log -name "*.log" -exec grep -l "ERROR" {} \;

# Search "password" in txt files with line numbers
find /home -name "*.txt" | xargs grep -n "password"

grep + awk:

# Extract date, time, and last field from error lines
grep "ERROR" /var/log/app.log | awk '{print $1, $2, $NF}'

# Sum CPU usage of nginx processes
ps aux | grep "nginx" | awk '{sum+=$4} END {print "Total CPU:", sum "%"}'

find + awk:

# Calculate total size and count of log files
find /var -name "*.log" -printf "%s %p\n" | awk '{size+=$1; count++} END {printf "Total: %.2f MB Files: %d\n", size/1024/1024, count}'

Real-World Combined Processing

Scenario 1: Web server access analysis

Extract the top 10 IP addresses with the most errors from the last week.

find /var/log/apache2 -name "access.log*" -mtime -7 | \
xargs grep " 5[0-9][0-9] " | \
awk '{print $1}' | \
sort | uniq -c | \
sort -rn | \
head -10 | \
awk '{printf "%-15s %d times\n", $2, $1}'

Step-by-step explanation:

  1. find: Find access logs from the last 7 days
  2. grep: Extract lines with 5xx errors (server errors)
  3. awk: Extract IP addresses (column 1)
  4. sort | uniq -c: Count by IP address
  5. sort -rn: Sort by count descending
  6. head -10: Top 10
  7. awk: Format output

Scenario 2: Bulk delete old temporary files

Safely delete temporary files older than 30 days from the system.

# 1. First check target files
find /tmp /var/tmp /home -name "*.tmp" -o -name "temp*" -o -name "*.temp" | \
xargs ls -la

# 2. After confirming, run the deletion
find /tmp /var/tmp /home -name "*.tmp" -mtime +30 -size +0 | \
xargs -I {} bash -c 'echo "Delete: {}"; rm "{}"'

Always list and verify target files before deletion. Only target files older than 30 days with size greater than zero.

Scenario 3: Database connection log analysis

Analyze MySQL log connection counts by hour.

find /var/log/mysql -name "*.log" -mtime -1 | \
xargs grep -h "Connect" | \
awk '{
    match($0, /[0-9]{4}-[0-9]{2}-[0-9]{2}T([0-9]{2})/, time_parts);
    hour = time_parts[1];
    connections[hour]++;
}
END {
    print "MySQL connections by hour (last 24h)";
    for (h = 0; h < 24; h++) {
        printf "%02d:00-%02d:59 | ", h, h;
        count = (h in connections) ? connections[h] : 0;
        printf "%5d ", count;
        for (i = 0; i < count/10; i++) printf "▓";
        printf "\n";
    }
}'

Useful One-Liners

Convenient one-liners commonly used in practice. Ready to use, with high practical value.

Disk and file management:

# Top 20 largest files
find . -type f -exec du -h {} + | sort -rh | head -20

# Calculate total size of old log files
find /var -name "*.log" -mtime +7 -exec ls -lh {} \; | awk '{size+=$5} END {print "Recoverable size:", size/1024/1024 "MB"}'

Network and access analysis:

# Top 10 IPs by today's access count
grep "$(date '+%d/%b/%Y')" /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10

# IPs with most SSH login failures
find /var/log -name "*.log" | xargs grep -h "Failed password" | awk '{print $11}' | sort | uniq -c | sort -rn

System monitoring:

# Memory usage per process
find /proc -maxdepth 2 -name "status" 2>/dev/null | xargs grep -l "VmRSS" | xargs -I {} bash -c 'echo -n "$(basename $(dirname {})): "; grep VmRSS {}'

# Today's system error/warning sources
find /var/log -name "syslog*" | xargs grep "$(date '+%b %d')" | grep -i "error\|warn\|fail" | awk '{print $5}' | sort | uniq -c | sort -rn

Pipeline Design Patterns

Error handling and recovery patterns:

In production, failure is expected. Designs that handle errors and continue processing matter.

#!/bin/bash
set -euo pipefail

handle_error() {
    echo "ERROR: pipeline failure on line $1" >&2
    exit 1
}

trap 'handle_error $LINENO' ERR

process_logs_safely() {
    local input_pattern="$1"
    local output_file="$2"
    local temp_dir="/tmp/pipeline_$$"

    mkdir -p "$temp_dir"

    find /var/log -name "$input_pattern" -type f 2>/dev/null > "$temp_dir/file_list" || {
        echo "WARNING: some files not accessible" >&2
    }

    if [[ ! -s "$temp_dir/file_list" ]]; then
        echo "ERROR: no target files" >&2
        rm -rf "$temp_dir"
        return 1
    fi

    while IFS= read -r logfile; do
        if [[ -r "$logfile" ]]; then
            grep -h "ERROR\|WARN" "$logfile" 2>/dev/null >> "$temp_dir/errors.log" || true
        fi
    done < "$temp_dir/file_list"

    if [[ -s "$temp_dir/errors.log" ]]; then
        awk '
        {
            if ($0 ~ /ERROR/) error_count++;
            if ($0 ~ /WARN/) warn_count++;
        }
        END {
            printf "ERROR: %d\n", error_count;
            printf "WARN:  %d\n", warn_count;
        }' "$temp_dir/errors.log" > "$output_file"
    fi

    rm -rf "$temp_dir"
}

Parallel processing pipeline:

Speed up CPU-intensive processing with parallelism.

parallel_log_analysis() {
    local log_pattern="$1"
    local output_dir="$2"
    local cpu_cores=$(nproc)
    local max_parallel=$((cpu_cores - 1))

    find /var/log -name "$log_pattern" -type f | \
    xargs -n 1 -P "$max_parallel" -I {} bash -c '
        logfile="$1"
        output_dir="$2"
        result_file="$output_dir/result_$$.tmp"

        grep -E "([0-9]{1,3}\.){3}[0-9]{1,3}" "$logfile" | \
        awk "
        {
            ip = \$1;
            if (match(ip, /^192\.168\./)) region = \"local\";
            else if (match(ip, /^10\./)) region = \"internal\";
            else region = \"external\";
            total_by_region[region]++;
        }
        END {
            for (region in total_by_region) {
                printf \"region:%s total:%d\n\", region, total_by_region[region];
            }
        }" > "$result_file"
    ' -- {} "$output_dir"
}

Real-World Use Cases

Practice over theory. Examples by job role.

Web Engineer

Sudden production incident response

"Site is slow" report. Need to identify the cause quickly.

# 1. Check error logs
find /var/log/apache2 /var/log/nginx -name "*.log" | xargs grep -E "$(date '+%d/%b/%Y')" | grep -E "5[0-9][0-9]|error|timeout" | tail -50

# 2. Identify slow queries
find /var/log/mysql -name "*slow.log" | xargs grep -A 5 "Query_time" | awk '/Query_time: [5-9]/ {getline; print}'

# 3. Detect abnormal access patterns
grep "$(date '+%d/%b/%Y')" /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | awk '$1 > 1000 {print "Abnormal:", $2, "count:", $1}'

Tasks that take 30-60 minutes manually are completed in 5 minutes.

Monthly report generation

Summarize last month's access stats and error rates.

#!/bin/bash
LAST_MONTH=$(date -d "last month" '+%b/%Y')

echo "=== $LAST_MONTH Access Stats ==="

# Total access
TOTAL_ACCESS=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | wc -l)
echo "Total access: $TOTAL_ACCESS"

# Unique visitors
UNIQUE_VISITORS=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | awk '{print $1}' | sort -u | wc -l)
echo "Unique visitors: $UNIQUE_VISITORS"

# Error rate
ERROR_COUNT=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | grep -E " [45][0-9][0-9] " | wc -l)
ERROR_RATE=$(echo "scale=2; $ERROR_COUNT * 100 / $TOTAL_ACCESS" | bc)
echo "Error rate: $ERROR_RATE%"

Infrastructure Engineer

Server monitoring and maintenance

Periodic health check of multiple servers.

#!/bin/bash

echo "=== Server Health Report ==="
date

# Disk usage warning
echo "=== Disk Usage (warn at 80%+) ==="
df -h | awk 'NR>1 {gsub(/%/, "", $5); if($5 > 80) printf "WARNING: %s: %s used (%s%%)\n", $6, $3, $5}'

# Memory usage
echo "=== Memory Usage ==="
free -m | awk 'NR==2{printf "Memory: %.1f%% (%dMB / %dMB)\n", $3*100/$2, $3, $2}'

# Top CPU processes
echo "=== Top 5 CPU ==="
ps aux --no-headers | sort -rn -k3 | head -5 | awk '{printf "%-10s %5.1f%% %s\n", $1, $3, $11}'

# Recent error spike check
echo "=== Errors in last hour ==="
find /var/log -name "*.log" -mmin -60 | xargs grep -h -E "$(date '+%b %d %H')|$(date -d '1 hour ago' '+%b %d %H')" | grep -ci error

Data Analyst

Large data preprocessing

Several GB CSV file can't be opened in Excel. Needs preprocessing.

#!/bin/bash

CSV_FILE="sales_data_2024.csv"
OUTPUT_DIR="processed_data"
mkdir -p $OUTPUT_DIR

# File size and row count
echo "File size: $(du -h "$CSV_FILE" | cut -f1)"
echo "Total rows: $(wc -l < "$CSV_FILE")"

# Data quality check
echo "Empty rows: $(grep -c '^$' "$CSV_FILE")"
echo "Invalid rows: $(awk -F',' 'NF != 5 {count++} END {print count+0}' "$CSV_FILE")"

# Split into monthly files
awk -F',' 'NR==1 {header=$0; next}
{
    month=substr($1,1,7);
    if(!seen[month]) {
        print header > "'$OUTPUT_DIR'/sales_" month ".csv";
        seen[month]=1;
    }
    print $0 > "'$OUTPUT_DIR'/sales_" month ".csv"
}' "$CSV_FILE"

# Monthly summary
find $OUTPUT_DIR -name "sales_*.csv" | sort | while read file; do
    month=$(basename "$file" .csv | cut -d'_' -f2)
    total_sales=$(awk -F',' 'NR>1 {sum+=$4} END {print sum}' "$file")
    record_count=$(expr $(wc -l < "$file") - 1)
    printf "%s: %d records, total sales: %d\n" "$month" "$record_count" "$total_sales"
done

Industry Case Studies

Game development: large-scale log analysis

Detect cheating from 100GB/day of player action logs.

analyze_game_logs() {
    local log_date="$1"
    local output_dir="/analysis/$(date +%Y%m%d)"
    mkdir -p "$output_dir"

    find /game/logs -name "*${log_date}*.log" -type f | \
    xargs grep -h "PLAYER_ACTION" | \
    awk -F'|' '
    {
        player_id = $3;
        action = $4;
        value = $5;

        # Detect rapid mass actions
        if (action == "LEVEL_UP") {
            player_levelups[player_id]++;
            if (player_levelups[player_id] > 10) {
                print "SUSPICIOUS_LEVELUP", player_id > "/tmp/cheat_suspects.log";
            }
        }

        # Abnormal currency spike
        if (action == "GOLD_CHANGE" && value > 1000000) {
            print "SUSPICIOUS_GOLD", player_id, value > "/tmp/gold_anomaly.log";
        }

        player_actions[player_id]++;
        total_actions++;
    }
    END {
        avg_actions = total_actions / length(player_actions);
        for (player in player_actions) {
            if (player_actions[player] > avg_actions * 5) {
                printf "HIGH_ACTIVITY: %s (%d actions)\n", player, player_actions[player];
            }
        }
    }'
}

E-commerce: customer behavior analysis

Analyze purchase patterns from EC site access logs.

analyze_customer_journey() {
    local analysis_period="$1"
    local output_dir="/analytics/customer_journey"
    mkdir -p "$output_dir"

    find /var/log/nginx -name "access.log*" | \
    xargs grep "$analysis_period" | \
    awk '
    BEGIN { session_timeout = 1800; }
    {
        ip = $1;
        url = $7;

        if (url ~ /\/checkout|\/purchase/) {
            purchase_sessions[ip]++;
        }

        if (url ~ /\/products\/([0-9]+)/) {
            match(url, /\/products\/([0-9]+)/, product_match);
            product_views[product_match[1]]++;
        }
    }
    END {
        print "=== Customer Journey Analysis ===";
        for (product in product_views) {
            printf "Product %s: %d views\n", product, product_views[product];
        }
    }'
}

Performance Optimization

Basic Speed-up Techniques

# 1. Limit search scope
find /var/log -name "*.log"  # Good
# find / -name "*.log"       # Bad (slow)

# 2. Locale optimization
LC_ALL=C grep "ERROR" huge.log

# 3. Fixed strings with -F
grep -F "literal_string" file.txt

# 4. Skip unwanted directories
find /var -path "*/node_modules" -prune -o -name "*.log" -print

Advanced Optimization Techniques

# Parallel processing
find /var/log -name "*.log" | xargs -P 4 grep "ERROR"

# Search compressed files directly
zgrep "ERROR" /var/log/app.log.gz

# memmap fast file reading (GNU grep)
grep --mmap "ERROR" huge_file.log

Performance Measurement

# Measure execution time
time grep "ERROR" /var/log/huge.log

# Detailed resource usage
/usr/bin/time -v grep "ERROR" /var/log/huge.log

# Effect of parallel processing
for cores in 1 2 4 8; do
    echo "cores: $cores"
    time find /var/log -name "*.log" | xargs -P $cores grep -c "ERROR"
done

Next Steps

Take the combination techniques from the practical guide further with the professional guide's exercises and troubleshooting.