grep and awk Advanced Techniques - find/grep/awk Master Series Advanced

September 16, 2025 Intermediate to Advanced ⏱ ~20-25 minutes Series 2/4

Advanced guide covering grep environment variable optimization, next-gen high-speed tools, awk associative arrays, user-defined functions, and stream processing. Master professional-level data processing techniques.

📋 Table of Contents

grep Command: The Text Search Wizard
awk Command: The Data Processing Sorcerer

4. grep Command: The Text Search Wizard

grep stands for "Global Regular Expression Print" - a command that extracts lines matching specific patterns from files or input. When combined with regular expressions, it becomes an extremely powerful search tool.

🔧 Basic Syntax

grep [options] pattern filename

Display lines containing the specified pattern

🔰 Basic Usage

String Search

grep "Linux" document.txt

Display lines containing "Linux"

grep -i "linux" document.txt

Case-insensitive search

grep -v "error" log.txt

Display lines NOT containing "error" (inverse search)

Line Numbers and Context

grep -n "function" script.js

Show line numbers with matches

grep -C 3 "ERROR" app.log

Show 3 lines before and after matches

grep -A 2 -B 1 "WARNING" app.log

Show 1 line before, 2 lines after matches

File Search and Counting

grep -l "TODO" *.js

Show only filenames containing "TODO"

grep -c "error" log.txt

Count lines containing "error"

grep -r "password" /etc/

Recursive directory search

🎯 Combining with Regular Expressions

grep's true power is unleashed when combined with regular expressions.

Basic Regular Expression Patterns

grep "^Linux" document.txt

Lines starting with "Linux"

grep "finished$" log.txt

Lines ending with "finished"

grep "^$" file.txt

Empty lines

Character Classes and Quantifiers

grep "[0-9]+" numbers.txt

Lines containing one or more digits

grep "colou?r" text.txt

"color" or "colour" (? means 0 or 1 occurrence)

grep -E "error|warning|fatal" log.txt

Match any of multiple patterns (OR search)

Practical Pattern Examples

grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log

Search for IP address patterns

grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" contacts.txt

Search for email address patterns

grep -E "20[0-9]{2}-[0-1][0-9]-[0-3][0-9]" log.txt

Search for date pattern (YYYY-MM-DD)

🔄 Combining grep with Pipes

By connecting with other commands through pipes, you can build powerful data processing pipelines.

Process Management Combinations

ps aux | grep "nginx"

Display only nginx processes

ps aux | grep -v "grep" | grep "python"

Display python processes excluding grep itself

Log Analysis Combinations

tail -f /var/log/app.log | grep --line-buffered "ERROR"

Monitor errors in real-time

cat access.log | grep "404" | wc -l

Count 404 error occurrences

Network Information Combinations

netstat -an | grep ":80 "

Display processes listening on port 80

ifconfig | grep -E "inet [0-9]+"

Extract only IP address information

💡 Practical grep Techniques

Combining Multiple Conditions

grep "error" log.txt | grep -v "timeout"

Lines with "error" but not "timeout"

grep -E "(error|warning)" log.txt | grep "2025-01-15"

Errors or warnings on specific date

Efficient Search Configuration

grep --color=always "pattern" file.txt | less -R

Preserve color output when using less

GREP_OPTIONS="--color=auto" grep "pattern" file.txt

Set default options via environment variable

Speed Optimization Techniques

LC_ALL=C grep "pattern" large_file.txt

Disable UTF-8 processing with locale setting for speed

grep -F "literal_string" file.txt

Fixed string search (disable regex processing)

🚀 Advanced grep Techniques

Parallel Search Across Multiple Files

find /var/log -name "*.log" | xargs -P 4 grep "ERROR"

Parallel search with 4 processes

🎯 grep Ultimate Techniques: Professional Level

Once you've mastered the basics, master grep's hidden features and advanced techniques for expert-level data processing.

🌐 Environment Variables and Locale Optimization

For large file processing, locale settings significantly impact performance.

🐌 Slow Method (UTF-8 Processing)

grep "ERROR" huge_log.txt

Character encoding processing creates overhead

⚡ Speed Optimization (ASCII Processing)

LC_ALL=C grep "ERROR" huge_log.txt

Up to 10x faster with ASCII processing

LC_ALL=C grep --binary-files=without-match "pattern" /var/log/*

High-speed search skipping binary files

GREP_OPTIONS="--color=never" LC_ALL=C grep -F "ERROR" *.log

Further speed improvement by disabling color

🔗 Pipeline Combination Mastery

Combine multiple grep commands to efficiently handle complex conditions.

🎯 Progressive Filtering

grep "ERROR" app.log | grep -v "Timeout" | grep "$(date +%Y-%m-%d)"

Extract today's ERROR lines excluding timeouts

📊 Search with Statistics

grep -h "ERROR" /var/log/*.log | sort | uniq -c | sort -nr

Rank error types by occurrence count

🕐 Time Series Analysis

grep "ERROR" app.log | grep -o "[0-9]{2}:[0-9]{2}:[0-9]{2}" | cut -c1-2 | sort | uniq -c

Aggregate error occurrences by hour

⚡ Next-Gen grep: ripgrep and ag

Master alternative tools that are faster and more feature-rich than traditional grep.

🦀 ripgrep (rg) - Rust-based High-Speed grep

rg --type js "function" /var/www/

High-speed search targeting only JavaScript files

rg --json "ERROR" /var/log/ | jq '.data.lines.text'

JSON output for structured data processing

rg --stats --count "TODO" ./src/

Display search statistics and counts simultaneously

⚡ ag (The Silver Searcher)

ag --parallel "pattern" /large/directory/

Multi-core parallel processing for large searches

ag --context=5 --group "ERROR" /var/log/

Display 5 lines context with grouping

📈 Performance Comparison (1GB File Search)

Tool	Execution Time	Memory Usage	Features
grep	15.2 sec	2MB	Standard, Stable
LC_ALL=C grep	8.1 sec	2MB	Optimized
ripgrep (rg)	2.3 sec	8MB	Fastest, Feature-rich
ag	4.1 sec	12MB	Fast, Developer-friendly

🧠 Complex Pattern Matching Strategies

Advanced techniques for efficiently combining multiple conditions and exclusions.

🎯 Multiple Keyword AND Conditions

grep "ERROR" app.log | grep "database" | grep "timeout"

Basic method (3 pipes)

grep -E "^.*ERROR.*database.*timeout.*$" app.log

Single regex processing (faster)

🚫 Complex Exclusion Patterns

grep -v -E "(DEBUG|INFO|TRACE)" app.log | grep -v "health_check"

Multi-level exclusion filtering

📅 Time Range Search

grep -E "2024-01-(0[1-9]|[12][0-9]|3[01]) (0[89]|1[0-7]):" app.log

Extract logs for January 1-31, hours 8-17

💾 Large File Processing Mastery

Efficient methods for processing multi-GB to TB class files.

🔄 Streaming Processing

tail -f /var/log/huge.log | grep --line-buffered "ERROR"

Monitor and search logs in real-time

📦 Direct Compressed File Search

zgrep "ERROR" /var/log/app.log.gz

Search gzip-compressed files without decompression

bzgrep "pattern" archive.log.bz2

Direct search of bzip2 files also possible

⚡ Parallel Split Processing

split -l 1000000 huge.log chunk_ && grep "ERROR" chunk_* | sort

Split large files for parallel processing

🎨 Output Customization and Report Generation

Techniques for formatting search results for readability and report processing.

🌈 Color Output Optimization

GREP_COLORS='ms=1;31:mc=1;31:sl=:cx=:fn=1;32:ln=1;33:bn=1;33:se=' grep --color=always "ERROR" app.log

Custom color settings for improved visibility

📋 Structured Output Generation

grep -n "ERROR" *.log | awk -F: '{print $1","$2","$3}' > error_report.csv

Generate error report in CSV format

📊 Automatic Statistical Report Generation


{
  echo "=== ERROR Analysis Report $(date) ==="
  echo "Total Errors: $(grep -c ERROR app.log)"
  echo "Unique Errors: $(grep -o 'ERROR.*' app.log | sort -u | wc -l)"
  echo "Top 5 Errors:"
  grep -o 'ERROR.*' app.log | sort | uniq -c | sort -nr | head -5
}

Generate comprehensive error analysis report

5. awk Command: The Data Processing Sorcerer

awk is named after "Alfred Aho, Peter Weinberger, Brian Kernighan" - a powerful text processing language that excels at processing CSV files and log files.

🔧 Basic Concepts

📊 Understanding awk

awk divides input into records (usually lines) and fields (usually columns) for processing.

Data Structure Example

name,age,occupation
Tanaka,25,Engineer
Sato,30,Designer
Yamada,28,Manager

$1: 1st field (name)
$2: 2nd field (age)
$3: 3rd field (occupation)
$0: Entire record
NF: Number of fields
NR: Record number

🔧 Basic Syntax

awk 'pattern { action }' filename

Execute action on lines matching pattern

🔰 Basic awk Operations

Column Extraction

awk '{print $1}' employees.csv

Display only 1st column (name)

awk '{print $2, $3}' employees.csv

Display 2nd and 3rd columns

awk '{print NR ": " $0}' file.txt

Display entire content with line numbers

Specifying Delimiters

awk -F ',' '{print $1}' data.csv

Display 1st column of comma-separated file

awk -F ':' '{print $1, $3}' /etc/passwd

Display username and UID from colon-separated file

awk 'BEGIN {FS="\t"} {print $2}' tab_separated.txt

Display 2nd column of tab-separated file

Conditional Processing

awk '$2 > 25 {print $1, $2}' employees.csv

Display name and age for people over 25

awk '$3 == "Engineer" {print $1}' employees.csv

Display names of engineers

awk 'NF > 3 {print NR, $0}' data.txt

Display lines with more than 3 fields with line numbers

📊 Calculation and Aggregation

One of awk's powerful features is numerical calculation and aggregation.

Basic Calculations

awk '{sum += $3} END {print "Total:", sum}' sales.csv

Calculate sum of 3rd column (sales, etc.)

awk '{sum += $2; count++} END {print "Average:", sum/count}' ages.txt

Calculate average of 2nd column

awk 'BEGIN {max=0} {if($2>max) max=$2} END {print "Max:", max}' numbers.txt

Find maximum value in 2nd column

Group Aggregation

awk '{dept[$3] += $2} END {for (d in dept) print d, dept[d]}' salary.csv

Calculate total salary by department

awk '{count[$1]++} END {for (c in count) print c, count[c]}' access.log

Count access by IP address

Complex Processing Examples

awk -F, 'NR>1 {sales[$2]+=$4; count[$2]++} END {for(region in sales) printf "%s: Sales %d Count %d Avg %.1f\n", region, sales[region], count[region], sales[region]/count[region]}' sales_data.csv

Regional sales statistics (total, count, average)

🎭 BEGIN and END Patterns

Using Special Patterns

BEGIN Pattern

Execute before file processing

awk 'BEGIN {print "Processing Start", "Name", "Age"} {print NR, $1, $2}' data.txt

Output header before processing data

END Pattern

Execute after file processing

awk '{count++} END {print "Total Records:", count}' data.txt

Display total record count after processing

Combined Example

awk 'BEGIN {print "=== Sales Report ==="} {total+=$3} END {print "Total Sales:", total, "yen"}' sales.txt

Sales aggregation in report format

🚀 Advanced awk Techniques

📊 Processing Multiple Files

awk 'FNR==1{print "=== " FILENAME " ==="} {print NR, $0}' file1.txt file2.txt

Process multiple files with filename labels

🔄 Conditional Branching and Functions

awk '{if($2>=60) grade="Pass"; else grade="Fail"; print $1, $2, grade}' scores.txt

Add judgment result based on conditions

📅 Date/Time Processing

awk '{gsub(/-/, "/", $1); cmd="date -d " $1 " +%w"; cmd | getline weekday; print $0, weekday}' dates.txt

Calculate and add day of week from date

🥋 awk Black Belt Level: Data Processing Mastery

Once you've mastered the basics, learn awk's hidden powers and advanced programming techniques.

🧠 Complete Associative Array Mastery

awk's true power lies in associative arrays (hash tables). They excel at multi-dimensional data processing.

📊 Multi-dimensional Aggregation (Sales by Region × Month)

awk -F, '
NR>1 {
    # sales[region][month] += sales_amount
    sales[$2][$3] += $4;
    total_by_region[$2] += $4;
    total_by_month[$3] += $4;
    grand_total += $4;
}
END {
    # Header output
    printf "%-12s", "Region/Month";
    for (month in total_by_month) printf "%10s", month;
    printf "%12s\n", "Region Total";

    # Data output
    for (region in total_by_region) {
        printf "%-12s", region;
        for (month in total_by_month) {
            printf "%10d", (month in sales[region]) ? sales[region][month] : 0;
        }
        printf "%12d\n", total_by_region[region];
    }

    # Month totals output
    printf "%-12s", "Month Total";
    for (month in total_by_month) printf "%10d", total_by_month[month];
    printf "%12d\n", grand_total;
}' sales_data.csv

Generate cross-tabulation from CSV sales data

🔍 Duplicate Data Detection and Statistics

awk '
{
    # Count occurrences of entire line
    count[$0]++;
    # Record line number of first occurrence
    if (!first_occurrence[$0]) {
        first_occurrence[$0] = NR;
    }
}
END {
    print "=== Duplicate Data Analysis Report ===";
    duplicates = 0;
    unique_count = 0;

    for (line in count) {
        if (count[line] > 1) {
            printf "Duplicate: %s (Count: %d, First Line: %d)\n",
                   line, count[line], first_occurrence[line];
            duplicates++;
        } else {
            unique_count++;
        }
    }

    printf "\nStatistics:\n";
    printf "Total Lines: %d\n", NR;
    printf "Unique Lines: %d\n", unique_count;
    printf "Duplicate Patterns: %d\n", duplicates;
    printf "Data Duplication Rate: %.2f%%\n", (duplicates * 100.0) / (unique_count + duplicates);
}' data_file.txt

Detect data duplicates and generate detailed statistical report

🔧 User-Defined Functions and Modularization

Functionalize complex processing for reuse and create maintainable code.

📅 Date Processing Library

awk '
# Date validity check function
function is_valid_date(date_str,    parts, year, month, day, days_in_month) {
    if (split(date_str, parts, "-") != 3) return 0;

    year = parts[1]; month = parts[2]; day = parts[3];
    if (year < 1900 || year > 2100) return 0;
    if (month < 1 || month > 12) return 0;

    # Check days in month (consider leap years)
    days_in_month = "31,28,31,30,31,30,31,31,30,31,30,31";
    split(days_in_month, month_days, ",");

    if (month == 2 && is_leap_year(year)) {
        max_day = 29;
    } else {
        max_day = month_days[month];
    }

    return (day >= 1 && day <= max_day);
}

# Leap year determination function
function is_leap_year(year) {
    return (year % 4 == 0 && year % 100 != 0) || (year % 400 == 0);
}

# Date difference calculation function (simplified)
function date_diff_days(date1, date2,    cmd, result) {
    cmd = sprintf("date -d \"%s\" +%%s", date1);
    cmd | getline timestamp1; close(cmd);

    cmd = sprintf("date -d \"%s\" +%%s", date2);
    cmd | getline timestamp2; close(cmd);

    return int((timestamp2 - timestamp1) / 86400);
}

# Main processing
{
    if (is_valid_date($1)) {
        diff = date_diff_days($1, "'$(date +%Y-%m-%d)'");
        printf "%s: %s (%d days %s)\n", $1,
               (diff >= 0) ? "Future" : "Past",
               (diff < 0) ? -diff : diff,
               (diff >= 0) ? "from now" : "ago";
    } else {
        printf "%s: Invalid date format\n", $1;
    }
}' date_list.txt

Function library for date validation, leap year check, and date difference calculation

🔢 Statistical Calculation Library

awk '
# Calculate array average
function average(arr, count,    sum, i) {
    sum = 0;
    for (i = 1; i <= count; i++) sum += arr[i];
    return sum / count;
}

# Calculate array standard deviation
function stddev(arr, count,    avg, sum_sq, i) {
    avg = average(arr, count);
    sum_sq = 0;
    for (i = 1; i <= count; i++) {
        sum_sq += (arr[i] - avg) ^ 2;
    }
    return sqrt(sum_sq / count);
}

# Calculate array median
function median(arr, count,    temp_arr, i, j, tmp) {
    # Copy array and sort
    for (i = 1; i <= count; i++) temp_arr[i] = arr[i];

    # Bubble sort (for small arrays)
    for (i = 1; i <= count; i++) {
        for (j = i + 1; j <= count; j++) {
            if (temp_arr[i] > temp_arr[j]) {
                tmp = temp_arr[i];
                temp_arr[i] = temp_arr[j];
                temp_arr[j] = tmp;
            }
        }
    }

    if (count % 2 == 1) {
        return temp_arr[int(count/2) + 1];
    } else {
        return (temp_arr[count/2] + temp_arr[count/2 + 1]) / 2;
    }
}

# Data collection
{
    if (NF >= 2 && $2 ~ /^[0-9]+\.?[0-9]*$/) {
        values[++count] = $2;
        sum += $2;
        if (min == "" || $2 < min) min = $2;
        if (max == "" || $2 > max) max = $2;
    }
}

END {
    if (count > 0) {
        printf "Statistical Summary (n=%d)\n", count;
        printf "==================\n";
        printf "Min:     %8.2f\n", min;
        printf "Max:     %8.2f\n", max;
        printf "Mean:    %8.2f\n", average(values, count);
        printf "Median:  %8.2f\n", median(values, count);
        printf "Std Dev: %8.2f\n", stddev(values, count);
        printf "Total:   %8.2f\n", sum;
    }
}' numerical_data.txt

Function suite for numerical data statistical analysis (mean, median, standard deviation, etc.)

🌊 Stream Processing and getline Utilization

Techniques that excel at real-time data processing and external command integration.

📡 Real-Time Log Monitoring

# Real-time monitoring with tail -f
tail -f /var/log/apache2/access.log | awk '
BEGIN {
    # Time window setting (5 minutes)
    window_size = 300;
    alert_threshold = 100;
}

{
    # Extract timestamp
    if (match($4, /\[([^\]]+)\]/, timestamp)) {
        # Get current time
        "date +%s" | getline current_time;
        close("date +%s");

        # Record access
        access_times[current_time]++;

        # Delete old data (older than 5 minutes)
        for (time in access_times) {
            if (current_time - time > window_size) {
                delete access_times[time];
            }
        }

        # Count current window access
        total_access = 0;
        for (time in access_times) {
            total_access += access_times[time];
        }

        # Alert determination
        if (total_access > alert_threshold) {
            printf "[ALERT] %s: High traffic detected - %d requests in last 5 minutes\n",
                   strftime("%Y-%m-%d %H:%M:%S", current_time), total_access | "cat >&2";
        }

        # Regular report (every minute)
        if (current_time % 60 == 0) {
            printf "[INFO] %s: Current window traffic: %d requests\n",
                   strftime("%Y-%m-%d %H:%M:%S", current_time), total_access;
        }
    }
}'

Real-time web server log monitoring with high-load alerts

🔄 External API Integration Data Processing

awk -F, '
# IP geo-location function
function get_geo_info(ip,    cmd, result, location) {
    if (ip in geo_cache) return geo_cache[ip];

    cmd = sprintf("curl -s \"http://ip-api.com/line/%s?fields=country,regionName,city\"", ip);
    cmd | getline result;
    close(cmd);

    # Cache result
    geo_cache[ip] = result;
    return result;
}

# Main processing (access log analysis)
NR > 1 {
    ip = $1;
    url = $7;
    status = $9;

    # Get geo info (consider API rate limits)
    if (++api_calls <= 100) {  # Max 100 API calls per run
        geo_info = get_geo_info(ip);
        split(geo_info, geo_parts, ",");
        country = geo_parts[1];
        region = geo_parts[2];
        city = geo_parts[3];

        # Country statistics
        country_stats[country]++;
        if (status >= 400) {
            country_errors[country]++;
        }
    }

    # URL statistics
    url_stats[url]++;
    if (status >= 400) {
        url_errors[url]++;
    }
}

END {
    print "=== Geographic Access Analysis ===";
    for (country in country_stats) {
        error_rate = (country in country_errors) ?
                     (country_errors[country] * 100.0 / country_stats[country]) : 0;
        printf "%-20s: %6d Access (Error Rate: %5.1f%%)\n",
               country, country_stats[country], error_rate;
    }

    print "\n=== Problematic URLs ===";
    for (url in url_stats) {
        if (url in url_errors && url_errors[url] > 10) {
            error_rate = url_errors[url] * 100.0 / url_stats[url];
            printf "%-50s: Errors %3d/%3d (%.1f%%)\n",
                   url, url_errors[url], url_stats[url], error_rate;
        }
    }
}' access_log.csv

Add IP geo-location to access logs and analyze error rates by country

🚀 Performance Optimization and Memory Management

Techniques to maximize speed and memory efficiency for large data processing.

⚡ High-Speed String Processing

🐌 Slow Method

                                                    # Repeated string concatenation (slow)
awk '{
    result = "";
    for (i = 1; i <= NF; i++) {
        result = result $i " ";  # Creates new string each time
    }
    print result;
}'
                                                

⚡ Fast Method

                                                    # Efficient string processing with arrays
awk '{
    for (i = 1; i <= NF; i++) {
        words[i] = $i;  # Store in array
    }
    # Join and output at once
    for (i = 1; i <= NF; i++) {
        printf "%s%s", words[i], (i < NF) ? " " : "\n";
    }
    # Clear array (save memory)
    delete words;
}'
                                                

💾 Memory-Efficient Large File Processing

awk '
BEGIN {
    # Processed record counter
    processed = 0;
    batch_size = 10000;
}

{
    # Process record
    process_record($0);
    processed++;

    # Batch processing (memory usage control)
    if (processed % batch_size == 0) {
        # Periodically delete unnecessary data
        cleanup_memory();

        # Progress report
        printf "Processing: %d records completed (%.1f MB processed)\n",
               processed, processed * length($0) / 1024 / 1024 > "/dev/stderr";
    }
}

function process_record(record,    fields) {
    # Process only necessary fields
    split(record, fields, ",");

    # Important: delete large temp variables immediately
    if (fields[2] > threshold) {
        summary[fields[1]] += fields[3];
    }

    # Local variables automatically deleted
}

function cleanup_memory(    key) {
    # Delete old data or unnecessary cache
    for (key in old_cache) {
        delete old_cache[key];
    }

    # Garbage collection-like processing
    system("echo 3 > /proc/sys/vm/drop_caches 2>/dev/null || true");
}

END {
    # Final results output
    for (key in summary) {
        printf "%s: %d\n", key, summary[key];
    }
    printf "Total Records Processed: %d\n", processed > "/dev/stderr";
}' huge_data_file.csv

Efficiently process large CSV files with controlled memory usage

🎨 Advanced Output Formatting

Professional report generation and data visualization techniques.

📊 ASCII Art Chart Generation

awk -F, '
NR > 1 {
    sales[$1] += $3;  # Total sales by salesperson
}

END {
    # Find maximum value
    max_sales = 0;
    for (person in sales) {
        if (sales[person] > max_sales) {
            max_sales = sales[person];
        }
    }

    # Chart settings
    chart_width = 50;
    scale = max_sales / chart_width;

    print "Sales Performance Chart";
    print "================";
    printf "Scale: 1 character = %.0f (10k units)\n\n", scale / 10000;

    # Prepare array for sorting by sales
    n = 0;
    for (person in sales) {
        sorted_sales[++n] = sales[person];
        person_by_sales[sales[person]] = person;
    }

    # Bubble sort (descending sales)
    for (i = 1; i <= n; i++) {
        for (j = i + 1; j <= n; j++) {
            if (sorted_sales[i] < sorted_sales[j]) {
                tmp = sorted_sales[i];
                sorted_sales[i] = sorted_sales[j];
                sorted_sales[j] = tmp;
            }
        }
    }

    # Chart output
    for (i = 1; i <= n; i++) {
        current_sales = sorted_sales[i];
        person = person_by_sales[current_sales];
        bar_length = int(current_sales / scale);

        printf "%-10s |", person;
        for (j = 1; j <= bar_length; j++) printf "█";
        printf " %d (10k)\n", current_sales / 10000;
    }

    print "";
    printf "Total Sales: %d (10k)\n", total_sales / 10000;
    printf "Average Sales: %.1f (10k)\n", (total_sales / n) / 10000;
}' sales_report.csv

Generate ASCII art bar chart from sales data

🚀 Next Steps

📚 Next Article: Proceed to Practical Guide 🐧 Practice with Penguin Gym Linux