find/grep/awk Master Series Advanced
grep/awk Ultimate Techniques

Advanced guide covering grep environment variable optimization, next-gen high-speed tools, awk associative arrays, user-defined functions, and stream processing. Master professional-level data processing techniques.

📋 Table of Contents

  1. grep Command: The Text Search Wizard
  2. awk Command: The Data Processing Sorcerer

4. grep Command: The Text Search Wizard

grep stands for "Global Regular Expression Print" - a command that extracts lines matching specific patterns from files or input. When combined with regular expressions, it becomes an extremely powerful search tool.

🔧 Basic Syntax

grep [options] pattern filename

Display lines containing the specified pattern

🔰 Basic Usage

String Search

grep "Linux" document.txt

Display lines containing "Linux"

grep -i "linux" document.txt

Case-insensitive search

grep -v "error" log.txt

Display lines NOT containing "error" (inverse search)

Line Numbers and Context

grep -n "function" script.js

Show line numbers with matches

grep -C 3 "ERROR" app.log

Show 3 lines before and after matches

grep -A 2 -B 1 "WARNING" app.log

Show 1 line before, 2 lines after matches

File Search and Counting

grep -l "TODO" *.js

Show only filenames containing "TODO"

grep -c "error" log.txt

Count lines containing "error"

grep -r "password" /etc/

Recursive directory search

🎯 Combining with Regular Expressions

grep's true power is unleashed when combined with regular expressions.

Basic Regular Expression Patterns

grep "^Linux" document.txt

Lines starting with "Linux"

grep "finished$" log.txt

Lines ending with "finished"

grep "^$" file.txt

Empty lines

Character Classes and Quantifiers

grep "[0-9]+" numbers.txt

Lines containing one or more digits

grep "colou?r" text.txt

"color" or "colour" (? means 0 or 1 occurrence)

grep -E "error|warning|fatal" log.txt

Match any of multiple patterns (OR search)

Practical Pattern Examples

grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log

Search for IP address patterns

grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" contacts.txt

Search for email address patterns

grep -E "20[0-9]{2}-[0-1][0-9]-[0-3][0-9]" log.txt

Search for date pattern (YYYY-MM-DD)

🔄 Combining grep with Pipes

By connecting with other commands through pipes, you can build powerful data processing pipelines.

Process Management Combinations

ps aux | grep "nginx"

Display only nginx processes

ps aux | grep -v "grep" | grep "python"

Display python processes excluding grep itself

Log Analysis Combinations

tail -f /var/log/app.log | grep --line-buffered "ERROR"

Monitor errors in real-time

cat access.log | grep "404" | wc -l

Count 404 error occurrences

Network Information Combinations

netstat -an | grep ":80 "

Display processes listening on port 80

ifconfig | grep -E "inet [0-9]+"

Extract only IP address information

💡 Practical grep Techniques

Combining Multiple Conditions

grep "error" log.txt | grep -v "timeout"

Lines with "error" but not "timeout"

grep -E "(error|warning)" log.txt | grep "2025-01-15"

Errors or warnings on specific date

Efficient Search Configuration

grep --color=always "pattern" file.txt | less -R

Preserve color output when using less

GREP_OPTIONS="--color=auto" grep "pattern" file.txt

Set default options via environment variable

Speed Optimization Techniques

LC_ALL=C grep "pattern" large_file.txt

Disable UTF-8 processing with locale setting for speed

grep -F "literal_string" file.txt

Fixed string search (disable regex processing)

🚀 Advanced grep Techniques

Parallel Search Across Multiple Files
find /var/log -name "*.log" | xargs -P 4 grep "ERROR"

Parallel search with 4 processes

🎯 grep Ultimate Techniques: Professional Level

Once you've mastered the basics, master grep's hidden features and advanced techniques for expert-level data processing.

🌐 Environment Variables and Locale Optimization

For large file processing, locale settings significantly impact performance.

🐌 Slow Method (UTF-8 Processing)
grep "ERROR" huge_log.txt

Character encoding processing creates overhead

⚡ Speed Optimization (ASCII Processing)
LC_ALL=C grep "ERROR" huge_log.txt

Up to 10x faster with ASCII processing

LC_ALL=C grep --binary-files=without-match "pattern" /var/log/*

High-speed search skipping binary files

GREP_OPTIONS="--color=never" LC_ALL=C grep -F "ERROR" *.log

Further speed improvement by disabling color

🔗 Pipeline Combination Mastery

Combine multiple grep commands to efficiently handle complex conditions.

🎯 Progressive Filtering
grep "ERROR" app.log | grep -v "Timeout" | grep "$(date +%Y-%m-%d)"

Extract today's ERROR lines excluding timeouts

📊 Search with Statistics
grep -h "ERROR" /var/log/*.log | sort | uniq -c | sort -nr

Rank error types by occurrence count

🕐 Time Series Analysis
grep "ERROR" app.log | grep -o "[0-9]{2}:[0-9]{2}:[0-9]{2}" | cut -c1-2 | sort | uniq -c

Aggregate error occurrences by hour

⚡ Next-Gen grep: ripgrep and ag

Master alternative tools that are faster and more feature-rich than traditional grep.

🦀 ripgrep (rg) - Rust-based High-Speed grep
rg --type js "function" /var/www/

High-speed search targeting only JavaScript files

rg --json "ERROR" /var/log/ | jq '.data.lines.text'

JSON output for structured data processing

rg --stats --count "TODO" ./src/

Display search statistics and counts simultaneously

⚡ ag (The Silver Searcher)
ag --parallel "pattern" /large/directory/

Multi-core parallel processing for large searches

ag --context=5 --group "ERROR" /var/log/

Display 5 lines context with grouping

📈 Performance Comparison (1GB File Search)
Tool Execution Time Memory Usage Features
grep 15.2 sec 2MB Standard, Stable
LC_ALL=C grep 8.1 sec 2MB Optimized
ripgrep (rg) 2.3 sec 8MB Fastest, Feature-rich
ag 4.1 sec 12MB Fast, Developer-friendly

🧠 Complex Pattern Matching Strategies

Advanced techniques for efficiently combining multiple conditions and exclusions.

🎯 Multiple Keyword AND Conditions
grep "ERROR" app.log | grep "database" | grep "timeout"

Basic method (3 pipes)

grep -E "^.*ERROR.*database.*timeout.*$" app.log

Single regex processing (faster)

🚫 Complex Exclusion Patterns
grep -v -E "(DEBUG|INFO|TRACE)" app.log | grep -v "health_check"

Multi-level exclusion filtering

📅 Time Range Search
grep -E "2024-01-(0[1-9]|[12][0-9]|3[01]) (0[89]|1[0-7]):" app.log

Extract logs for January 1-31, hours 8-17

💾 Large File Processing Mastery

Efficient methods for processing multi-GB to TB class files.

🔄 Streaming Processing
tail -f /var/log/huge.log | grep --line-buffered "ERROR"

Monitor and search logs in real-time

📦 Direct Compressed File Search
zgrep "ERROR" /var/log/app.log.gz

Search gzip-compressed files without decompression

bzgrep "pattern" archive.log.bz2

Direct search of bzip2 files also possible

⚡ Parallel Split Processing
split -l 1000000 huge.log chunk_ && grep "ERROR" chunk_* | sort

Split large files for parallel processing

🎨 Output Customization and Report Generation

Techniques for formatting search results for readability and report processing.

🌈 Color Output Optimization
GREP_COLORS='ms=1;31:mc=1;31:sl=:cx=:fn=1;32:ln=1;33:bn=1;33:se=' grep --color=always "ERROR" app.log

Custom color settings for improved visibility

📋 Structured Output Generation
grep -n "ERROR" *.log | awk -F: '{print $1","$2","$3}' > error_report.csv

Generate error report in CSV format

📊 Automatic Statistical Report Generation
{ echo "=== ERROR Analysis Report $(date) ===" echo "Total Errors: $(grep -c ERROR app.log)" echo "Unique Errors: $(grep -o 'ERROR.*' app.log | sort -u | wc -l)" echo "Top 5 Errors:" grep -o 'ERROR.*' app.log | sort | uniq -c | sort -nr | head -5 }

Generate comprehensive error analysis report

5. awk Command: The Data Processing Sorcerer

awk is named after "Alfred Aho, Peter Weinberger, Brian Kernighan" - a powerful text processing language that excels at processing CSV files and log files.

🔧 Basic Concepts

📊 Understanding awk

awk divides input into records (usually lines) and fields (usually columns) for processing.

Data Structure Example
name,age,occupation
Tanaka,25,Engineer
Sato,30,Designer
Yamada,28,Manager
  • $1: 1st field (name)
  • $2: 2nd field (age)
  • $3: 3rd field (occupation)
  • $0: Entire record
  • NF: Number of fields
  • NR: Record number

🔧 Basic Syntax

awk 'pattern { action }' filename

Execute action on lines matching pattern

🔰 Basic awk Operations

Column Extraction

awk '{print $1}' employees.csv

Display only 1st column (name)

awk '{print $2, $3}' employees.csv

Display 2nd and 3rd columns

awk '{print NR ": " $0}' file.txt

Display entire content with line numbers

Specifying Delimiters

awk -F ',' '{print $1}' data.csv

Display 1st column of comma-separated file

awk -F ':' '{print $1, $3}' /etc/passwd

Display username and UID from colon-separated file

awk 'BEGIN {FS="\t"} {print $2}' tab_separated.txt

Display 2nd column of tab-separated file

Conditional Processing

awk '$2 > 25 {print $1, $2}' employees.csv

Display name and age for people over 25

awk '$3 == "Engineer" {print $1}' employees.csv

Display names of engineers

awk 'NF > 3 {print NR, $0}' data.txt

Display lines with more than 3 fields with line numbers

📊 Calculation and Aggregation

One of awk's powerful features is numerical calculation and aggregation.

Basic Calculations

awk '{sum += $3} END {print "Total:", sum}' sales.csv

Calculate sum of 3rd column (sales, etc.)

awk '{sum += $2; count++} END {print "Average:", sum/count}' ages.txt

Calculate average of 2nd column

awk 'BEGIN {max=0} {if($2>max) max=$2} END {print "Max:", max}' numbers.txt

Find maximum value in 2nd column

Group Aggregation

awk '{dept[$3] += $2} END {for (d in dept) print d, dept[d]}' salary.csv

Calculate total salary by department

awk '{count[$1]++} END {for (c in count) print c, count[c]}' access.log

Count access by IP address

Complex Processing Examples

awk -F, 'NR>1 {sales[$2]+=$4; count[$2]++} END {for(region in sales) printf "%s: Sales %d Count %d Avg %.1f\n", region, sales[region], count[region], sales[region]/count[region]}' sales_data.csv

Regional sales statistics (total, count, average)

🎭 BEGIN and END Patterns

Using Special Patterns

BEGIN Pattern

Execute before file processing

awk 'BEGIN {print "Processing Start", "Name", "Age"} {print NR, $1, $2}' data.txt

Output header before processing data

END Pattern

Execute after file processing

awk '{count++} END {print "Total Records:", count}' data.txt

Display total record count after processing

Combined Example
awk 'BEGIN {print "=== Sales Report ==="} {total+=$3} END {print "Total Sales:", total, "yen"}' sales.txt

Sales aggregation in report format

🚀 Advanced awk Techniques

📊 Processing Multiple Files

awk 'FNR==1{print "=== " FILENAME " ==="} {print NR, $0}' file1.txt file2.txt

Process multiple files with filename labels

🔄 Conditional Branching and Functions

awk '{if($2>=60) grade="Pass"; else grade="Fail"; print $1, $2, grade}' scores.txt

Add judgment result based on conditions

📅 Date/Time Processing

awk '{gsub(/-/, "/", $1); cmd="date -d " $1 " +%w"; cmd | getline weekday; print $0, weekday}' dates.txt

Calculate and add day of week from date

🥋 awk Black Belt Level: Data Processing Mastery

Once you've mastered the basics, learn awk's hidden powers and advanced programming techniques.

🧠 Complete Associative Array Mastery

awk's true power lies in associative arrays (hash tables). They excel at multi-dimensional data processing.

📊 Multi-dimensional Aggregation (Sales by Region × Month)
awk -F, ' NR>1 { # sales[region][month] += sales_amount sales[$2][$3] += $4; total_by_region[$2] += $4; total_by_month[$3] += $4; grand_total += $4; } END { # Header output printf "%-12s", "Region/Month"; for (month in total_by_month) printf "%10s", month; printf "%12s\n", "Region Total"; # Data output for (region in total_by_region) { printf "%-12s", region; for (month in total_by_month) { printf "%10d", (month in sales[region]) ? sales[region][month] : 0; } printf "%12d\n", total_by_region[region]; } # Month totals output printf "%-12s", "Month Total"; for (month in total_by_month) printf "%10d", total_by_month[month]; printf "%12d\n", grand_total; }' sales_data.csv

Generate cross-tabulation from CSV sales data

🔍 Duplicate Data Detection and Statistics
awk ' { # Count occurrences of entire line count[$0]++; # Record line number of first occurrence if (!first_occurrence[$0]) { first_occurrence[$0] = NR; } } END { print "=== Duplicate Data Analysis Report ==="; duplicates = 0; unique_count = 0; for (line in count) { if (count[line] > 1) { printf "Duplicate: %s (Count: %d, First Line: %d)\n", line, count[line], first_occurrence[line]; duplicates++; } else { unique_count++; } } printf "\nStatistics:\n"; printf "Total Lines: %d\n", NR; printf "Unique Lines: %d\n", unique_count; printf "Duplicate Patterns: %d\n", duplicates; printf "Data Duplication Rate: %.2f%%\n", (duplicates * 100.0) / (unique_count + duplicates); }' data_file.txt

Detect data duplicates and generate detailed statistical report

🔧 User-Defined Functions and Modularization

Functionalize complex processing for reuse and create maintainable code.

📅 Date Processing Library
awk ' # Date validity check function function is_valid_date(date_str, parts, year, month, day, days_in_month) { if (split(date_str, parts, "-") != 3) return 0; year = parts[1]; month = parts[2]; day = parts[3]; if (year < 1900 || year > 2100) return 0; if (month < 1 || month > 12) return 0; # Check days in month (consider leap years) days_in_month = "31,28,31,30,31,30,31,31,30,31,30,31"; split(days_in_month, month_days, ","); if (month == 2 && is_leap_year(year)) { max_day = 29; } else { max_day = month_days[month]; } return (day >= 1 && day <= max_day); } # Leap year determination function function is_leap_year(year) { return (year % 4 == 0 && year % 100 != 0) || (year % 400 == 0); } # Date difference calculation function (simplified) function date_diff_days(date1, date2, cmd, result) { cmd = sprintf("date -d \"%s\" +%%s", date1); cmd | getline timestamp1; close(cmd); cmd = sprintf("date -d \"%s\" +%%s", date2); cmd | getline timestamp2; close(cmd); return int((timestamp2 - timestamp1) / 86400); } # Main processing { if (is_valid_date($1)) { diff = date_diff_days($1, "'$(date +%Y-%m-%d)'"); printf "%s: %s (%d days %s)\n", $1, (diff >= 0) ? "Future" : "Past", (diff < 0) ? -diff : diff, (diff >= 0) ? "from now" : "ago"; } else { printf "%s: Invalid date format\n", $1; } }' date_list.txt

Function library for date validation, leap year check, and date difference calculation

🔢 Statistical Calculation Library
awk ' # Calculate array average function average(arr, count, sum, i) { sum = 0; for (i = 1; i <= count; i++) sum += arr[i]; return sum / count; } # Calculate array standard deviation function stddev(arr, count, avg, sum_sq, i) { avg = average(arr, count); sum_sq = 0; for (i = 1; i <= count; i++) { sum_sq += (arr[i] - avg) ^ 2; } return sqrt(sum_sq / count); } # Calculate array median function median(arr, count, temp_arr, i, j, tmp) { # Copy array and sort for (i = 1; i <= count; i++) temp_arr[i] = arr[i]; # Bubble sort (for small arrays) for (i = 1; i <= count; i++) { for (j = i + 1; j <= count; j++) { if (temp_arr[i] > temp_arr[j]) { tmp = temp_arr[i]; temp_arr[i] = temp_arr[j]; temp_arr[j] = tmp; } } } if (count % 2 == 1) { return temp_arr[int(count/2) + 1]; } else { return (temp_arr[count/2] + temp_arr[count/2 + 1]) / 2; } } # Data collection { if (NF >= 2 && $2 ~ /^[0-9]+\.?[0-9]*$/) { values[++count] = $2; sum += $2; if (min == "" || $2 < min) min = $2; if (max == "" || $2 > max) max = $2; } } END { if (count > 0) { printf "Statistical Summary (n=%d)\n", count; printf "==================\n"; printf "Min: %8.2f\n", min; printf "Max: %8.2f\n", max; printf "Mean: %8.2f\n", average(values, count); printf "Median: %8.2f\n", median(values, count); printf "Std Dev: %8.2f\n", stddev(values, count); printf "Total: %8.2f\n", sum; } }' numerical_data.txt

Function suite for numerical data statistical analysis (mean, median, standard deviation, etc.)

🌊 Stream Processing and getline Utilization

Techniques that excel at real-time data processing and external command integration.

📡 Real-Time Log Monitoring
# Real-time monitoring with tail -f tail -f /var/log/apache2/access.log | awk ' BEGIN { # Time window setting (5 minutes) window_size = 300; alert_threshold = 100; } { # Extract timestamp if (match($4, /\[([^\]]+)\]/, timestamp)) { # Get current time "date +%s" | getline current_time; close("date +%s"); # Record access access_times[current_time]++; # Delete old data (older than 5 minutes) for (time in access_times) { if (current_time - time > window_size) { delete access_times[time]; } } # Count current window access total_access = 0; for (time in access_times) { total_access += access_times[time]; } # Alert determination if (total_access > alert_threshold) { printf "[ALERT] %s: High traffic detected - %d requests in last 5 minutes\n", strftime("%Y-%m-%d %H:%M:%S", current_time), total_access | "cat >&2"; } # Regular report (every minute) if (current_time % 60 == 0) { printf "[INFO] %s: Current window traffic: %d requests\n", strftime("%Y-%m-%d %H:%M:%S", current_time), total_access; } } }'

Real-time web server log monitoring with high-load alerts

🔄 External API Integration Data Processing
awk -F, ' # IP geo-location function function get_geo_info(ip, cmd, result, location) { if (ip in geo_cache) return geo_cache[ip]; cmd = sprintf("curl -s \"http://ip-api.com/line/%s?fields=country,regionName,city\"", ip); cmd | getline result; close(cmd); # Cache result geo_cache[ip] = result; return result; } # Main processing (access log analysis) NR > 1 { ip = $1; url = $7; status = $9; # Get geo info (consider API rate limits) if (++api_calls <= 100) { # Max 100 API calls per run geo_info = get_geo_info(ip); split(geo_info, geo_parts, ","); country = geo_parts[1]; region = geo_parts[2]; city = geo_parts[3]; # Country statistics country_stats[country]++; if (status >= 400) { country_errors[country]++; } } # URL statistics url_stats[url]++; if (status >= 400) { url_errors[url]++; } } END { print "=== Geographic Access Analysis ==="; for (country in country_stats) { error_rate = (country in country_errors) ? (country_errors[country] * 100.0 / country_stats[country]) : 0; printf "%-20s: %6d Access (Error Rate: %5.1f%%)\n", country, country_stats[country], error_rate; } print "\n=== Problematic URLs ==="; for (url in url_stats) { if (url in url_errors && url_errors[url] > 10) { error_rate = url_errors[url] * 100.0 / url_stats[url]; printf "%-50s: Errors %3d/%3d (%.1f%%)\n", url, url_errors[url], url_stats[url], error_rate; } } }' access_log.csv

Add IP geo-location to access logs and analyze error rates by country

🚀 Performance Optimization and Memory Management

Techniques to maximize speed and memory efficiency for large data processing.

⚡ High-Speed String Processing
🐌 Slow Method
# Repeated string concatenation (slow) awk '{ result = ""; for (i = 1; i <= NF; i++) { result = result $i " "; # Creates new string each time } print result; }'
⚡ Fast Method
# Efficient string processing with arrays awk '{ for (i = 1; i <= NF; i++) { words[i] = $i; # Store in array } # Join and output at once for (i = 1; i <= NF; i++) { printf "%s%s", words[i], (i < NF) ? " " : "\n"; } # Clear array (save memory) delete words; }'
💾 Memory-Efficient Large File Processing
awk ' BEGIN { # Processed record counter processed = 0; batch_size = 10000; } { # Process record process_record($0); processed++; # Batch processing (memory usage control) if (processed % batch_size == 0) { # Periodically delete unnecessary data cleanup_memory(); # Progress report printf "Processing: %d records completed (%.1f MB processed)\n", processed, processed * length($0) / 1024 / 1024 > "/dev/stderr"; } } function process_record(record, fields) { # Process only necessary fields split(record, fields, ","); # Important: delete large temp variables immediately if (fields[2] > threshold) { summary[fields[1]] += fields[3]; } # Local variables automatically deleted } function cleanup_memory( key) { # Delete old data or unnecessary cache for (key in old_cache) { delete old_cache[key]; } # Garbage collection-like processing system("echo 3 > /proc/sys/vm/drop_caches 2>/dev/null || true"); } END { # Final results output for (key in summary) { printf "%s: %d\n", key, summary[key]; } printf "Total Records Processed: %d\n", processed > "/dev/stderr"; }' huge_data_file.csv

Efficiently process large CSV files with controlled memory usage

🎨 Advanced Output Formatting

Professional report generation and data visualization techniques.

📊 ASCII Art Chart Generation
awk -F, ' NR > 1 { sales[$1] += $3; # Total sales by salesperson } END { # Find maximum value max_sales = 0; for (person in sales) { if (sales[person] > max_sales) { max_sales = sales[person]; } } # Chart settings chart_width = 50; scale = max_sales / chart_width; print "Sales Performance Chart"; print "================"; printf "Scale: 1 character = %.0f (10k units)\n\n", scale / 10000; # Prepare array for sorting by sales n = 0; for (person in sales) { sorted_sales[++n] = sales[person]; person_by_sales[sales[person]] = person; } # Bubble sort (descending sales) for (i = 1; i <= n; i++) { for (j = i + 1; j <= n; j++) { if (sorted_sales[i] < sorted_sales[j]) { tmp = sorted_sales[i]; sorted_sales[i] = sorted_sales[j]; sorted_sales[j] = tmp; } } } # Chart output for (i = 1; i <= n; i++) { current_sales = sorted_sales[i]; person = person_by_sales[current_sales]; bar_length = int(current_sales / scale); printf "%-10s |", person; for (j = 1; j <= bar_length; j++) printf "█"; printf " %d (10k)\n", current_sales / 10000; } print ""; printf "Total Sales: %d (10k)\n", total_sales / 10000; printf "Average Sales: %.1f (10k)\n", (total_sales / n) / 10000; }' sales_report.csv

Generate ASCII art bar chart from sales data