find/grep/awk Master Series Advanced
grep/awk Ultimate Techniques
Advanced guide covering grep environment variable optimization, next-gen high-speed tools, awk associative arrays, user-defined functions, and stream processing. Master professional-level data processing techniques.
📋 Table of Contents
4. grep Command: The Text Search Wizard
grep stands for "Global Regular Expression Print" - a command that extracts lines matching specific patterns from files or input. When combined with regular expressions, it becomes an extremely powerful search tool.
🔧 Basic Syntax
grep [options] pattern filename
Display lines containing the specified pattern
🔰 Basic Usage
String Search
grep "Linux" document.txt
Display lines containing "Linux"
grep -i "linux" document.txt
Case-insensitive search
grep -v "error" log.txt
Display lines NOT containing "error" (inverse search)
Line Numbers and Context
grep -n "function" script.js
Show line numbers with matches
grep -C 3 "ERROR" app.log
Show 3 lines before and after matches
grep -A 2 -B 1 "WARNING" app.log
Show 1 line before, 2 lines after matches
File Search and Counting
grep -l "TODO" *.js
Show only filenames containing "TODO"
grep -c "error" log.txt
Count lines containing "error"
grep -r "password" /etc/
Recursive directory search
🎯 Combining with Regular Expressions
grep's true power is unleashed when combined with regular expressions.
Basic Regular Expression Patterns
grep "^Linux" document.txt
Lines starting with "Linux"
grep "finished$" log.txt
Lines ending with "finished"
grep "^$" file.txt
Empty lines
Character Classes and Quantifiers
grep "[0-9]+" numbers.txt
Lines containing one or more digits
grep "colou?r" text.txt
"color" or "colour" (? means 0 or 1 occurrence)
grep -E "error|warning|fatal" log.txt
Match any of multiple patterns (OR search)
Practical Pattern Examples
grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log
Search for IP address patterns
grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" contacts.txt
Search for email address patterns
grep -E "20[0-9]{2}-[0-1][0-9]-[0-3][0-9]" log.txt
Search for date pattern (YYYY-MM-DD)
🔄 Combining grep with Pipes
By connecting with other commands through pipes, you can build powerful data processing pipelines.
Process Management Combinations
ps aux | grep "nginx"
Display only nginx processes
ps aux | grep -v "grep" | grep "python"
Display python processes excluding grep itself
Log Analysis Combinations
tail -f /var/log/app.log | grep --line-buffered "ERROR"
Monitor errors in real-time
cat access.log | grep "404" | wc -l
Count 404 error occurrences
Network Information Combinations
netstat -an | grep ":80 "
Display processes listening on port 80
ifconfig | grep -E "inet [0-9]+"
Extract only IP address information
💡 Practical grep Techniques
Combining Multiple Conditions
grep "error" log.txt | grep -v "timeout"
Lines with "error" but not "timeout"
grep -E "(error|warning)" log.txt | grep "2025-01-15"
Errors or warnings on specific date
Efficient Search Configuration
grep --color=always "pattern" file.txt | less -R
Preserve color output when using less
GREP_OPTIONS="--color=auto" grep "pattern" file.txt
Set default options via environment variable
Speed Optimization Techniques
LC_ALL=C grep "pattern" large_file.txt
Disable UTF-8 processing with locale setting for speed
grep -F "literal_string" file.txt
Fixed string search (disable regex processing)
🚀 Advanced grep Techniques
Parallel Search Across Multiple Files
find /var/log -name "*.log" | xargs -P 4 grep "ERROR"
Parallel search with 4 processes
🎯 grep Ultimate Techniques: Professional Level
Once you've mastered the basics, master grep's hidden features and advanced techniques for expert-level data processing.
🌐 Environment Variables and Locale Optimization
For large file processing, locale settings significantly impact performance.
🐌 Slow Method (UTF-8 Processing)
grep "ERROR" huge_log.txt
Character encoding processing creates overhead
⚡ Speed Optimization (ASCII Processing)
LC_ALL=C grep "ERROR" huge_log.txt
Up to 10x faster with ASCII processing
LC_ALL=C grep --binary-files=without-match "pattern" /var/log/*
High-speed search skipping binary files
GREP_OPTIONS="--color=never" LC_ALL=C grep -F "ERROR" *.log
Further speed improvement by disabling color
🔗 Pipeline Combination Mastery
Combine multiple grep commands to efficiently handle complex conditions.
🎯 Progressive Filtering
grep "ERROR" app.log | grep -v "Timeout" | grep "$(date +%Y-%m-%d)"
Extract today's ERROR lines excluding timeouts
📊 Search with Statistics
grep -h "ERROR" /var/log/*.log | sort | uniq -c | sort -nr
Rank error types by occurrence count
🕐 Time Series Analysis
grep "ERROR" app.log | grep -o "[0-9]{2}:[0-9]{2}:[0-9]{2}" | cut -c1-2 | sort | uniq -c
Aggregate error occurrences by hour
⚡ Next-Gen grep: ripgrep and ag
Master alternative tools that are faster and more feature-rich than traditional grep.
🦀 ripgrep (rg) - Rust-based High-Speed grep
rg --type js "function" /var/www/
High-speed search targeting only JavaScript files
rg --json "ERROR" /var/log/ | jq '.data.lines.text'
JSON output for structured data processing
rg --stats --count "TODO" ./src/
Display search statistics and counts simultaneously
⚡ ag (The Silver Searcher)
ag --parallel "pattern" /large/directory/
Multi-core parallel processing for large searches
ag --context=5 --group "ERROR" /var/log/
Display 5 lines context with grouping
📈 Performance Comparison (1GB File Search)
| Tool | Execution Time | Memory Usage | Features |
|---|---|---|---|
| grep | 15.2 sec | 2MB | Standard, Stable |
| LC_ALL=C grep | 8.1 sec | 2MB | Optimized |
| ripgrep (rg) | 2.3 sec | 8MB | Fastest, Feature-rich |
| ag | 4.1 sec | 12MB | Fast, Developer-friendly |
🧠 Complex Pattern Matching Strategies
Advanced techniques for efficiently combining multiple conditions and exclusions.
🎯 Multiple Keyword AND Conditions
grep "ERROR" app.log | grep "database" | grep "timeout"
Basic method (3 pipes)
grep -E "^.*ERROR.*database.*timeout.*$" app.log
Single regex processing (faster)
🚫 Complex Exclusion Patterns
grep -v -E "(DEBUG|INFO|TRACE)" app.log | grep -v "health_check"
Multi-level exclusion filtering
📅 Time Range Search
grep -E "2024-01-(0[1-9]|[12][0-9]|3[01]) (0[89]|1[0-7]):" app.log
Extract logs for January 1-31, hours 8-17
💾 Large File Processing Mastery
Efficient methods for processing multi-GB to TB class files.
🔄 Streaming Processing
tail -f /var/log/huge.log | grep --line-buffered "ERROR"
Monitor and search logs in real-time
📦 Direct Compressed File Search
zgrep "ERROR" /var/log/app.log.gz
Search gzip-compressed files without decompression
bzgrep "pattern" archive.log.bz2
Direct search of bzip2 files also possible
⚡ Parallel Split Processing
split -l 1000000 huge.log chunk_ && grep "ERROR" chunk_* | sort
Split large files for parallel processing
🎨 Output Customization and Report Generation
Techniques for formatting search results for readability and report processing.
🌈 Color Output Optimization
GREP_COLORS='ms=1;31:mc=1;31:sl=:cx=:fn=1;32:ln=1;33:bn=1;33:se=' grep --color=always "ERROR" app.log
Custom color settings for improved visibility
📋 Structured Output Generation
grep -n "ERROR" *.log | awk -F: '{print $1","$2","$3}' > error_report.csv
Generate error report in CSV format
📊 Automatic Statistical Report Generation
{
echo "=== ERROR Analysis Report $(date) ==="
echo "Total Errors: $(grep -c ERROR app.log)"
echo "Unique Errors: $(grep -o 'ERROR.*' app.log | sort -u | wc -l)"
echo "Top 5 Errors:"
grep -o 'ERROR.*' app.log | sort | uniq -c | sort -nr | head -5
}
Generate comprehensive error analysis report
5. awk Command: The Data Processing Sorcerer
awk is named after "Alfred Aho, Peter Weinberger, Brian Kernighan" - a powerful text processing language that excels at processing CSV files and log files.
🔧 Basic Concepts
📊 Understanding awk
awk divides input into records (usually lines) and fields (usually columns) for processing.
Data Structure Example
name,age,occupation Tanaka,25,Engineer Sato,30,Designer Yamada,28,Manager
- $1: 1st field (name)
- $2: 2nd field (age)
- $3: 3rd field (occupation)
- $0: Entire record
- NF: Number of fields
- NR: Record number
🔧 Basic Syntax
awk 'pattern { action }' filename
Execute action on lines matching pattern
🔰 Basic awk Operations
Column Extraction
awk '{print $1}' employees.csv
Display only 1st column (name)
awk '{print $2, $3}' employees.csv
Display 2nd and 3rd columns
awk '{print NR ": " $0}' file.txt
Display entire content with line numbers
Specifying Delimiters
awk -F ',' '{print $1}' data.csv
Display 1st column of comma-separated file
awk -F ':' '{print $1, $3}' /etc/passwd
Display username and UID from colon-separated file
awk 'BEGIN {FS="\t"} {print $2}' tab_separated.txt
Display 2nd column of tab-separated file
Conditional Processing
awk '$2 > 25 {print $1, $2}' employees.csv
Display name and age for people over 25
awk '$3 == "Engineer" {print $1}' employees.csv
Display names of engineers
awk 'NF > 3 {print NR, $0}' data.txt
Display lines with more than 3 fields with line numbers
📊 Calculation and Aggregation
One of awk's powerful features is numerical calculation and aggregation.
Basic Calculations
awk '{sum += $3} END {print "Total:", sum}' sales.csv
Calculate sum of 3rd column (sales, etc.)
awk '{sum += $2; count++} END {print "Average:", sum/count}' ages.txt
Calculate average of 2nd column
awk 'BEGIN {max=0} {if($2>max) max=$2} END {print "Max:", max}' numbers.txt
Find maximum value in 2nd column
Group Aggregation
awk '{dept[$3] += $2} END {for (d in dept) print d, dept[d]}' salary.csv
Calculate total salary by department
awk '{count[$1]++} END {for (c in count) print c, count[c]}' access.log
Count access by IP address
Complex Processing Examples
awk -F, 'NR>1 {sales[$2]+=$4; count[$2]++} END {for(region in sales) printf "%s: Sales %d Count %d Avg %.1f\n", region, sales[region], count[region], sales[region]/count[region]}' sales_data.csv
Regional sales statistics (total, count, average)
🎭 BEGIN and END Patterns
Using Special Patterns
BEGIN Pattern
Execute before file processing
awk 'BEGIN {print "Processing Start", "Name", "Age"} {print NR, $1, $2}' data.txt
Output header before processing data
END Pattern
Execute after file processing
awk '{count++} END {print "Total Records:", count}' data.txt
Display total record count after processing
Combined Example
awk 'BEGIN {print "=== Sales Report ==="} {total+=$3} END {print "Total Sales:", total, "yen"}' sales.txt
Sales aggregation in report format
🚀 Advanced awk Techniques
📊 Processing Multiple Files
awk 'FNR==1{print "=== " FILENAME " ==="} {print NR, $0}' file1.txt file2.txt
Process multiple files with filename labels
🔄 Conditional Branching and Functions
awk '{if($2>=60) grade="Pass"; else grade="Fail"; print $1, $2, grade}' scores.txt
Add judgment result based on conditions
📅 Date/Time Processing
awk '{gsub(/-/, "/", $1); cmd="date -d " $1 " +%w"; cmd | getline weekday; print $0, weekday}' dates.txt
Calculate and add day of week from date
🥋 awk Black Belt Level: Data Processing Mastery
Once you've mastered the basics, learn awk's hidden powers and advanced programming techniques.
🧠 Complete Associative Array Mastery
awk's true power lies in associative arrays (hash tables). They excel at multi-dimensional data processing.
📊 Multi-dimensional Aggregation (Sales by Region × Month)
awk -F, '
NR>1 {
# sales[region][month] += sales_amount
sales[$2][$3] += $4;
total_by_region[$2] += $4;
total_by_month[$3] += $4;
grand_total += $4;
}
END {
# Header output
printf "%-12s", "Region/Month";
for (month in total_by_month) printf "%10s", month;
printf "%12s\n", "Region Total";
# Data output
for (region in total_by_region) {
printf "%-12s", region;
for (month in total_by_month) {
printf "%10d", (month in sales[region]) ? sales[region][month] : 0;
}
printf "%12d\n", total_by_region[region];
}
# Month totals output
printf "%-12s", "Month Total";
for (month in total_by_month) printf "%10d", total_by_month[month];
printf "%12d\n", grand_total;
}' sales_data.csv
Generate cross-tabulation from CSV sales data
🔍 Duplicate Data Detection and Statistics
awk '
{
# Count occurrences of entire line
count[$0]++;
# Record line number of first occurrence
if (!first_occurrence[$0]) {
first_occurrence[$0] = NR;
}
}
END {
print "=== Duplicate Data Analysis Report ===";
duplicates = 0;
unique_count = 0;
for (line in count) {
if (count[line] > 1) {
printf "Duplicate: %s (Count: %d, First Line: %d)\n",
line, count[line], first_occurrence[line];
duplicates++;
} else {
unique_count++;
}
}
printf "\nStatistics:\n";
printf "Total Lines: %d\n", NR;
printf "Unique Lines: %d\n", unique_count;
printf "Duplicate Patterns: %d\n", duplicates;
printf "Data Duplication Rate: %.2f%%\n", (duplicates * 100.0) / (unique_count + duplicates);
}' data_file.txt
Detect data duplicates and generate detailed statistical report
🔧 User-Defined Functions and Modularization
Functionalize complex processing for reuse and create maintainable code.
📅 Date Processing Library
awk '
# Date validity check function
function is_valid_date(date_str, parts, year, month, day, days_in_month) {
if (split(date_str, parts, "-") != 3) return 0;
year = parts[1]; month = parts[2]; day = parts[3];
if (year < 1900 || year > 2100) return 0;
if (month < 1 || month > 12) return 0;
# Check days in month (consider leap years)
days_in_month = "31,28,31,30,31,30,31,31,30,31,30,31";
split(days_in_month, month_days, ",");
if (month == 2 && is_leap_year(year)) {
max_day = 29;
} else {
max_day = month_days[month];
}
return (day >= 1 && day <= max_day);
}
# Leap year determination function
function is_leap_year(year) {
return (year % 4 == 0 && year % 100 != 0) || (year % 400 == 0);
}
# Date difference calculation function (simplified)
function date_diff_days(date1, date2, cmd, result) {
cmd = sprintf("date -d \"%s\" +%%s", date1);
cmd | getline timestamp1; close(cmd);
cmd = sprintf("date -d \"%s\" +%%s", date2);
cmd | getline timestamp2; close(cmd);
return int((timestamp2 - timestamp1) / 86400);
}
# Main processing
{
if (is_valid_date($1)) {
diff = date_diff_days($1, "'$(date +%Y-%m-%d)'");
printf "%s: %s (%d days %s)\n", $1,
(diff >= 0) ? "Future" : "Past",
(diff < 0) ? -diff : diff,
(diff >= 0) ? "from now" : "ago";
} else {
printf "%s: Invalid date format\n", $1;
}
}' date_list.txt
Function library for date validation, leap year check, and date difference calculation
🔢 Statistical Calculation Library
awk '
# Calculate array average
function average(arr, count, sum, i) {
sum = 0;
for (i = 1; i <= count; i++) sum += arr[i];
return sum / count;
}
# Calculate array standard deviation
function stddev(arr, count, avg, sum_sq, i) {
avg = average(arr, count);
sum_sq = 0;
for (i = 1; i <= count; i++) {
sum_sq += (arr[i] - avg) ^ 2;
}
return sqrt(sum_sq / count);
}
# Calculate array median
function median(arr, count, temp_arr, i, j, tmp) {
# Copy array and sort
for (i = 1; i <= count; i++) temp_arr[i] = arr[i];
# Bubble sort (for small arrays)
for (i = 1; i <= count; i++) {
for (j = i + 1; j <= count; j++) {
if (temp_arr[i] > temp_arr[j]) {
tmp = temp_arr[i];
temp_arr[i] = temp_arr[j];
temp_arr[j] = tmp;
}
}
}
if (count % 2 == 1) {
return temp_arr[int(count/2) + 1];
} else {
return (temp_arr[count/2] + temp_arr[count/2 + 1]) / 2;
}
}
# Data collection
{
if (NF >= 2 && $2 ~ /^[0-9]+\.?[0-9]*$/) {
values[++count] = $2;
sum += $2;
if (min == "" || $2 < min) min = $2;
if (max == "" || $2 > max) max = $2;
}
}
END {
if (count > 0) {
printf "Statistical Summary (n=%d)\n", count;
printf "==================\n";
printf "Min: %8.2f\n", min;
printf "Max: %8.2f\n", max;
printf "Mean: %8.2f\n", average(values, count);
printf "Median: %8.2f\n", median(values, count);
printf "Std Dev: %8.2f\n", stddev(values, count);
printf "Total: %8.2f\n", sum;
}
}' numerical_data.txt
Function suite for numerical data statistical analysis (mean, median, standard deviation, etc.)
🌊 Stream Processing and getline Utilization
Techniques that excel at real-time data processing and external command integration.
📡 Real-Time Log Monitoring
# Real-time monitoring with tail -f
tail -f /var/log/apache2/access.log | awk '
BEGIN {
# Time window setting (5 minutes)
window_size = 300;
alert_threshold = 100;
}
{
# Extract timestamp
if (match($4, /\[([^\]]+)\]/, timestamp)) {
# Get current time
"date +%s" | getline current_time;
close("date +%s");
# Record access
access_times[current_time]++;
# Delete old data (older than 5 minutes)
for (time in access_times) {
if (current_time - time > window_size) {
delete access_times[time];
}
}
# Count current window access
total_access = 0;
for (time in access_times) {
total_access += access_times[time];
}
# Alert determination
if (total_access > alert_threshold) {
printf "[ALERT] %s: High traffic detected - %d requests in last 5 minutes\n",
strftime("%Y-%m-%d %H:%M:%S", current_time), total_access | "cat >&2";
}
# Regular report (every minute)
if (current_time % 60 == 0) {
printf "[INFO] %s: Current window traffic: %d requests\n",
strftime("%Y-%m-%d %H:%M:%S", current_time), total_access;
}
}
}'
Real-time web server log monitoring with high-load alerts
🔄 External API Integration Data Processing
awk -F, '
# IP geo-location function
function get_geo_info(ip, cmd, result, location) {
if (ip in geo_cache) return geo_cache[ip];
cmd = sprintf("curl -s \"http://ip-api.com/line/%s?fields=country,regionName,city\"", ip);
cmd | getline result;
close(cmd);
# Cache result
geo_cache[ip] = result;
return result;
}
# Main processing (access log analysis)
NR > 1 {
ip = $1;
url = $7;
status = $9;
# Get geo info (consider API rate limits)
if (++api_calls <= 100) { # Max 100 API calls per run
geo_info = get_geo_info(ip);
split(geo_info, geo_parts, ",");
country = geo_parts[1];
region = geo_parts[2];
city = geo_parts[3];
# Country statistics
country_stats[country]++;
if (status >= 400) {
country_errors[country]++;
}
}
# URL statistics
url_stats[url]++;
if (status >= 400) {
url_errors[url]++;
}
}
END {
print "=== Geographic Access Analysis ===";
for (country in country_stats) {
error_rate = (country in country_errors) ?
(country_errors[country] * 100.0 / country_stats[country]) : 0;
printf "%-20s: %6d Access (Error Rate: %5.1f%%)\n",
country, country_stats[country], error_rate;
}
print "\n=== Problematic URLs ===";
for (url in url_stats) {
if (url in url_errors && url_errors[url] > 10) {
error_rate = url_errors[url] * 100.0 / url_stats[url];
printf "%-50s: Errors %3d/%3d (%.1f%%)\n",
url, url_errors[url], url_stats[url], error_rate;
}
}
}' access_log.csv
Add IP geo-location to access logs and analyze error rates by country
🚀 Performance Optimization and Memory Management
Techniques to maximize speed and memory efficiency for large data processing.
⚡ High-Speed String Processing
🐌 Slow Method
# Repeated string concatenation (slow)
awk '{
result = "";
for (i = 1; i <= NF; i++) {
result = result $i " "; # Creates new string each time
}
print result;
}'
⚡ Fast Method
# Efficient string processing with arrays
awk '{
for (i = 1; i <= NF; i++) {
words[i] = $i; # Store in array
}
# Join and output at once
for (i = 1; i <= NF; i++) {
printf "%s%s", words[i], (i < NF) ? " " : "\n";
}
# Clear array (save memory)
delete words;
}'
💾 Memory-Efficient Large File Processing
awk '
BEGIN {
# Processed record counter
processed = 0;
batch_size = 10000;
}
{
# Process record
process_record($0);
processed++;
# Batch processing (memory usage control)
if (processed % batch_size == 0) {
# Periodically delete unnecessary data
cleanup_memory();
# Progress report
printf "Processing: %d records completed (%.1f MB processed)\n",
processed, processed * length($0) / 1024 / 1024 > "/dev/stderr";
}
}
function process_record(record, fields) {
# Process only necessary fields
split(record, fields, ",");
# Important: delete large temp variables immediately
if (fields[2] > threshold) {
summary[fields[1]] += fields[3];
}
# Local variables automatically deleted
}
function cleanup_memory( key) {
# Delete old data or unnecessary cache
for (key in old_cache) {
delete old_cache[key];
}
# Garbage collection-like processing
system("echo 3 > /proc/sys/vm/drop_caches 2>/dev/null || true");
}
END {
# Final results output
for (key in summary) {
printf "%s: %d\n", key, summary[key];
}
printf "Total Records Processed: %d\n", processed > "/dev/stderr";
}' huge_data_file.csv
Efficiently process large CSV files with controlled memory usage
🎨 Advanced Output Formatting
Professional report generation and data visualization techniques.
📊 ASCII Art Chart Generation
awk -F, '
NR > 1 {
sales[$1] += $3; # Total sales by salesperson
}
END {
# Find maximum value
max_sales = 0;
for (person in sales) {
if (sales[person] > max_sales) {
max_sales = sales[person];
}
}
# Chart settings
chart_width = 50;
scale = max_sales / chart_width;
print "Sales Performance Chart";
print "================";
printf "Scale: 1 character = %.0f (10k units)\n\n", scale / 10000;
# Prepare array for sorting by sales
n = 0;
for (person in sales) {
sorted_sales[++n] = sales[person];
person_by_sales[sales[person]] = person;
}
# Bubble sort (descending sales)
for (i = 1; i <= n; i++) {
for (j = i + 1; j <= n; j++) {
if (sorted_sales[i] < sorted_sales[j]) {
tmp = sorted_sales[i];
sorted_sales[i] = sorted_sales[j];
sorted_sales[j] = tmp;
}
}
}
# Chart output
for (i = 1; i <= n; i++) {
current_sales = sorted_sales[i];
person = person_by_sales[current_sales];
bar_length = int(current_sales / scale);
printf "%-10s |", person;
for (j = 1; j <= bar_length; j++) printf "█";
printf " %d (10k)\n", current_sales / 10000;
}
print "";
printf "Total Sales: %d (10k)\n", total_sales / 10000;
printf "Average Sales: %.1f (10k)\n", (total_sales / n) / 10000;
}' sales_report.csv
Generate ASCII art bar chart from sales data