Combining find, grep, and awk - Practical Techniques for Real Work
The practical guide explains practical data processing patterns combining find, grep, and awk, real-world use cases, and performance optimization. Master immediately usable skills for engineers and data analysts.
Combination Techniques
A true Linux master uses find, grep, and awk together. Tasks that are difficult with one command become powerful solutions when combined.
Basic Pipe Patterns
find + grep:
# Find log files containing ERROR
find /var/log -name "*.log" -exec grep -l "ERROR" {} \;
# Search "password" in txt files with line numbers
find /home -name "*.txt" | xargs grep -n "password"grep + awk:
# Extract date, time, and last field from error lines
grep "ERROR" /var/log/app.log | awk '{print $1, $2, $NF}'
# Sum CPU usage of nginx processes
ps aux | grep "nginx" | awk '{sum+=$4} END {print "Total CPU:", sum "%"}'find + awk:
# Calculate total size and count of log files
find /var -name "*.log" -printf "%s %p\n" | awk '{size+=$1; count++} END {printf "Total: %.2f MB Files: %d\n", size/1024/1024, count}'Real-World Combined Processing
Scenario 1: Web server access analysis
Extract the top 10 IP addresses with the most errors from the last week.
find /var/log/apache2 -name "access.log*" -mtime -7 | \
xargs grep " 5[0-9][0-9] " | \
awk '{print $1}' | \
sort | uniq -c | \
sort -rn | \
head -10 | \
awk '{printf "%-15s %d times\n", $2, $1}'Step-by-step explanation:
find: Find access logs from the last 7 daysgrep: Extract lines with 5xx errors (server errors)awk: Extract IP addresses (column 1)sort | uniq -c: Count by IP addresssort -rn: Sort by count descendinghead -10: Top 10awk: Format output
Scenario 2: Bulk delete old temporary files
Safely delete temporary files older than 30 days from the system.
# 1. First check target files
find /tmp /var/tmp /home -name "*.tmp" -o -name "temp*" -o -name "*.temp" | \
xargs ls -la
# 2. After confirming, run the deletion
find /tmp /var/tmp /home -name "*.tmp" -mtime +30 -size +0 | \
xargs -I {} bash -c 'echo "Delete: {}"; rm "{}"'Always list and verify target files before deletion. Only target files older than 30 days with size greater than zero.
Scenario 3: Database connection log analysis
Analyze MySQL log connection counts by hour.
find /var/log/mysql -name "*.log" -mtime -1 | \
xargs grep -h "Connect" | \
awk '{
match($0, /[0-9]{4}-[0-9]{2}-[0-9]{2}T([0-9]{2})/, time_parts);
hour = time_parts[1];
connections[hour]++;
}
END {
print "MySQL connections by hour (last 24h)";
for (h = 0; h < 24; h++) {
printf "%02d:00-%02d:59 | ", h, h;
count = (h in connections) ? connections[h] : 0;
printf "%5d ", count;
for (i = 0; i < count/10; i++) printf "▓";
printf "\n";
}
}'Useful One-Liners
Convenient one-liners commonly used in practice. Ready to use, with high practical value.
Disk and file management:
# Top 20 largest files
find . -type f -exec du -h {} + | sort -rh | head -20
# Calculate total size of old log files
find /var -name "*.log" -mtime +7 -exec ls -lh {} \; | awk '{size+=$5} END {print "Recoverable size:", size/1024/1024 "MB"}'Network and access analysis:
# Top 10 IPs by today's access count
grep "$(date '+%d/%b/%Y')" /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10
# IPs with most SSH login failures
find /var/log -name "*.log" | xargs grep -h "Failed password" | awk '{print $11}' | sort | uniq -c | sort -rnSystem monitoring:
# Memory usage per process
find /proc -maxdepth 2 -name "status" 2>/dev/null | xargs grep -l "VmRSS" | xargs -I {} bash -c 'echo -n "$(basename $(dirname {})): "; grep VmRSS {}'
# Today's system error/warning sources
find /var/log -name "syslog*" | xargs grep "$(date '+%b %d')" | grep -i "error\|warn\|fail" | awk '{print $5}' | sort | uniq -c | sort -rnPipeline Design Patterns
Error handling and recovery patterns:
In production, failure is expected. Designs that handle errors and continue processing matter.
#!/bin/bash
set -euo pipefail
handle_error() {
echo "ERROR: pipeline failure on line $1" >&2
exit 1
}
trap 'handle_error $LINENO' ERR
process_logs_safely() {
local input_pattern="$1"
local output_file="$2"
local temp_dir="/tmp/pipeline_$$"
mkdir -p "$temp_dir"
find /var/log -name "$input_pattern" -type f 2>/dev/null > "$temp_dir/file_list" || {
echo "WARNING: some files not accessible" >&2
}
if [[ ! -s "$temp_dir/file_list" ]]; then
echo "ERROR: no target files" >&2
rm -rf "$temp_dir"
return 1
fi
while IFS= read -r logfile; do
if [[ -r "$logfile" ]]; then
grep -h "ERROR\|WARN" "$logfile" 2>/dev/null >> "$temp_dir/errors.log" || true
fi
done < "$temp_dir/file_list"
if [[ -s "$temp_dir/errors.log" ]]; then
awk '
{
if ($0 ~ /ERROR/) error_count++;
if ($0 ~ /WARN/) warn_count++;
}
END {
printf "ERROR: %d\n", error_count;
printf "WARN: %d\n", warn_count;
}' "$temp_dir/errors.log" > "$output_file"
fi
rm -rf "$temp_dir"
}Parallel processing pipeline:
Speed up CPU-intensive processing with parallelism.
parallel_log_analysis() {
local log_pattern="$1"
local output_dir="$2"
local cpu_cores=$(nproc)
local max_parallel=$((cpu_cores - 1))
find /var/log -name "$log_pattern" -type f | \
xargs -n 1 -P "$max_parallel" -I {} bash -c '
logfile="$1"
output_dir="$2"
result_file="$output_dir/result_$$.tmp"
grep -E "([0-9]{1,3}\.){3}[0-9]{1,3}" "$logfile" | \
awk "
{
ip = \$1;
if (match(ip, /^192\.168\./)) region = \"local\";
else if (match(ip, /^10\./)) region = \"internal\";
else region = \"external\";
total_by_region[region]++;
}
END {
for (region in total_by_region) {
printf \"region:%s total:%d\n\", region, total_by_region[region];
}
}" > "$result_file"
' -- {} "$output_dir"
}Real-World Use Cases
Practice over theory. Examples by job role.
Web Engineer
Sudden production incident response
"Site is slow" report. Need to identify the cause quickly.
# 1. Check error logs
find /var/log/apache2 /var/log/nginx -name "*.log" | xargs grep -E "$(date '+%d/%b/%Y')" | grep -E "5[0-9][0-9]|error|timeout" | tail -50
# 2. Identify slow queries
find /var/log/mysql -name "*slow.log" | xargs grep -A 5 "Query_time" | awk '/Query_time: [5-9]/ {getline; print}'
# 3. Detect abnormal access patterns
grep "$(date '+%d/%b/%Y')" /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | awk '$1 > 1000 {print "Abnormal:", $2, "count:", $1}'Tasks that take 30-60 minutes manually are completed in 5 minutes.
Monthly report generation
Summarize last month's access stats and error rates.
#!/bin/bash
LAST_MONTH=$(date -d "last month" '+%b/%Y')
echo "=== $LAST_MONTH Access Stats ==="
# Total access
TOTAL_ACCESS=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | wc -l)
echo "Total access: $TOTAL_ACCESS"
# Unique visitors
UNIQUE_VISITORS=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | awk '{print $1}' | sort -u | wc -l)
echo "Unique visitors: $UNIQUE_VISITORS"
# Error rate
ERROR_COUNT=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | grep -E " [45][0-9][0-9] " | wc -l)
ERROR_RATE=$(echo "scale=2; $ERROR_COUNT * 100 / $TOTAL_ACCESS" | bc)
echo "Error rate: $ERROR_RATE%"Infrastructure Engineer
Server monitoring and maintenance
Periodic health check of multiple servers.
#!/bin/bash
echo "=== Server Health Report ==="
date
# Disk usage warning
echo "=== Disk Usage (warn at 80%+) ==="
df -h | awk 'NR>1 {gsub(/%/, "", $5); if($5 > 80) printf "WARNING: %s: %s used (%s%%)\n", $6, $3, $5}'
# Memory usage
echo "=== Memory Usage ==="
free -m | awk 'NR==2{printf "Memory: %.1f%% (%dMB / %dMB)\n", $3*100/$2, $3, $2}'
# Top CPU processes
echo "=== Top 5 CPU ==="
ps aux --no-headers | sort -rn -k3 | head -5 | awk '{printf "%-10s %5.1f%% %s\n", $1, $3, $11}'
# Recent error spike check
echo "=== Errors in last hour ==="
find /var/log -name "*.log" -mmin -60 | xargs grep -h -E "$(date '+%b %d %H')|$(date -d '1 hour ago' '+%b %d %H')" | grep -ci errorData Analyst
Large data preprocessing
Several GB CSV file can't be opened in Excel. Needs preprocessing.
#!/bin/bash
CSV_FILE="sales_data_2024.csv"
OUTPUT_DIR="processed_data"
mkdir -p $OUTPUT_DIR
# File size and row count
echo "File size: $(du -h "$CSV_FILE" | cut -f1)"
echo "Total rows: $(wc -l < "$CSV_FILE")"
# Data quality check
echo "Empty rows: $(grep -c '^$' "$CSV_FILE")"
echo "Invalid rows: $(awk -F',' 'NF != 5 {count++} END {print count+0}' "$CSV_FILE")"
# Split into monthly files
awk -F',' 'NR==1 {header=$0; next}
{
month=substr($1,1,7);
if(!seen[month]) {
print header > "'$OUTPUT_DIR'/sales_" month ".csv";
seen[month]=1;
}
print $0 > "'$OUTPUT_DIR'/sales_" month ".csv"
}' "$CSV_FILE"
# Monthly summary
find $OUTPUT_DIR -name "sales_*.csv" | sort | while read file; do
month=$(basename "$file" .csv | cut -d'_' -f2)
total_sales=$(awk -F',' 'NR>1 {sum+=$4} END {print sum}' "$file")
record_count=$(expr $(wc -l < "$file") - 1)
printf "%s: %d records, total sales: %d\n" "$month" "$record_count" "$total_sales"
doneIndustry Case Studies
Game development: large-scale log analysis
Detect cheating from 100GB/day of player action logs.
analyze_game_logs() {
local log_date="$1"
local output_dir="/analysis/$(date +%Y%m%d)"
mkdir -p "$output_dir"
find /game/logs -name "*${log_date}*.log" -type f | \
xargs grep -h "PLAYER_ACTION" | \
awk -F'|' '
{
player_id = $3;
action = $4;
value = $5;
# Detect rapid mass actions
if (action == "LEVEL_UP") {
player_levelups[player_id]++;
if (player_levelups[player_id] > 10) {
print "SUSPICIOUS_LEVELUP", player_id > "/tmp/cheat_suspects.log";
}
}
# Abnormal currency spike
if (action == "GOLD_CHANGE" && value > 1000000) {
print "SUSPICIOUS_GOLD", player_id, value > "/tmp/gold_anomaly.log";
}
player_actions[player_id]++;
total_actions++;
}
END {
avg_actions = total_actions / length(player_actions);
for (player in player_actions) {
if (player_actions[player] > avg_actions * 5) {
printf "HIGH_ACTIVITY: %s (%d actions)\n", player, player_actions[player];
}
}
}'
}E-commerce: customer behavior analysis
Analyze purchase patterns from EC site access logs.
analyze_customer_journey() {
local analysis_period="$1"
local output_dir="/analytics/customer_journey"
mkdir -p "$output_dir"
find /var/log/nginx -name "access.log*" | \
xargs grep "$analysis_period" | \
awk '
BEGIN { session_timeout = 1800; }
{
ip = $1;
url = $7;
if (url ~ /\/checkout|\/purchase/) {
purchase_sessions[ip]++;
}
if (url ~ /\/products\/([0-9]+)/) {
match(url, /\/products\/([0-9]+)/, product_match);
product_views[product_match[1]]++;
}
}
END {
print "=== Customer Journey Analysis ===";
for (product in product_views) {
printf "Product %s: %d views\n", product, product_views[product];
}
}'
}Performance Optimization
Basic Speed-up Techniques
# 1. Limit search scope find /var/log -name "*.log" # Good # find / -name "*.log" # Bad (slow) # 2. Locale optimization LC_ALL=C grep "ERROR" huge.log # 3. Fixed strings with -F grep -F "literal_string" file.txt # 4. Skip unwanted directories find /var -path "*/node_modules" -prune -o -name "*.log" -print
Advanced Optimization Techniques
# Parallel processing find /var/log -name "*.log" | xargs -P 4 grep "ERROR" # Search compressed files directly zgrep "ERROR" /var/log/app.log.gz # memmap fast file reading (GNU grep) grep --mmap "ERROR" huge_file.log
Performance Measurement
# Measure execution time
time grep "ERROR" /var/log/huge.log
# Detailed resource usage
/usr/bin/time -v grep "ERROR" /var/log/huge.log
# Effect of parallel processing
for cores in 1 2 4 8; do
echo "cores: $cores"
time find /var/log -name "*.log" | xargs -P $cores grep -c "ERROR"
done