Combining find, grep, and awk - Practical Techniques for Real Work
The practical guide explains practical data processing patterns combining find, grep, and awk, real-world use cases, and performance optimization. Master immediately usable skills for engineers and data analysts.
What You'll Learn
- Practical patterns that pipe find, grep, and awk together
- Real-world use cases and workflows organized by job role
- Performance optimization tips for handling large datasets
- Immediately usable data processing skills for the field
Combination Techniques
Conclusion: Piping find, grep and awk turns hard tasks into powerful solutions.
A true Linux master uses find, grep, and awk together. Tasks that are difficult with one command become powerful solutions when combined.
Basic Pipe Patterns
find + grep:
# Find log files containing ERROR
find /var/log -name "*.log" -exec grep -l "ERROR" {} \;
# Search "password" in txt files with line numbers
find /home -name "*.txt" | xargs grep -n "password"grep + awk:
# Extract date, time, and last field from error lines
grep "ERROR" /var/log/app.log | awk '{print $1, $2, $NF}'
# Sum CPU usage of nginx processes
ps aux | grep "nginx" | awk '{sum+=$4} END {print "Total CPU:", sum "%"}'find + awk:
# Calculate total size and count of log files
find /var -name "*.log" -printf "%s %p\n" | awk '{size+=$1; count++} END {printf "Total: %.2f MB Files: %d\n", size/1024/1024, count}'Real-World Combined Processing
Scenario 1: Web server access analysis
Extract the top 10 IP addresses with the most errors from the last week.
find /var/log/apache2 -name "access.log*" -mtime -7 | \
xargs grep " 5[0-9][0-9] " | \
awk '{print $1}' | \
sort | uniq -c | \
sort -rn | \
head -10 | \
awk '{printf "%-15s %d times\n", $2, $1}'Step-by-step explanation:
find: Find access logs from the last 7 daysgrep: Extract lines with 5xx errors (server errors)awk: Extract IP addresses (column 1)sort | uniq -c: Count by IP addresssort -rn: Sort by count descendinghead -10: Top 10awk: Format output
Scenario 2: Bulk delete old temporary files
Safely delete temporary files older than 30 days from the system.
# 1. First check target files
find /tmp /var/tmp /home -name "*.tmp" -o -name "temp*" -o -name "*.temp" | \
xargs ls -la
# 2. After confirming, run the deletion
find /tmp /var/tmp /home -name "*.tmp" -mtime +30 -size +0 | \
xargs -I {} bash -c 'echo "Delete: {}"; rm "{}"'Always list and verify target files before deletion. Only target files older than 30 days with size greater than zero.
Scenario 3: Database connection log analysis
Analyze MySQL log connection counts by hour.
find /var/log/mysql -name "*.log" -mtime -1 | \
xargs grep -h "Connect" | \
awk '{
match($0, /[0-9]{4}-[0-9]{2}-[0-9]{2}T([0-9]{2})/, time_parts);
hour = time_parts[1];
connections[hour]++;
}
END {
print "MySQL connections by hour (last 24h)";
for (h = 0; h < 24; h++) {
printf "%02d:00-%02d:59 | ", h, h;
count = (h in connections) ? connections[h] : 0;
printf "%5d ", count;
for (i = 0; i < count/10; i++) printf "▓";
printf "\n";
}
}'Useful One-Liners
Convenient one-liners commonly used in practice. Ready to use, with high practical value.
Disk and file management:
# Top 20 largest files
find . -type f -exec du -h {} + | sort -rh | head -20
# Calculate total size of old log files
find /var -name "*.log" -mtime +7 -exec ls -lh {} \; | awk '{size+=$5} END {print "Recoverable size:", size/1024/1024 "MB"}'Network and access analysis:
# Top 10 IPs by today's access count
grep "$(date '+%d/%b/%Y')" /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10
# IPs with most SSH login failures
find /var/log -name "*.log" | xargs grep -h "Failed password" | awk '{print $11}' | sort | uniq -c | sort -rnSystem monitoring:
# Memory usage per process
find /proc -maxdepth 2 -name "status" 2>/dev/null | xargs grep -l "VmRSS" | xargs -I {} bash -c 'echo -n "$(basename $(dirname {})): "; grep VmRSS {}'
# Today's system error/warning sources
find /var/log -name "syslog*" | xargs grep "$(date '+%b %d')" | grep -i "error\|warn\|fail" | awk '{print $5}' | sort | uniq -c | sort -rnPipeline Design Patterns
Error handling and recovery patterns:
In production, failure is expected. Designs that handle errors and continue processing matter.
#!/bin/bash
set -euo pipefail
handle_error() {
echo "ERROR: pipeline failure on line $1" >&2
exit 1
}
trap 'handle_error $LINENO' ERR
process_logs_safely() {
local input_pattern="$1"
local output_file="$2"
local temp_dir="/tmp/pipeline_$$"
mkdir -p "$temp_dir"
find /var/log -name "$input_pattern" -type f 2>/dev/null > "$temp_dir/file_list" || {
echo "WARNING: some files not accessible" >&2
}
if [[ ! -s "$temp_dir/file_list" ]]; then
echo "ERROR: no target files" >&2
rm -rf "$temp_dir"
return 1
fi
while IFS= read -r logfile; do
if [[ -r "$logfile" ]]; then
grep -h "ERROR\|WARN" "$logfile" 2>/dev/null >> "$temp_dir/errors.log" || true
fi
done < "$temp_dir/file_list"
if [[ -s "$temp_dir/errors.log" ]]; then
awk '
{
if ($0 ~ /ERROR/) error_count++;
if ($0 ~ /WARN/) warn_count++;
}
END {
printf "ERROR: %d\n", error_count;
printf "WARN: %d\n", warn_count;
}' "$temp_dir/errors.log" > "$output_file"
fi
rm -rf "$temp_dir"
}Parallel processing pipeline:
Speed up CPU-intensive processing with parallelism.
parallel_log_analysis() {
local log_pattern="$1"
local output_dir="$2"
local cpu_cores=$(nproc)
local max_parallel=$((cpu_cores - 1))
find /var/log -name "$log_pattern" -type f | \
xargs -n 1 -P "$max_parallel" -I {} bash -c '
logfile="$1"
output_dir="$2"
result_file="$output_dir/result_$$.tmp"
grep -E "([0-9]{1,3}\.){3}[0-9]{1,3}" "$logfile" | \
awk "
{
ip = \$1;
if (match(ip, /^192\.168\./)) region = \"local\";
else if (match(ip, /^10\./)) region = \"internal\";
else region = \"external\";
total_by_region[region]++;
}
END {
for (region in total_by_region) {
printf \"region:%s total:%d\n\", region, total_by_region[region];
}
}" > "$result_file"
' -- {} "$output_dir"
}Real-World Use Cases
Conclusion: Role-based examples show how the three commands help in real work.
Practice over theory. Examples by job role.
Web Engineer
Sudden production incident response
"Site is slow" report. Need to identify the cause quickly.
# 1. Check error logs
find /var/log/apache2 /var/log/nginx -name "*.log" | xargs grep -E "$(date '+%d/%b/%Y')" | grep -E "5[0-9][0-9]|error|timeout" | tail -50
# 2. Identify slow queries
find /var/log/mysql -name "*slow.log" | xargs grep -A 5 "Query_time" | awk '/Query_time: [5-9]/ {getline; print}'
# 3. Detect abnormal access patterns
grep "$(date '+%d/%b/%Y')" /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | awk '$1 > 1000 {print "Abnormal:", $2, "count:", $1}'Tasks that take 30-60 minutes manually are completed in 5 minutes.
Monthly report generation
Summarize last month's access stats and error rates.
#!/bin/bash
LAST_MONTH=$(date -d "last month" '+%b/%Y')
echo "=== $LAST_MONTH Access Stats ==="
# Total access
TOTAL_ACCESS=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | wc -l)
echo "Total access: $TOTAL_ACCESS"
# Unique visitors
UNIQUE_VISITORS=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | awk '{print $1}' | sort -u | wc -l)
echo "Unique visitors: $UNIQUE_VISITORS"
# Error rate
ERROR_COUNT=$(find /var/log/apache2 -name "access.log*" | xargs grep "$LAST_MONTH" | grep -E " [45][0-9][0-9] " | wc -l)
ERROR_RATE=$(echo "scale=2; $ERROR_COUNT * 100 / $TOTAL_ACCESS" | bc)
echo "Error rate: $ERROR_RATE%"Infrastructure Engineer
Server monitoring and maintenance
Periodic health check of multiple servers.
#!/bin/bash
echo "=== Server Health Report ==="
date
# Disk usage warning
echo "=== Disk Usage (warn at 80%+) ==="
df -h | awk 'NR>1 {gsub(/%/, "", $5); if($5 > 80) printf "WARNING: %s: %s used (%s%%)\n", $6, $3, $5}'
# Memory usage
echo "=== Memory Usage ==="
free -m | awk 'NR==2{printf "Memory: %.1f%% (%dMB / %dMB)\n", $3*100/$2, $3, $2}'
# Top CPU processes
echo "=== Top 5 CPU ==="
ps aux --no-headers | sort -rn -k3 | head -5 | awk '{printf "%-10s %5.1f%% %s\n", $1, $3, $11}'
# Recent error spike check
echo "=== Errors in last hour ==="
find /var/log -name "*.log" -mmin -60 | xargs grep -h -E "$(date '+%b %d %H')|$(date -d '1 hour ago' '+%b %d %H')" | grep -ci errorData Analyst
Large data preprocessing
Several GB CSV file can't be opened in Excel. Needs preprocessing.
#!/bin/bash
CSV_FILE="sales_data_2024.csv"
OUTPUT_DIR="processed_data"
mkdir -p $OUTPUT_DIR
# File size and row count
echo "File size: $(du -h "$CSV_FILE" | cut -f1)"
echo "Total rows: $(wc -l < "$CSV_FILE")"
# Data quality check
echo "Empty rows: $(grep -c '^$' "$CSV_FILE")"
echo "Invalid rows: $(awk -F',' 'NF != 5 {count++} END {print count+0}' "$CSV_FILE")"
# Split into monthly files
awk -F',' 'NR==1 {header=$0; next}
{
month=substr($1,1,7);
if(!seen[month]) {
print header > "'$OUTPUT_DIR'/sales_" month ".csv";
seen[month]=1;
}
print $0 > "'$OUTPUT_DIR'/sales_" month ".csv"
}' "$CSV_FILE"
# Monthly summary
find $OUTPUT_DIR -name "sales_*.csv" | sort | while read file; do
month=$(basename "$file" .csv | cut -d'_' -f2)
total_sales=$(awk -F',' 'NR>1 {sum+=$4} END {print sum}' "$file")
record_count=$(expr $(wc -l < "$file") - 1)
printf "%s: %d records, total sales: %d\n" "$month" "$record_count" "$total_sales"
doneIndustry Case Studies
Game development: large-scale log analysis
Detect cheating from 100GB/day of player action logs.
analyze_game_logs() {
local log_date="$1"
local output_dir="/analysis/$(date +%Y%m%d)"
mkdir -p "$output_dir"
find /game/logs -name "*${log_date}*.log" -type f | \
xargs grep -h "PLAYER_ACTION" | \
awk -F'|' '
{
player_id = $3;
action = $4;
value = $5;
# Detect rapid mass actions
if (action == "LEVEL_UP") {
player_levelups[player_id]++;
if (player_levelups[player_id] > 10) {
print "SUSPICIOUS_LEVELUP", player_id > "/tmp/cheat_suspects.log";
}
}
# Abnormal currency spike
if (action == "GOLD_CHANGE" && value > 1000000) {
print "SUSPICIOUS_GOLD", player_id, value > "/tmp/gold_anomaly.log";
}
player_actions[player_id]++;
total_actions++;
}
END {
avg_actions = total_actions / length(player_actions);
for (player in player_actions) {
if (player_actions[player] > avg_actions * 5) {
printf "HIGH_ACTIVITY: %s (%d actions)\n", player, player_actions[player];
}
}
}'
}E-commerce: customer behavior analysis
Analyze purchase patterns from EC site access logs.
analyze_customer_journey() {
local analysis_period="$1"
local output_dir="/analytics/customer_journey"
mkdir -p "$output_dir"
find /var/log/nginx -name "access.log*" | \
xargs grep "$analysis_period" | \
awk '
BEGIN { session_timeout = 1800; }
{
ip = $1;
url = $7;
if (url ~ /\/checkout|\/purchase/) {
purchase_sessions[ip]++;
}
if (url ~ /\/products\/([0-9]+)/) {
match(url, /\/products\/([0-9]+)/, product_match);
product_views[product_match[1]]++;
}
}
END {
print "=== Customer Journey Analysis ===";
for (product in product_views) {
printf "Product %s: %d views\n", product, product_views[product];
}
}'
}Performance Optimization
Conclusion: Limit scope, use fixed strings, and cut waste to run fast on big data.
Basic Speed-up Techniques
# 1. Limit search scope find /var/log -name "*.log" # Good # find / -name "*.log" # Bad (slow) # 2. Locale optimization LC_ALL=C grep "ERROR" huge.log # 3. Fixed strings with -F grep -F "literal_string" file.txt # 4. Skip unwanted directories find /var -path "*/node_modules" -prune -o -name "*.log" -print
Advanced Optimization Techniques
# Parallel processing find /var/log -name "*.log" | xargs -P 4 grep "ERROR" # Search compressed files directly zgrep "ERROR" /var/log/app.log.gz # memmap fast file reading (GNU grep) grep --mmap "ERROR" huge_file.log
Performance Measurement
# Measure execution time
time grep "ERROR" /var/log/huge.log
# Detailed resource usage
/usr/bin/time -v grep "ERROR" /var/log/huge.log
# Effect of parallel processing
for cores in 1 2 4 8; do
echo "cores: $cores"
time find /var/log -name "*.log" | xargs -P $cores grep -c "ERROR"
done