Analyzing Web Server Logs for Bot Traffic Using Python

Introduction

Understanding bot traffic is essential for optimizing website performance, improving security, and enhancing SEO strategy. This project provides a Python script to analyze web server logs and categorize bot traffic, with a focus on search engine crawlers. It leverages the pandas library for data manipulation and analysis.

Why Analyze Bot Traffic?

Bot traffic plays a crucial role in indexing your website, but excessive or malicious bot activity can negatively impact performance. This script helps you:

  • Identify different types of bots accessing your site.
  • Analyze their behavior.
  • Take appropriate actions based on insights.

Getting Started

Prerequisites

  • Python 3.7 or later.
  • Web server log files stored in a logs directory at the project root.
  • Log files must follow a standard format (e.g., Combined Log Format).

Configuring the Script

Modify the log_files list in the if __name__ == "__main__": block to specify your log files:

if __name__ == "__main__":
    log_files = ["logs/your_log_file.log", "logs/another_log_file.log"]  # Add your log file paths
    df = analyze_logs(log_files)
    summarize_analysis(df)

Code Structure and Explanation

The script is organized into key functions:

1. Helper Functions

is_google_ip(ip)

  • Purpose: Checks if an IP belongs to Google’s crawler network.
  • Implementation: Uses ipaddress to validate if the IP falls within Googlebot’s known ranges.
import ipaddress

def is_google_ip(ip):
    try:
        google_ranges = [
            "66.249.64.0/19", "66.249.80.0/20", "64.233.160.0/19", "216.239.32.0/19",
            "2001:4860:4801::/48", "2404:6800:4003::/48", "2607:f8b0:4003::/48", "2800:3f0:4003::/48"
        ]
        ip_obj = ipaddress.ip_address(ip)
        return any(ip_obj in ipaddress.ip_network(net) for net in google_ranges)
    except:
        return False

Search Engine Identifier

  • Purpose: Identifies the search engine based on the User-Agent string.
  • Implementation: Uses regular expressions to match known bot patterns.
import re

def identify_search_engine(user_agent):
    search_engines = {
        'Google': ['Googlebot', 'AdsBot-Google'],
        'Bing': ['bingbot'],
        'DuckDuckGo': ['DuckDuckBot'],
        'Yandex': ['YandexBot']
    }
    for engine, patterns in search_engines.items():
        if any(re.search(pattern, user_agent, re.IGNORECASE) for pattern in patterns):
            return engine
    return 'Other'

2. Main Analysis Function

analyze_logs(log_files)

  • Purpose: Parses web server logs to extract key information.
  • Implementation: Reads logs line-by-line, extracts IPs, timestamps, URLs, and user agents, then classifies search engine bots.
import pandas as pd

def analyze_logs(log_files):
    records = []
    for log_file in log_files:
        with open(log_file, 'r') as f:
            for line in f:
                pattern = r'(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\].*?"(\w+) (.*?) HTTP.*?" (\d+) .*?"(.*?)" "(.*?)"'
                match = re.search(pattern, line)
                if match:
                    ip, datetime, method, url, status, referer, user_agent = match.groups()
                    search_engine = identify_search_engine(user_agent)
                    records.append({
                        'ip': ip, 'datetime': datetime, 'method': method,
                        'url': url, 'status': int(status), 'referer': referer,
                        'user_agent': user_agent, 'search_engine': search_engine
                    })
    df = pd.DataFrame(records)
    if not df.empty:
        df['datetime'] = pd.to_datetime(df['datetime'], format="%d/%b/%Y:%H:%M:%S", errors='coerce')
    return df

3. Summarization and Output

summarize_analysis(df)

  • Purpose: Provides a high-level summary of search engine bot activity.
def summarize_analysis(df):
    if df.empty:
        print("No data to summarize.")
        return
    print("\n--- Summary of Log Analysis ---")
    print("\nTop Search Engines:")
    print(df['search_engine'].value_counts())

4. Main Execution

  • Purpose: Runs the analysis workflow.
if __name__ == "__main__":
    log_files = ["logs/your_log_file.log"]
    df = analyze_logs(log_files)
    summarize_analysis(df)

Running the Script

  1. Open a terminal.
  2. Navigate to the project directory.
  3. Execute: python analyze_logs.py

Extending the Script

Here are ways to improve the script:

More Detailed Analysis

  • Calculate status code distributions per search engine.
  • Identify top pages crawled by different bots.
  • Filter out static assets (CSS, JavaScript, images).

Data Visualization

  • Use matplotlib and seaborn to generate insights through charts.
  • Create automated reports in HTML or PDF.

GeoIP Analysis

  • Integrate a GeoIP library to track bot locations.

Advanced Bot Detection

  • Analyze request patterns and detect anomalies in user agents.

Database Integration

  • Store parsed log data in SQLite or PostgreSQL for efficient querying.

Conclusion

This Python script provides a powerful way to analyze bot traffic from web server logs. By identifying and categorizing search engine bots, it helps webmasters optimize SEO strategies and website performance. With additional enhancements, it can serve as a robust tool for deeper traffic insights.

You can find the complete code here.