Skip to content
This repository has been archived by the owner on Sep 9, 2023. It is now read-only.
zhicwu edited this page Apr 27, 2013 · 4 revisions

Welcome to the apache-log-analysis wiki!

This is a small project with purpose of learning python programming and data analysis(and git for sure :). So it is not going to be something that can actually solve your problem. Anyway, the main parts are as following:

  • Tool
    A tool named log2csv to extract information from Apache log, and store them in CSV format so that they can be easily imported into data warehouse for further analysis.

Usage: log2csv.py [options]

A tool to convert Apache httpd log to csv.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -o DIRECTORY, --output-dir=DIRECTORY
                        write files to the DIRECTORY
  -l LOG_FORMAT, --log-format=LOG_FORMAT
                        apache log format(default: "%h %v %l %D %u %t \"%r\"
                        %>s %b \"%{Referer}i\" \"%{User-Agent}i\"
                        \"%{Cookie}i\"")
  -c CSV_FORMAT, --csv-format=CSV_FORMAT
                        output CSV format(default: "%(id)s      %(region)s
                        %(server)s      %(ip)s  %(date)s        %(time)s
                        %(method)s      %(app)s %(resp_code)s   %(resp_size)s
                        %(resp_time)s   %(user)s        %(session)s
                        %(client)s      %(device)s      %(os)s  %(browser)s
                        %(ref_server)s  %(ref_app)s")
  -k SKIP_LINES, --skip-lines=SKIP_LINES
                        skip lines for the first file(default: "0")
  -w, --wipe-cache      wipe off all cache files
  -t TIME_FORMAT, --time-format=TIME_FORMAT
                        time format used in apache log(default:
                        "%d/%b/%Y:%H:%M:%S")
  -r REGION, --region=REGION
                        region where apache log belongs to(1 for US)
  -s SERVER, --server=SERVER
                        server where apache log belongs to(0 for MyServer)
  -i, --ignore-user-agent
                        whether to parse user agent, which is very slow
  -f, --enable-filter   whether to filter invalid urls
  -q, --quiet           don't print status messages to stdout
  • Data Warehouse
    I came from the RDBMS world so I choose column-oriented one for data warehousing. I was using LucidDB and like its build-in ETL a lot. However, since it’s discontinued, I switched to InfiniDB for better performance.
  • Reporting
    There’s not many choices in opensource field for OLAP, hence I choose Pentaho BI Server along with the C-Tools to visualize information from Apache log.

Sample Reports

Clone this wiki locally