This repository has been archived by the owner on Sep 9, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Home
zhicwu edited this page Apr 27, 2013
·
4 revisions
Welcome to the apache-log-analysis wiki!
This is a small project with purpose of learning python programming and data analysis(and git for sure :). So it is not going to be something that can actually solve your problem. Anyway, the main parts are as following:
-
Tool
A tool named log2csv to extract information from Apache log, and store them in CSV format so that they can be easily imported into data warehouse for further analysis.
Usage: log2csv.py [options]
A tool to convert Apache httpd log to csv.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-o DIRECTORY, --output-dir=DIRECTORY
write files to the DIRECTORY
-l LOG_FORMAT, --log-format=LOG_FORMAT
apache log format(default: "%h %v %l %D %u %t \"%r\"
%>s %b \"%{Referer}i\" \"%{User-Agent}i\"
\"%{Cookie}i\"")
-c CSV_FORMAT, --csv-format=CSV_FORMAT
output CSV format(default: "%(id)s %(region)s
%(server)s %(ip)s %(date)s %(time)s
%(method)s %(app)s %(resp_code)s %(resp_size)s
%(resp_time)s %(user)s %(session)s
%(client)s %(device)s %(os)s %(browser)s
%(ref_server)s %(ref_app)s")
-k SKIP_LINES, --skip-lines=SKIP_LINES
skip lines for the first file(default: "0")
-w, --wipe-cache wipe off all cache files
-t TIME_FORMAT, --time-format=TIME_FORMAT
time format used in apache log(default:
"%d/%b/%Y:%H:%M:%S")
-r REGION, --region=REGION
region where apache log belongs to(1 for US)
-s SERVER, --server=SERVER
server where apache log belongs to(0 for MyServer)
-i, --ignore-user-agent
whether to parse user agent, which is very slow
-f, --enable-filter whether to filter invalid urls
-q, --quiet don't print status messages to stdout
-
Data Warehouse
I came from the RDBMS world so I choose column-oriented one for data warehousing. I was using LucidDB and like its build-in ETL a lot. However, since it’s discontinued, I switched to InfiniDB for better performance.
-
Reporting
There’s not many choices in opensource field for OLAP, hence I choose Pentaho BI Server along with the C-Tools to visualize information from Apache log.