The replication package for "Plug it and Play on Logs: A Configuration-Free Statistic-Based Log Parser".
The overview of our tool structure is shown in the following figure. The design of PIPLUP comprises two parsing stages: online log clustering, cluster updating. After parsing all log lines, the results are stored in a CSV file during the template matching stage for further verification. PIPLUP leverages a novel tree structure without the pre-assumption of the position of constant tokens, and enhances the template extraction approach based on template similarity and describability. Further, it uses a set of data-insensitive parameters, which enables users to directly "plug and play" on their log files without excessive configuration needs.
Online Log Clustering: Inspired by Drain, PIPLUP leverages a similar tree structure as a hashing function to find the most compatible leaf for an incoming log message and conduct further comparisons. Instead of hashing with
Cluster Updating: The matched clusters are dynamically updated to allow online parsing. First, an in-cluster update will be triggered to append to the log line number the cluster's log line list and update the cluster's message example, path list, and template list. If the template list is updated, an inter-cluster template merging process among clusters under the same constant token node will be triggered to reduce redundancy in event templates.
Template Matching: PIPLUP may create multiple templates for a log cluster. If a cluster contains only one template, then all log messages are matched with this event directly. Conversely, if multiple templates are inferred, we assign the template to in-cluster log messages using regex matching. The matched results are stored in a CSV file for further analysis.
- python>=3.8
- chardet==5.1.0
- ipython==8.12.0
- matplotlib==3.7.2
- natsort==8.4.0
- numpy==1.24.4
- pandas==2.0.3
- regex==2022.3.2
- scipy
- tqdm==4.65.0
- rpy2
- spacy
For experiments, we leveraged the log data from Loghub 2.0. The original data can be obtained from the Loghub 2.0 repository. Before replicating the experiments, please obtain Loghub 2.0 from its original repository. Before starting the experiment, we corrected the event templates with the latest rules provided by LOGPAI to ensure the quality of the ground truths. The ground truth templates can be automatically corrected with template_correction.py.
To replicate the results in RQ1, switch folder to benchmark and run the command ./run_rq1_br_thresh.sh and ./run_rq1_sim_thresh.sh; to replicate the overall evaluation of RQ2 and RQ3, run the command ./run_all_full.sh. To conduct the Scott-Knott effect size difference (ESD) analysis, please first follow the tutorial ScottKnottESD to install the package, then run the code ./sk_analysis.py.
Two parameters are required in PIPLUP's parsing process, namely
Evaluating the impact of
Evaluating the impact of
Detailed results can be found under ./results/RQ1/. We inherit the parameter settings from RQ1 and use them to parse all 14 datasets in RQ2 and RQ3.
PIPLUP is compared with 7 state-of-the-art log parsers, including Drain, XDrain, Preprocessed-Drain, LILAC, LibreLog, LogBatcher, and LUNAR. Due to resource limitations, we did not replicate LibreLog. Therefore, the parsing results and time consumption of this log parser are obtained from its original study, and the evaluations are re-run on the corrected ground truths. We also conducted the experiment for PILAR, which is another data-insensitive log parser. The replication code of PILAR can be found in the PILAR_implementation folder. The following table shows the parsing effectiveness of PIPLUP, along with the seven benchmarks.
According to the table, PIPLUP obtains significantly better average performance than state-of-the-art statistic-based parsers on all four metrics. Moreover, even with LLM-powered semantic-based parsers included, the simple PIPLUP approach is still statistically optimal or near-optimal in terms of all four metrics. Even with LLM-powered semantic-based parsers included, the simple PIPLUP approach is still competitive in terms of all four metrics.
The datasets are sorted from the smallest (i.e., with the least number of lines) to the largest (i.e., with the most number of lines), and their time consumptions are documented in the following table. As shown in the following table, all parsers show anomalous high time consumption on the Thunderbird dataset. Therefore, we provided two versions of statistical time rankings (i.e., all files and files excluding Thunderbird) to avoid obtaining misleading conclusions.
PIPLUP requires more processing time than Drain, but less time than XDrain and Preprocessed-Drain on average. It also has a much lower time consumption than the semantic-based parsers. According to the Scott-Knott ESD ranking, PIPLUP is the second most efficient. It requires only $\sim$1.5 seconds to $\sim$25 minutes to parse each of the studied datasets. Its time efficiency is statistically comparable to state-of-the-art statistic-based parsers and much better than semantic-based ones that rely on expensive computing resources (e.g., only using ~6% of LUNAR's parsing time).
Detailed results for RQ2 and RQ3 can be found under ./results/RQ2&RQ3/.
├── 2k_dataset # Loghub-2k
├── PILAR_implementation # PILAR evaluation with Loghub 2.0 evaluation functions
├── benchmark
├── evaluation # Configurations for the parsers
├── logparser # Main code for parsers
├── Drain
├── PIPLUP
├── Preprocessed_Drain
├── utils
├── XDrain
└── __init__.py
├── old_benchmark # Default settings for the Drain series
├── run_all_full.sh # Script for running PIPLUP
├── run_rq1_br_thresh.sh
├── run_rq1_sim_thresh.sh
└── README.md
├── figures
├── result # Results for all parsers in CSV format (including PILAR)
├── RQ1
└── RQ2&RQ3
├── sk_analysis.py # Code for Scott-Knott ESD analysis
├── template_correction.py # Code for ground truth template correction
└── README.md





