diff --git a/README.md b/README.md index 282ead1..a168676 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,11 @@ # log-surgeon: A performant log parsing library Project Link: [Homepage][home-page] +Video Demo Link: [Video Demo][video-demo] + ## Team Members -- Student 1: Siwei (Louis) He, 1004220960 -- Student 2: Zhihao Lin, 1005071299 +- Student 1: Siwei (Louis) He, 1004220960, siwei.he@mail.utoronto.ca +- Student 2: Zhihao Lin, 1005071299, ## Introduction @@ -76,7 +78,7 @@ Our project, [log-surgeon-rust][home-page], is designed to improve CLP's parsing safe and high-performant regular expression engine specialized for unstructured logs, allowing users to extract named variables from raw text log messages efficiently according to user-defined schema. -## Objective and Key Features +## Objective The objective of this project is to fill the gap explained in the motivation above in the current Rust ecosystem. We shall deliver a high-performance and memory-safe log parsing library using Rust. The project should consist of the core regex engine, the parser, and the user-oriented log parsing @@ -101,35 +103,97 @@ The log parsing interface will provide user programmatic APIs to: - Feed input stream to the log parser using the configured regex engine - Retrieve outputs (parsed log events structured according to the user schema) from the parser -[Zhihao Lin][github-zhihao] will be working on the parser implementation. +## Features +The log-surgeon library provides the following features: +- Parsing and extracting variable values like the log event's log-level and any other user-specified variables, +no matter where they appear in each log event. +- Parsing by using regular expressions for each variable type rather than regular expressions for an entire log event. +- Parsing multi-line log events (delimited by timestamps). + +Since log-surgeon is a Rust library, there are also some features that are not available externally: +- any string to AST conversion +- AST to NFA conversion +- multiple NFAs to a single DFA conversion +- DFA simulation on the input stream +- user schema parser +- log parser +If you need these features, you can reference the implementation of log-surgeon library in your Rust project. + +## User's Guide +log-surgeon is a Rust library for high-performance parsing of unstructured text logs. It is being +shipped as a Rust crate and can be included in your Rust project by adding the following line to +your `Cargo.toml` file: +```toml +[dependencies] +log-surgeon = { git = "https://github.com/Toplogic-Inc/log-surgeon-rust", branch = "main" } +``` + +Example usage of the library can be found in the examples directory of the repository. You can use +the following code to confirm that you successfully included the library and check the version of +the library: +```rust +extern crate log_surgeon; + +fn main() { + println!("You are using log-surgeon version: {}", log_surgeon::version()); +} +``` + +## Reproducibility Guide +There are several regression tests in the `tests` directory of the repository as well as in the +individual components of the project. You can run the tests to ensure that the library is working +as expected. The tests include testing the AST to NFA conversion, the NFA to DFA conversion, the +DFA simulation on the input stream, and the correct passing of unstructured logs given input file +and log searching schema. + +To run the tests, you can use the following command: +```shell +cargo test +``` + +There are also example usage of the library in the `examples` directory of the repository. You can +run the examples to see how the library can be used or be reproduced in a real-world scenario. Assume +you are in the root directory of the repository, you can run the following command to change your +directory to the examples directory and run the example: +```shell +cd examples +cargo run +``` +The example uses the repository relative path to include the dependency. If you want to include the +library in your project, you can follow the user's guide above where you should specify the git URL +to obtain the latest version of the library. + +## Contributions by each team member +1. **[Louis][github-siwei]** +- Implemented the draft version of the AST to NFA conversion. +- Implemented the conversion from one or more NFAs to a single DFA. +- Implemented the simulation of the DFA on the input stream. -[Siwei (Louis) He][github-siwei] will be working on the core regex engine implementation. -Both will be working on the log parsing interface. +2. **[Zhihao][github-zhihao]** +- -One will review the other's implementation through GitHub's Pull Request for the purpose of the -correctness and efficiency. +Both members on the team have contributed to the design of the project. Both will review the other's implementation +through GitHub's Pull Request for the purpose of the correctness and efficiency. -## Tentative Plan and Status -1. **Louis** +## Lessons learned and concluding remarks +This project is a great opportunity for us to learn about the Rust programming language. There is an existing +C++ implementation of the log parsing library, and we have learned how to port the existing code to Rust. The Rust +programming language has a very different coding mindset compared to C++. It is a memory-safe language that has +a very strict borrowing system. We have learned how to use Rust's borrowing system to ensure the safety of our code. -| Time | Tentative Schedule | Status | -|-----------------------|---------------------------------------------|-------------| -| Oct. 18th ~ Oct. 25th | Complete AST common structs for the project | Done | -| Oct. 25th ~ Nov. 8th | Complete NFA structs and research | On track | -| Nov. 1st ~ Nov. 8th | Implement AST to NFA translation | Not started | -| Nov. 8th ~ Nov. 15th | Implement AST to NFA translation | Not started | -| Nov. 15th ~ Nov. 22nd | Complete DFA structs and research | Not started | -| Nov. 22nd ~ Nov. 29th | Implement NFA to DFA translation | Not started | -| Nov. 29th ~ Dec. 6th | Stages integration and final reporting | Not started | +Alongside the successful completion of the project, we also noticed a few places where we could potentially do better. +First, we could have spent more time on the research and the design part of the project. We spent significant time on +iterating how the AST to NFA conversion should be implemented. A consensus on the design could have saved us time in +the implementation phase. -2. **Zhihao** +Second, given the time constraint, we did not have time to optimize the performance of the library. We have implemented +the core functionality of the library, but we have not done enough for the performance optimization. We might have chosen +a project that is too ambitious for the very limited time frame. -| Time | Tentative Schedule | Status | -|-----------------------|-------------------------------------------------------------|-------------| -| Nov. 1st ~ Nov. 15th | Implement LALR parser for schema parsing and AST generation | Not started | -| Nov. 15th ~ Nov. 29nd | Implement lexer for input stream processing | Not started | -| Nov. 29nd ~ Dec. 6th | Formalize log parsing APIs | Not started | +Overall, the project is a great learning experience. We have learned a lot about Rust, how to ship a Rust crate, +and how everything works behind the Regex processing. We are proud filling the gap in the Rust ecosystem where +there is no high-performance unstructured log parsing library. [clp-paper]: https://www.usenix.org/system/files/osdi21-rodrigues.pdf [clp-s-paper]: https://www.usenix.org/system/files/osdi24-wang-rui.pdf @@ -143,3 +207,4 @@ correctness and efficiency. [wiki-lalr]: https://en.wikipedia.org/wiki/LALR_parser [wiki-nfa]: https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton [wiki-tagged-dfa]: https://en.wikipedia.org/wiki/Tagged_Deterministic_Finite_Automaton +[video-demo]: todo \ No newline at end of file diff --git a/proposal.md b/proposal.md new file mode 100644 index 0000000..282ead1 --- /dev/null +++ b/proposal.md @@ -0,0 +1,145 @@ +# log-surgeon: A performant log parsing library +Project Link: [Homepage][home-page] + +## Team Members +- Student 1: Siwei (Louis) He, 1004220960 +- Student 2: Zhihao Lin, 1005071299 + +## Introduction + +`log-surgeon` is a library for high-performance parsing of unstructured text +logs implemented using Rust. + + +## Motivation +Today's large technology companies generate logs the magnitude of petabytes per day as a critical +source for runtime failure diagnostics and data analytics. In a real-world production environment, +logs can be split into two categories: unstructured logs and structured logs, where unstructured logs +usually consist of a timestamp and a raw text message (i.e.,[Hadoop logs][hadoop-logs]), and +structured logs are normally JSON records (i.e., [mongoDB logs][mongodb-logs]). [CLP][github-clp], +is a distributed system designed to compress, search, and analyze large-scale log data. It provides +solutions for both unstructured and structured logs, as discussed in its +[2021's OSDI paper][clp-paper] and [2024's OSDI paper][clp-s-paper]. + +CLP has been deployed in many large-scale production software systems in thousands of cloud servers +and commercial electric vehicles. Throughout the deployment experiences, an interesting issue has +been found. Consider the following log event: +```text +2022-10-10 12:30:02 1563 1827 I AppControl: Removed item: AppOpItem(Op code=1, UID=1000) +``` +This is an unstructured log event collected from the Android system on a mobile device. It can be +manually structured in the following way: +```JSON +{ + "timestamp": "2022-10-10 12:30:02", + "PID": 1563, + "TID": 1827, + "priority": "I", + "tag": "AppControl", + "record": { + "action": "Removed item", + "op_code": 1, + "UID": 1000 + } +} +``` +Intuitively, the structured version makes it easier to query relevant data fields. For example, if +an application wants to query `UID=1000`, it can take advantage of the tree-style key-value pair +structure that JSON format provides. Otherwise, it might need a complicated regular expression to +extract the number from the raw-text log message. Unfortunately, it is impossible to deprecate +unstructured logging infrastructures in any real-world software systems for the following reasons: +- Unstructured logs are more run-time-efficient: it does not introduce overhead of structuring data. +- Legacy issues: real-world software systems use countless software components; some + may not be compatible with structured logging infrastructure. + +Hence, the high-level motivation of our project has been formed: how to improve the analyzability of +unstructured logs to make it as usable as structured logs? The scope of this problem is vast, +and we will focus on one aspect: log parsing. CLP has introduced an innovative way of handling +unstructured logs. The basic idea behind is to find the static text and variables in a raw text log +message, where the static text is like a format string. For instance, the above log event can be +interpreted as the following: +```Python +print( + f"{timestamp}, {pid}, {tid}, {priority}, {tag}: Removed item: AppOpItem(Op code={op}, UID={uid})" +) +``` +`timestamp`, `pid`, `tid`, `priority`, `tag`, `op`, and `uid` are all variables. This provides +some simple data structuring, however, it has a few limitations: +- CLP's heuristic parser cannot parse logs based on user-defined schema. For example, + `"Removed item"` above may be a variable, but CLP's heuristic parser cannot handle that. +- CLP's heuristic parser cannot parse complicated substrings, i.e., a substring described by the + regular expression `capture:((?a)*)|(((?c)|(?d)){0,10})`. +- The parsed variables are unnamed. For example, users cannot name the 7th variable to be `"uid"` in + the above example. + +Our project, [log-surgeon-rust][home-page], is designed to improve CLP's parsing features. It is a +safe and high-performant regular expression engine specialized for unstructured logs, allowing users +to extract named variables from raw text log messages efficiently according to user-defined schema. + +## Objective and Key Features +The objective of this project is to fill the gap explained in the motivation above in the current +Rust ecosystem. We shall deliver a high-performance and memory-safe log parsing library using Rust. +The project should consist of the core regex engine, the parser, and the user-oriented log parsing +interface. + +The core regex engine is designed for high-performance schema matching and variable extraction. +User-defined schemas will be described in regular expressions, and the underlying engine will parse +the schema regular expressions into abstract syntax trees (AST), convert ASTs into non-deterministic +finite automata ([NFA][wiki-nfa]), and merge all NFAs into one large deterministic finite automata +([DFA][wiki-dfa]). This single-DFA design will ensure the execution time is bounded by the length of +the input stream. If time allows, we will even implement [tagged DFA][wiki-tagged-dfa] to make +the schema more powerful. + +The parser has two components: +- The schema parser, which is an implementation of [LALR parser][wiki-lalr], parses user-input +schema into regex AST. +- The log parser, which operates similarly to a simple compiler, uses a lexer to process the input +text and emits tokens, and makes decisions based on emitted tokens using the core regex engine. + +The log parsing interface will provide user programmatic APIs to: +- Specify inputs (variable schemas) to configure the regex engine +- Feed input stream to the log parser using the configured regex engine +- Retrieve outputs (parsed log events structured according to the user schema) from the parser + +[Zhihao Lin][github-zhihao] will be working on the parser implementation. + +[Siwei (Louis) He][github-siwei] will be working on the core regex engine implementation. + +Both will be working on the log parsing interface. + +One will review the other's implementation through GitHub's Pull Request for the purpose of the +correctness and efficiency. + +## Tentative Plan and Status +1. **Louis** + +| Time | Tentative Schedule | Status | +|-----------------------|---------------------------------------------|-------------| +| Oct. 18th ~ Oct. 25th | Complete AST common structs for the project | Done | +| Oct. 25th ~ Nov. 8th | Complete NFA structs and research | On track | +| Nov. 1st ~ Nov. 8th | Implement AST to NFA translation | Not started | +| Nov. 8th ~ Nov. 15th | Implement AST to NFA translation | Not started | +| Nov. 15th ~ Nov. 22nd | Complete DFA structs and research | Not started | +| Nov. 22nd ~ Nov. 29th | Implement NFA to DFA translation | Not started | +| Nov. 29th ~ Dec. 6th | Stages integration and final reporting | Not started | + +2. **Zhihao** + +| Time | Tentative Schedule | Status | +|-----------------------|-------------------------------------------------------------|-------------| +| Nov. 1st ~ Nov. 15th | Implement LALR parser for schema parsing and AST generation | Not started | +| Nov. 15th ~ Nov. 29nd | Implement lexer for input stream processing | Not started | +| Nov. 29nd ~ Dec. 6th | Formalize log parsing APIs | Not started | + +[clp-paper]: https://www.usenix.org/system/files/osdi21-rodrigues.pdf +[clp-s-paper]: https://www.usenix.org/system/files/osdi24-wang-rui.pdf +[github-clp]: https://github.com/y-scope/clp +[github-siwei]: https://github.com/Louis-He +[github-zhihao]: https://github.com/LinZhihao-723 +[hadoop-logs]: https://zenodo.org/records/7114847 +[home-page]: https://github.com/Toplogic-Inc/log-surgeon-rust +[mongodb-logs]: https://zenodo.org/records/11075361 +[wiki-dfa]: https://en.wikipedia.org/wiki/Deterministic_finite_automaton +[wiki-lalr]: https://en.wikipedia.org/wiki/LALR_parser +[wiki-nfa]: https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton +[wiki-tagged-dfa]: https://en.wikipedia.org/wiki/Tagged_Deterministic_Finite_Automaton