-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
move proposal from README.md to porposal.md, add report as README.md
- Loading branch information
Showing
2 changed files
with
235 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,11 @@ | ||
# log-surgeon: A performant log parsing library | ||
Project Link: [Homepage][home-page] | ||
|
||
Video Demo Link: [Video Demo][video-demo] | ||
|
||
## Team Members | ||
- Student 1: Siwei (Louis) He, 1004220960 | ||
- Student 2: Zhihao Lin, 1005071299 | ||
- Student 1: Siwei (Louis) He, 1004220960, [email protected] | ||
- Student 2: Zhihao Lin, 1005071299, <TO-DO: Zhihao's email> | ||
|
||
## Introduction | ||
|
||
|
@@ -76,7 +78,7 @@ Our project, [log-surgeon-rust][home-page], is designed to improve CLP's parsing | |
safe and high-performant regular expression engine specialized for unstructured logs, allowing users | ||
to extract named variables from raw text log messages efficiently according to user-defined schema. | ||
|
||
## Objective and Key Features | ||
## Objective | ||
The objective of this project is to fill the gap explained in the motivation above in the current | ||
Rust ecosystem. We shall deliver a high-performance and memory-safe log parsing library using Rust. | ||
The project should consist of the core regex engine, the parser, and the user-oriented log parsing | ||
|
@@ -101,35 +103,97 @@ The log parsing interface will provide user programmatic APIs to: | |
- Feed input stream to the log parser using the configured regex engine | ||
- Retrieve outputs (parsed log events structured according to the user schema) from the parser | ||
|
||
[Zhihao Lin][github-zhihao] will be working on the parser implementation. | ||
## Features | ||
The log-surgeon library provides the following features: | ||
- Parsing and extracting variable values like the log event's log-level and any other user-specified variables, | ||
no matter where they appear in each log event. | ||
- Parsing by using regular expressions for each variable type rather than regular expressions for an entire log event. | ||
- Parsing multi-line log events (delimited by timestamps). | ||
|
||
Since log-surgeon is a Rust library, there are also some features that are not available externally: | ||
- any string to AST conversion | ||
- AST to NFA conversion | ||
- multiple NFAs to a single DFA conversion | ||
- DFA simulation on the input stream | ||
- user schema parser | ||
- log parser | ||
If you need these features, you can reference the implementation of log-surgeon library in your Rust project. | ||
|
||
## User's Guide | ||
log-surgeon is a Rust library for high-performance parsing of unstructured text logs. It is being | ||
shipped as a Rust crate and can be included in your Rust project by adding the following line to | ||
your `Cargo.toml` file: | ||
```toml | ||
[dependencies] | ||
log-surgeon = { git = "https://github.com/Toplogic-Inc/log-surgeon-rust", branch = "main" } | ||
``` | ||
|
||
Example usage of the library can be found in the examples directory of the repository. You can use | ||
the following code to confirm that you successfully included the library and check the version of | ||
the library: | ||
```rust | ||
extern crate log_surgeon; | ||
|
||
fn main() { | ||
println!("You are using log-surgeon version: {}", log_surgeon::version()); | ||
} | ||
``` | ||
|
||
## Reproducibility Guide | ||
There are several regression tests in the `tests` directory of the repository as well as in the | ||
individual components of the project. You can run the tests to ensure that the library is working | ||
as expected. The tests include testing the AST to NFA conversion, the NFA to DFA conversion, the | ||
DFA simulation on the input stream, and the correct passing of unstructured logs given input file | ||
and log searching schema. | ||
|
||
To run the tests, you can use the following command: | ||
```shell | ||
cargo test | ||
``` | ||
|
||
There are also example usage of the library in the `examples` directory of the repository. You can | ||
run the examples to see how the library can be used or be reproduced in a real-world scenario. Assume | ||
you are in the root directory of the repository, you can run the following command to change your | ||
directory to the examples directory and run the example: | ||
```shell | ||
cd examples | ||
cargo run | ||
``` | ||
The example uses the repository relative path to include the dependency. If you want to include the | ||
library in your project, you can follow the user's guide above where you should specify the git URL | ||
to obtain the latest version of the library. | ||
|
||
## Contributions by each team member | ||
1. **[Louis][github-siwei]** | ||
- Implemented the draft version of the AST to NFA conversion. | ||
- Implemented the conversion from one or more NFAs to a single DFA. | ||
- Implemented the simulation of the DFA on the input stream. | ||
|
||
[Siwei (Louis) He][github-siwei] will be working on the core regex engine implementation. | ||
|
||
Both will be working on the log parsing interface. | ||
2. **[Zhihao][github-zhihao]** | ||
- | ||
|
||
One will review the other's implementation through GitHub's Pull Request for the purpose of the | ||
correctness and efficiency. | ||
Both members on the team have contributed to the design of the project. Both will review the other's implementation | ||
through GitHub's Pull Request for the purpose of the correctness and efficiency. | ||
|
||
## Tentative Plan and Status | ||
1. **Louis** | ||
## Lessons learned and concluding remarks | ||
This project is a great opportunity for us to learn about the Rust programming language. There is an existing | ||
C++ implementation of the log parsing library, and we have learned how to port the existing code to Rust. The Rust | ||
programming language has a very different coding mindset compared to C++. It is a memory-safe language that has | ||
a very strict borrowing system. We have learned how to use Rust's borrowing system to ensure the safety of our code. | ||
|
||
| Time | Tentative Schedule | Status | | ||
|-----------------------|---------------------------------------------|-------------| | ||
| Oct. 18th ~ Oct. 25th | Complete AST common structs for the project | Done | | ||
| Oct. 25th ~ Nov. 8th | Complete NFA structs and research | On track | | ||
| Nov. 1st ~ Nov. 8th | Implement AST to NFA translation | Not started | | ||
| Nov. 8th ~ Nov. 15th | Implement AST to NFA translation | Not started | | ||
| Nov. 15th ~ Nov. 22nd | Complete DFA structs and research | Not started | | ||
| Nov. 22nd ~ Nov. 29th | Implement NFA to DFA translation | Not started | | ||
| Nov. 29th ~ Dec. 6th | Stages integration and final reporting | Not started | | ||
Alongside the successful completion of the project, we also noticed a few places where we could potentially do better. | ||
First, we could have spent more time on the research and the design part of the project. We spent significant time on | ||
iterating how the AST to NFA conversion should be implemented. A consensus on the design could have saved us time in | ||
the implementation phase. | ||
|
||
2. **Zhihao** | ||
Second, given the time constraint, we did not have time to optimize the performance of the library. We have implemented | ||
the core functionality of the library, but we have not done enough for the performance optimization. We might have chosen | ||
a project that is too ambitious for the very limited time frame. | ||
|
||
| Time | Tentative Schedule | Status | | ||
|-----------------------|-------------------------------------------------------------|-------------| | ||
| Nov. 1st ~ Nov. 15th | Implement LALR parser for schema parsing and AST generation | Not started | | ||
| Nov. 15th ~ Nov. 29nd | Implement lexer for input stream processing | Not started | | ||
| Nov. 29nd ~ Dec. 6th | Formalize log parsing APIs | Not started | | ||
Overall, the project is a great learning experience. We have learned a lot about Rust, how to ship a Rust crate, | ||
and how everything works behind the Regex processing. We are proud filling the gap in the Rust ecosystem where | ||
there is no high-performance unstructured log parsing library. | ||
|
||
[clp-paper]: https://www.usenix.org/system/files/osdi21-rodrigues.pdf | ||
[clp-s-paper]: https://www.usenix.org/system/files/osdi24-wang-rui.pdf | ||
|
@@ -143,3 +207,4 @@ correctness and efficiency. | |
[wiki-lalr]: https://en.wikipedia.org/wiki/LALR_parser | ||
[wiki-nfa]: https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton | ||
[wiki-tagged-dfa]: https://en.wikipedia.org/wiki/Tagged_Deterministic_Finite_Automaton | ||
[video-demo]: todo |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
# log-surgeon: A performant log parsing library | ||
Project Link: [Homepage][home-page] | ||
|
||
## Team Members | ||
- Student 1: Siwei (Louis) He, 1004220960 | ||
- Student 2: Zhihao Lin, 1005071299 | ||
|
||
## Introduction | ||
|
||
`log-surgeon` is a library for high-performance parsing of unstructured text | ||
logs implemented using Rust. | ||
|
||
|
||
## Motivation | ||
Today's large technology companies generate logs the magnitude of petabytes per day as a critical | ||
source for runtime failure diagnostics and data analytics. In a real-world production environment, | ||
logs can be split into two categories: unstructured logs and structured logs, where unstructured logs | ||
usually consist of a timestamp and a raw text message (i.e.,[Hadoop logs][hadoop-logs]), and | ||
structured logs are normally JSON records (i.e., [mongoDB logs][mongodb-logs]). [CLP][github-clp], | ||
is a distributed system designed to compress, search, and analyze large-scale log data. It provides | ||
solutions for both unstructured and structured logs, as discussed in its | ||
[2021's OSDI paper][clp-paper] and [2024's OSDI paper][clp-s-paper]. | ||
|
||
CLP has been deployed in many large-scale production software systems in thousands of cloud servers | ||
and commercial electric vehicles. Throughout the deployment experiences, an interesting issue has | ||
been found. Consider the following log event: | ||
```text | ||
2022-10-10 12:30:02 1563 1827 I AppControl: Removed item: AppOpItem(Op code=1, UID=1000) | ||
``` | ||
This is an unstructured log event collected from the Android system on a mobile device. It can be | ||
manually structured in the following way: | ||
```JSON | ||
{ | ||
"timestamp": "2022-10-10 12:30:02", | ||
"PID": 1563, | ||
"TID": 1827, | ||
"priority": "I", | ||
"tag": "AppControl", | ||
"record": { | ||
"action": "Removed item", | ||
"op_code": 1, | ||
"UID": 1000 | ||
} | ||
} | ||
``` | ||
Intuitively, the structured version makes it easier to query relevant data fields. For example, if | ||
an application wants to query `UID=1000`, it can take advantage of the tree-style key-value pair | ||
structure that JSON format provides. Otherwise, it might need a complicated regular expression to | ||
extract the number from the raw-text log message. Unfortunately, it is impossible to deprecate | ||
unstructured logging infrastructures in any real-world software systems for the following reasons: | ||
- Unstructured logs are more run-time-efficient: it does not introduce overhead of structuring data. | ||
- Legacy issues: real-world software systems use countless software components; some | ||
may not be compatible with structured logging infrastructure. | ||
|
||
Hence, the high-level motivation of our project has been formed: how to improve the analyzability of | ||
unstructured logs to make it as usable as structured logs? The scope of this problem is vast, | ||
and we will focus on one aspect: log parsing. CLP has introduced an innovative way of handling | ||
unstructured logs. The basic idea behind is to find the static text and variables in a raw text log | ||
message, where the static text is like a format string. For instance, the above log event can be | ||
interpreted as the following: | ||
```Python | ||
print( | ||
f"{timestamp}, {pid}, {tid}, {priority}, {tag}: Removed item: AppOpItem(Op code={op}, UID={uid})" | ||
) | ||
``` | ||
`timestamp`, `pid`, `tid`, `priority`, `tag`, `op`, and `uid` are all variables. This provides | ||
some simple data structuring, however, it has a few limitations: | ||
- CLP's heuristic parser cannot parse logs based on user-defined schema. For example, | ||
`"Removed item"` above may be a variable, but CLP's heuristic parser cannot handle that. | ||
- CLP's heuristic parser cannot parse complicated substrings, i.e., a substring described by the | ||
regular expression `capture:((?<letterA>a)*)|(((?<letterC>c)|(?<letterD>d)){0,10})`. | ||
- The parsed variables are unnamed. For example, users cannot name the 7th variable to be `"uid"` in | ||
the above example. | ||
|
||
Our project, [log-surgeon-rust][home-page], is designed to improve CLP's parsing features. It is a | ||
safe and high-performant regular expression engine specialized for unstructured logs, allowing users | ||
to extract named variables from raw text log messages efficiently according to user-defined schema. | ||
|
||
## Objective and Key Features | ||
The objective of this project is to fill the gap explained in the motivation above in the current | ||
Rust ecosystem. We shall deliver a high-performance and memory-safe log parsing library using Rust. | ||
The project should consist of the core regex engine, the parser, and the user-oriented log parsing | ||
interface. | ||
|
||
The core regex engine is designed for high-performance schema matching and variable extraction. | ||
User-defined schemas will be described in regular expressions, and the underlying engine will parse | ||
the schema regular expressions into abstract syntax trees (AST), convert ASTs into non-deterministic | ||
finite automata ([NFA][wiki-nfa]), and merge all NFAs into one large deterministic finite automata | ||
([DFA][wiki-dfa]). This single-DFA design will ensure the execution time is bounded by the length of | ||
the input stream. If time allows, we will even implement [tagged DFA][wiki-tagged-dfa] to make | ||
the schema more powerful. | ||
|
||
The parser has two components: | ||
- The schema parser, which is an implementation of [LALR parser][wiki-lalr], parses user-input | ||
schema into regex AST. | ||
- The log parser, which operates similarly to a simple compiler, uses a lexer to process the input | ||
text and emits tokens, and makes decisions based on emitted tokens using the core regex engine. | ||
|
||
The log parsing interface will provide user programmatic APIs to: | ||
- Specify inputs (variable schemas) to configure the regex engine | ||
- Feed input stream to the log parser using the configured regex engine | ||
- Retrieve outputs (parsed log events structured according to the user schema) from the parser | ||
|
||
[Zhihao Lin][github-zhihao] will be working on the parser implementation. | ||
|
||
[Siwei (Louis) He][github-siwei] will be working on the core regex engine implementation. | ||
|
||
Both will be working on the log parsing interface. | ||
|
||
One will review the other's implementation through GitHub's Pull Request for the purpose of the | ||
correctness and efficiency. | ||
|
||
## Tentative Plan and Status | ||
1. **Louis** | ||
|
||
| Time | Tentative Schedule | Status | | ||
|-----------------------|---------------------------------------------|-------------| | ||
| Oct. 18th ~ Oct. 25th | Complete AST common structs for the project | Done | | ||
| Oct. 25th ~ Nov. 8th | Complete NFA structs and research | On track | | ||
| Nov. 1st ~ Nov. 8th | Implement AST to NFA translation | Not started | | ||
| Nov. 8th ~ Nov. 15th | Implement AST to NFA translation | Not started | | ||
| Nov. 15th ~ Nov. 22nd | Complete DFA structs and research | Not started | | ||
| Nov. 22nd ~ Nov. 29th | Implement NFA to DFA translation | Not started | | ||
| Nov. 29th ~ Dec. 6th | Stages integration and final reporting | Not started | | ||
|
||
2. **Zhihao** | ||
|
||
| Time | Tentative Schedule | Status | | ||
|-----------------------|-------------------------------------------------------------|-------------| | ||
| Nov. 1st ~ Nov. 15th | Implement LALR parser for schema parsing and AST generation | Not started | | ||
| Nov. 15th ~ Nov. 29nd | Implement lexer for input stream processing | Not started | | ||
| Nov. 29nd ~ Dec. 6th | Formalize log parsing APIs | Not started | | ||
|
||
[clp-paper]: https://www.usenix.org/system/files/osdi21-rodrigues.pdf | ||
[clp-s-paper]: https://www.usenix.org/system/files/osdi24-wang-rui.pdf | ||
[github-clp]: https://github.com/y-scope/clp | ||
[github-siwei]: https://github.com/Louis-He | ||
[github-zhihao]: https://github.com/LinZhihao-723 | ||
[hadoop-logs]: https://zenodo.org/records/7114847 | ||
[home-page]: https://github.com/Toplogic-Inc/log-surgeon-rust | ||
[mongodb-logs]: https://zenodo.org/records/11075361 | ||
[wiki-dfa]: https://en.wikipedia.org/wiki/Deterministic_finite_automaton | ||
[wiki-lalr]: https://en.wikipedia.org/wiki/LALR_parser | ||
[wiki-nfa]: https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton | ||
[wiki-tagged-dfa]: https://en.wikipedia.org/wiki/Tagged_Deterministic_Finite_Automaton |