Skip to content

Commit

Permalink
Add clp-s for compressing and searching semi-structured logs. (#217)
Browse files Browse the repository at this point in the history
  • Loading branch information
wraymo authored Jan 9, 2024
1 parent 4293499 commit 1a17b85
Show file tree
Hide file tree
Showing 133 changed files with 16,503 additions and 110 deletions.
6 changes: 6 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,9 @@
[submodule "components/core/submodules/boost-outcome"]
path = components/core/submodules/boost-outcome
url = https://github.com/boostorg/outcome.git
[submodule "components/core/submodules/simdjson"]
path = components/core/submodules/simdjson
url = https://github.com/simdjson/simdjson.git
[submodule "components/core/submodules/abseil-cpp"]
path = components/core/submodules/abseil-cpp
url = https://github.com/abseil/abseil-cpp.git
4 changes: 2 additions & 2 deletions components/core/.clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,8 @@ IncludeCategories:
# NOTE: A header is grouped by first matching regex
# Library headers. Update when adding new libraries.
# NOTE: clang-format retains leading white-space on a line in violation of the YAML spec.
- Regex: "^<(archive|boost|catch2|date|fmt|json|log_surgeon|mariadb|spdlog|sqlite3|string_utils\
|yaml-cpp|zstd)"
- Regex: "<(absl|antlr4|archive|boost|catch2|date|fmt|json|log_surgeon|mariadb|simdjson|spdlog\
|sqlite3|string_utils|yaml-cpp|zstd)"
Priority: 3
# C system headers
- Regex: "^<.+\\.h>"
Expand Down
1 change: 1 addition & 0 deletions components/core/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
build/**
submodules/sqlite3/*
third-party/**
17 changes: 17 additions & 0 deletions components/core/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,15 @@ elseif (CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
endif ()
endif ()

# Find and setup ANTLR Library
# We build and link to the static library
find_package(ANTLR REQUIRED)
if (ANTLR_FOUND)
message(STATUS "Found ANTLR ${ANTLR_VERSION}")
else()
message(FATAL_ERROR "Could not find libraries for ANTLR ${ANTLR4_TAG}")
endif()

# Find and setup Boost Library
if(CLP_USE_STATIC_LIBS)
set(Boost_USE_STATIC_LIBS ON)
Expand Down Expand Up @@ -142,6 +151,13 @@ else()
message(FATAL_ERROR "Could not find msgpack-cxx")
endif()

# Add abseil-cpp
set(ABSL_PROPAGATE_CXX_STD ON)
add_subdirectory(submodules/abseil-cpp EXCLUDE_FROM_ALL)

# Add simdjson
add_subdirectory(submodules/simdjson EXCLUDE_FROM_ALL)

# Add yaml-cpp
add_subdirectory(submodules/yaml-cpp EXCLUDE_FROM_ALL)

Expand All @@ -167,6 +183,7 @@ add_subdirectory(src/clp/clg)
add_subdirectory(src/clp/clo)
add_subdirectory(src/clp/clp)
add_subdirectory(src/clp/make_dictionaries_readable)
add_subdirectory(src/clp_s)

set(SOURCE_FILES_unitTest
src/clp/BufferedFileReader.cpp
Expand Down
119 changes: 11 additions & 108 deletions components/core/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,7 @@ CLP core is the low-level component that performs compression, decompression, an
* [Docker Environment](#docker-environment)
* [Build](#build)
* [Running](#running)
* [`clp`](#clp)
* [`clg`](#clg)
* [`make-dictionaries-readable`](#make-dictionaries-readable)
* [Parallel Compression](#parallel-compression)


## Requirements

Expand All @@ -36,10 +33,13 @@ tools/scripts/deps-download/download-all.sh
```

This will download:
* [abseil-cpp](https://github.com/abseil/abseil-cpp) (20230802.1)
* [ANTLR](https://www.antlr.org) (v4.13.1)
* [Catch2](https://github.com/catchorg/Catch2.git) (v2.13.7)
* [date](https://github.com/HowardHinnant/date.git) (v3.0.1)
* [json](https://github.com/nlohmann/json.git) (v3.10.4)
* [log-surgeon](https://github.com/y-scope/log-surgeon) (895f464)
* [simdjson](https://github.com/simdjson/simdjson) (v3.6.3)
* [SQLite3](https://www.sqlite.org/download.html) (v3.36.0)
* [yaml-cpp](https://github.com/jbeder/yaml-cpp.git) (v0.7.0)

Expand Down Expand Up @@ -98,108 +98,11 @@ the relevant paths on your machine.

## Running

* CLP contains two core executables: `clp` and `clg`
* `clp` is used for compressing and extracting logs
* `clg` is used for performing wildcard searches on the compressed logs

### `clp`

To compress some logs without a schema file:
```shell
./clp c archives-dir /home/my/logs
```
* `archives-dir` is where compressed logs should be output
* `clp` will create a number of files and directories within, so it's best if this directory is empty
* You can use the same directory repeatedly and `clp` will add to the compressed logs within.
* `/home/my/logs` is any log file or directory containing log files
* In this mode, `clp` will use heuristics to determine what are the variables in
each uncompressed message.
* The heuristics roughly correspond to the example schema file in
`config/schemas.txt`.

To compress with a user-defined schema file:
```shell
./clp c --schema-path path-to-schema-file archives-dir /home/my/logs
```
* `path-to-schema-file` is the location of a schema file. For more details on
schema files, see README-Schema.md.
* CLP contains three core executables: `clp`, `clg`, and `clp-s`.
* `clp` is used for compressing and extracting unstructured (plain text) logs.
* `clg` is used for performing wildcard searches on the compressed unstructured logs.
* `clp-s` is used for compressing and searching semi-structured logs (e.g., JSON) with support for
handling highly dynamic schemas.

To decompress those logs:
```shell
./clp x archive-dir decompressed
```
* `archives-dir` is where the compressed logs were previously stored
* `decompressed` is a directory where they will be decompressed to

You can also decompress a specific file:
```shell
./clp x archive-dir decompressed /my/file/path.log
```
* `/my/file/path.log` is the uncompressed file's path (the one that was passed to `clp` for compression)

More usage instructions can be found by running:
```shell
./clp --help
```

### `clg`

To search the compressed logs:
```shell
./clg archives-dir " a *wildcard* search phrase "
```
* `archives-dir` is where the compressed logs were previously stored
* For archives compressed without a schema file:
* The search phrase can contain the `*` wildcard which matches 0 or more
characters, or the `?` wildcard which matches any single character.
* For archives compressed using a schema file:
* `*` may only represent non-delimiter characters.

Similar to `clp`, `clg` can search a single file:
```shell
./clg archives-dir " a *wildcard* search phrase " /my/file/path.log
```
* `/my/file/path.log` is the uncompressed file's path (the one that was passed to `clp` for compression)

More usage instructions can be found by running:
```shell
./clg --help
```

### `make-dictionaries-readable`

If you'd like to convert the dictionaries of an individual archive into a human-readable form, you
can use `make-dictionaries-readable`.

```shell
./make-dictionaries-readable archive-path <output dir>
```
* `archive-path` is a path to a specific archive (inside `archives-dir`)

See the `make-dictionaries-readable` [README](src/clp/make_dictionaries_readable/README.md) for
details on the output format.

## Parallel Compression

By default, `clp` uses an embedded SQLite database, so each directory containing archives can only
be accessed by a single `clp` instance.

To enable parallel compression to the same archives directory, `clp`/`clg` can be configured to
use a MySQL-type database (MariaDB) as follows:

* Install and configure MariaDB using the instructions for your platform
* Create a user that has privileges to create databases, create tables, insert records, and delete
records.
* Copy and change `config/metadata-db.yml`, setting the type to `mysql` and uncommenting the MySQL
parameters.
* Install the MariaDB and PyYAML Python packages `pip3 install mariadb PyYAML`
* This is necessary to run the database initialization script. If you prefer, you can run the
SQL statements in `tools/scripts/db/init-db.py` directly.
* Run `tools/scripts/db/init-db.py` with the updated config file. This will initialize the
database CLP requires.
* Run `clp` or `clg` as before, with the addition of the `--db-config-file` option pointing at
the updated config file.
* To compress in parallel, simply run another instance of `clp` concurrently.

Note that currently, decompression (`clp x`) and search (`clg`) can only be run with a single
instance. We are in the process of open-sourcing parallelized versions of these as well.
See [Using CLP for unstructured logs](../../docs/core/clp-unstructured.md) and
[Using CLP for semi-structured logs](../../docs/core/clp-structured.md) for usage instructions.
180 changes: 180 additions & 0 deletions components/core/cmake/Modules/ExternalAntlr4Cpp.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# NOTE: ExternalAntlr4Cpp.cmake taken from
# https://github.com/antlr/antlr4/blob/4.13.1/runtime/Cpp/cmake/ExternalAntlr4Cpp.cmake

cmake_minimum_required(VERSION 3.7)

if(POLICY CMP0114)
cmake_policy(SET CMP0114 NEW)
endif()

include(ExternalProject)

set(ANTLR4_ROOT ${CMAKE_CURRENT_BINARY_DIR}/antlr4_runtime/src/antlr4_runtime)
set(ANTLR4_INCLUDE_DIRS ${ANTLR4_ROOT}/runtime/Cpp/runtime/src)
set(ANTLR4_GIT_REPOSITORY https://github.com/antlr/antlr4.git)
if(NOT DEFINED ANTLR4_TAG)
# Set to branch name to keep library updated at the cost of needing to rebuild after 'clean'
# Set to commit hash to keep the build stable and does not need to rebuild after 'clean'
set(ANTLR4_TAG master)
endif()

# Ensure that the include dir already exists at configure time (to avoid cmake erroring
# on non-existent include dirs)
file(MAKE_DIRECTORY "${ANTLR4_INCLUDE_DIRS}")

if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*")
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(Configuration))
elseif(${CMAKE_GENERATOR} MATCHES "Xcode.*")
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(CONFIGURATION))
else()
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist)
endif()

if(MSVC)
set(ANTLR4_STATIC_LIBRARIES
${ANTLR4_OUTPUT_DIR}/antlr4-runtime-static.lib)
set(ANTLR4_SHARED_LIBRARIES
${ANTLR4_OUTPUT_DIR}/antlr4-runtime.lib)
set(ANTLR4_RUNTIME_LIBRARIES
${ANTLR4_OUTPUT_DIR}/antlr4-runtime.dll)
else()
set(ANTLR4_STATIC_LIBRARIES
${ANTLR4_OUTPUT_DIR}/libantlr4-runtime.a)
if(MINGW)
set(ANTLR4_SHARED_LIBRARIES
${ANTLR4_OUTPUT_DIR}/libantlr4-runtime.dll.a)
set(ANTLR4_RUNTIME_LIBRARIES
${ANTLR4_OUTPUT_DIR}/libantlr4-runtime.dll)
elseif(CYGWIN)
set(ANTLR4_SHARED_LIBRARIES
${ANTLR4_OUTPUT_DIR}/libantlr4-runtime.dll.a)
set(ANTLR4_RUNTIME_LIBRARIES
${ANTLR4_OUTPUT_DIR}/cygantlr4-runtime-${ANTLR4_TAG}.dll)
elseif(APPLE)
set(ANTLR4_RUNTIME_LIBRARIES
${ANTLR4_OUTPUT_DIR}/libantlr4-runtime.dylib)
else()
set(ANTLR4_RUNTIME_LIBRARIES
${ANTLR4_OUTPUT_DIR}/libantlr4-runtime.so)
endif()
endif()

if(${CMAKE_GENERATOR} MATCHES ".* Makefiles")
# This avoids
# 'warning: jobserver unavailable: using -j1. Add '+' to parent make rule.'
set(ANTLR4_BUILD_COMMAND $(MAKE))
elseif(${CMAKE_GENERATOR} MATCHES "Visual Studio.*")
set(ANTLR4_BUILD_COMMAND
${CMAKE_COMMAND}
--build .
--config $(Configuration)
--target)
elseif(${CMAKE_GENERATOR} MATCHES "Xcode.*")
set(ANTLR4_BUILD_COMMAND
${CMAKE_COMMAND}
--build .
--config $(CONFIGURATION)
--target)
else()
set(ANTLR4_BUILD_COMMAND
${CMAKE_COMMAND}
--build .
--target)
endif()

if(NOT DEFINED ANTLR4_WITH_STATIC_CRT)
set(ANTLR4_WITH_STATIC_CRT ON)
endif()

if(ANTLR4_ZIP_REPOSITORY)
ExternalProject_Add(
antlr4_runtime
PREFIX antlr4_runtime
URL ${ANTLR4_ZIP_REPOSITORY}
DOWNLOAD_DIR ${CMAKE_CURRENT_BINARY_DIR}
BUILD_COMMAND ""
BUILD_IN_SOURCE 1
SOURCE_DIR ${ANTLR4_ROOT}
SOURCE_SUBDIR runtime/Cpp
CMAKE_CACHE_ARGS
-DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
-DWITH_STATIC_CRT:BOOL=${ANTLR4_WITH_STATIC_CRT}
-DDISABLE_WARNINGS:BOOL=ON
# -DCMAKE_CXX_STANDARD:STRING=17 # if desired, compile the runtime with a different C++ standard
# -DCMAKE_CXX_STANDARD:STRING=${CMAKE_CXX_STANDARD} # alternatively, compile the runtime with the same C++ standard as the outer project
INSTALL_COMMAND ""
EXCLUDE_FROM_ALL 1)
else()
ExternalProject_Add(
antlr4_runtime
PREFIX antlr4_runtime
GIT_REPOSITORY ${ANTLR4_GIT_REPOSITORY}
GIT_TAG ${ANTLR4_TAG}
DOWNLOAD_DIR ${CMAKE_CURRENT_BINARY_DIR}
BUILD_COMMAND ""
BUILD_IN_SOURCE 1
SOURCE_DIR ${ANTLR4_ROOT}
SOURCE_SUBDIR runtime/Cpp
CMAKE_CACHE_ARGS
-DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
-DWITH_STATIC_CRT:BOOL=${ANTLR4_WITH_STATIC_CRT}
-DDISABLE_WARNINGS:BOOL=ON
# -DCMAKE_CXX_STANDARD:STRING=17 # if desired, compile the runtime with a different C++ standard
# -DCMAKE_CXX_STANDARD:STRING=${CMAKE_CXX_STANDARD} # alternatively, compile the runtime with the same C++ standard as the outer project
INSTALL_COMMAND ""
EXCLUDE_FROM_ALL 1)
endif()

# Separate build step as rarely people want both
set(ANTLR4_BUILD_DIR ${ANTLR4_ROOT})
if(${CMAKE_VERSION} VERSION_GREATER_EQUAL "3.14.0")
# CMake 3.14 builds in above's SOURCE_SUBDIR when BUILD_IN_SOURCE is true
set(ANTLR4_BUILD_DIR ${ANTLR4_ROOT}/runtime/Cpp)
endif()

ExternalProject_Add_Step(
antlr4_runtime
build_static
COMMAND ${ANTLR4_BUILD_COMMAND} antlr4_static
# Depend on target instead of step (a custom command)
# to avoid running dependent steps concurrently
DEPENDS antlr4_runtime
BYPRODUCTS ${ANTLR4_STATIC_LIBRARIES}
EXCLUDE_FROM_MAIN 1
WORKING_DIRECTORY ${ANTLR4_BUILD_DIR})
ExternalProject_Add_StepTargets(antlr4_runtime build_static)

add_library(antlr4_static STATIC IMPORTED)
add_dependencies(antlr4_static antlr4_runtime-build_static)
set_target_properties(antlr4_static PROPERTIES
IMPORTED_LOCATION ${ANTLR4_STATIC_LIBRARIES})
target_include_directories(antlr4_static
INTERFACE
${ANTLR4_INCLUDE_DIRS}
)

ExternalProject_Add_Step(
antlr4_runtime
build_shared
COMMAND ${ANTLR4_BUILD_COMMAND} antlr4_shared
# Depend on target instead of step (a custom command)
# to avoid running dependent steps concurrently
DEPENDS antlr4_runtime
BYPRODUCTS ${ANTLR4_SHARED_LIBRARIES} ${ANTLR4_RUNTIME_LIBRARIES}
EXCLUDE_FROM_MAIN 1
WORKING_DIRECTORY ${ANTLR4_BUILD_DIR})
ExternalProject_Add_StepTargets(antlr4_runtime build_shared)

add_library(antlr4_shared SHARED IMPORTED)
add_dependencies(antlr4_shared antlr4_runtime-build_shared)
set_target_properties(antlr4_shared PROPERTIES
IMPORTED_LOCATION ${ANTLR4_RUNTIME_LIBRARIES})
target_include_directories(antlr4_shared
INTERFACE
${ANTLR4_INCLUDE_DIRS}
)

if(ANTLR4_SHARED_LIBRARIES)
set_target_properties(antlr4_shared PROPERTIES
IMPORTED_IMPLIB ${ANTLR4_SHARED_LIBRARIES})
endif()
Loading

0 comments on commit 1a17b85

Please sign in to comment.