What is zstd-jsonl-filter?

zstd-jsonl-filter takes advantage of zstd's incredibly high decompression speed to effortlessly filter terabytes of data faster than you could read it from disk. It does so with minimal memory usage and I/O writes, holding only the relevant file section in memory. The only limitation is that your data must be interpretable line-by-line (e.g. jsonl, log and csv). It uses zstd's stream decoder and rayon's parallelism to very efficiently decompress and filter gigantic datasets in memory by streaming several files at once.

Is zstd-jsonl-filter the right tool for you?

You have multiple zstd archives which must be can be interpretable line-by-line
(any newline separated file should work, although I mainly use it for JSON Lines)
The uncompressed data is infeasible to store in ram or on disk
You want to process data faster than you could read it from your source

How to use zstd-jsonl-filter

Say you had a couple billion ~~Destiny 2 PGCR sessions~~ log entires from your fortune 500 employer (please don't actually use it for that) which are neatly separated in smaller chunks and compressed with zstd. If you wanted to extract only relevant entries ~~( e.g. all Team Scorched matches)~~ you'd have to temporarily store a decompressed version in ram or disk to filter them. With zstd-jsonl-filter you can perform this pattern matching without ever needing to hold the entire file.

Build zstd-jsonl-filter or download the latest release
Open config.toml and set your parameters (see this section for more details) and in particular pattern which is the regex term that is matched against a line from the decompressed files. When a pattern matches the entire line is included in the output, otherwise it is discarded.
Make sure your data is in the following format (i.e. no tar):

input_path/
├───12000000000-12010000000.jsonl.zst 
│   └───12000000000-12010000000.jsonl
├───12010000000-12020000000.jsonl.zst
│   └───12010000000-12020000000.jsonl
.
.
.

The names are irrelevant, but you need to make sure your filetype and patterns are applicable line-by-line. zstd-jsonl-filter is not going to stop you from e.g. reading a .tar but you may get unexpected results.

Run zstd-jsonl-filter

Options

You can launch zstd-jsonl-filter

with command line parameters (will take priority over the config file)
with the included config.toml file

Also see practical examples

You usually want to include a regex term to filter the output and make zstd-jsonl-filter more than a decoder. You can supply it with --pattern or in config.toml. If you need more substantial filtering you can fork this code to implement your own logic. I might look into ways to make more powerful filtering in the future.

Created files will follow the structure {output_path}original_filename_without_extension{output_suffix}{output_file_extension}.

All parameters

Parameter	Description	Default
`--config`	Point zstd-jsonl-filter to the config file.	`config.toml` in the same folder
`--input`	The path where your .zst files are located. Both `/` slashes and `\` backslashes work. It is also possible to point to a single file.	`./` current folder
`--output`	Where the output files should be stored.	`./` current folder
`--zstd`	Whether the output should be stored as a compressed .zst file.	`false` no zstd compression
`--compression-level`	The zstd compression level from 1 (fastest) to 22 (smallest).	`0` use zstd default
`--suffix`	Name to be appended to output files. Will generate e.g. `12000000000-12010000000_filtered.zst`.	`_filtered`
`--file-extension`	If you want to replace the file extension for output files. You can usually leave this empty, otherwise do not include a dot i.e. `jsonl`.	`""`
`--pattern`	The regex pattern to be applied line-by-line. A match means the line will be included in the output. Keep in mind that regex terms with special characters need to be escaped properly. Look-around are not supported in favor of a worst case performance of O(m * n). Also note the impact of regex patterns on your performance.	`^` matches everything
`--threads`	The maximum number of threads used by rayon. Since each thread reads from one file, changing this number also affects I/O.	`0` unlimited
`--buffer`	The maximum buffer per thread before matches lines are written to disk.	`4096` 4KiB
`--quiet`	Displays only the current progress and error messages.	`false`
`--no-write`	Does not write any output files. Can be used for testing or line counting. It will check for already existing output files so you can spot conflicts before running long tasks.	`false`

Practical examples

Using config.toml

# Input Parameters
input = 'C:/Users/User/Documents/Destiny_PGCR/bungo-pgcr-12b/'

# Output Parameters
output = 'C:\Users\User\Documents\Destiny_PGCR\test' # Backslashes also work
suffix = "_scorch"
file_extension = "" # You can use this to change the output file extension e.g. from {file}.jsonl.zst to {file}_filtered.txt
no_write = false

# In this example we the output to be uncompressed thus we set zstd to false
zstd = false
compression_level = 0

# Regex Filter
pattern = ',"mode":62,' # Make sure to properly escape if needed, look-arounds are not supported

# Performance
threads = 0
buffer = 4096
quiet = false

This finds all Team Scorched matches in Destiny PGCRs by identifying ,"mode":62,. Make sure your source files are well defined and your regex terms are robust enough. Given a file named {file}.jsonl.zst it then writes the output to uncompressed files called {file}_scorch.jsonl.

Using arguments

.\zstd-jsonl-filter.exe --input "\\10.0.0.2\D2_PGCR\bungo-pgcr-12b" --output "C:\Users\User\Documents\Destiny_PGCR\test" --zstd --compression-level 14 --threads 2 --pattern ',"mode":62,' --quiet

This examples also finds all Team Scorched matches with ,"mode":62, in the network share \\10.0.0.2\D2_PGCR\bungo-pgcr-12b and writes the output to compressed files called {file}_filtered.zst. It is restricted to only 2 threads and with --quiet it will only display the current progress and important error messages.

Without arguments or config.toml zstd-jsonl-filter will default back to extracting every .zst archive in the current directory without filtering any lines.

Only counting with --no-write

.\zstd-jsonl-filter.exe --no-write --input "\\10.0.0.2\D2_PGCR\bungo-pgcr-12b"  --output "E:\already occupied path\" --threads 2 --pattern '("teams":\[\{("\w*":[\w\."]*,)*"score":6\d,("\w*":[\w\."]*,*)*\},\{("\w*":[\w\.]*,)*"score":0,)|("teams":\[\{("\w*":[\w\."]*,)*"score":0,("\w*":[\w\."]*,*)*\},\{("\w*":[\w\.]*,)*"score":6\d,)'

This is really useful if you just want to see how often your term occurs or if you want to test your regex before committing to a potentially long write. I would recommend to not use the --quiet flag so shows you all potential issues that (would) occur if were to run it again without --no-write. This includes e.g. already existing files in the output path which zstd-jsonl-filter will not overwrite. This is why --output will still affect the outcome.

If you're curious about the regex: It should return all matches won 60:0 or 0:60. As it turns out though, there can be more than 2 teams when a player was not assigned to one. These kinds of quick checks is what --no-write is very useful for.

Performance

Using a Ryzen 9 3900x and a test set of 200 GB zstd archives stored on NVMe drives, zstd-jsonl-filter is pretty much always CPU bound with average read speeds of ~600 MB/s from disk and ~5 GB/s of uncompressed data processed in memory. These operation took on average 8 min and processed 3 TB of uncompressed data.

CPU

zstd-jsonl-filter uses rayon for parallelization across files. This means it decompresses one file per thread at once. If you have less files than suggested threads, you will not see any speedup. You can set the maximum number of threads max_threads.

Matching is performed via regex because it was significantly faster than parsing each JSON. It uses the regex crate implementation which runs in linear time. This depends on what you have to work with, adjust as needed.

Memory

Stream decompression drastically reduces the memory usage. For my test set the usage sits around 200 MB. However the exact value depends on several factors like the size of a single line in your decompressed file and if you write to zstd compressed output. Also note the impact of regex patterns on your performance.

I/O

Although zstd-jsonl-filter is usually CPU bound when reading from an NVMe, that can quickly change for other sources. Reading from network storage or hard drives can be a bottleneck. You can set max_threads to change the number of simultaneous file operations if that impacts your source medium.

Write speeds depend on how many entries are filtered, your storage speed and if compression is used. By default zstd-jsonl-filter writes to a 4 KiB buffer which you can adjust as needed.

Donate

You can support this project by donating! specifically bungo-pgcr, bungo-pgcr-10b, bungo-pgcr-11b and bungo-pgcr-14b

Why Rust?

I read somewhere that it's fast and I wanted to learn something new. While I didn't intend to create a public tool, I uploaded this just in case there's another person out there who, for whatever reason, needs to efficiently go through terabytes of zstd compressed data.

Why Team Scorched?

It is the single best game mode.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
assets		assets
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
config.toml		config.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is zstd-jsonl-filter?

Is zstd-jsonl-filter the right tool for you?

How to use zstd-jsonl-filter

Options

All parameters

Practical examples

Using config.toml

Using arguments

Only counting with --no-write

Performance

CPU

Memory

I/O

Donate

Why Rust?

Why Team Scorched?

About

Releases 8

Packages

Languages

License

uniQIndividual/zstd-jsonl-filter

Folders and files

Latest commit

History

Repository files navigation

What is zstd-jsonl-filter?

Is zstd-jsonl-filter the right tool for you?

How to use zstd-jsonl-filter

Options

All parameters

Practical examples

Using config.toml

Using arguments

Only counting with --no-write

Performance

CPU

Memory

I/O

Donate

Why Rust?

Why Team Scorched?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Languages

Packages