JSON is everywhere on the Internet. Servers spend a lot of time parsing it. We need a fresh approach.
Welcome to zimdjson: a high-performance JSON parser that takes advantage of SIMD vector instructions, based on the paper Parsing Gigabytes of JSON per Second.
The majority of the source code is based on the C++ implementation https://github.com/simdjson/simdjson with the addition of some fundamental features like:
- Streaming support which can handle arbitrarily large documents with O(1) of memory usage.
- An ergonomic, Serde-like deserialization interface thanks to Zig's compile-time reflection. See Reflection-based JSON.
- More efficient memory usage.
Install the zimdjson library by running the following command in your project root:
zig fetch --save git+https://github.com/ezequielramis/zimdjson#0.1.1
Then write the following in your build.zig
:
const zimdjson = b.dependency("zimdjson", .{});
exe.root_module.addImport("zimdjson", zimdjson.module("zimdjson"));
As an example, download a sample file called twitter.json
.
Then execute the following:
const std = @import("std");
const zimdjson = @import("zimdjson");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}).init;
const allocator = gpa.allocator();
var parser = zimdjson.ondemand.StreamParser(.default).init;
defer parser.deinit(allocator);
const file = try std.fs.cwd().openFile("twitter.json", .{});
defer file.close();
const document = try parser.parseFromReader(allocator, file.reader().any());
const metadata_count = try document.at("search_metadata").at("count").asUnsigned();
std.debug.print("{} results.", .{metadata_count});
}
> zig build run
100 results.
To see how the streaming parser above handles multi-gigabyte JSON documents with minimal memory usage, download one of these dumps or play it with a file of your choice.
Currently, targets with Linux, Windows, or macOS operating systems and CPUs with SIMD capabilities are supported. Missing targets can be added by contributing.
The most recent documentation can be found in https://zimdjson.ramis.ar.
Although the provided interfaces are simple enough, it is expected to have unnecessary boilerplate when deserializing lots of data structures. Thank to Zig's compile-time reflection, we can eliminate it:
const std = @import("std");
const zimdjson = @import("zimdjson");
const Film = struct {
name: []const u8,
year: u32,
characters: []const []const u8, // we could also use std.ArrayListUnmanaged([]const u8)
};
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}).init;
const allocator = gpa.allocator();
var parser = zimdjson.ondemand.FullParser(.default).init;
defer parser.deinit(allocator);
const json =
\\{
\\ "name": "Esperando la carroza",
\\ "year": 1985,
\\ "characters": [
\\ "Mamá Cora",
\\ "Antonio",
\\ "Sergio",
\\ "Emilia",
\\ "Jorge"
\\ ]
\\}
;
const document = try parser.parseFromSlice(allocator, json);
const film = try document.as(Film, allocator, .{});
defer film.deinit();
try std.testing.expectEqualDeep(
Film{
.name = "Esperando la carroza",
.year = 1985,
.characters = &.{
"Mamá Cora",
"Antonio",
"Sergio",
"Emilia",
"Jorge",
},
},
film.value,
);
}
This is just a simple example, but this way of deserializing is as powerful as Serde, so there is a lot of more features we can use, such as:
- Deserializing data structures from the Zig Standard Library.
- Renaming fields.
- Using different union representations.
- Custom handling unknown fields.
To see all available options it offers checkout its reference.
To see all supported Zig Standard Library's data structures checkout this list.
To see how it can be really used checkout the test suite for more examples.
Note
As a rule of thumb, do not trust any benchmark — always verify it yourself. There may be biases that favor a particular candidate, including mine.
The following picture represents parsing speed in GB/s of similar tasks presented in the paper On-Demand JSON: A Better Way to Parse
Documents?, where the first three tasks iterate over twitter.json
and the others iterate over a 626MB JSON file called systemsPopulated.json
from these dumps.
Ok, it seems the benchmark got borked but it is not, because of how cache works on small files and how the streaming parser happily ended finding out the tweet in the middle of the file.
Let's get rid of that task to see better the other results.
The following picture corresponds to a second simple benchmark, representing parsing speed in GB/s for near-complete parsing of the twitter.json
file with reflection-based parsers (serde_json
, std.json
).
Note: If you look closely, you'll notice that "zimdjson (On-Demand, Unordered)" is the slowest of all. This is, unfortunately, a behaviour that also occurs with simdjson
when object keys are unordered. If you do not know the order, it can be mitigated by using an schema. Thanks to the glaze library author for pointing this out.
All benchmarks were run on a 3.30GHz Intel Skylake processor.