Skip to content

Commit

Permalink
Serialize header for new serialization format (#1028)
Browse files Browse the repository at this point in the history
This PR adds a tool to serialize the header for the new serialization
format that will be used by the proof hints. The new format focuses on
fast serialization of llvm backend terms rather than generality. It
should be quite fast to deserialize as well, however.

Not ready for review yet. I still have to:

* [x] write tests to ensure that the header emitted matches expectations
* [x] double check that this code doesn't break anything
* [x] write up documentation of the new serialization format
  • Loading branch information
Dwight Guth authored May 9, 2024
1 parent 26902b1 commit 6dbf12e
Show file tree
Hide file tree
Showing 10 changed files with 7,930 additions and 0 deletions.
87 changes: 87 additions & 0 deletions docs/binary-kore-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Binary KORE Format 2.0

This document specifies a new binary format for KORE patterns. It is not intended to be backwards compatible with the Binary KORE 1.0 format, and will exist side by side with that format. This format has slightly different design goals, namely:

* Serialization of terms must be as fast as possible
* Terms do not need to be composable
* Optimized for serializing multiple terms in sequence
* Not optimized for very large terms.

## Preface

Byte values are written as space-separated pairs, similarly to the default
output of `xxd`. For example, `ABCD EFGH` represents the byte sequence `0xAB`,
`0xCD`, `0xEF`, `0xGH` (where `A`, `B`, ... are hexadecimal digits).

The serialization format assumes little-endianness throughout. For example, the
32-bit integer 1 would be serialized as `0100 0000`.

## Header

A stream of binary KORE terms depends on a header designed to make serialization as fast as possible by storing data about the terms that might be serialized. The header does not necessarily have to be serialized in the same stream as the terms being serialized, but it must be referenced when deserializing.

The header begins with a 4-byte magic number: the sequence `7f4b 5232`. It is followed by a 4-byte integer representing the version number (currently 1).

Following the version number are 3 4-byte integers: the size of the string table, the size of the sort table, and the size of the symbol table, represented as the number of entries, in that order.

### String table

Following the size of the symbol table is the string table itself. It contains a number of entries equal to the size serialized above. Each entry consists of the following:

1. A 4-byte integer length of the string in bytes.
2. The zero-terminated string itself.

### Sort table

Following the last entry in the string table is the sort table. It contains a number of entries equal to the size serialized above. Each entry consists of the following:

1. A 4-byte integer representing a 0-indexed array offset in the string table.
2. A 1-byte size representing the number of sort parameters of the sort.
3. For each sort parameter, a 4-byte integer representing the offset of the parameter in the sort table.

### Symbol table

Following the last entry in the sort table is the symbol table. It contains a number of entries equal to the size serialized above. Each entry consists of the following:

1. A 4-byte integer representing a 0-indexed array offset in the string table.
2. A 1-byte size representing the number of sort parameters of the symbol.
3. A 1-byte size representing the arity of the symbol.
4. For each sort parameter, a 4-byte integer representing the offset of the parameter in the sort table.

## Terms

Only concrete terms are currently supported by this serialization format. We do not support object or sort variables yet. This may change in a future version of this format.

Terms can be divided into two categories:

1. String patterns
2. Composite patterns.

### String patterns

A string pattern begins with the byte `00`. It is followed by:

1. A 8-byte integer representing the length of the string.
2. The zero-terminated string itself.

Strings in patterns do not utilize the string table because the string table is intended to be generated statically, and the strings that we may wish to serialize do not exist at that point in time.

### Composite patterns

A composite pattern begins with the byte `01`. It is followed by:

1. A 4-byte integer representing the offset of the symbol in the symbol table.
2. For each child of the symbol, the serialized data associated with the child.

## Serialization algorithm

A brief note is left here to explain exactly how to serialize terms using this format. We assume a definition exists with N tags and M sorts.

* The header has N + M + 1 strings in the string table. Strings 0 to N - 1 are the names of the symbols with the corresponding tags. Strings N to N + M - 1 are the names of the sorts of the definition. String N + M is '\dv'.$
* The header has M sorts, corresponding to each sort in the definition. M may not exactly be equal to the number of sort declarations in the kore definition if the definition has parametric sorts, since each instantiation of a parametric symbol exists separately in the sort table.
* The header has N + M symbols. Symbols 0 to N - 1 are the symbols with the corresponding tags. Symbols N to N + M - 1 are the symbol '\dv' instantiated with each sort in the sort table.
* Builtin constants are serialized with the assistance of a recursive descent function which is, for each constant, passed by its parent the '\dv' symbol index associated with its corresponding sort.
* Builtin collections are serialized with the assistance of a recursive descent function which is, for each collection, passed by its parent the unit, element, and concatenation symbol indices associated with its corresponding sort.
* String tokens are serialized the same way as builtin constants.
* Symbols with zero children are serialized by simply serializing their 32-bit tag.
* Symbols with nonzero children are serialized by serializing their 32-bit tag, followed by recursively serializing each child.
8 changes: 8 additions & 0 deletions include/kllvm/ast/AST.h
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ class kore_composite_sort : public kore_sort {
std::string name_;
std::vector<sptr<kore_sort>> arguments_;
value_type category_;
uint32_t ordinal_{};

public:
static sptr<kore_composite_sort> create(
Expand All @@ -168,11 +169,13 @@ class kore_composite_sort : public kore_sort {
value_type get_category(kore_definition *definition);
std::string get_hook(kore_definition *definition) const;
static value_type get_category(std::string const &hook_name);
uint32_t get_ordinal() const { return ordinal_; }

bool is_concrete() const override;
sptr<kore_sort> substitute(substitution const &subst) override;

void add_argument(sptr<kore_sort> const &argument);
void set_ordinal(uint32_t ordinal) { ordinal_ = ordinal; }
void print(std::ostream &out, unsigned indent = 0) const override;
void pretty_print(std::ostream &out) const override;
void serialize_to(serializer &s) const override;
Expand Down Expand Up @@ -909,6 +912,7 @@ class kore_definition {
kore_composite_sortMapType hooked_sorts_;
kore_symbolStringMapType fresh_functions_;
KOREAxiomMapType ordinals_;
std::vector<kore_composite_sort *> all_sorts_;

std::vector<sptr<kore_module>> modules_;
attribute_set attributes_;
Expand Down Expand Up @@ -1024,6 +1028,10 @@ class kore_definition {
return fresh_functions_;
}
kore_symbol *get_inj_symbol() { return inj_symbol_; }

std::vector<kore_composite_sort *> const &get_all_sorts() const {
return all_sorts_;
}
};

void read_multimap(
Expand Down
3 changes: 3 additions & 0 deletions include/kllvm/binary/serializer.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#ifndef AST_SERIALIZER_H
#define AST_SERIALIZER_H

#include <kllvm/ast/AST.h>
#include <kllvm/binary/version.h>

#include <array>
Expand Down Expand Up @@ -138,6 +139,8 @@ void serializer::emit(T val) {
next_idx_ += sizeof(T);
}

void emit_kore_rich_header(std::ostream &os, kore_definition *definition);

} // namespace kllvm

#endif
10 changes: 10 additions & 0 deletions lib/ast/definition.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,9 @@ void kore_definition::preprocess() {
}

uint32_t next_symbol = 0;
uint32_t next_sort = 0;
uint16_t next_layout = 1;
auto sorts = std::unordered_map<kore_composite_sort, uint32_t, hash_sort>{};
auto instantiations
= std::unordered_map<kore_symbol, uint32_t, hash_symbol>{};
auto layouts = std::unordered_map<std::string, uint16_t>{};
Expand All @@ -262,6 +264,14 @@ void kore_definition::preprocess() {
uint32_t first_tag = next_symbol;
for (auto *symbol : entry.second) {
if (symbol->is_concrete()) {
for (auto const &sort : symbol->get_arguments()) {
auto *ctr = dynamic_cast<kore_composite_sort *>(sort.get());
if (!sorts.contains(*ctr)) {
sorts.emplace(*ctr, next_sort++);
all_sorts_.push_back(ctr);
}
ctr->set_ordinal(sorts[*ctr]);
}
if (!instantiations.contains(*symbol)) {
instantiations.emplace(*symbol, next_symbol++);
}
Expand Down
66 changes: 66 additions & 0 deletions lib/binary/serializer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -144,4 +144,70 @@ void serializer::emit_direct_string(std::string const &s) {
next_idx_ += s.size();
}

void emit_kore_rich_header(std::ostream &os, kore_definition *definition) {
const uint32_t version = 1;
const uint32_t num_tags = definition->get_symbols().size();
const uint32_t num_sorts = definition->get_all_sorts().size();
const uint32_t num_strings = num_tags + num_sorts + 1;
const uint32_t num_symbols = num_tags + num_sorts;

os.write("\177KR2", 4);
os.write(reinterpret_cast<char const *>(&version), 4);
os.write(reinterpret_cast<char const *>(&num_strings), 4);
os.write(reinterpret_cast<char const *>(&num_sorts), 4);
os.write(reinterpret_cast<char const *>(&num_symbols), 4);

for (uint32_t i = 0; i < num_tags; i++) {
auto name = definition->get_symbols().at(i)->get_name();
const uint32_t len = name.size();
os.write(reinterpret_cast<char const *>(&len), 4);
os << name;
os.put('\000');
}
for (uint32_t i = 0; i < num_sorts; i++) {
auto name = definition->get_all_sorts()[i]->get_name();
const uint32_t len = name.size();
os.write(reinterpret_cast<char const *>(&len), 4);
os << name;
os.put('\000');
}
auto const *name = "\\dv";
const uint32_t len = 3;
os.write(reinterpret_cast<char const *>(&len), 4);
os.write(name, 4);

for (uint32_t i = 0; i < num_sorts; i++) {
uint32_t idx = i + num_tags;
os.write(reinterpret_cast<char const *>(&idx), 4);
if (!definition->get_all_sorts()[i]->get_arguments().empty()) {
throw std::runtime_error(
"cannot yet serialize sorts with sort parameters");
}
os.put('\000');
}

for (uint32_t i = 0; i < num_tags; i++) {
os.write(reinterpret_cast<char const *>(&i), 4);
auto const &symbol = definition->get_symbols().at(i);
int const num_params = symbol->get_formal_arguments().size();
os.put((uint8_t)num_params);
os.put((uint8_t)symbol->get_arguments().size());
for (int j = 0; j < num_params; j++) {
uint32_t ordinal = dynamic_cast<kore_composite_sort *>(
symbol->get_formal_arguments()[j].get())
->get_ordinal();
os.write(reinterpret_cast<char const *>(&ordinal), 4);
}
}
for (uint32_t i = 0; i < num_sorts; i++) {
uint32_t idx = num_strings - 1;
os.write(reinterpret_cast<char const *>(&idx), 4);
int const num_params = 1;
os.put((uint8_t)num_params);
os.put((uint8_t)num_params);
uint32_t ordinal = definition->get_all_sorts()[i]->get_ordinal();
os.write(reinterpret_cast<char const *>(&ordinal), 4);
}
}

} // namespace kllvm
Loading

0 comments on commit 6dbf12e

Please sign in to comment.