Skip to content

Commit

Permalink
Move the persist docs
Browse files Browse the repository at this point in the history
  • Loading branch information
alex-sparus committed Aug 13, 2024
1 parent ea67620 commit 691a95a
Show file tree
Hide file tree
Showing 2 changed files with 234 additions and 235 deletions.
1 change: 0 additions & 1 deletion doc/persist.rst

This file was deleted.

234 changes: 234 additions & 0 deletions doc/persist.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@

Persist
===============

This library allows to preserve structural sharing of immer containers while serializing and deserializing them.


Motivation: serialization
----------

Structural sharing allows immer containers to be efficient. In runtime, two distinct containers can be operated on independently but internally they share nodes and
use memory efficiently in that way. But when such containers are serialized in a simple direct way, for example, as lists, this sharing is lost: they become truly
independent, same data is stored multiple times on disk and later, when it is read from disk, in memory.

This library operates on the internal structure of immer containers: allowing it to be serialized and deserialized (and also transformed). That allows for more efficient
storage (especially, in case when a lot of nodes are reused) and, even more importantly, for preserving structural sharing after deserializing the containers.


Motivation: transformation
----------

Imagine this scenario: an application has a document type that uses an immer container internally in multiple places, for example, a vector of strings. Some of these vectors
would be completely identical, some would have just a few elements different (stored in an undo history, for example). And we want to run a transformation function
over these vectors.

A direct approach would be to take each vector and create a new vector applying the transformation function for each element. But after this, all the structural sharing
of the original containers would be lost: we will have multiple independent vectors without any structural sharing.

This library allows to apply the transformation function directly on the nodes which allows to preserve structural sharing. Additionally, it doesn't matter how many times
a node is reused, the transformation needs to be performed only once.

.. _first-example:

First example
-------------

For this example, we'll use a `document` type that contains two immer vectors.

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: intro/start-types
:end-before: intro/end-types

Let's say we have two vectors ``v1`` and ``v2``, where ``v2`` is derived from ``v1`` so that it shares data with it:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: intro/start-prepare-value
:end-before: intro/end-prepare-value

We can serialize the document using ``cereal`` with this:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: intro/start-serialize-with-cereal
:end-before: intro/end-serialize-with-cereal

Generating a JSON like this one:

.. code-block:: c++

{"value0": {"ints": [1, 2, 3], "ints2": [1, 2, 3, 4, 5, 6]}}

As you can see, ``ints`` and ``ints2`` contain the full linearization of each vector.
The structural sharing between these two data structures is not represented in its
serialized form. However, with ``immer-persist`` we can serialize it with:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: intro/start-serialize-with-persist
:end-before: intro/end-serialize-with-persist

Which generates some JSON like this:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: include:intro/start-persist-json
:end-before: include:intro/end-persist-json

As you can see, the value is serialized with every ``immer`` container replaced by an identifier.
This identifier is a key into a pool, which is serialized just after.

A pool represents a *set* of ``immer`` containers of a given type. For example, we may have a pool that contains all
``immer::vector<int>`` of our document. You can think of it as a little database of ``immer`` containers. When
serializing the pool, the internal structure of all those ``immer`` containers is written, preserving the structural
sharing between those containers. The nodes of the trees that implement the ``immer`` containers are represented
directly in the JSON and, because we are representing all the containers as a whole, those nodes that are referenced in
multiple trees can be stored only once. That same structure is preserved when reading the pool back from disk and
reconstructing the vectors (and other containers) from it, thus allowing us to preserve the structural sharing across
sessions.

.. note::
Currently, ``immer-persist`` makes a distiction between pools used for saving containers (*output* pools) and for loading containers (*input* pools),
similar to ``cereal`` with its ``InputArchive`` and ``OutputArchive`` distiction.

Currently, ``immer-persist`` focuses on JSON as the serialization format and uses the ``cereal`` library internally. In principle, other formats
and serialization libraries could be supported in the future.


Custom policy
----------

We can use policy to control the names of the pools for each container.

For this example, let's define a new document type ``doc_2``. It will also contain another type ``extra_data`` with a ``vector`` of ``strings`` in it.
To demonstrate the responsibilities of the policy, the ``doc_2`` type will not be a ``boost::hana::Struct`` and will not allow for a compile-time reflection.

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: include:start-doc_2-type
:end-before: include:end-doc_2-type

We define the ``doc_2_policy`` as following:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: include:start-doc_2_policy
:end-before: include:end-doc_2_policy

The ``get_pool_types`` function returns the types of containers that should be serialized with pools, in this case it's both ``vector`` of ``ints`` and ``strings``.
The ``save`` and ``load`` functions control the name of the document node, in this case it is ``doc2_value``.
And the ``get_pool_name`` overloaded functions supply the name of the pool for each corresponding ``immer`` container.
We can create and serialize a value of ``doc_2`` like this:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: include:start-doc_2-cereal_save_with_pools
:end-before: include:end-doc_2-cereal_save_with_pools

The serialized JSON looks like this:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: include:start-doc_2-json
:end-before: include:end-doc_2-json

And it can also be loaded from JSON like this:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: include:start-doc_2-load
:end-before: include:end-doc_2-load

This example also demonstrates a case where the main document type ``doc_2`` contains another type ``extra_data`` with a ``vector``.
As you can see in the resulting JSON, nested types are also serialized with pools: ``"extra": {"comments": 1}``. Only the ID of the ``comments`` ``vector``
is serialized instead of its content.


Transformations with pools
--------------------------

Suppose, we want to apply certain transforming functions to the ``immer`` containers inside of a large document type.
The most straightforward way would be to simply create new containers with the new data, running the transforming
function over each element. However, this approach has some disadvantages:

- All new containers will be independent, no structural sharing will be preserved and the same data would be stored
multiple times.
- The transformation would be applied more times than necessary when some of the data is shared. Example: one vector
is built by appending elements to the other vector. Transforming shared elements multiple times could be
unnecessary.

Let's look at a simple case using the document from the :ref:`first-example`. The desired transformation would be to
multiply each element of the ``immer::vector<int>`` by 10.

First, the document value would be created in the same way:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: intro/start-prepare-value
:end-before: intro/end-prepare-value

The next component we need is the pools of all the containers from the value:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: start-get_auto_pool
:end-before: end-get_auto_pool

The ``get_auto_pool`` function returns the output pools of all ``immer`` containers that would be serialized using
pools, as controlled by the policy. Here we use the default policy ``hana_struct_auto_policy`` which will use pools for
all ``immer`` containers inside of the document type which must be a ``hana::Struct``.

The other required component is the ``conversion_map``:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: start-conversion_map
:end-before: end-conversion_map

This is a ``hana::map`` that describes the desired transformations to be applied. The key of the map is an ``immer``
container and the value is the function to be applied to each element of the corresponding container type. In this case,
it will apply ``[](int val) { return val * 10; }`` to each ``int`` of the ``vector_one`` type, we have two of those in
the ``document``.

Having these two parts, we can create the new pools with the transformations:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: start-transformed_pools
:end-before: end-transformed_pools

At this point, we can start converting the ``immer`` containers and create the transformed document value with them,
``new_value``:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: start-convert-containers
:end-before: end-convert-containers

In order to confirm that the structural sharing has been preserved after applying the transformations, let's serialize
the ``new_value`` and inspect the JSON:

.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
:language: c++
:start-after: start-save-new_value
:end-before: end-save-new_value

And indeed, we can see in the JSON that the node ``{"key": 2, "value": [10, 20]}`` is reused in both vectors.


Policy
------

.. doxygengroup:: Persist-policy
:project: immer
:content-only:


API Overview
------------

.. doxygengroup:: persist-api
:project: immer
:content-only:
Loading

0 comments on commit 691a95a

Please sign in to comment.