Move the persist docs

arximboldi · Aug 13, 2024 · 691a95a · 691a95a
1 parent ea67620
commit 691a95a
Show file tree

Hide file tree

Showing 2 changed files with 234 additions and 235 deletions.
diff --git a/doc/persist.rst b/doc/persist.rst
diff --git a/doc/persist.rst b/doc/persist.rst
@@ -0,0 +1,234 @@
+
+Persist
+===============
+
+This library allows to preserve structural sharing of immer containers while serializing and deserializing them.
+
+
+Motivation: serialization
+----------
+
+Structural sharing allows immer containers to be efficient. In runtime, two distinct containers can be operated on independently but internally they share nodes and
+use memory efficiently in that way. But when such containers are serialized in a simple direct way, for example, as lists, this sharing is lost: they become truly
+independent, same data is stored multiple times on disk and later, when it is read from disk, in memory.
+
+This library operates on the internal structure of immer containers: allowing it to be serialized and deserialized (and also transformed). That allows for more efficient
+storage (especially, in case when a lot of nodes are reused) and, even more importantly, for preserving structural sharing after deserializing the containers.
+
+
+Motivation: transformation
+----------
+
+Imagine this scenario: an application has a document type that uses an immer container internally in multiple places, for example, a vector of strings. Some of these vectors
+would be completely identical, some would have just a few elements different (stored in an undo history, for example). And we want to run a transformation function
+over these vectors.
+
+A direct approach would be to take each vector and create a new vector applying the transformation function for each element. But after this, all the structural sharing
+of the original containers would be lost: we will have multiple independent vectors without any structural sharing.
+
+This library allows to apply the transformation function directly on the nodes which allows to preserve structural sharing. Additionally, it doesn't matter how many times
+a node is reused, the transformation needs to be performed only once.
+
+.. _first-example:
+
+First example
+-------------
+
+For this example, we'll use a `document` type that contains two immer vectors.
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: intro/start-types
+   :end-before:  intro/end-types
+
+Let's say we have two vectors ``v1`` and ``v2``, where ``v2`` is derived from ``v1`` so that it shares data with it:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: intro/start-prepare-value
+   :end-before:  intro/end-prepare-value
+
+We can serialize the document using ``cereal`` with this:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: intro/start-serialize-with-cereal
+   :end-before:  intro/end-serialize-with-cereal
+
+Generating a JSON like this one:
+
+.. code-block:: c++
+
+   {"value0": {"ints": [1, 2, 3], "ints2": [1, 2, 3, 4, 5, 6]}}
+
+As you can see, ``ints`` and ``ints2`` contain the full linearization of each vector.
+The structural sharing between these two data structures is not represented in its
+serialized form. However, with ``immer-persist`` we can serialize it with:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: intro/start-serialize-with-persist
+   :end-before:  intro/end-serialize-with-persist
+
+Which generates some JSON like this:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: include:intro/start-persist-json
+   :end-before:  include:intro/end-persist-json
+
+As you can see, the value is serialized with every ``immer`` container replaced by an identifier.
+This identifier is a key into a pool, which is serialized just after.
+
+A pool represents a *set* of ``immer`` containers of a given type. For example, we may have a pool that contains all
+``immer::vector<int>`` of our document. You can think of it as a little database of ``immer`` containers. When
+serializing the pool, the internal structure of all those ``immer`` containers is written, preserving the structural
+sharing between those containers. The nodes of the trees that implement the ``immer`` containers are represented
+directly in the JSON and, because we are representing all the containers as a whole, those nodes that are referenced in
+multiple trees can be stored only once. That same structure is preserved when reading the pool back from disk and
+reconstructing the vectors (and other containers) from it, thus allowing us to preserve the structural sharing across
+sessions.
+
+.. note::
+   Currently, ``immer-persist`` makes a distiction between pools used for saving containers (*output* pools) and for loading containers (*input* pools),
+   similar to ``cereal`` with its ``InputArchive`` and ``OutputArchive`` distiction.
+
+Currently, ``immer-persist`` focuses on JSON as the serialization format and uses the ``cereal`` library internally. In principle, other formats
+and serialization libraries could be supported in the future.
+
+
+Custom policy
+----------
+
+We can use policy to control the names of the pools for each container.
+
+For this example, let's define a new document type ``doc_2``. It will also contain another type ``extra_data`` with a ``vector`` of ``strings`` in it.
+To demonstrate the responsibilities of the policy, the ``doc_2`` type will not be a ``boost::hana::Struct`` and will not allow for a compile-time reflection.
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: include:start-doc_2-type
+   :end-before:  include:end-doc_2-type
+
+We define the ``doc_2_policy`` as following:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: include:start-doc_2_policy
+   :end-before:  include:end-doc_2_policy
+
+The ``get_pool_types`` function returns the types of containers that should be serialized with pools, in this case it's both ``vector`` of ``ints`` and ``strings``.
+The ``save`` and ``load`` functions control the name of the document node, in this case it is ``doc2_value``.
+And the ``get_pool_name`` overloaded functions supply the name of the pool for each corresponding ``immer`` container.
+We can create and serialize a value of ``doc_2`` like this:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: include:start-doc_2-cereal_save_with_pools
+   :end-before:  include:end-doc_2-cereal_save_with_pools
+
+The serialized JSON looks like this:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: include:start-doc_2-json
+   :end-before:  include:end-doc_2-json
+
+And it can also be loaded from JSON like this:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: include:start-doc_2-load
+   :end-before:  include:end-doc_2-load
+
+This example also demonstrates a case where the main document type ``doc_2`` contains another type ``extra_data`` with a ``vector``.
+As you can see in the resulting JSON, nested types are also serialized with pools: ``"extra": {"comments": 1}``. Only the ID of the ``comments`` ``vector``
+is serialized instead of its content.
+
+
+Transformations with pools
+--------------------------
+
+Suppose, we want to apply certain transforming functions to the ``immer`` containers inside of a large document type.
+The most straightforward way would be to simply create new containers with the new data, running the transforming
+function over each element. However, this approach has some disadvantages:
+
+- All new containers will be independent, no structural sharing will be preserved and the same data would be stored
+   multiple times.
+- The transformation would be applied more times than necessary when some of the data is shared. Example: one vector
+   is built by appending elements to the other vector. Transforming shared elements multiple times could be
+   unnecessary.
+
+Let's look at a simple case using the document from the :ref:`first-example`. The desired transformation would be to
+multiply each element of the ``immer::vector<int>`` by 10.
+
+First, the document value would be created in the same way:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: intro/start-prepare-value
+   :end-before:  intro/end-prepare-value
+
+The next component we need is the pools of all the containers from the value:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: start-get_auto_pool
+   :end-before:  end-get_auto_pool
+
+The ``get_auto_pool`` function returns the output pools of all ``immer`` containers that would be serialized using
+pools, as controlled by the policy. Here we use the default policy ``hana_struct_auto_policy`` which will use pools for
+all ``immer`` containers inside of the document type which must be a ``hana::Struct``.
+
+The other required component is the ``conversion_map``:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: start-conversion_map
+   :end-before:  end-conversion_map
+
+This is a ``hana::map`` that describes the desired transformations to be applied. The key of the map is an ``immer``
+container and the value is the function to be applied to each element of the corresponding container type. In this case,
+it will apply ``[](int val) { return val * 10; }`` to each ``int`` of the ``vector_one`` type, we have two of those in
+the ``document``.
+
+Having these two parts, we can create the new pools with the transformations:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: start-transformed_pools
+   :end-before:  end-transformed_pools
+
+At this point, we can start converting the ``immer`` containers and create the transformed document value with them,
+``new_value``:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: start-convert-containers
+   :end-before:  end-convert-containers
+
+In order to confirm that the structural sharing has been preserved after applying the transformations, let's serialize
+the ``new_value`` and inspect the JSON:
+
+.. literalinclude:: ../test/extra/persist/test_for_docs.cpp
+   :language: c++
+   :start-after: start-save-new_value
+   :end-before:  end-save-new_value
+
+And indeed, we can see in the JSON that the node ``{"key": 2, "value": [10, 20]}`` is reused in both vectors.
+
+
+Policy
+------
+
+.. doxygengroup:: Persist-policy
+   :project: immer
+   :content-only:
+
+
+API Overview
+------------
+
+.. doxygengroup:: persist-api
+   :project: immer
+   :content-only: