From 691a95af82767cd61f6ffe71581cf4c6533c1287 Mon Sep 17 00:00:00 2001 From: Alex Shabalin Date: Tue, 13 Aug 2024 15:36:44 +0200 Subject: [PATCH] Move the persist docs --- doc/persist.rst | 235 ++++++++++++++++++++++++++++++++- immer/extra/persist/README.rst | 234 -------------------------------- 2 files changed, 234 insertions(+), 235 deletions(-) mode change 120000 => 100644 doc/persist.rst delete mode 100644 immer/extra/persist/README.rst diff --git a/doc/persist.rst b/doc/persist.rst deleted file mode 120000 index 28a2b9ae..00000000 --- a/doc/persist.rst +++ /dev/null @@ -1 +0,0 @@ -../immer/extra/persist/README.rst \ No newline at end of file diff --git a/doc/persist.rst b/doc/persist.rst new file mode 100644 index 00000000..845ac78b --- /dev/null +++ b/doc/persist.rst @@ -0,0 +1,234 @@ + +Persist +=============== + +This library allows to preserve structural sharing of immer containers while serializing and deserializing them. + + +Motivation: serialization +---------- + +Structural sharing allows immer containers to be efficient. In runtime, two distinct containers can be operated on independently but internally they share nodes and +use memory efficiently in that way. But when such containers are serialized in a simple direct way, for example, as lists, this sharing is lost: they become truly +independent, same data is stored multiple times on disk and later, when it is read from disk, in memory. + +This library operates on the internal structure of immer containers: allowing it to be serialized and deserialized (and also transformed). That allows for more efficient +storage (especially, in case when a lot of nodes are reused) and, even more importantly, for preserving structural sharing after deserializing the containers. + + +Motivation: transformation +---------- + +Imagine this scenario: an application has a document type that uses an immer container internally in multiple places, for example, a vector of strings. Some of these vectors +would be completely identical, some would have just a few elements different (stored in an undo history, for example). And we want to run a transformation function +over these vectors. + +A direct approach would be to take each vector and create a new vector applying the transformation function for each element. But after this, all the structural sharing +of the original containers would be lost: we will have multiple independent vectors without any structural sharing. + +This library allows to apply the transformation function directly on the nodes which allows to preserve structural sharing. Additionally, it doesn't matter how many times +a node is reused, the transformation needs to be performed only once. + +.. _first-example: + +First example +------------- + +For this example, we'll use a `document` type that contains two immer vectors. + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: intro/start-types + :end-before: intro/end-types + +Let's say we have two vectors ``v1`` and ``v2``, where ``v2`` is derived from ``v1`` so that it shares data with it: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: intro/start-prepare-value + :end-before: intro/end-prepare-value + +We can serialize the document using ``cereal`` with this: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: intro/start-serialize-with-cereal + :end-before: intro/end-serialize-with-cereal + +Generating a JSON like this one: + +.. code-block:: c++ + + {"value0": {"ints": [1, 2, 3], "ints2": [1, 2, 3, 4, 5, 6]}} + +As you can see, ``ints`` and ``ints2`` contain the full linearization of each vector. +The structural sharing between these two data structures is not represented in its +serialized form. However, with ``immer-persist`` we can serialize it with: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: intro/start-serialize-with-persist + :end-before: intro/end-serialize-with-persist + +Which generates some JSON like this: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: include:intro/start-persist-json + :end-before: include:intro/end-persist-json + +As you can see, the value is serialized with every ``immer`` container replaced by an identifier. +This identifier is a key into a pool, which is serialized just after. + +A pool represents a *set* of ``immer`` containers of a given type. For example, we may have a pool that contains all +``immer::vector`` of our document. You can think of it as a little database of ``immer`` containers. When +serializing the pool, the internal structure of all those ``immer`` containers is written, preserving the structural +sharing between those containers. The nodes of the trees that implement the ``immer`` containers are represented +directly in the JSON and, because we are representing all the containers as a whole, those nodes that are referenced in +multiple trees can be stored only once. That same structure is preserved when reading the pool back from disk and +reconstructing the vectors (and other containers) from it, thus allowing us to preserve the structural sharing across +sessions. + +.. note:: + Currently, ``immer-persist`` makes a distiction between pools used for saving containers (*output* pools) and for loading containers (*input* pools), + similar to ``cereal`` with its ``InputArchive`` and ``OutputArchive`` distiction. + +Currently, ``immer-persist`` focuses on JSON as the serialization format and uses the ``cereal`` library internally. In principle, other formats +and serialization libraries could be supported in the future. + + +Custom policy +---------- + +We can use policy to control the names of the pools for each container. + +For this example, let's define a new document type ``doc_2``. It will also contain another type ``extra_data`` with a ``vector`` of ``strings`` in it. +To demonstrate the responsibilities of the policy, the ``doc_2`` type will not be a ``boost::hana::Struct`` and will not allow for a compile-time reflection. + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: include:start-doc_2-type + :end-before: include:end-doc_2-type + +We define the ``doc_2_policy`` as following: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: include:start-doc_2_policy + :end-before: include:end-doc_2_policy + +The ``get_pool_types`` function returns the types of containers that should be serialized with pools, in this case it's both ``vector`` of ``ints`` and ``strings``. +The ``save`` and ``load`` functions control the name of the document node, in this case it is ``doc2_value``. +And the ``get_pool_name`` overloaded functions supply the name of the pool for each corresponding ``immer`` container. +We can create and serialize a value of ``doc_2`` like this: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: include:start-doc_2-cereal_save_with_pools + :end-before: include:end-doc_2-cereal_save_with_pools + +The serialized JSON looks like this: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: include:start-doc_2-json + :end-before: include:end-doc_2-json + +And it can also be loaded from JSON like this: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: include:start-doc_2-load + :end-before: include:end-doc_2-load + +This example also demonstrates a case where the main document type ``doc_2`` contains another type ``extra_data`` with a ``vector``. +As you can see in the resulting JSON, nested types are also serialized with pools: ``"extra": {"comments": 1}``. Only the ID of the ``comments`` ``vector`` +is serialized instead of its content. + + +Transformations with pools +-------------------------- + +Suppose, we want to apply certain transforming functions to the ``immer`` containers inside of a large document type. +The most straightforward way would be to simply create new containers with the new data, running the transforming +function over each element. However, this approach has some disadvantages: + +- All new containers will be independent, no structural sharing will be preserved and the same data would be stored + multiple times. +- The transformation would be applied more times than necessary when some of the data is shared. Example: one vector + is built by appending elements to the other vector. Transforming shared elements multiple times could be + unnecessary. + +Let's look at a simple case using the document from the :ref:`first-example`. The desired transformation would be to +multiply each element of the ``immer::vector`` by 10. + +First, the document value would be created in the same way: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: intro/start-prepare-value + :end-before: intro/end-prepare-value + +The next component we need is the pools of all the containers from the value: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: start-get_auto_pool + :end-before: end-get_auto_pool + +The ``get_auto_pool`` function returns the output pools of all ``immer`` containers that would be serialized using +pools, as controlled by the policy. Here we use the default policy ``hana_struct_auto_policy`` which will use pools for +all ``immer`` containers inside of the document type which must be a ``hana::Struct``. + +The other required component is the ``conversion_map``: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: start-conversion_map + :end-before: end-conversion_map + +This is a ``hana::map`` that describes the desired transformations to be applied. The key of the map is an ``immer`` +container and the value is the function to be applied to each element of the corresponding container type. In this case, +it will apply ``[](int val) { return val * 10; }`` to each ``int`` of the ``vector_one`` type, we have two of those in +the ``document``. + +Having these two parts, we can create the new pools with the transformations: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: start-transformed_pools + :end-before: end-transformed_pools + +At this point, we can start converting the ``immer`` containers and create the transformed document value with them, +``new_value``: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: start-convert-containers + :end-before: end-convert-containers + +In order to confirm that the structural sharing has been preserved after applying the transformations, let's serialize +the ``new_value`` and inspect the JSON: + +.. literalinclude:: ../test/extra/persist/test_for_docs.cpp + :language: c++ + :start-after: start-save-new_value + :end-before: end-save-new_value + +And indeed, we can see in the JSON that the node ``{"key": 2, "value": [10, 20]}`` is reused in both vectors. + + +Policy +------ + +.. doxygengroup:: Persist-policy + :project: immer + :content-only: + + +API Overview +------------ + +.. doxygengroup:: persist-api + :project: immer + :content-only: diff --git a/immer/extra/persist/README.rst b/immer/extra/persist/README.rst deleted file mode 100644 index 845ac78b..00000000 --- a/immer/extra/persist/README.rst +++ /dev/null @@ -1,234 +0,0 @@ - -Persist -=============== - -This library allows to preserve structural sharing of immer containers while serializing and deserializing them. - - -Motivation: serialization ----------- - -Structural sharing allows immer containers to be efficient. In runtime, two distinct containers can be operated on independently but internally they share nodes and -use memory efficiently in that way. But when such containers are serialized in a simple direct way, for example, as lists, this sharing is lost: they become truly -independent, same data is stored multiple times on disk and later, when it is read from disk, in memory. - -This library operates on the internal structure of immer containers: allowing it to be serialized and deserialized (and also transformed). That allows for more efficient -storage (especially, in case when a lot of nodes are reused) and, even more importantly, for preserving structural sharing after deserializing the containers. - - -Motivation: transformation ----------- - -Imagine this scenario: an application has a document type that uses an immer container internally in multiple places, for example, a vector of strings. Some of these vectors -would be completely identical, some would have just a few elements different (stored in an undo history, for example). And we want to run a transformation function -over these vectors. - -A direct approach would be to take each vector and create a new vector applying the transformation function for each element. But after this, all the structural sharing -of the original containers would be lost: we will have multiple independent vectors without any structural sharing. - -This library allows to apply the transformation function directly on the nodes which allows to preserve structural sharing. Additionally, it doesn't matter how many times -a node is reused, the transformation needs to be performed only once. - -.. _first-example: - -First example -------------- - -For this example, we'll use a `document` type that contains two immer vectors. - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: intro/start-types - :end-before: intro/end-types - -Let's say we have two vectors ``v1`` and ``v2``, where ``v2`` is derived from ``v1`` so that it shares data with it: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: intro/start-prepare-value - :end-before: intro/end-prepare-value - -We can serialize the document using ``cereal`` with this: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: intro/start-serialize-with-cereal - :end-before: intro/end-serialize-with-cereal - -Generating a JSON like this one: - -.. code-block:: c++ - - {"value0": {"ints": [1, 2, 3], "ints2": [1, 2, 3, 4, 5, 6]}} - -As you can see, ``ints`` and ``ints2`` contain the full linearization of each vector. -The structural sharing between these two data structures is not represented in its -serialized form. However, with ``immer-persist`` we can serialize it with: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: intro/start-serialize-with-persist - :end-before: intro/end-serialize-with-persist - -Which generates some JSON like this: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: include:intro/start-persist-json - :end-before: include:intro/end-persist-json - -As you can see, the value is serialized with every ``immer`` container replaced by an identifier. -This identifier is a key into a pool, which is serialized just after. - -A pool represents a *set* of ``immer`` containers of a given type. For example, we may have a pool that contains all -``immer::vector`` of our document. You can think of it as a little database of ``immer`` containers. When -serializing the pool, the internal structure of all those ``immer`` containers is written, preserving the structural -sharing between those containers. The nodes of the trees that implement the ``immer`` containers are represented -directly in the JSON and, because we are representing all the containers as a whole, those nodes that are referenced in -multiple trees can be stored only once. That same structure is preserved when reading the pool back from disk and -reconstructing the vectors (and other containers) from it, thus allowing us to preserve the structural sharing across -sessions. - -.. note:: - Currently, ``immer-persist`` makes a distiction between pools used for saving containers (*output* pools) and for loading containers (*input* pools), - similar to ``cereal`` with its ``InputArchive`` and ``OutputArchive`` distiction. - -Currently, ``immer-persist`` focuses on JSON as the serialization format and uses the ``cereal`` library internally. In principle, other formats -and serialization libraries could be supported in the future. - - -Custom policy ----------- - -We can use policy to control the names of the pools for each container. - -For this example, let's define a new document type ``doc_2``. It will also contain another type ``extra_data`` with a ``vector`` of ``strings`` in it. -To demonstrate the responsibilities of the policy, the ``doc_2`` type will not be a ``boost::hana::Struct`` and will not allow for a compile-time reflection. - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: include:start-doc_2-type - :end-before: include:end-doc_2-type - -We define the ``doc_2_policy`` as following: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: include:start-doc_2_policy - :end-before: include:end-doc_2_policy - -The ``get_pool_types`` function returns the types of containers that should be serialized with pools, in this case it's both ``vector`` of ``ints`` and ``strings``. -The ``save`` and ``load`` functions control the name of the document node, in this case it is ``doc2_value``. -And the ``get_pool_name`` overloaded functions supply the name of the pool for each corresponding ``immer`` container. -We can create and serialize a value of ``doc_2`` like this: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: include:start-doc_2-cereal_save_with_pools - :end-before: include:end-doc_2-cereal_save_with_pools - -The serialized JSON looks like this: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: include:start-doc_2-json - :end-before: include:end-doc_2-json - -And it can also be loaded from JSON like this: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: include:start-doc_2-load - :end-before: include:end-doc_2-load - -This example also demonstrates a case where the main document type ``doc_2`` contains another type ``extra_data`` with a ``vector``. -As you can see in the resulting JSON, nested types are also serialized with pools: ``"extra": {"comments": 1}``. Only the ID of the ``comments`` ``vector`` -is serialized instead of its content. - - -Transformations with pools --------------------------- - -Suppose, we want to apply certain transforming functions to the ``immer`` containers inside of a large document type. -The most straightforward way would be to simply create new containers with the new data, running the transforming -function over each element. However, this approach has some disadvantages: - -- All new containers will be independent, no structural sharing will be preserved and the same data would be stored - multiple times. -- The transformation would be applied more times than necessary when some of the data is shared. Example: one vector - is built by appending elements to the other vector. Transforming shared elements multiple times could be - unnecessary. - -Let's look at a simple case using the document from the :ref:`first-example`. The desired transformation would be to -multiply each element of the ``immer::vector`` by 10. - -First, the document value would be created in the same way: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: intro/start-prepare-value - :end-before: intro/end-prepare-value - -The next component we need is the pools of all the containers from the value: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: start-get_auto_pool - :end-before: end-get_auto_pool - -The ``get_auto_pool`` function returns the output pools of all ``immer`` containers that would be serialized using -pools, as controlled by the policy. Here we use the default policy ``hana_struct_auto_policy`` which will use pools for -all ``immer`` containers inside of the document type which must be a ``hana::Struct``. - -The other required component is the ``conversion_map``: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: start-conversion_map - :end-before: end-conversion_map - -This is a ``hana::map`` that describes the desired transformations to be applied. The key of the map is an ``immer`` -container and the value is the function to be applied to each element of the corresponding container type. In this case, -it will apply ``[](int val) { return val * 10; }`` to each ``int`` of the ``vector_one`` type, we have two of those in -the ``document``. - -Having these two parts, we can create the new pools with the transformations: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: start-transformed_pools - :end-before: end-transformed_pools - -At this point, we can start converting the ``immer`` containers and create the transformed document value with them, -``new_value``: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: start-convert-containers - :end-before: end-convert-containers - -In order to confirm that the structural sharing has been preserved after applying the transformations, let's serialize -the ``new_value`` and inspect the JSON: - -.. literalinclude:: ../test/extra/persist/test_for_docs.cpp - :language: c++ - :start-after: start-save-new_value - :end-before: end-save-new_value - -And indeed, we can see in the JSON that the node ``{"key": 2, "value": [10, 20]}`` is reused in both vectors. - - -Policy ------- - -.. doxygengroup:: Persist-policy - :project: immer - :content-only: - - -API Overview ------------- - -.. doxygengroup:: persist-api - :project: immer - :content-only: