diff --git a/doc/index.rst b/doc/index.rst index 2c642ec3..f2029e8e 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -21,8 +21,8 @@ blocks*, used to develop explicit memory and data management policies. The goals of AML are: * **composability**: application developers and performance experts should be - able to pick and choose the building blocks to use depending on their specific - needs. + able to pick and choose which building blocks to use depending on their + specific needs. * **flexibility**: users should be able to customize, replace, or change the configuration of each building block as much as possible. @@ -36,7 +36,7 @@ AML currently implements the following abstractions: * :doc:`Area `, a set of addressable physical memories, * :doc:`Layout `, a description of data structure organization, * :doc:`Tiling `, a description of data blocking (decomposition) -* :doc:`DMA `, an engine to asynchronously move data structures between areas, +* :doc:`DMA `, an engine to asynchronously move data structures between areas. Each of these abstractions has several implementations. For instance, areas may refer to the usual DRAM or its subset, to GPU memory, or to non-volatile memory. @@ -76,7 +76,7 @@ Installation Workflow ~~~~~~~~ -Include the AML header: +Include AML header: .. code-block:: c @@ -93,7 +93,7 @@ Check the AML version: return 1; } -Initialize and clean up the library: +Initialize and cleanup AML: .. code-block:: c @@ -106,8 +106,8 @@ Initialize and clean up the library: Link your program with *-laml*. -Check the above building-blocks-specific pages for further examples and -information on the library features. +See the above pages on specific building blocks for further examples and +information on library features. Support ------- diff --git a/doc/pages/area_cuda_api.rst b/doc/pages/area_cuda_api.rst index 78742f74..2c4c4511 100644 --- a/doc/pages/area_cuda_api.rst +++ b/doc/pages/area_cuda_api.rst @@ -1,4 +1,18 @@ Area Cuda Implementation API ================================= +Cuda Implementation of Areas. + +.. codeblock:: c + #include + +Cuda implementation of AML areas. +This building block relies on Cuda implementation of +malloc/free to provide mmap/munmap on device memory. +Additional documentation of cuda runtime API can be found here: +https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html + +AML cuda areas may be created to allocate current or specific cuda devices. +Also allocations can be private to a single device or shared across devices. +Finally allocations can be backed by host memory allocation. .. doxygengroup:: aml_area_cuda diff --git a/doc/pages/area_linux_api.rst b/doc/pages/area_linux_api.rst index 8f8a61f0..da3c3767 100644 --- a/doc/pages/area_linux_api.rst +++ b/doc/pages/area_linux_api.rst @@ -1,4 +1,89 @@ -Area Linux Implementation API -================================= +Area Linux Implementation +========================= + +This is the Linux implementation of AML areas. + +This building block relies on the libnuma implementation and the Linux +mmap() / munmap() to provide mmap() / munmap() on NUMA host processor memory. +New areas may be created to allocate a specific subset of memories. +This building block also includes a static declaration of a default initialized +area that can be used out-of-the-box with the abstract area API. + +.. codeblock:: c + #include linux_data, ptr, size); + aml_area_linux_mbind(data->linux_data, program_data, size); + // additional work we wnat to do on top of area linux work + whatever_shark(data->my_data, program_data, size); + return program_data; + } + // same for munmap + int* my_munmap(cont struct aml_area_data* data, void* ptr, size_t size); + + // builds your custom area + struct aml_area_ops { + .mmap = my_mmap, + .munmap = my_munmap, + } my_area_ops; + + struct aml_area { + .ops = my_area_ops, + .data = my_area_data, + } my_area; + + void* program_data = aml_area_mmap(&my_area, NULL, size); + + +And now you can call the generic API on your area. + +Area Linux API +============== .. doxygengroup:: aml_area_linux diff --git a/doc/pages/area_opencl_api.rst b/doc/pages/area_opencl_api.rst index 0cf2cdce..df6105cb 100644 --- a/doc/pages/area_opencl_api.rst +++ b/doc/pages/area_opencl_api.rst @@ -1,4 +1,15 @@ Area OpenCL Implementation API ================================= +OpenCL Implementation of Areas. + +.. codeblock:: c + #include + +OpenCL implementation of AML areas. +This building block relies on OpenCL implementation of +device memory allocation to provide mmap/munmap on device memory. +Additional documentation of OpenCL memory model can be found here: +https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_memory_model + .. doxygengroup:: aml_area_opencl diff --git a/doc/pages/area_ze_api.rst b/doc/pages/area_ze_api.rst index 014c8814..04038a26 100644 --- a/doc/pages/area_ze_api.rst +++ b/doc/pages/area_ze_api.rst @@ -1,4 +1,16 @@ Area Level Zero Implementation API ================================== +Implementation of Areas with Level Zero API. + +.. codeblock:: c + #include + +Implementation of Areas with Level Zero API. +This building block relies on Ze implementation of +host and device memory mapping to provide mmap/munmap on device memory. +Additional documentation of Ze memory model can be found here: + +https://spec.oneapi.com/level-zero/latest/core/api.html#memory + .. doxygengroup:: aml_area_ze diff --git a/doc/pages/areas.rst b/doc/pages/areas.rst index 8447fc04..63a4c718 100644 --- a/doc/pages/areas.rst +++ b/doc/pages/areas.rst @@ -1,10 +1,90 @@ Areas: Addressable Physical Memories ==================================== +AML areas represent places where data can be stored. +In shared memory systems, locality is a major concern for performance. +Being able to query memory from specific places is of major interest to achieve +this goal. +AML areas provide low-level mmap() / munmap() functions to query memory from +specific places materialized as areas. +Available area implementations dictate the way such places can be arranged and +their properties. + +.. image:: ../img/area.png + :width=700px +"Illustration of areas on a complex system." + +An AML area is an implementation of memory operations for several type of +devices through a consistent abstraction. +This abstraction is meant to be implemented for several kind of devices, i.e. +the same function calls allocate different kinds of devices depending on the +area implementation provided. + +With the high level API, you can: + +* Use an area to allocate space for your data +* Release the data in this area + +Example +------- + +Let's look how these operations can be done in a C program. + +.. code-block:: c + #include + #include + + int main(){ + + void* data = aml_area_mmap(&aml_area_linux, s); + do_work(data); + aml_area_munmap(data, s); + } + +We start by importing the AML interface, as well as the area implementation we +want to use. + +We then proceed to allocate space for the data of size s using the default from +the AML Linux implementation. +The data will be only visible by this process and bound to the CPU with the +default linux allocation policy. + +Finally, when the work is done with data, we free it. + + +Area API +-------- + +It is important to notice that the functions provided through the Area API are +low-level functions and are not optimized for performance as allocators are. + .. doxygengroup:: aml_area + Implementations --------------- +Aware users may create or modify implementation by assembling appropriate +operations in an aml_area_ops structure. + +The linux implementation is the go to for using simple areas on NUMA CPUs with +linux operating system. + +There is an ongoing work on hwloc, CUDA and OpenCL areas. + +Let's look at an example of a dynamic creation of a linux area identical to the +static default aml_area_linux: + +.. code-block:: c + #include + #include + + int main(){ + struct aml_area* area; + aml_area_linux_create(&area, AML_AREA_LINUX_MMAP_FLAG_PRIVATE, NULL, + AML_AREA_LINUX_BINDING_FLAG_DEFAULT); + do_work(area); + aml_area_linux_destroy(&area); + } .. toctree:: diff --git a/doc/pages/layout.rst b/doc/pages/layout.rst index 329fc87e..664ddb80 100644 --- a/doc/pages/layout.rst +++ b/doc/pages/layout.rst @@ -1,6 +1,130 @@ Layout: Description of Data Organization ======================================== +A layout describes how contiguous elements of a flat memory address space are +organized into a multidimensional array of fixed-size elements. +The abstraction provides functions to build layouts, access elements, reshape a +layout, or subset a layout. + +A layout is characterized by: +* A pointer to the data it describes +* A set of dimensions on which data spans. +* A stride in between elements of a dimension. +* A pitch indicating the space between contiguous elements of a dimension. + +For a definition of row and columns of matrices see: +https://en.wikipedia.org/wiki/Matrix_(mathematics) + +The figure below describes a 2D layout with a sub-layout (obtained with +aml_layout_slice()) operation. +The sub-layout has a stride of 1 element along the second dimension. +The slice has an offset of 1 element along the same dimension, and its pitch is +the pitch of the original layout. +Calling aml_layout_deref() on this sublayout with appropriate coordinates will +return a pointer to elements noted (coor_x, coord_y). +@see aml_layout_slice() + +.. image:: img/layout.png + :width=400px +"2D layout with a 2D slice." + +Access to specific elements of a layout can be done with the aml_layout_deref() +function. +Access to an element is always done relatively to the dimensions' order set by +the user at creation time. +However, internally, the library will always store dimensions in such a way +that elements along the first dimension are contiguous in memory. +This order is defined with the value AML_LAYOUT_ORDER_COLUMN_MAJOR +(AML_LAYOUT_ORDER_FORTRAN). See: +https://en.wikipedia.org/wiki/Row-_and_column-major_order + +Additionally, AML provides access to elements without the overhead of user order +choice through function suffixed with "native". + +The layout abstraction also provides a function to reshape data with a different +set of dimensions. +A reshaped layout will access the same data but with different coordinates as +depicted in the figure below. + +.. image:: img/reshape.png + :width=700px +"2D layout turned into a 3D layout." + +Example +------- + +Let's look at a problem where layouts can be quite useful: DGEMM in multiple +levels. +Let's say you want to multiply matrix A (size [m, k]) with matrix B +(size [k, n]) to get matrix C (size [m, n]). + +The naive matrix multiplication algorithm should look something like: + +.. code:: c + for (i = 0; i < m; i++){ + for (j = 0; j < n; j++){ + cij = C[i*n + j]; + for (l = 0; l < k; l++) + cij += A[i*n + l] * B[l*n + j]; + C[i*n + j] = cij; + } + } + +Unfortunately this algorithm does not have a great runtime complexity... + +We can then have 3 nested loops running on blocks of the matrices. +With several sizes of memory, we want to lverage the power of using blocks of +different sizes. +Let's take an algorithm with three levels of granularity. + + +The first level is focused on fitting our blocks in the smallest cache. +We compute a block of C of size [mr, nr] noted C_r using a block of +A of size [mr, kb] noted A_r, and a block of B of size [kb, nr] noted B_r. +A_r is stored in column major order while C_r and B_r are stored in row major +order, allowing us to read A_r row by row, and go with B_r and C_r column by +column. + +.. code:: c + for (i = 0; i < m_r; i++){ + for (j = 0; j < n_r; j++){ + for (l = 0; l < k_b; l++) + C_r[i][j] += A_r[i][l] + B_r[l][j]; + } + } + +These are our smallest blocks. +The implementation at this level is simply doing the multiplication at a level +where is fast enough. +B_r blocks need to be transposed before they can be accessed column by column. + +The second level is when the matrices are so big that you need a second +caching. +We then use blocks of intermediate size. +We compute a block of C of size [mb, n] noted C_b using a block +of A of size [mb, kb] noted A_b, and a block of B of size [kb, n] noted B_b. +To be efficient, A_b is stored as mb/mr consecutive blocks of size [mr, kb] +(A_r) in column major order while C_b is stored as (mb/mr)*(n/nr) blocks of +size [mr, nr] (C_r) in row major order and B_b is stored as n/nr blocks of size +[kb, nr] (B_r) in row major order. + +This means we need to have Ab laid out as a 3-dimensional array mr x kb x (mb/mr), +B as nr x kb x (n/nr), C with 4 dimensions as nr x mr x (mb/mr) x (n/nr). + +The last level uses the actual matrices, of any size. +The original matrices are C of size [m, n], A of size [m, k] and B of size +[k, n]. +The layout used here are: C is stored as m/mb blocks of C_b, A is stored as +(k/kb) * (m/mb) blocks of A_b and B is stored as k/kb blocks of B_b. + +This means we need to rework A to be laid out in 5 dimensions as +mr x kb x mb/mr x m/mb x k/kb, +B in 4 dimensions as nr x kb x n/nr x k/kb, +C in 5 dimensions as nr x mr x mb/mr x n/nr x m/mb + +High level API +-------------- + .. doxygengroup:: aml_layout Implementations diff --git a/doc/pages/tilings.rst b/doc/pages/tilings.rst index 0558aa6a..0fe722fd 100644 --- a/doc/pages/tilings.rst +++ b/doc/pages/tilings.rst @@ -1,11 +1,25 @@ Tilings: Decomposing Data ==================================== +Tiling is a representation of the decomposition of data structures. +It identifies ways in which a layout can be split into layouts of smaller +sizes. +As such the main function of a tiling is to provide an index into subcomponents +of a layout. +Implementations focus on the ability to provide sublayouts of different sizes +at the corners, and linearization of the index range. + + +Tiling High Level API +--------------------- + .. doxygengroup:: aml_tiling Implementations --------------- +There are so far two implementations for the AML tiling, in 1D and in 2D: + .. toctree:: tiling_resize_api diff --git a/include/aml.h b/include/aml.h index 8448999f..ec3d7344 100644 --- a/include/aml.h +++ b/include/aml.h @@ -84,22 +84,6 @@ int aml_finalize(void); /** * @} * @defgroup aml_area "AML Area" - * @brief Area High-Level API - * - * AML areas represent places where data can be stored. - * In shared memory systems, locality is a major concern for performance. - * Being able to query memory from specific places is of major interest - * to achieve this goal. AML areas provide low-level mmap() / munmap() functions - * to query memory from specific places materialized as areas. Available area - * implementations dictate the way such places can be arranged and their - * properties. It is important to note that the functions provided through the - * Area API are low-level and are not optimized for performance as allocators - * are. - * - * @image html area.png "Illustration of areas in a complex system." width=700cm - * - * @see aml_area_linux - * * @{ **/ @@ -223,56 +207,6 @@ int aml_area_fprintf(FILE *stream, /** * @} * @defgroup aml_layout "AML Layout" - * @brief Low level description of data organization at the byte granularity. - * - * Layout describes how contiguous elements of a flat memory address space are - * organized into a multidimensional array of elements of a fixed size. - * The abstraction provide functions to build layouts, access elements, - * reshape a layout, or subset a layout. - * - * Layouts are characterized by: - * * A pointer to the data it describes - * * A set of dimensions on which data spans. - * * A stride in between elements of a dimension. - * * A pitch indicating the space between contiguous elements of a dimension. - * - * For a definition of row and columns of matrices see : - * https://en.wikipedia.org/wiki/Matrix_(mathematics) - * - * The figure below describes a 2D layout with a sub-layout - * (obtained with aml_layout_slice()) operation. The sub-layout has a stride - * of 1 element along the second dimension. The slice has an offset of 1 element - * along the same dimension, and its pitch is the pitch of the original - * layout. Calling aml_layout_deref() on this sublayout with appropriate - * coordinates will return a pointer to elements noted (coor_x, coord_y). - * @see aml_layout_slice() - * - * @image html layout.png "2D layout with a 2D slice." width=400cm - * - * Access to specific elements of a layout can be done with - * the aml_layout_deref() function. Access to an element is always done - * relatively to the dimensions' order set by the user at creation time. - * However, internally, the library will always store dimensions in such a way - * that elements along the first dimension - * are contiguous in memory. This order is defined with the value - * AML_LAYOUT_ORDER_COLUMN_MAJOR (AML_LAYOUT_ORDER_FORTRAN). See: - * https://en.wikipedia.org/wiki/Row-_and_column-major_order - * Additionally, AML provides access to elements without the overhead of user - * order choice through function suffixed with "native". - * @see aml_layout_deref() - * @see aml_layout_deref_native() - * @see aml_layout_dims_native() - * @see aml_layout_slice_native() - * - * The layout abstraction also provides a function to reshape data - * with a different set of dimensions. A reshaped layout will access - * the same data but with different coordinates as depicted in the - * figure below. - * @see aml_layout_reshape() - * - * @image html reshape.png "2D layout turned into a 3D layout." width=700cm - * - * @see aml_layout_dense * @{ **/ @@ -662,13 +596,6 @@ void aml_layout_destroy(struct aml_layout **layout); /** * @} * @defgroup aml_tiling "AML Tiling" - * @brief Tiling Data Structure High-Level API - * - * Tiling is a representation of the decomposition of data structures. It - * identifies ways a layout can be split into layouts of smaller size. As such, - * the main function of a tiling is to provide an index into subcomponents of a - * layout. Implementations focus on the ability to provide sublayouts of - * different sizes at the corners, and linearization of the index range. * @{ **/ diff --git a/include/aml/area/cuda.h b/include/aml/area/cuda.h index 88a74596..6338fb58 100644 --- a/include/aml/area/cuda.h +++ b/include/aml/area/cuda.h @@ -17,21 +17,6 @@ extern "C" { /** * @defgroup aml_area_cuda "AML Cuda Areas" - * @brief Cuda Implementation of Areas. - * @code - * #include - * @endcode - * - * Cuda implementation of AML areas. - * This building block relies on Cuda implementation of - * malloc/free to provide mmap/munmap on device memory. - * Additional documentation of cuda runtime API can be found here: - * @see https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html - * - * AML cuda areas may be created to allocate current or specific cuda devices. - * Also allocations can be private to a single device or shared across devices. - * Finally allocations can be backed by host memory allocation. - * * @{ **/ diff --git a/include/aml/area/linux.h b/include/aml/area/linux.h index 6d01cfb0..bb6fda45 100644 --- a/include/aml/area/linux.h +++ b/include/aml/area/linux.h @@ -17,19 +17,6 @@ extern "C" { /** * @defgroup aml_area_linux "AML Linux Areas" - * @brief Linux Implementation of AML Areas. - * - * This building block relies on the libnuma implementation and - * the Linux mmap() / munmap() to provide mmap() / munmap() on NUMA host - * processor memory. New areas may be created - * to allocate a specific subset of memories. - * This building block also includes a static declaration of - * a default initialized area that can be used out-of-the-box with - * the abstract area API. - * - * @code - * #include - * @endcode * @{ **/ @@ -92,17 +79,17 @@ struct aml_area_linux_mmap_options { /** * \brief Linux area creation. * - * Allocates and initializes struct aml_area implemented by aml_area_linux + * Allocate and initialize a struct aml_area implemented by aml_area_linux * operations. - * @param[out] area pointer to an uninitialized struct aml_area pointer to + * @param[out] area: pointer to an uninitialized struct aml_area pointer to * receive the new area. - * @param[in] nodemask list of memory nodes to use. Defaults to all allowed + * @param[in] nodemask: list of memory nodes to use. Defaults to all allowed * memory nodes if NULL. - * @param[in] policy: The memory allocation policy to use when binding to + * @param[in] policy: the memory allocation policy to use when binding to * nodeset. - * @return On success, returns 0 and fills "area" with a pointer to the new + * @return on success, returns 0 and fills "area" with a pointer to the new * aml_area. - * @return On failure, fills "area" with NULL and returns one of AML error + * @return on failure, fills "area" with NULL and returns one of AML error * codes: * - AML_ENOMEM if there wasn't enough memory available. * - AML_EINVAL if input flags were invalid. @@ -117,21 +104,21 @@ int aml_area_linux_create(struct aml_area **area, /** * \brief Linux area destruction. * - * Destroys (finalizes and frees resources) struct aml_area created by + * Destroy (finalizes and frees resources) a struct aml_area created by * aml_area_linux_create(). * - * @param area address of an initialized struct aml_area pointer, which will be + * @param[inout] area: address of an initialized struct aml_area pointer, which will be * reset to NULL on return from this call. **/ void aml_area_linux_destroy(struct aml_area **area); /** - * Binds memory of size "size" pointed to by "ptr" using the binding provided + * Bind memory of size "size" pointed to by "ptr" using the binding provided * in "bind". If the mbind() call was not successfull, i.e., AML_FAILURE is * returned, then "errno" should be inspected for further error information. - * @param bind: The requested binding. "mmap_flags" is actually unused. - * @param ptr: The memory to bind. - * @param size: The size of the memory pointed to by "ptr". + * @param[in] bind: the requested binding. "mmap_flags" is actually unused. + * @param[in] ptr: the memory to bind. + * @param[in] size: the size of the memory pointed to by "ptr". * @return 0 if successful; an error code otherwise. **/ int @@ -140,12 +127,12 @@ aml_area_linux_mbind(struct aml_area_linux_data *bind, size_t size); /** - * Checks whether the binding of a pointer obtained with + * Check whether the binding of a pointer obtained with * aml_area_linux_mmap() followed by aml_area_linux_mbind() matches the area * settings. - * @param area_data: The expected binding settings. - * @param ptr: The supposedly bound memory. - * @param size: The memory size. + * @param[in] area_data: the expected binding settings. + * @param[in] ptr: the supposedly bound memory. + * @param[in] size: the memory size. * @return 1 if the mapped memory binding in "ptr" matches the "area_data" * binding settings, else 0. **/ @@ -161,10 +148,10 @@ aml_area_linux_check_binding(struct aml_area_linux_data *area_data, * "mmap_flags" of "area_data". * This function does not perform binding, unlike what is done in areas created * using aml_area_linux_create(). - * @param area_data: The structure containing "mmap_flags" for the mmap() call. + * @param[in] area_data: The structure containing "mmap_flags" for the mmap() call. * "nodemask" and "bind_flags" fields are ignored. - * @param size: The size to allocate. - * @param opts: See "aml_area_linux_mmap_options". + * @param[in] size: The size to allocate. + * @param[in] opts: See "aml_area_linux_mmap_options". * @return a valid memory pointer, or NULL on failure. * On failure, "errno" should be checked for further information. **/ @@ -177,9 +164,9 @@ aml_area_linux_mmap(const struct aml_area_data *area_data, * \brief munmap hook for AML area. * * Unmaps memory mapped with aml_area_linux_mmap(). - * @param area_data: unused - * @param ptr: The virtual memory to unmap. - * @param size: The size of the virtual memory to unmap. + * @param[in] area_data: unused + * @param[inout] ptr: The virtual memory to unmap. + * @param[in] size: The size of the virtual memory to unmap. * @return AML_SUCCESS on success, else AML_FAILURE. * On failure, "errno" should be checked for further information. **/ diff --git a/include/aml/area/opencl.h b/include/aml/area/opencl.h index 6664dbeb..40872cb7 100644 --- a/include/aml/area/opencl.h +++ b/include/aml/area/opencl.h @@ -19,18 +19,6 @@ extern "C" { /** * @defgroup aml_area_opencl "AML OpenCL Areas" - * @brief OpenCL Implementation of Areas. - * @code - * #include - * @endcode - * - * OpenCL implementation of AML areas. - * This building block relies on OpenCL implementation of - * device memory allocation to provide mmap/munmap on device memory. - * Additional documentation of OpenCL memory model can be found here: - * @see - *https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_memory_model - * * @{ **/ diff --git a/include/aml/area/ze.h b/include/aml/area/ze.h index ce2cfe22..dd816b11 100644 --- a/include/aml/area/ze.h +++ b/include/aml/area/ze.h @@ -20,18 +20,6 @@ extern "C" { /** * @defgroup aml_area_ze "AML Level Zero Areas" - * @brief Implementation of Areas with Level Zero API. - * @code - * #include - * @endcode - * - * Implementation of Areas with Level Zero API. - * This building block relies on Ze implementation of - * host and device memory mapping to provide mmap/munmap on device memory. - * Additional documentation of Ze memory model can be found here: - * @see - * https://spec.oneapi.com/level-zero/latest/core/api.html#memory - * * @{ **/