diff --git a/_sources/api_for_sycl_kernels/random.rst b/_sources/api_for_sycl_kernels/random.rst index 9e82d86c7b..6e200cc01a 100644 --- a/_sources/api_for_sycl_kernels/random.rst +++ b/_sources/api_for_sycl_kernels/random.rst @@ -11,53 +11,79 @@ Random Number Engines --------------------- Random number engines use seed data as an entropy source to generate pseudo-random numbers. -|onedpl_short| provides several class templates for customized engines, they are defined in the header -````. - -============================== ========================================================================================================= -Engine Description -============================== ========================================================================================================= -``linear_congruential_engine`` Implements a linear congruential algorithm -``subtract_with_carry_engine`` Implements a subtract-with-carry algorithm -``discard_block_engine`` Implements a discard block adaptor -============================== ========================================================================================================= +|onedpl_short| provides several class templates for customizable engines, defined in the header +```` under the ``oneapi::dpl::`` namespace. + +=============================== ============================================ +Engine Description +=============================== ============================================ +``linear_congruential_engine`` Implements a linear congruential algorithm +``subtract_with_carry_engine`` Implements a subtract-with-carry algorithm +``discard_block_engine`` Implements a discard block adaptor +``experimental::philox_engine`` Implements a Philox algorithm +=============================== ============================================ Predefined Random Number Engines -------------------------------- -Predefined random number engines are instantiations of random number engines class templates. -The types below are defined in the header ```` under the ``oneapi::dpl::`` namespace. - -===================================================================== ========================================================================================================= -Type Description -===================================================================== ========================================================================================================= -``minstd_rand0`` ``oneapi::dpl::linear_congruential_engine`` -``minstd_rand`` ``oneapi::dpl::linear_congruential_engine`` -``ranlux24_base`` ``oneapi::dpl::subtract_with_carry_engine`` -``ranlux48_base`` ``oneapi::dpl::subtract_with_carry_engine`` -``ranlux24`` ``oneapi::dpl::discard_block_engine`` -``ranlux48`` ``oneapi::dpl::discard_block_engine`` -===================================================================== ========================================================================================================= - -The engines described below can efficiently generate vectors of random numbers. These types are -defined in the header ```` under the ``oneapi::dpl::`` namespace. - -===================================================================== ========================================================================================================= -Type Description -===================================================================== ========================================================================================================= -``template minstd_rand0_vec`` ``oneapi::dpl::linear_congruential_engine, 16807, 0, 2147483647>`` - minstd_rand0 for a vector generation case -``template minstd_rand_vec`` ``oneapi::dpl::linear_congruential_engine, 48271, 0, 2147483647>`` - minstd_rand for a vector generation case -``template ranlux24_base_vec`` ``oneapi::dpl::subtract_with_carry_engine, 24, 10, 24>`` - ranlux24_base for a vector generation case -``template ranlux48_base_vec`` ``oneapi::dpl::subtract_with_carry_engine, 48, 5, 12>`` - ranlux48_base for a vector generation case -``template ranlux24_vec`` ``oneapi::dpl::discard_block_engine, 223, 23>`` - ranlux24 for a vector generation case -``template ranlux48_vec`` ``oneapi::dpl::discard_block_engine, 389, 11>`` - ranlux48 for vector generation case -===================================================================== ========================================================================================================= +Predefined random number engines are instantiations of the random number engines class templates +with selected engine parameters. + +The types below are defined in the header ```` in the same namespaces as their +respective class templates. + +================== ================================================================================= +Type Description +================== ================================================================================= +``minstd_rand0`` ``oneapi::dpl::linear_congruential_engine`` +``minstd_rand`` ``oneapi::dpl::linear_congruential_engine`` +``ranlux24_base`` ``oneapi::dpl::subtract_with_carry_engine`` +``ranlux48_base`` ``oneapi::dpl::subtract_with_carry_engine`` +``ranlux24`` ``oneapi::dpl::discard_block_engine`` +``ranlux48`` ``oneapi::dpl::discard_block_engine`` +``philox4x32`` ``oneapi::dpl::experimental::philox_engine`` +``philox4x64`` ``oneapi::dpl::experimental::philox_engine`` +================== ================================================================================= + +The following predefined engines can efficiently generate vectors of random numbers. +They differ from the scalar engines above by using ``sycl::vec`` as the data type, +while other engine parameters remain the same. + +================================================== =============================================================================================== +Type Description +================================================== =============================================================================================== +``template minstd_rand0_vec`` ``oneapi::dpl::linear_congruential_engine, 16807, 0, 2147483647>`` + + ``minstd_rand0`` that generates a vector. +-------------------------------------------------- ----------------------------------------------------------------------------------------------- +``template minstd_rand_vec`` ``oneapi::dpl::linear_congruential_engine, 48271, 0, 2147483647>`` + + ``minstd_rand`` that generates a vector. +-------------------------------------------------- ----------------------------------------------------------------------------------------------- +``template ranlux24_base_vec`` ``oneapi::dpl::subtract_with_carry_engine, 24, 10, 24>`` + + ``ranlux24_base`` that generates a vector. +-------------------------------------------------- ----------------------------------------------------------------------------------------------- +``template ranlux48_base_vec`` ``oneapi::dpl::subtract_with_carry_engine, 48, 5, 12>`` + + ``ranlux48_base`` that generates a vector. +-------------------------------------------------- ----------------------------------------------------------------------------------------------- +``template ranlux24_vec`` ``oneapi::dpl::discard_block_engine, 223, 23>`` + + ``ranlux24`` that generates a vector. +-------------------------------------------------- ----------------------------------------------------------------------------------------------- +``template ranlux48_vec`` ``oneapi::dpl::discard_block_engine, 389, 11>`` + + ``ranlux48`` that generates a vector. +-------------------------------------------------- ----------------------------------------------------------------------------------------------- +``template philox4x32_vec`` ``oneapi::dpl::experimental::philox_engine, 32, 4, 10, 0xCD9E8D57, 0x9E3779B9, 0xD2511F53, 0xBB67AE85>`` + + ``philox4x32`` that generates a vector. +-------------------------------------------------- ----------------------------------------------------------------------------------------------- +``template philox4x64_vec`` ``oneapi::dpl::experimental::philox_engine, 64, 4, 10, 0xCA5A826395121157, 0x9E3779B97F4A7C15, 0xD2E7470EE14C6C93, 0xBB67AE8584CAA73B>`` + + ``philox4x64`` that generates a vector. +================================================== =============================================================================================== Random Number Distributions --------------------------- @@ -66,9 +92,9 @@ Random number distributions process the output of random number engines in such resulting output is distributed according to a defined statistical probability density function. They are defined in the header ```` under the ``oneapi::dpl::`` namespace. -============================== ========================================================================================================= +============================== ============================================================================ Distribution Description -============================== ========================================================================================================= +============================== ============================================================================ ``uniform_int_distribution`` Produces integer values evenly distributed across a range ``uniform_real_distribution`` Produces real values evenly distributed across a range ``normal_distribution`` Produces real values according to the Normal (Gaussian) distribution @@ -79,7 +105,10 @@ Distribution Description ``lognormal_distribution`` Produces real values according to the Lognormal distribution ``extreme_value_distribution`` Produces real values according to the Extreme value (Gumbel) distribution ``cauchy_distribution`` Produces real values according to the Cauchy distribution -============================== ========================================================================================================= +============================== ============================================================================ + +.. note:: + ``bernoulli_distribution``, ``geometric_distribution``, and ``uniform_int_distribution`` can only be used on devices with FP64 support as they rely on double precision in their implementation (use ``sycl::aspect::fp64`` to check if the device supports FP64). Usage Model of |onedpl_short| Random Number Generation Functionality -------------------------------------------------------------------- diff --git a/_sources/api_for_sycl_kernels/tested_standard_cpp_api.rst b/_sources/api_for_sycl_kernels/tested_standard_cpp_api.rst index af87a010f9..cd7691bb60 100644 --- a/_sources/api_for_sycl_kernels/tested_standard_cpp_api.rst +++ b/_sources/api_for_sycl_kernels/tested_standard_cpp_api.rst @@ -463,9 +463,9 @@ C++ Standard API libstdc++ libc++ MSVC These tests were done for the following versions of the standard C++ library: ============================================= ============================================= -libstdc++(GNU) Provided with GCC*-7.5.0, GCC*-9.3.0 +libstdc++ (GNU) Provided with GCC*-7.5.0, GCC*-9.3.0 --------------------------------------------- --------------------------------------------- -libc++(LLVM) Provided with Clang*-11.0 +libc++ (LLVM) Provided with Clang*-11.0 --------------------------------------------- --------------------------------------------- Microsoft Visual C++* (MSVC) Standard Library Provided with Microsoft Visual Studio* 2017; Microsoft Visual Studio 2019; and Microsoft diff --git a/_sources/api_for_sycl_kernels/utility_function_object_classes.rst b/_sources/api_for_sycl_kernels/utility_function_object_classes.rst index b760041ebf..af1f368442 100644 --- a/_sources/api_for_sycl_kernels/utility_function_object_classes.rst +++ b/_sources/api_for_sycl_kernels/utility_function_object_classes.rst @@ -2,11 +2,11 @@ Utility Function Object Classes ################################## The definitions of the utility function objects are available through the -``oneapi/dpl/functional`` header. All function objects are implemented in the ``oneapi::dpl`` namespace. +```` header. All function objects are implemented in the ``oneapi::dpl`` namespace. -* ``identity``: A C++11 implementation of the C++20 ``std::identity`` function object type, where the operator() returns the - argument unchanged. -* ``minimum``: A function object type where the operator() applies ``std::less`` to its arguments, then returns the - lesser argument unchanged. -* ``maximum``: A function object type where the operator() applies ``std::greater`` to its arguments, then returns the - greater argument unchanged. \ No newline at end of file +* ``identity``: A function object type where the operator() returns the argument unchanged. + It is an implementation of ``std::identity`` that can be used prior to C++20. +* ``minimum``: A function object type where the operator() applies ``std::less`` to its arguments, + then returns the lesser argument unchanged. +* ``maximum``: A function object type where the operator() applies ``std::greater`` to its arguments, + then returns the greater argument unchanged. \ No newline at end of file diff --git a/_sources/api_for_sycl_kernels_main.rst b/_sources/api_for_sycl_kernels_main.rst index 333f9e24c2..2a9aad44a7 100644 --- a/_sources/api_for_sycl_kernels_main.rst +++ b/_sources/api_for_sycl_kernels_main.rst @@ -1,5 +1,5 @@ -API for the SYCL* Kernels -######################### +API for SYCL* Kernels +##################### |onedpl_long| (|onedpl_short|) includes the following APIs for SYCL* kernels: @@ -13,9 +13,8 @@ API for the SYCL* Kernels .. toctree:: :maxdepth: 2 :titlesonly: - :glob: :hidden: - api_for_sycl_kernels/random api_for_sycl_kernels/tested_standard_cpp_api - api_for_sycl_kernels/utility_function_object_classes \ No newline at end of file + api_for_sycl_kernels/random + api_for_sycl_kernels/utility_function_object_classes diff --git a/_sources/cmake_support.rst b/_sources/cmake_support.rst index 61eb16ef28..2d50cfa506 100644 --- a/_sources/cmake_support.rst +++ b/_sources/cmake_support.rst @@ -17,7 +17,10 @@ Requirements ============ The minimal supported CMake version for |onedpl_short| is 3.11 on Linux and 3.20 on Windows. -The supported `CMake Generator `_ for Linux is `Unix Makefiles `_ (default). In the Windows environment, the supported generator is `Ninja `_ as described in the `Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference `_ which may be specified via ``-GNinja``. +The supported `CMake Generator `_ +for Linux is `Unix Makefiles `_ (default). +In the Windows environment, the supported generator is `Ninja `_ +which may be specified via ``-GNinja`` as described in the |dpcpp_cmake_support|_. |onedpl_short| Backend Options ============================== diff --git a/_sources/dynamic_selection_api/auto_tune_policy.rst b/_sources/dynamic_selection_api/auto_tune_policy.rst index 4bdf3ea792..8ab8a2be46 100644 --- a/_sources/dynamic_selection_api/auto_tune_policy.rst +++ b/_sources/dynamic_selection_api/auto_tune_policy.rst @@ -1,248 +1,260 @@ -Auto-Tune Policy -################ - -The dynamic selection API is an experimental feature in the |onedpl_long| -(|onedpl_short|) that selects an *execution resource* based on a chosen -*selection policy*. There are several policies provided as part -of the API. Policies encapsulate the logic and any associated state needed -to make a selection. - -The auto-tune policy selects resources using runtime profiling. ``auto_tune_policy`` -is useful for determining which resource performs best -for a given kernel. The choice is made based on runtime performance -history, so this policy is only useful for kernels that have stable -performance. Initially, this policy acts like ``round_robin_policy``, -rotating through each resource (one or more times). Then, once it has -determined which resource is performing best, it uses that resource -thereafter. Optionally, a resampling interval can be set to return to -the profiling phase periodically. - -.. code:: cpp - - namespace oneapi::dpl::experimental { - - template - class auto_tune_policy { - public: - // useful types - using resource_type = typename Backend::resource_type; - using wait_type = typename Backend::wait_type; - - class selection_type { - public: - auto_tune_policy get_policy() const; - resource_type unwrap() const; - }; - - // constructors - auto_tune_policy(deferred_initialization_t); - auto_tune_policy(uint64_t resample_interval_in_milliseconds = 0); - auto_tune_policy(const std::vector& u, - uint64_t resample_interval_in_milliseconds = 0); - - // deferred initializer - void initialize(uint64_t resample_interval_in_milliseconds = 0); - void initialize(const std::vector& u, - uint64_t resample_interval_in_milliseconds = 0); - - // queries - auto get_resources() const; - auto get_submission_group(); - - // other implementation defined functions... - }; - - } - -This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, -and ``submit_and_wait``. It can also be used with ``policy_traits``. - -Example -------- - -In the following example, an ``auto_tune_policy`` is used to dynamically select between -two queues, a CPU queue and a GPU queue. - -.. code:: cpp - - #include - #include - #include - - namespace ex = oneapi::dpl::experimental; - - int main() { - std::vector r { sycl::queue{sycl::cpu_selector_v}, - sycl::queue{sycl::gpu_selector_v} }; - - const std::size_t N = 10000; - std::vector av(N, 0.0); - std::vector bv(N, 0.0); - std::vector cv(N, 0.0); - for (int i = 0; i < N; ++i) { - av[i] = bv[i] = i; - } - - ex::auto_tune_policy p{r}; // (1) - - { - sycl::buffer a_b(av); - sycl::buffer b_b(bv); - sycl::buffer c_b(cv); - - - for (int i = 0; i < 6; ++i) { - ex::submit_and_wait(p, [&](sycl::queue q) { // (2) - // (3) - std::cout << (q.get_device().is_cpu() ? "using cpu\n" : "using gpu\n"); - return q.submit([&](sycl::handler &h) { // (4) - sycl::accessor a_a(a_b, h, sycl::read_only); - sycl::accessor b_a(b_b, h, sycl::read_only); - sycl::accessor c_a(c_b, h, sycl::read_write); - h.parallel_for(N, [=](auto i) { c_a[i] = a_a[i] + b_a[i]; }); - }); - }); - }; - } - - for (int i = 0; i < N; ++i) { - if (cv[i] != 2*i) { - std::cout << "ERROR!\n"; - } - } - std::cout << "Done.\n"; - } - -The key points in this example are: - -#. An ``auto_tune_policy`` is constructed to select between the CPU and GPU. -#. ``submit_and_wait`` is invoked with the policy as the first argument. The selected queue will be passed to the user-provided function. -#. For clarity when run, the type of device is displayed. -#. The queue is used in function to perform and asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. - -Selection Algorithm -------------------- - -The selection algorithm for ``auto_tune_policy`` uses runtime profiling -to choose the best resource for the given function. A simplified, expository -implementation of the selection algorithm follows: - -.. code:: - - template - selection_type auto_tune_policy::select(Function&& f, Args&&...args) { - if (initialized_) { - auto k = make_task_key(f, args...); - auto tuner = get_tuner(k); - auto offset = tuner->get_resource_to_profile(); - if (offset == use_best) { - return selection_type {*this, tuner->best_resource_, tuner}; - } else { - auto r = resources_[offset]; - return selection{*this, r, tuner}; - } - } else { - throw std::logic_error(“selected called before initialization”); - } - } - -where ``make_task_key`` combines the inputs, including the function and its -arguments, into a key that uniquely identifies the user function that is being -profiled. ``tuner`` is the encapsulated logic for performing runtime profiling -and choosing the best option for a given key. When the call to ``get_resource_to_profile()`` -return ``use_best``, the tuner is not in the profiling phase, and so the previously -determined best resource is used. Otherwise, the resource at index ``offset`` -in the ``resources_`` vector is used and its resulting performance is profiled. -When an ``auto_tune_policy`` is initialized with a non-zero resample interval, -the policy will periodically return to the profiling phase base on the provided -interval value. - -Constructors ------------- - -``auto_tune_policy`` provides three constructors. - -.. list-table:: ``auto_tune_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``auto_tune_policy(deferred_initialization_t);`` - - Defers initialization. An ``initialize`` function must be called prior to use. - * - ``auto_tune_policy(uint64_t resample_interval_in_milliseconds = 0);`` - - Initialized to use the default set of resources. An optional resampling interval can be provided. - * - ``auto_tune_policy(const std::vector& u, uint64_t resample_interval_in_milliseconds = 0);`` - - Overrides the default set of resources. An optional resampling interval can be provided. - -Deferred Initialization ------------------------ - -A ``auto_tune_policy`` that was constructed with deferred initialization must be -initialized by calling one its ``initialize`` member functions before it can be used -to select or submit. - -.. list-table:: ``auto_tune_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``initialize(uint64_t resample_interval_in_milliseconds = 0);`` - - Initialize to use the default set of resources. An optional resampling interval can be provided. - * - ``initialize(const std::vector& u, uint64_t resample_interval_in_milliseconds = 0);`` - - Overrides the default set of resources. An optional resampling interval can be provided. - -Queries -------- - -A ``auto_tune_policy`` has ``get_resources`` and ``get_submission_group`` -member functions. - -.. list-table:: ``auto_tune_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``std::vector get_resources();`` - - Returns the set of resources the policy is selecting from. - * - ``auto get_submission_group();`` - - Returns an object that can be used to wait for all active submissions. - -Reporting Requirements ----------------------- - -If a resource returned by ``select`` is used directly without calling -``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` -to provide feedback to the policy. The ``auto_tune_policy`` tracks the -performance of submissions on each device via callbacks that report -the execution time. The instrumentation to report these events is included -in the implementations of ``submit`` and ``submit_and_wait``. However, if you -use ``select`` and then submit work directly to the selected resource, it -is necessary to explicitly report these events. - -.. list-table:: ``auto_tune_policy`` reporting requirements - :widths: 50 50 - :header-rows: 1 - - * - ``execution_info`` - - is reporting required? - * - ``task_submission`` - - No - * - ``task_completion`` - - No - * - ``task_time`` - - Yes - -In generic code, it is possible to perform compile-time checks to avoid -reporting overheads when reporting is not needed, while still writing -code that will work with any policy, as demonstrated below: - -.. code:: cpp - - auto s = select(my_policy); - if constexpr (report_info_v) - { - s.report(execution_info::task_submission); - } +Auto-Tune Policy +################ + +The dynamic selection API is an experimental feature in the |onedpl_long| +(|onedpl_short|) that selects an *execution resource* based on a chosen +*selection policy*. There are several policies provided as part +of the API. Policies encapsulate the logic and any associated state needed +to make a selection. + +The auto-tune policy selects resources using runtime profiling. ``auto_tune_policy`` +is useful for determining which resource performs best +for a given kernel. The choice is made based on runtime performance +history, so this policy is only useful for kernels that have stable +performance. Initially, this policy acts like ``round_robin_policy``, +rotating through each resource (one or more times). Then, once it has +determined which resource is performing best, it uses that resource +thereafter. Optionally, a resampling interval can be set to return to +the profiling phase periodically. + +.. code:: cpp + + namespace oneapi::dpl::experimental { + + template + class auto_tune_policy { + public: + // useful types + using resource_type = typename Backend::resource_type; + using wait_type = typename Backend::wait_type; + + class selection_type { + public: + auto_tune_policy get_policy() const; + resource_type unwrap() const; + }; + + // constructors + auto_tune_policy(deferred_initialization_t); + auto_tune_policy(uint64_t resample_interval_in_milliseconds = 0); + auto_tune_policy(const std::vector& u, + uint64_t resample_interval_in_milliseconds = 0); + + // deferred initializer + void initialize(uint64_t resample_interval_in_milliseconds = 0); + void initialize(const std::vector& u, + uint64_t resample_interval_in_milliseconds = 0); + + // queries + auto get_resources() const; + auto get_submission_group(); + + // other implementation defined functions... + }; + + } + +This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, +and ``submit_and_wait``. It can also be used with ``policy_traits``. + +Example +------- + +In the following example, an ``auto_tune_policy`` is used to dynamically select between +two queues, a CPU queue and a GPU queue. + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + std::vector r { sycl::queue{sycl::cpu_selector_v}, + sycl::queue{sycl::gpu_selector_v} }; + + const std::size_t N = 10000; + std::vector av(N, 0.0); + std::vector bv(N, 0.0); + std::vector cv(N, 0.0); + for (int i = 0; i < N; ++i) { + av[i] = bv[i] = i; + } + + ex::auto_tune_policy p{r}; // (1) + + { + sycl::buffer a_b(av); + sycl::buffer b_b(bv); + sycl::buffer c_b(cv); + + + for (int i = 0; i < 6; ++i) { + ex::submit_and_wait(p, [&](sycl::queue q) { // (2) + // (3) + std::cout << (q.get_device().is_cpu() ? "using cpu\n" : "using gpu\n"); + return q.submit([&](sycl::handler &h) { // (4) + sycl::accessor a_a(a_b, h, sycl::read_only); + sycl::accessor b_a(b_b, h, sycl::read_only); + sycl::accessor c_a(c_b, h, sycl::read_write); + h.parallel_for(N, [=](auto i) { c_a[i] = a_a[i] + b_a[i]; }); + }); + }); + }; + } + + for (int i = 0; i < N; ++i) { + if (cv[i] != 2*i) { + std::cout << "ERROR!\n"; + } + } + std::cout << "Done.\n"; + } + +The key points in this example are: + +#. An ``auto_tune_policy`` is constructed to select between the CPU and GPU. +#. ``submit_and_wait`` is invoked with the policy as the first argument. The selected queue will be passed to the user-provided function. +#. For clarity when run, the type of device is displayed. +#. The queue is used in function to perform and asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. + +Selection Algorithm +------------------- + +The selection algorithm for ``auto_tune_policy`` uses runtime profiling +to choose the best resource for the given function. A simplified, expository +implementation of the selection algorithm follows: + +.. code:: cpp + + template + selection_type auto_tune_policy::select(Function&& f, Args&&...args) { + if (initialized_) { + auto k = make_task_key(f, args...); + auto tuner = get_tuner(k); + auto offset = tuner->get_resource_to_profile(); + if (offset == use_best) { + return selection_type {*this, tuner->best_resource_, tuner}; + } else { + auto r = resources_[offset]; + return selection{*this, r, tuner}; + } + } else { + throw std::logic_error("selected called before initialization"); + } + } + +where ``make_task_key`` combines the inputs, including the function and its +arguments, into a key that uniquely identifies the user function that is being +profiled. ``tuner`` is the encapsulated logic for performing runtime profiling +and choosing the best option for a given key. When the call to ``get_resource_to_profile()`` +return ``use_best``, the tuner is not in the profiling phase, and so the previously +determined best resource is used. Otherwise, the resource at index ``offset`` +in the ``resources_`` vector is used and its resulting performance is profiled. +When an ``auto_tune_policy`` is initialized with a non-zero resample interval, +the policy will periodically return to the profiling phase base on the provided +interval value. + +Constructors +------------ + +``auto_tune_policy`` provides three constructors. + +.. list-table:: ``auto_tune_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``auto_tune_policy(deferred_initialization_t);`` + - Defers initialization. An ``initialize`` function must be called prior to use. + * - ``auto_tune_policy(uint64_t resample_interval_in_milliseconds = 0);`` + - Initialized to use the default set of resources. An optional resampling interval can be provided. + * - ``auto_tune_policy(const std::vector& u, uint64_t resample_interval_in_milliseconds = 0);`` + - Overrides the default set of resources. An optional resampling interval can be provided. + +.. Note:: + + When initializing the ``auto_tune_policy`` with SYCL queues, constructing the queues with the + ``sycl::property::queue::enable_profiling`` property allows a more accurate determination of the + best-performing device to be made. + +Deferred Initialization +----------------------- + +A ``auto_tune_policy`` that was constructed with deferred initialization must be +initialized by calling one its ``initialize`` member functions before it can be used +to select or submit. + +.. list-table:: ``auto_tune_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``initialize(uint64_t resample_interval_in_milliseconds = 0);`` + - Initialize to use the default set of resources. An optional resampling interval can be provided. + * - ``initialize(const std::vector& u, uint64_t resample_interval_in_milliseconds = 0);`` + - Overrides the default set of resources. An optional resampling interval can be provided. + +.. Note:: + + When initializing the ``auto_tune_policy`` with SYCL queues, constructing the queues with the + ``sycl::property::queue::enable_profiling`` property allows a more accurate determination of the + best-performing device to be made. + +Queries +------- + +A ``auto_tune_policy`` has ``get_resources`` and ``get_submission_group`` +member functions. + +.. list-table:: ``auto_tune_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``std::vector get_resources();`` + - Returns the set of resources the policy is selecting from. + * - ``auto get_submission_group();`` + - Returns an object that can be used to wait for all active submissions. + +Reporting Requirements +---------------------- + +If a resource returned by ``select`` is used directly without calling +``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` +to provide feedback to the policy. The ``auto_tune_policy`` tracks the +performance of submissions on each device via callbacks that report +the execution time. The instrumentation to report these events is included +in the implementations of ``submit`` and ``submit_and_wait``. However, if you +use ``select`` and then submit work directly to the selected resource, it +is necessary to explicitly report these events. + +.. list-table:: ``auto_tune_policy`` reporting requirements + :widths: 50 50 + :header-rows: 1 + + * - ``execution_info`` + - is reporting required? + * - ``task_submission`` + - No + * - ``task_completion`` + - No + * - ``task_time`` + - Yes + +In generic code, it is possible to perform compile-time checks to avoid +reporting overheads when reporting is not needed, while still writing +code that will work with any policy, as demonstrated below: + +.. code:: cpp + + auto s = select(my_policy); + if constexpr (report_info_v) + { + s.report(execution_info::task_submission); + } diff --git a/_sources/dynamic_selection_api/dynamic_load_policy.rst b/_sources/dynamic_selection_api/dynamic_load_policy.rst index ffd7dae485..93a1f6c482 100644 --- a/_sources/dynamic_selection_api/dynamic_load_policy.rst +++ b/_sources/dynamic_selection_api/dynamic_load_policy.rst @@ -1,231 +1,231 @@ -Dynamic Load Policy -################### - -The dynamic selection API is an experimental feature in the |onedpl_long| -(|onedpl_short|) that selects an *execution resource* based on a chosen -*selection policy*. There are several policies provided as part -of the API. Policies encapsulate the logic and any associated state needed -to make a selection. - -The dynamic load policy tracks the number of submissions currently submitted but not yet completed on each -resource and selects the resource that has the fewest unfinished submissions. -``dynamic_load_policy`` is useful for offloading kernels of varying cost to devices -of varying performance. A load-based assignment may achieve a good load balancing -by submitting tasks to a resource that completes work faster. - -.. code:: cpp - - namespace oneapi::dpl::experimental { - - template - class dynamic_load_policy { - public: - // useful types - using resource_type = typename Backend::resource_type; - using wait_type = typename Backend::wait_type; - - class selection_type { - public: - dynamic_load_policy get_policy() const; - resource_type unwrap() const; - }; - - // constructors - dynamic_load_policy(deferred_initialization_t); - dynamic_load_policy(); - dynamic_load_policy(const std::vector& u); - - // deferred initializer - void initialize(); - void initialize(const std::vector& u); - - // queries - auto get_resources() const; - auto get_submission_group(); - - // other implementation defined functions... - }; - - } - -This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, -and ``submit_and_wait``. It can also be used with ``policy_traits``. - -Example -------- - -The following example demonstrates a simple approach to send work to more than -one queue concurrently using ``dynamic_load_policy``. The policy selects the -resource with the fewest number of unfinished submissions. - -.. code:: cpp - - #include - #include - #include - - const std::size_t N = 10000; - namespace ex = oneapi::dpl::experimental; - - void f(sycl::handler& h, float* v); - void do_cpu_work(); - - int dynamic_load_example(std::vector& devices, - std::vector& usm_data) { - - ex::dynamic_load_policy p{devices}; // (1) - - auto num_devices = p.get_resources().size(); - auto num_arrays = usm_data.size(); - // (2) - auto submission_group_size = num_arrays; - - std::cout << "Running with " << num_devices << " queues\n" - << " " << num_arrays << " usm arrays\n" - << "Will perform " << submission_group_size << " concurrent offloads\n"; - - - for (int i = 0; i < 100; i+=submission_group_size) { // (3) - for (int j = 0; j < submission_group_size; ++j) { // (4) - ex::submit(p, [&](sycl::queue q) { // (5) - float *data = usm_data[j]; - return q.submit([=](sycl::handler &h) { // (6) - f(h, data); - }); - }); - do_cpu_work(); // (7) - } - ex::wait(p.get_submission_group()); // (8) - } - return 0; - } - -The key points in this example are: - -#. A ``dynamic_load_policy`` is constructed that selects from queues in the ``devices`` vector. -#. The total number of concurrent offloads, ``submission_group_size``, will be limited to the number of USM arrays. In this example, we allow multiple simultaneous offloads to the same queue. The only limitation is that there should be enough available vectors to support the concurrent executions. -#. The outer ``i``-loop iterates from 0 to 99, stepping by the ``submission_group_size``. This number of submissions will be offloaded concurrently. -#. The inner ``j``-loop iterates over ``submission_group_size`` submissions. -#. ``submit`` is used to select a queue and pass it to the user's function, but does not block until the event returned by that function completes. This provides the opportunity for concurrency across the submissions. -#. The queue is used in a function to perform an asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. -#. Some additional work is done between calls to ``submit``. ``dynamic_load_policy`` is most useful when there is time for work to complete on some devices before the next assignment is made. If all submissions are performed simultaneously, all devices will appear equally loaded, since the fast devices would have had no time to complete their work. -#. ``wait`` is called to block for all the concurrent ``submission_group_size`` submissions to complete. - -Selection Algorithm -------------------- - -The selection algorithm for ``dynamic_load_policy`` chooses the resource -that has the fewest number of unfinished offloads. The number of unfinished -offloads is the difference between the number of reported task submissions -and then number of reported task completions. This value is tracked for each -available resource. - -Simplified, expository implementation of the selection algorithm: - -.. code:: - - template - selection_type dynamic_load_policy::select(Args&& ...) { - if (initialized_) { - auto least_loaded_resource = find_least_loaded(resources_); - return selection_type{dynamic_load_policy(*this), least_loaded}; - } else { - throw std::logic_error("select called before initialialization"); - } - } - -where ``resources_`` is a container of resources, such as -``std::vector`` of ``sycl::queue``. The function ``find_least_loaded`` -iterates through the resources available to the policy and returns the -resource with the fewest number of unfinished offloads. - -Constructors ------------- - -``dynamic_load_policy`` provides three constructors. - -.. list-table:: ``dynamic_load_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``dynamic_load_policy(deferred_initialization_t);`` - - Defers initialization. An ``initialize`` function must be called prior to use. - * - ``dynamic_load_policy();`` - - Initialized to use the default set of resources. - * - ``dynamic_load_policy(const std::vector& u);`` - - Overrides the default set of resources. - -Deferred Initialization ------------------------ - -A ``dynamic_load_policy`` that was constructed with deferred initialization must be -initialized by calling one of its ``initialize`` member functions before it can be used -to select or submit. - -.. list-table:: ``dynamic_load_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``initialize();`` - - Initialize to use the default set of resources. - * - ``initialize(const std::vector& u);`` - - Overrides the default set of resources. - -Queries -------- - -A ``dynamic_load_policy`` has ``get_resources`` and ``get_submission_group`` -member functions. - -.. list-table:: ``dynamic_load_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``std::vector get_resources();`` - - Returns the set of resources the policy is selecting from. - * - ``auto get_submission_group();`` - - Returns an object that can be used to wait for all active submissions. - -Reporting Requirements ----------------------- - -If a resource returned by ``select`` is used directly without calling -``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` -to provide feedback to the policy. The ``dynamic_load_policy`` tracks the -number of outstanding submissions on each device via callbacks that report -when a submission is started, and when it is completed. The instrumentation -to report these events is included in the implementations of -``submit`` and ``submit_and_wait``. However, if you use ``select`` and then -submit work directly to the selected resource, it is necessary to explicitly -report these events. - -.. list-table:: ``dynamic_load_policy`` reporting requirements - :widths: 50 50 - :header-rows: 1 - - * - ``execution_info`` - - is reporting required? - * - ``task_submission`` - - Yes - * - ``task_completion`` - - Yes - * - ``task_time`` - - No - -In generic code, it is possible to perform compile-time checks to avoid -reporting overheads when reporting is not needed, while still writing -code that will work with any policy, as demonstrated below: - -.. code:: cpp - - auto s = select(my_policy); - if constexpr (report_info_v) - { - s.report(execution_info::task_submission); - } +Dynamic Load Policy +################### + +The dynamic selection API is an experimental feature in the |onedpl_long| +(|onedpl_short|) that selects an *execution resource* based on a chosen +*selection policy*. There are several policies provided as part +of the API. Policies encapsulate the logic and any associated state needed +to make a selection. + +The dynamic load policy tracks the number of submissions currently submitted but not yet completed on each +resource and selects the resource that has the fewest unfinished submissions. +``dynamic_load_policy`` is useful for offloading kernels of varying cost to devices +of varying performance. A load-based assignment may achieve a good load balancing +by submitting tasks to a resource that completes work faster. + +.. code:: cpp + + namespace oneapi::dpl::experimental { + + template + class dynamic_load_policy { + public: + // useful types + using resource_type = typename Backend::resource_type; + using wait_type = typename Backend::wait_type; + + class selection_type { + public: + dynamic_load_policy get_policy() const; + resource_type unwrap() const; + }; + + // constructors + dynamic_load_policy(deferred_initialization_t); + dynamic_load_policy(); + dynamic_load_policy(const std::vector& u); + + // deferred initializer + void initialize(); + void initialize(const std::vector& u); + + // queries + auto get_resources() const; + auto get_submission_group(); + + // other implementation defined functions... + }; + + } + +This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, +and ``submit_and_wait``. It can also be used with ``policy_traits``. + +Example +------- + +The following example demonstrates a simple approach to send work to more than +one queue concurrently using ``dynamic_load_policy``. The policy selects the +resource with the fewest number of unfinished submissions. + +.. code:: cpp + + #include + #include + #include + + const std::size_t N = 10000; + namespace ex = oneapi::dpl::experimental; + + void f(sycl::handler& h, float* v); + void do_cpu_work(); + + int dynamic_load_example(std::vector& devices, + std::vector& usm_data) { + + ex::dynamic_load_policy p{devices}; // (1) + + auto num_devices = p.get_resources().size(); + auto num_arrays = usm_data.size(); + // (2) + auto submission_group_size = num_arrays; + + std::cout << "Running with " << num_devices << " queues\n" + << " " << num_arrays << " usm arrays\n" + << "Will perform " << submission_group_size << " concurrent offloads\n"; + + + for (int i = 0; i < 100; i+=submission_group_size) { // (3) + for (int j = 0; j < submission_group_size; ++j) { // (4) + ex::submit(p, [&](sycl::queue q) { // (5) + float *data = usm_data[j]; + return q.submit([=](sycl::handler &h) { // (6) + f(h, data); + }); + }); + do_cpu_work(); // (7) + } + ex::wait(p.get_submission_group()); // (8) + } + return 0; + } + +The key points in this example are: + +#. A ``dynamic_load_policy`` is constructed that selects from queues in the ``devices`` vector. +#. The total number of concurrent offloads, ``submission_group_size``, will be limited to the number of USM arrays. In this example, we allow multiple simultaneous offloads to the same queue. The only limitation is that there should be enough available vectors to support the concurrent executions. +#. The outer ``i``-loop iterates from 0 to 99, stepping by the ``submission_group_size``. This number of submissions will be offloaded concurrently. +#. The inner ``j``-loop iterates over ``submission_group_size`` submissions. +#. ``submit`` is used to select a queue and pass it to the user's function, but does not block until the event returned by that function completes. This provides the opportunity for concurrency across the submissions. +#. The queue is used in a function to perform an asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. +#. Some additional work is done between calls to ``submit``. ``dynamic_load_policy`` is most useful when there is time for work to complete on some devices before the next assignment is made. If all submissions are performed simultaneously, all devices will appear equally loaded, since the fast devices would have had no time to complete their work. +#. ``wait`` is called to block for all the concurrent ``submission_group_size`` submissions to complete. + +Selection Algorithm +------------------- + +The selection algorithm for ``dynamic_load_policy`` chooses the resource +that has the fewest number of unfinished offloads. The number of unfinished +offloads is the difference between the number of reported task submissions +and then number of reported task completions. This value is tracked for each +available resource. + +Simplified, expository implementation of the selection algorithm: + +.. code:: cpp + + template + selection_type dynamic_load_policy::select(Args&& ...) { + if (initialized_) { + auto least_loaded_resource = find_least_loaded(resources_); + return selection_type{dynamic_load_policy(*this), least_loaded}; + } else { + throw std::logic_error("select called before initialialization"); + } + } + +where ``resources_`` is a container of resources, such as +``std::vector`` of ``sycl::queue``. The function ``find_least_loaded`` +iterates through the resources available to the policy and returns the +resource with the fewest number of unfinished offloads. + +Constructors +------------ + +``dynamic_load_policy`` provides three constructors. + +.. list-table:: ``dynamic_load_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``dynamic_load_policy(deferred_initialization_t);`` + - Defers initialization. An ``initialize`` function must be called prior to use. + * - ``dynamic_load_policy();`` + - Initialized to use the default set of resources. + * - ``dynamic_load_policy(const std::vector& u);`` + - Overrides the default set of resources. + +Deferred Initialization +----------------------- + +A ``dynamic_load_policy`` that was constructed with deferred initialization must be +initialized by calling one of its ``initialize`` member functions before it can be used +to select or submit. + +.. list-table:: ``dynamic_load_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``initialize();`` + - Initialize to use the default set of resources. + * - ``initialize(const std::vector& u);`` + - Overrides the default set of resources. + +Queries +------- + +A ``dynamic_load_policy`` has ``get_resources`` and ``get_submission_group`` +member functions. + +.. list-table:: ``dynamic_load_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``std::vector get_resources();`` + - Returns the set of resources the policy is selecting from. + * - ``auto get_submission_group();`` + - Returns an object that can be used to wait for all active submissions. + +Reporting Requirements +---------------------- + +If a resource returned by ``select`` is used directly without calling +``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` +to provide feedback to the policy. The ``dynamic_load_policy`` tracks the +number of outstanding submissions on each device via callbacks that report +when a submission is started, and when it is completed. The instrumentation +to report these events is included in the implementations of +``submit`` and ``submit_and_wait``. However, if you use ``select`` and then +submit work directly to the selected resource, it is necessary to explicitly +report these events. + +.. list-table:: ``dynamic_load_policy`` reporting requirements + :widths: 50 50 + :header-rows: 1 + + * - ``execution_info`` + - is reporting required? + * - ``task_submission`` + - Yes + * - ``task_completion`` + - Yes + * - ``task_time`` + - No + +In generic code, it is possible to perform compile-time checks to avoid +reporting overheads when reporting is not needed, while still writing +code that will work with any policy, as demonstrated below: + +.. code:: cpp + + auto s = select(my_policy); + if constexpr (report_info_v) + { + s.report(execution_info::task_submission); + } diff --git a/_sources/dynamic_selection_api/fixed_resource_policy.rst b/_sources/dynamic_selection_api/fixed_resource_policy.rst index eb2ceda028..664d9e0477 100644 --- a/_sources/dynamic_selection_api/fixed_resource_policy.rst +++ b/_sources/dynamic_selection_api/fixed_resource_policy.rst @@ -1,244 +1,244 @@ -Fixed-Resource Policy -##################### - -The dynamic selection API is an experimental feature in the |onedpl_long| -(|onedpl_short|) that selects an *execution resource* based on a chosen -*selection policy*. There are several policies provided as part -of the API. Policies encapsulate the logic and any associated state needed -to make a selection. - -The fixed-resource policy always returns the same resource selection. -``fixed_resource_policy`` is designed for two primary scenarios: - -#. debugging the use of dynamic selection -#. special casing a dynamic selection capable application for a specific resource when it is known to be best on that platform. - -.. code:: cpp - - namespace oneapi::dpl::experimental { - - template - class fixed_resource_policy { - public: - // useful types - using resource_type = typename Backend::resource_type; - using wait_type = typename Backend::wait_type; - - class selection_type { - public: - fixed_resource_policy get_policy() const; - resource_type unwrap() const; - }; - - // constructors - fixed_resource_policy(deferred_initialization_t); - fixed_resource_policy(std::size_t offset = 0); - fixed_resource_policy(const std::vector& u, - std::size_t offset = 0); - - // deferred initializers - void initialize(std::size_t offset = 0); - void initialize(const std::vector& u, - std::size_t offset = 0); - - // queries - auto get_resources() const; - auto get_submission_group(); - - // other implementation defined functions... - }; - - } - -This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, -and ``submit_and_wait``. It can also be used with ``policy_traits``. - - -Example -------- - -In the following example, a ``fixed_resource_policy`` is used when the code is -compiled so that it selects a specific device. When ``USE_CPU`` is defined at -compile-time, this example always uses the CPU queue. When ``USE_GPU`` is defined -at compile-time, it always uses the GPU queue. Otherwise, it uses an -``auto_tune_policy`` to dynamically select between these two queues. Such a scenario -could be used for debugging or simply to maintain the dynamic selection code even if -the best device to use is known for some subset of platforms. - -.. code:: cpp - - #include - #include - #include - - namespace ex = oneapi::dpl::experimental; - - int main() { - std::vector r { sycl::queue{sycl::cpu_selector_v}, - sycl::queue{sycl::gpu_selector_v} }; - - const std::size_t N = 10000; - std::vector av(N, 0.0); - std::vector bv(N, 0.0); - std::vector cv(N, 0.0); - for (int i = 0; i < N; ++i) { - av[i] = bv[i] = i; - } - - #if USE_CPU - ex::fixed_resource_policy p{r}; // (1) uses index 0 of r, the cpu - #elif USE_GPU - ex::fixed_resource_policy p{r, 1}; // (2) uses index 1 of r, the gpu - #else - ex::auto_tune_policy p{r}; - #endif - - { - sycl::buffer a_b(av); - sycl::buffer b_b(bv); - sycl::buffer c_b(cv); - - - for (int i = 0; i < 6; ++i) { - ex::submit_and_wait(p, [&](sycl::queue q) { // (3) - // (4) - std::cout << (q.get_device().is_cpu() ? "using cpu\n" : "using gpu\n"); - return q.submit([&](sycl::handler &h) { // (5) - sycl::accessor a_a(a_b, h, sycl::read_only); - sycl::accessor b_a(b_b, h, sycl::read_only); - sycl::accessor c_a(c_b, h, sycl::read_write); - h.parallel_for(N, [=](auto i) { c_a[i] = a_a[i] + b_a[i]; }); - }); - }); - }; - } - - for (int i = 0; i < N; ++i) { - if (cv[i] != 2*i) { - std::cout << "ERROR!\n"; - } - } - std::cout << "Done.\n"; - } - -The key points in this example are: - -#. If ``USE_CPU`` is defined, a ``fixed_resouce_policy`` is constructed that targets the CPU. -#. If ``USE_GPU`` is defined, a ``fixed_resouce_policy`` is constructed that targets the GPU. -#. ``submit_and_wait`` is invoked with the policy as the first argument. The selected queue will be passed to the user-provided function. -#. For clarity when run, the type of device is displayed. -#. The queue is used in a function to perform an asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. - -Selection Algorithm -------------------- - -The selection algorithm for ``fixed_resource_policy`` always returns -the same specific resource from its set of resources. The index of the -resource is set during construction or deferred initialization. - -Simplified, expository implementation of the selection algorithm: - -.. code:: - - template - selection_type fixed_resource_policy::select(Args&& ...) { - if (initialized_) { - return selection_type{*this, resources_[fixed_offset_]}; - } else { - throw std::logic_error(“select called before initialization”); - } - } - -where ``resources_`` is a container of resources, such as -``std::vector`` of ``sycl::queue``, and ``fixed_offset_`` stores a -fixed integer offset. Both ``resources_`` and ``fixed_offset`` -are set during construction or deferred initialization of the policy -and then remain constant. - -Constructors ------------- - -``fixed_resource_policy`` provides three constructors. - -.. list-table:: ``fixed_resource_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``fixed_resource_policy(deferred_initialization_t);`` - - Defers initialization. An ``initialize`` function must be called prior to use. - * - ``fixed_resource_policy(std::size_t offset = 0);`` - - Sets the index for the resource to be selected. Uses the default set of resources. - * - ``fixed_resource_policy(const std::vector& u, std::size_t offset = 0);`` - - Overrides the default set of resources and optionally sets the index for the resource to be selected. - -Deferred Initialization ------------------------ - -A ``fixed_resource_policy`` that was constructed with deferred initialization must be -initialized by calling one its ``initialize`` member functions before it can be used -to select or submit. - -.. list-table:: ``fixed_resource_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``initialize(std::size_t offset = 0);`` - - Sets the index for the resource to be selected. Uses the default set of resources. - * - ``initialize(const std::vector& u, std::size_t offset = 0);`` - - Overrides the default set of resources and optionally sets the index for the resource to be selected. - -Queries -------- - -A ``fixed_resource_policy`` has ``get_resources`` and ``get_submission_group`` -member functions. - -.. list-table:: ``fixed_resource_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``std::vector get_resources();`` - - Returns the set of resources the policy is selecting from. - * - ``auto get_submission_group();`` - - Returns an object that can be used to wait for all active submissions. - -Reporting Requirements ----------------------- - -If a resource returned by ``select`` is used directly without calling -``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` -to provide feedback to the policy. However, the ``fixed_resource_policy`` -does not require any feedback about the system state or the behavior of -the workload. Therefore, no explicit reporting of execution information -is needed, as is summarized in the table below. - -.. list-table:: ``fixed_resource_policy`` reporting requirements - :widths: 50 50 - :header-rows: 1 - - * - ``execution_info`` - - is reporting required? - * - ``task_submission`` - - No - * - ``task_completion`` - - No - * - ``task_time`` - - No - -In generic code, it is possible to perform compile-time checks to avoid -reporting overheads when reporting is not needed, while still writing -code that will work with any policy, as demonstrated below: - -.. code:: cpp - - auto s = select(my_policy); - if constexpr (report_info_v) - { - s.report(execution_info::task_submission); - } +Fixed-Resource Policy +##################### + +The dynamic selection API is an experimental feature in the |onedpl_long| +(|onedpl_short|) that selects an *execution resource* based on a chosen +*selection policy*. There are several policies provided as part +of the API. Policies encapsulate the logic and any associated state needed +to make a selection. + +The fixed-resource policy always returns the same resource selection. +``fixed_resource_policy`` is designed for two primary scenarios: + +#. debugging the use of dynamic selection +#. special casing a dynamic selection capable application for a specific resource when it is known to be best on that platform. + +.. code:: cpp + + namespace oneapi::dpl::experimental { + + template + class fixed_resource_policy { + public: + // useful types + using resource_type = typename Backend::resource_type; + using wait_type = typename Backend::wait_type; + + class selection_type { + public: + fixed_resource_policy get_policy() const; + resource_type unwrap() const; + }; + + // constructors + fixed_resource_policy(deferred_initialization_t); + fixed_resource_policy(std::size_t offset = 0); + fixed_resource_policy(const std::vector& u, + std::size_t offset = 0); + + // deferred initializers + void initialize(std::size_t offset = 0); + void initialize(const std::vector& u, + std::size_t offset = 0); + + // queries + auto get_resources() const; + auto get_submission_group(); + + // other implementation defined functions... + }; + + } + +This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, +and ``submit_and_wait``. It can also be used with ``policy_traits``. + + +Example +------- + +In the following example, a ``fixed_resource_policy`` is used when the code is +compiled so that it selects a specific device. When ``USE_CPU`` is defined at +compile-time, this example always uses the CPU queue. When ``USE_GPU`` is defined +at compile-time, it always uses the GPU queue. Otherwise, it uses an +``auto_tune_policy`` to dynamically select between these two queues. Such a scenario +could be used for debugging or simply to maintain the dynamic selection code even if +the best device to use is known for some subset of platforms. + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + std::vector r { sycl::queue{sycl::cpu_selector_v}, + sycl::queue{sycl::gpu_selector_v} }; + + const std::size_t N = 10000; + std::vector av(N, 0.0); + std::vector bv(N, 0.0); + std::vector cv(N, 0.0); + for (int i = 0; i < N; ++i) { + av[i] = bv[i] = i; + } + + #if USE_CPU + ex::fixed_resource_policy p{r}; // (1) uses index 0 of r, the cpu + #elif USE_GPU + ex::fixed_resource_policy p{r, 1}; // (2) uses index 1 of r, the gpu + #else + ex::auto_tune_policy p{r}; + #endif + + { + sycl::buffer a_b(av); + sycl::buffer b_b(bv); + sycl::buffer c_b(cv); + + + for (int i = 0; i < 6; ++i) { + ex::submit_and_wait(p, [&](sycl::queue q) { // (3) + // (4) + std::cout << (q.get_device().is_cpu() ? "using cpu\n" : "using gpu\n"); + return q.submit([&](sycl::handler &h) { // (5) + sycl::accessor a_a(a_b, h, sycl::read_only); + sycl::accessor b_a(b_b, h, sycl::read_only); + sycl::accessor c_a(c_b, h, sycl::read_write); + h.parallel_for(N, [=](auto i) { c_a[i] = a_a[i] + b_a[i]; }); + }); + }); + }; + } + + for (int i = 0; i < N; ++i) { + if (cv[i] != 2*i) { + std::cout << "ERROR!\n"; + } + } + std::cout << "Done.\n"; + } + +The key points in this example are: + +#. If ``USE_CPU`` is defined, a ``fixed_resouce_policy`` is constructed that targets the CPU. +#. If ``USE_GPU`` is defined, a ``fixed_resouce_policy`` is constructed that targets the GPU. +#. ``submit_and_wait`` is invoked with the policy as the first argument. The selected queue will be passed to the user-provided function. +#. For clarity when run, the type of device is displayed. +#. The queue is used in a function to perform an asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. + +Selection Algorithm +------------------- + +The selection algorithm for ``fixed_resource_policy`` always returns +the same specific resource from its set of resources. The index of the +resource is set during construction or deferred initialization. + +Simplified, expository implementation of the selection algorithm: + +.. code:: cpp + + template + selection_type fixed_resource_policy::select(Args&& ...) { + if (initialized_) { + return selection_type{*this, resources_[fixed_offset_]}; + } else { + throw std::logic_error("select called before initialization"); + } + } + +where ``resources_`` is a container of resources, such as +``std::vector`` of ``sycl::queue``, and ``fixed_offset_`` stores a +fixed integer offset. Both ``resources_`` and ``fixed_offset`` +are set during construction or deferred initialization of the policy +and then remain constant. + +Constructors +------------ + +``fixed_resource_policy`` provides three constructors. + +.. list-table:: ``fixed_resource_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``fixed_resource_policy(deferred_initialization_t);`` + - Defers initialization. An ``initialize`` function must be called prior to use. + * - ``fixed_resource_policy(std::size_t offset = 0);`` + - Sets the index for the resource to be selected. Uses the default set of resources. + * - ``fixed_resource_policy(const std::vector& u, std::size_t offset = 0);`` + - Overrides the default set of resources and optionally sets the index for the resource to be selected. + +Deferred Initialization +----------------------- + +A ``fixed_resource_policy`` that was constructed with deferred initialization must be +initialized by calling one its ``initialize`` member functions before it can be used +to select or submit. + +.. list-table:: ``fixed_resource_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``initialize(std::size_t offset = 0);`` + - Sets the index for the resource to be selected. Uses the default set of resources. + * - ``initialize(const std::vector& u, std::size_t offset = 0);`` + - Overrides the default set of resources and optionally sets the index for the resource to be selected. + +Queries +------- + +A ``fixed_resource_policy`` has ``get_resources`` and ``get_submission_group`` +member functions. + +.. list-table:: ``fixed_resource_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``std::vector get_resources();`` + - Returns the set of resources the policy is selecting from. + * - ``auto get_submission_group();`` + - Returns an object that can be used to wait for all active submissions. + +Reporting Requirements +---------------------- + +If a resource returned by ``select`` is used directly without calling +``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` +to provide feedback to the policy. However, the ``fixed_resource_policy`` +does not require any feedback about the system state or the behavior of +the workload. Therefore, no explicit reporting of execution information +is needed, as is summarized in the table below. + +.. list-table:: ``fixed_resource_policy`` reporting requirements + :widths: 50 50 + :header-rows: 1 + + * - ``execution_info`` + - is reporting required? + * - ``task_submission`` + - No + * - ``task_completion`` + - No + * - ``task_time`` + - No + +In generic code, it is possible to perform compile-time checks to avoid +reporting overheads when reporting is not needed, while still writing +code that will work with any policy, as demonstrated below: + +.. code:: cpp + + auto s = select(my_policy); + if constexpr (report_info_v) + { + s.report(execution_info::task_submission); + } diff --git a/_sources/dynamic_selection_api/functions.rst b/_sources/dynamic_selection_api/functions.rst index 6dd574b502..63083b8f54 100644 --- a/_sources/dynamic_selection_api/functions.rst +++ b/_sources/dynamic_selection_api/functions.rst @@ -42,9 +42,7 @@ An example that calls ``select`` using a ``round_robin_policy``: } } -The output of this example: - -.. code:: +The output of this example is:: selected queue is cpu selected queue is gpu @@ -134,9 +132,7 @@ submitted to the selected queue. } } -The output from this example: - -.. code:: +The output from this example is:: (j == 0): submit to cpu (i == 0): async work on main thread @@ -204,9 +200,7 @@ command groups can be submitted to the selected queue. } } -The output from this example: - -.. code:: +The output from this example is:: (j == 0): submit to cpu (i == 0): async work on main thread @@ -269,9 +263,7 @@ the object returned by ``get_submission_group()`` to ``wait``: std::cout << "done waiting for all submissions\n"; } -The output from this example: - -.. code:: +The output from this example is:: (j == 0): submit to cpu (i == 0): async work on main thread @@ -345,9 +337,7 @@ submitted to the selected queue. } } -The output from this example: - -.. code:: +The output from this example is:: (j == 0): submit to cpu (i == 0): submission done @@ -410,9 +400,7 @@ submitted to the selected queue. } -The output from this example: - -.. code:: +The output from this example is:: (j == 0): submit to cpu (i == 0): submission done @@ -462,9 +450,7 @@ selects from. The following example demonstrates the use of the function std::cout << "queue is " << ((q.get_device().is_gpu()) ? "gpu\n" : "not-gpu\n"); } -The output from this example on a test machine is shown below. - -.. code:: +The output from this example on a test machine is:: Resources in explicitly set policy queue is cpu diff --git a/_sources/dynamic_selection_api/policies.rst b/_sources/dynamic_selection_api/policies.rst index f74a2f0dca..afb156b7a8 100644 --- a/_sources/dynamic_selection_api/policies.rst +++ b/_sources/dynamic_selection_api/policies.rst @@ -100,9 +100,7 @@ An example, demonstrating this difference, is shown below: print_type("p2 selection 4: ", p2s4); } -The output of this example: - -.. code:: +The output of this example is:: p1 selection 1: cpu p2 selection 1: cpu @@ -125,7 +123,6 @@ More detailed information about the API is provided in the following sections: .. toctree:: :maxdepth: 2 :titlesonly: - :glob: fixed_resource_policy round_robin_policy diff --git a/_sources/dynamic_selection_api/round_robin_policy.rst b/_sources/dynamic_selection_api/round_robin_policy.rst index 0d766dc566..ead7101d94 100644 --- a/_sources/dynamic_selection_api/round_robin_policy.rst +++ b/_sources/dynamic_selection_api/round_robin_policy.rst @@ -1,222 +1,222 @@ -Round-Robin Policy -################## - -The dynamic selection API is an experimental feature in the |onedpl_long| -(|onedpl_short|) that selects an *execution resource* based on a chosen -*selection policy*. There are several policies provided as part -of the API. Policies encapsulate the logic and any associated state needed -to make a selection. - -The round-robin policy cycles through the set of resources at each selection. ``round_robin_policy`` -is useful for offloading kernels of similar cost to devices of similar -capabilities. In those cases, a round-robin assignment of kernels to devices -will achieve a good load balancing. - -.. code:: cpp - - namespace oneapi::dpl::experimental { - - template - class round_robin_policy { - public: - // useful types - using resource_type = typename Backend::resource_type; - using wait_type = typename Backend::wait_type; - - class selection_type { - public: - round_robin_policy get_policy() const; - resource_type unwrap() const; - }; - - // constructors - round_robin_policy(deferred_initialization_t); - round_robin_policy(); - round_robin_policy(const std::vector& u); - - // deferred initializer - void initialize(); - void initialize(const std::vector& u); - - // queries - auto get_resources() const; - auto get_submission_group(); - - // other implementation defined functions... - }; - - } - -This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, -and ``submit_and_wait``. It can also be used with ``policy_traits``. - -Example -------- - -The following example demonstrates a simple approach to send work to each -queue in a set of queues, and then wait for all devices to complete the work -before repeating the process. A ``round_robin_policy`` is used rotate through -the available devices. - -.. code:: cpp - - #include - #include - #include - - const std::size_t N = 10000; - namespace ex = oneapi::dpl::experimental; - - void f(sycl::handler& h, float* v); - - - int round_robin_example(std::vector& similar_devices, - std::vector& usm_data) { - - ex::round_robin_policy p{similar_devices}; // (1) - - auto num_devices = p.get_resources().size(); - auto num_arrays = usm_data.size(); - - // (2) - auto submission_group_size = (num_arrays < num_devices) ? num_arrays : num_devices; - - std::cout << "Running with " << num_devices << " queues\n" - << " " << num_arrays << " usm arrays\n" - << "Will perform " << submission_group_size << " concurrent offloads\n"; - - for (int i = 0; i < 100; i += submission_group_size) { // (3) - for (int j = 0; j < submission_group_size; ++j) { // (4) - ex::submit(p, [&](sycl::queue q) { // (5) - float* data = usm_data[j]; - return q.submit([=](sycl::handler &h) { // (6) - f(h, data); - }); - }); - } - ex::wait(p.get_submission_group()); // (7) - } - return 0; - } - -The key points in this example are: - -#. A ``round_robin_policy`` is constructed that rotates between the CPU and GPU queues. -#. The total number of concurrent offloads, ``submission_group_size``, will be limited to the number of USM arrays or the number of queues, whichever is smaller. -#. The outer ``i``-loop iterates from 0 to 99, stepping by the ``submission_group_size``. This number of submissions will be offload concurrently. -#. The inner ``j``-loop iterates over ``submission_group_size`` submissions. -#. ``submit`` is used to select a queue and pass it to the user's function, but does not block until the event returned by that function completes. This provides the opportunity for concurrency across the submissions. -#. The queue is used in a function to perform an asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. -#. ``wait`` is called to block for all the concurrent ``submission_group_size`` submissions to complete. - -Selection Algorithm -------------------- - -The selection algorithm for ``round_robin_policy`` rotates through -the elements of the set of available resources. A simplified, expository -implementation of the selection algorithm follows: - -.. code:: - - template - selection_type round_robin_policy::select(Args&&...) { - if (initialized_) { - auto& r = resources_[next_context_++ % num_resources_]; - return selection_type{*this, r}; - } else { - throw std::logic_error(“selected called before initialization”); - } - } - -where ``resources_`` is a container of resources, such as -``std::vector`` of ``sycl::queue``, ``next_context_`` is -a counter that increments at each selection, and ``num_resources_`` -is the size of the ``resources_`` vector. - -Constructors ------------- - -``round_robin_policy`` provides three constructors. - -.. list-table:: ``round_robin_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``round_round_policy(deferred_initialization_t);`` - - Defers initialization. An ``initialize`` function must be called prior to use. - * - ``round_robin_policy();`` - - Initialized to use the default set of resources. - * - ``round_robin_policy(const std::vector& u);`` - - Overrides the default set of resources. - -Deferred Initialization ------------------------ - -A ``round_robin_policy`` that was constructed with deferred initialization must be -initialized by calling one its ``initialize`` member functions before it can be used -to select or submit. - -.. list-table:: ``round_robin_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``initialize();`` - - Initialize to use the default set of resources. - * - ``initialize(const std::vector& u);`` - - Overrides the default set of resources. - -Queries -------- - -A ``round_robin_policy`` has ``get_resources`` and ``get_submission_group`` -member functions. - -.. list-table:: ``round_robin_policy`` constructors - :widths: 50 50 - :header-rows: 1 - - * - Signature - - Description - * - ``std::vector get_resources();`` - - Returns the set of resources the policy is selecting from. - * - ``auto get_submission_group();`` - - Returns an object that can be used to wait for all active submissions. - -Reporting Requirements ----------------------- - -If a resource returned by ``select`` is used directly without calling -``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` -to provide feedback to the policy. However, the ``round_robin_policy`` -does not require any feedback about the system state or the behavior of -the workload. Therefore, no explicit reporting of execution information -is needed, as is summarized in the table below. - -.. list-table:: ``round_robin_policy`` reporting requirements - :widths: 50 50 - :header-rows: 1 - - * - ``execution_info`` - - is reporting required? - * - ``task_submission`` - - No - * - ``task_completion`` - - No - * - ``task_time`` - - No - -In generic code, it is possible to perform compile-time checks to avoid -reporting overheads when reporting is not needed, while still writing -code that will work with any policy, as demonstrated below: - -.. code:: cpp - - auto s = select(my_policy); - if constexpr (report_info_v) - { - s.report(execution_info::task_submission); - } +Round-Robin Policy +################## + +The dynamic selection API is an experimental feature in the |onedpl_long| +(|onedpl_short|) that selects an *execution resource* based on a chosen +*selection policy*. There are several policies provided as part +of the API. Policies encapsulate the logic and any associated state needed +to make a selection. + +The round-robin policy cycles through the set of resources at each selection. ``round_robin_policy`` +is useful for offloading kernels of similar cost to devices of similar +capabilities. In those cases, a round-robin assignment of kernels to devices +will achieve a good load balancing. + +.. code:: cpp + + namespace oneapi::dpl::experimental { + + template + class round_robin_policy { + public: + // useful types + using resource_type = typename Backend::resource_type; + using wait_type = typename Backend::wait_type; + + class selection_type { + public: + round_robin_policy get_policy() const; + resource_type unwrap() const; + }; + + // constructors + round_robin_policy(deferred_initialization_t); + round_robin_policy(); + round_robin_policy(const std::vector& u); + + // deferred initializer + void initialize(); + void initialize(const std::vector& u); + + // queries + auto get_resources() const; + auto get_submission_group(); + + // other implementation defined functions... + }; + + } + +This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, +and ``submit_and_wait``. It can also be used with ``policy_traits``. + +Example +------- + +The following example demonstrates a simple approach to send work to each +queue in a set of queues, and then wait for all devices to complete the work +before repeating the process. A ``round_robin_policy`` is used rotate through +the available devices. + +.. code:: cpp + + #include + #include + #include + + const std::size_t N = 10000; + namespace ex = oneapi::dpl::experimental; + + void f(sycl::handler& h, float* v); + + + int round_robin_example(std::vector& similar_devices, + std::vector& usm_data) { + + ex::round_robin_policy p{similar_devices}; // (1) + + auto num_devices = p.get_resources().size(); + auto num_arrays = usm_data.size(); + + // (2) + auto submission_group_size = (num_arrays < num_devices) ? num_arrays : num_devices; + + std::cout << "Running with " << num_devices << " queues\n" + << " " << num_arrays << " usm arrays\n" + << "Will perform " << submission_group_size << " concurrent offloads\n"; + + for (int i = 0; i < 100; i += submission_group_size) { // (3) + for (int j = 0; j < submission_group_size; ++j) { // (4) + ex::submit(p, [&](sycl::queue q) { // (5) + float* data = usm_data[j]; + return q.submit([=](sycl::handler &h) { // (6) + f(h, data); + }); + }); + } + ex::wait(p.get_submission_group()); // (7) + } + return 0; + } + +The key points in this example are: + +#. A ``round_robin_policy`` is constructed that rotates between the CPU and GPU queues. +#. The total number of concurrent offloads, ``submission_group_size``, will be limited to the number of USM arrays or the number of queues, whichever is smaller. +#. The outer ``i``-loop iterates from 0 to 99, stepping by the ``submission_group_size``. This number of submissions will be offload concurrently. +#. The inner ``j``-loop iterates over ``submission_group_size`` submissions. +#. ``submit`` is used to select a queue and pass it to the user's function, but does not block until the event returned by that function completes. This provides the opportunity for concurrency across the submissions. +#. The queue is used in a function to perform an asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. +#. ``wait`` is called to block for all the concurrent ``submission_group_size`` submissions to complete. + +Selection Algorithm +------------------- + +The selection algorithm for ``round_robin_policy`` rotates through +the elements of the set of available resources. A simplified, expository +implementation of the selection algorithm follows: + +.. code:: cpp + + template + selection_type round_robin_policy::select(Args&&...) { + if (initialized_) { + auto& r = resources_[next_context_++ % num_resources_]; + return selection_type{*this, r}; + } else { + throw std::logic_error("selected called before initialization"); + } + } + +where ``resources_`` is a container of resources, such as +``std::vector`` of ``sycl::queue``, ``next_context_`` is +a counter that increments at each selection, and ``num_resources_`` +is the size of the ``resources_`` vector. + +Constructors +------------ + +``round_robin_policy`` provides three constructors. + +.. list-table:: ``round_robin_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``round_round_policy(deferred_initialization_t);`` + - Defers initialization. An ``initialize`` function must be called prior to use. + * - ``round_robin_policy();`` + - Initialized to use the default set of resources. + * - ``round_robin_policy(const std::vector& u);`` + - Overrides the default set of resources. + +Deferred Initialization +----------------------- + +A ``round_robin_policy`` that was constructed with deferred initialization must be +initialized by calling one its ``initialize`` member functions before it can be used +to select or submit. + +.. list-table:: ``round_robin_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``initialize();`` + - Initialize to use the default set of resources. + * - ``initialize(const std::vector& u);`` + - Overrides the default set of resources. + +Queries +------- + +A ``round_robin_policy`` has ``get_resources`` and ``get_submission_group`` +member functions. + +.. list-table:: ``round_robin_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``std::vector get_resources();`` + - Returns the set of resources the policy is selecting from. + * - ``auto get_submission_group();`` + - Returns an object that can be used to wait for all active submissions. + +Reporting Requirements +---------------------- + +If a resource returned by ``select`` is used directly without calling +``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` +to provide feedback to the policy. However, the ``round_robin_policy`` +does not require any feedback about the system state or the behavior of +the workload. Therefore, no explicit reporting of execution information +is needed, as is summarized in the table below. + +.. list-table:: ``round_robin_policy`` reporting requirements + :widths: 50 50 + :header-rows: 1 + + * - ``execution_info`` + - is reporting required? + * - ``task_submission`` + - No + * - ``task_completion`` + - No + * - ``task_time`` + - No + +In generic code, it is possible to perform compile-time checks to avoid +reporting overheads when reporting is not needed, while still writing +code that will work with any policy, as demonstrated below: + +.. code:: cpp + + auto s = select(my_policy); + if constexpr (report_info_v) + { + s.report(execution_info::task_submission); + } diff --git a/_sources/dynamic_selection_api_main.rst b/_sources/dynamic_selection_api_main.rst index 2b3f18d97a..62c62f49e1 100644 --- a/_sources/dynamic_selection_api_main.rst +++ b/_sources/dynamic_selection_api_main.rst @@ -68,9 +68,7 @@ In the preceding example, the key points in the code include: #. The submit function returns an object that can be waited on. Calling ``wait`` on the ``done`` object blocks the main thread until the work submitted to the queue by your function is complete. #. The whole group of submissions made during the loop can be waited on. In this example, the call is redundant, since each submission was already waited for inside of the loop body. -The output from this example is: - -.. code:: +The output from this example is:: submit task to cpu submit task to gpu @@ -86,7 +84,6 @@ More detailed information about the API is provided in the following sections: .. toctree:: :maxdepth: 2 :titlesonly: - :glob: dynamic_selection_api/functions dynamic_selection_api/policies diff --git a/_sources/index.rst b/_sources/index.rst index eb0588e092..f460357ad1 100644 --- a/_sources/index.rst +++ b/_sources/index.rst @@ -10,6 +10,7 @@ For general information, refer to the `oneDPL GitHub* repository .. toctree:: :maxdepth: 2 + :titlesonly: :caption: Get Started introduction/release_notes.rst @@ -17,17 +18,27 @@ For general information, refer to the `oneDPL GitHub* repository .. toctree:: :maxdepth: 2 - :caption: Library Guide + :titlesonly: + :includehidden: + :caption: Core Functionality introduction parallel_api_main api_for_sycl_kernels_main - dynamic_selection_api_main - kernel_templates_main macros cmake_support oneDPL_2022.0_changes +.. toctree:: + :maxdepth: 1 + :titlesonly: + :includehidden: + :caption: Technology Preview + + parallel_api/async_api + dynamic_selection_api_main + kernel_templates_main + .. toctree:: :hidden: :caption: Notices and Disclaimers diff --git a/_sources/introduction.rst b/_sources/introduction.rst index 531a0f4637..77e592d33d 100644 --- a/_sources/introduction.rst +++ b/_sources/introduction.rst @@ -1,15 +1,11 @@ |onedpl_long| Introduction ####################################### -Parallel API can be used with the `C++ Standard Execution -Policies `_ -to enable parallelism on the host. - The |onedpl_long| (|onedpl_short|) is implemented in accordance with the `oneDPL -Specification `_. +Specification `_. -To support heterogeneity, |onedpl_short| works with the DPC++ API. More information can be found in the -`oneAPI Specification `_. +To support heterogeneity, |onedpl_short| uses `SYCL `_. +More information about SYCL can be found in the `SYCL Specification`_. Before You Begin ================ @@ -24,18 +20,11 @@ page for: * Fixed Issues * Deprecation Notice * Known Issues and Limitations -* Previous Release Notes +* Previous Release Notes -Install the `Intel® oneAPI Base Toolkit (Base Kit) `_ +Install the `Intel® oneAPI Base Toolkit (Base Kit) `_ to use |onedpl_short|. -All |onedpl_short| header files are in the ``oneapi/dpl`` directory. To use the |onedpl_short| API, -include the corresponding header in your source code with the ``#include `` directive. -|onedpl_short| introduces the namespace ``oneapi::dpl`` for most its classes and functions. - -To use tested C++ standard APIs, you need to include the corresponding C++ standard header files -and use the ``std`` namespace. - System Requirements =================== @@ -46,19 +35,56 @@ C++17 is the minimal supported version of the C++ standard. That means, any use of |onedpl_short| may require a C++17 compiler. While some APIs of the library may accidentally work with earlier versions of the C++ standard, it is no more guaranteed. -To call Parallel API with the C++ standard policies, you need to install the following software: +To call Parallel API with the C++ standard aligned policies, you need to install the following software: * A C++ compiler with support for OpenMP* 4.0 (or higher) SIMD constructs -* Depending on what parallel backend you want to use install either: +* Depending on what parallel backend you want to use, install either: - * |onetbb_long| or |tbb_long| 2019 and later - * A C++ compiler with support for OpenMP 4.5 (or higher) + * |onetbb_long| or |tbb_long| 2019 and later, + * A C++ compiler with support for OpenMP 4.5 (or higher). -For more information about parallel backends, see :doc:`Execution Policies ` +For more information about parallel backends, see :doc:`Execution Policies `. To use Parallel API with the device execution policies, you need to install the following software: -* A C++ compiler with support for SYCL 2020 +* A C++ compiler with support for SYCL 2020. + +Develop and Build Your Code with |onedpl_short| +=============================================== + +All |onedpl_short| header files are in the ``oneapi/dpl`` directory. To use the |onedpl_short| API, +include the corresponding header in your source code with the ``#include `` directive. +For better coexistence with the C++ standard library, include |onedpl_short| header files before the standard C++ ones. + +|onedpl_short| introduces the ``namespace oneapi::dpl`` for its classes and functions. For brevity, +``namespace dpl`` is defined as an alias to ``oneapi::dpl`` and can be used interchangeably. + +To use :doc:`tested C++ standard APIs ` in SYCL device code, +include the corresponding C++ standard header files and use the ``std`` namespace. + +Follow the steps below to build your code with |onedpl_short|: + +#. To build with the |dpcpp_cpp|, see the |dpcpp_gsg|_ for details. +#. Set the environment variables for |onedpl_short| and |onetbb_short|. + +Here is an example of a command line used to compile code that contains |onedpl_short| parallel algorithms +on Linux* (depending on the code, parameters within [] could be unnecessary):: + + icpx [-fsycl] [-fiopenmp] program.cpp [-ltbb] -o program + +You may also use the |pstl_offload_option|_ of |dpcpp_cpp| powered by |onedpl_short| +to build the standard C++ code for execution on a SYCL device:: + + icpx -fsycl -fsycl-pstl-offload=gpu program.cpp -o program + +This option redirects C++ parallel algorithms invoked with the ``std::execution::par_unseq`` policy +to |onedpl_short| algorithms. It does not change the behavior of the |onedpl_short| algorithms and +execution policies that are directly used in the code. + +Useful Information +================== + +.. _library-restrictions: Difference with Standard C++ Parallel Algorithms ************************************************ @@ -66,25 +92,24 @@ Difference with Standard C++ Parallel Algorithms * oneDPL execution policies only result in parallel execution if random access iterators are provided, the execution will remain serial for other iterator types. * Function objects passed in to algorithms executed with device policies must provide ``const``-qualified ``operator()``. - `The SYCL specification `_ states that writing to such an object during a SYCL - kernel is undefined behavior. -* For the following algorithms, par_unseq and unseq policies do not result in vectorized execution: + The `SYCL specification`_ states that writing to such an object during a SYCL kernel is undefined behavior. +* For the following algorithms, ``par_unseq`` and ``unseq`` policies do not result in SIMD execution: ``includes``, ``inplace_merge``, ``merge``, ``set_difference``, ``set_intersection``, ``set_symmetric_difference``, ``set_union``, ``stable_partition``, ``unique``. * The following algorithms require additional O(n) memory space for parallel execution: ``copy_if``, ``inplace_merge``, ``partial_sort``, ``partial_sort_copy``, ``partition_copy``, ``remove``, ``remove_if``, ``rotate``, ``sort``, ``stable_sort``, ``unique``, ``unique_copy``. - Restrictions ************ -When called with |dpcpp_short| execution policies, |onedpl_short| algorithms apply the same restrictions as -|dpcpp_short| does (see the |dpcpp_short| specification and the SYCL specification for details), such as: +When called with device execution policies, |onedpl_short| algorithms apply the same restrictions as +|dpcpp_short| does (see the |dpcpp_cpp| documentation and the SYCL specification for details), such as: * Adding buffers to a lambda capture list is not allowed for lambdas passed to an algorithm. * Passing data types, which are not trivially copyable, is only allowed via USM, but not via buffers or host-allocated containers. +* Objects of pointer-to-member types cannot be passed to an algorithm. * The definition of lambda functions used with parallel algorithms should not depend on preprocessor macros that makes it different for the host and the device. Otherwise, the behavior is undefined. * When used within SYCL kernels or transferred to/from a device, a container class can only hold objects @@ -94,12 +119,11 @@ When called with |dpcpp_short| execution policies, |onedpl_short| algorithms app Known Limitations ***************** -* When compiled with ``-fsycl-pstl-offload`` option of Intel oneAPI DPC++/C++ compiler and with - ``libstdc++`` version 8 or ``libc++``, ``oneapi::dpl::execution::par_unseq`` offloads - standard parallel algorithms to the SYCL device similarly to ``std::execution::par_unseq`` - in accordance with the ``-fsycl-pstl-offload`` option value. +* The ``oneapi::dpl::execution::par_unseq`` policy is affected by ``-fsycl-pstl-offload`` option of |dpcpp_cpp| + when |onedpl_short| substitutes this policy for the ``std::execution::par_unseq`` policy + missing in a standard C++ library, particularly in libstdc++ version 8 and in libc++. * For ``transform_exclusive_scan`` and ``exclusive_scan`` to run in-place (that is, with the same data - used for both input and destination) and with an execution policy of ``unseq`` or ``par_unseq``, + used for both input and destination) and with an execution policy of ``unseq`` or ``par_unseq``, it is required that the provided input and destination iterators are equality comparable. Furthermore, the equality comparison of the input and destination iterator must evaluate to true. If these conditions are not met, the result of these algorithm calls is undefined. @@ -107,34 +131,27 @@ Known Limitations convertible to the type of the initial value if one is provided, otherwise it is convertible to the type of values in the processed data sequence: ``std::iterator_traits::value_type``. * ``exclusive_scan`` and ``transform_exclusive_scan`` algorithms may provide wrong results with - vector execution policies when building a program with GCC 10 and using ``-O0`` option. -* Compiling ``reduce`` and ``transform_reduce`` algorithms with the Intel DPC++ Compiler, versions 2021 and older, - may result in a runtime error. To fix this issue, use an Intel DPC++ Compiler version 2022 or newer. + unsequenced execution policies when building a program with GCC 10 and using ``-O0`` option. +* Compiling ``reduce`` and ``transform_reduce`` algorithms with |dpcpp_cpp| versions 2021 and older + may result in a runtime error. To fix this issue, use |dpcpp_cpp| version 2022 or newer. * When compiling on Windows, add the option ``/EHsc`` to the compilation command to avoid errors with oneDPL's experimental ranges API that uses exceptions. -* The use of |onedpl_short| together with the GNU C++ standard library (libstdc++) version 9 or 10 may lead to - compilation errors (caused by oneTBB API changes). - Using libstdc++ version 9 requires TBB version 2020 for the header file. This may result in compilation errors when - using C++17 or C++20 and TBB is not found in the environment, even if its use in |onedpl_short| is switched off. - To overcome these issues, include |onedpl_short| header files before the standard C++ header files, - or disable parallel algorithms support in the standard library. - For more information, please see `Intel® oneAPI Threading Building Blocks (oneTBB) Release Notes`_. * The ``using namespace oneapi;`` directive in a |onedpl_short| program code may result in compilation errors with some compilers including GCC 7 and earlier. Instead of this directive, explicitly use - ``oneapi::dpl`` namespace, or create a namespace alias. + the ``oneapi::dpl`` namespace, the shorter ``dpl`` namespace alias, or create your own alias. * ``std::array::at`` member function cannot be used in kernels because it may throw an exception; use ``std::array::operator[]`` instead. * Due to specifics of Microsoft* Visual C++, some standard floating-point math functions (including ``std::ldexp``, ``std::frexp``, ``std::sqrt(std::complex)``) require device support - for double precision. + for double precision. * ``exclusive_scan``, ``inclusive_scan``, ``exclusive_scan_by_segment``, ``inclusive_scan_by_segment``, ``transform_exclusive_scan``, ``transform_inclusive_scan``, - when used with C++ standard policies, impose limitations on the initial value type if an - initial value is provided, and on the value type of the input iterator if an initial value is + when used with C++ standard aligned policies, impose limitations on the initial value type if an + initial value is provided, and on the value type of the input iterator if an initial value is not provided. Firstly, it must satisfy the ``DefaultConstructible`` requirements. Secondly, a default-constructed instance of that type should act as the identity element for the binary scan function. -* ``reduce_by_segment``, when used with C++ standard policies, imposes limitations on the value type. +* ``reduce_by_segment``, when used with C++ standard aligned policies, imposes limitations on the value type. Firstly, it must satisfy the ``DefaultConstructible`` requirements. Secondly, a default-constructed instance of that type should act as the identity element for the binary reduction function. * The initial value type for ``exclusive_scan``, ``inclusive_scan``, ``exclusive_scan_by_segment``, @@ -144,28 +161,5 @@ Known Limitations the dereferenced value type of the provided iterators should satisfy the ``DefaultConstructible`` requirements. * For ``remove``, ``remove_if``, ``unique`` the dereferenced value type of the provided iterators should be ``MoveConstructible``. -* The algorithms that process uninitialized storage: ``uninitialized_copy``, ``uninitialized_copy_n``, ``uninitialized_fill``, ``uninitialized_fill_n``, ``uninitialized_fill_n``, ``uninitialized_move``, - ``uninitialized_move_n``, ``uninitialized_default_construct``, ``uninitialized_default_construct_n``, ``uninitialized_value_construct``, ``uninitialized_value_construct_n`` - should be called with a device policy when using device data and should be called with a host policy when using host data. Otherwise, the result is undefined. -* The algorithms that destroy data: ``destroy`` and ``destroy_n`` should be called with a host policy when using host data that was initialized on the host, and should be called with a device policy when using device data that was initialized on the device. Otherwise, the result is undefined. - - -Build Your Code with |onedpl_short| -=================================== - -Follow the steps below to build your code with |onedpl_short|: - -#. To build with the |dpcpp_cpp|, see the `Get Started with the Intel® oneAPI DPC++/C++ Compiler - `_ - for details. -#. Set the environment variables for |onedpl_short| and |onetbb_short|. -#. To avoid naming device policy objects explicitly, add the ``-fsycl-unnamed-lambda`` option. - -Below is an example of a command line used to compile code that contains -|onedpl_short| parallel algorithms on Linux* (depending on the code, parameters within [] could be unnecessary): - -.. code:: cpp - - dpcpp [-fsycl-unnamed-lambda] test.cpp [-ltbb|-fopenmp] -o test -.. _`Intel® oneAPI Threading Building Blocks (oneTBB) Release Notes`: https://www.intel.com/content/www/us/en/developer/articles/release-notes/intel-oneapi-threading-building-blocks-release-notes.html +.. _`SYCL Specification`: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html \ No newline at end of file diff --git a/_sources/introduction/onedpl_gsg.rst b/_sources/introduction/onedpl_gsg.rst index 5b598c23c3..761b7d677b 100644 --- a/_sources/introduction/onedpl_gsg.rst +++ b/_sources/introduction/onedpl_gsg.rst @@ -1,217 +1,223 @@ -Get Started with the |onedpl_long| -################################## - -|onedpl_long| (|onedpl_short|) works with the -`Intel® oneAPI DPC++/C++ Compiler `_ -to provide high-productivity APIs to developers, which can minimize SYCL* -programming efforts across devices for high performance parallel applications. - -|onedpl_short| consists of the following components: - -* Parallel API -* API for SYCL Kernels -* Macros - - -For general information about |onedpl_short|, visit the `oneDPL GitHub* repository `_, -or visit the `Intel® oneAPI DPC++ Library Guide `_ -and the `Intel® oneAPI DPC++ Library main page `_. - -Quick Start -=========== - -Installation ------------- - -Visit the |onedpl_short| `Release Notes -`_ -page for: - -* Where to Find the Release -* Overview -* New Features -* Fixed Issues -* Known Issues and Limitations - -Install the `Intel® oneAPI Base Toolkit (Base Kit) `_ -to use |onedpl_short|. - -To use Parallel API, include the corresponding header files in your source code. - -All |onedpl_short| header files are in the ``oneapi/dpl`` directory. Use ``#include `` to include them. -|onedpl_short| uses the namespace ``oneapi::dpl`` for most its classes and functions. - -To use tested C++ standard APIs, you need to include the corresponding C++ standard header files -and use the ``std`` namespace. - -CMake Support -------------- -`CMake `_ generates build scripts which can then be used to build and link your application. |onedpl_short| can be added to your project via CMake. - -A simple example for Linux is provided below. For more detailed usage and options including details specific to Windows, please look to the `CMake Support Page `_. - -Simple Example CMake File -************************* -To use |onedpl_short| with CMake, create a CMakeLists.txt file for your project's base directory and use `find_package `_ and `target_link_libraries `_ to add oneDPL. -For example: - -.. code:: cpp - - project(Foo) - add_executable(foo foo.cpp) - - # Search to find oneDPL - find_package(oneDPL REQUIRED) - - # Connect oneDPL to foo - target_link_libraries(foo oneDPL) - -Simple Example CMake Invocation -******************************* -The following is an example CMake invocation which generates build scripts for the project in the parent directory: - -.. code:: cpp - - mkdir build && cd build - cmake -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=release .. - -Example Build Command -********************* -Once build scripts have been generated for your desired configuration following the instruction above, a `build command `_ can be issued to build your project: - -.. code:: cpp - - cmake --build . - -pkg-config Support ------------------- - -The pkg-config program is used to retrieve information about your installed libraries, and -to compile and link against one or more libraries. - -Use pkg-config with |onedpl_short| -********************************** - -Use pkg-config with the ``--cflags`` flag to get the include path to the oneDPL directory: - -.. code:: cpp - - icpx -fsycl foo.cpp $(pkg-config --cflags dpl) - -The ``--msvc-syntax`` flag is required when you use a Microsoft Visual C++* compiler. -This flag converts your compiling and linking flags to the appropriate form: - -.. code:: cpp - - icpx -fsycl foo.cpp $(pkg-config --msvc-syntax --cflags dpl) - -.. note:: - Use the pkg-config tool to get rid of large hard-coded paths and make compilation more portable. - - -Usage Examples --------------- - -|onedpl_short| sample code is available from the -`oneAPI GitHub samples repository `_. -Each sample includes a readme with build instructions. - -\ Header Usage Example -****************************************** - -This example illustrates |onedpl_short| random number generator usage. -The sample below shows you how to create an random number generator engine object (the source of pseudo-randomness), -a distribution object (specifying the desired probability distribution), and how to generate -the random numbers themselves. Random number generation is performed in a vectorized manner -to improve the speed of your computations. - -This example performs its computations on your default SYCL device. You can set the -``SYCL_DEVICE_TYPE`` environment variable to CPU or GPU. - -.. code:: cpp - - template - void random_fill(float* usmptr, std::size_t n) { - auto zero = oneapi::dpl::counting_iterator(0); - - std::for_each(oneapi::dpl::execution::dpcpp_default, - zero, zero + n/VecSize, - [usmptr](std::size_t i) { - auto offset = i * VecSize; - - oneapi::dpl::minstd_rand_vec engine(seed, offset); - oneapi::dpl::uniform_real_distribution> distr; - - auto res = distr(engine); - res.store(i, sycl::global_ptr(usmptr)); - }); - } - -Pi Benchmark Usage Example -************************** - -This example uses a Monte Carlo method to estimate the value of π. -The basic idea is to generate random points within a square, and to check what -fraction of these random points lie in a quarter-circle inscribed within that square. -The expected value is the ratio of the areas of the quarter-circle and the square (π/4). -You can take the observed fraction of points in the quarter-circle as an estimate of π/4. - -This example shows you how to create an random number generator engine object (the source of pseudo-randomness), -a distribution object (specifying the desired probability distribution), generate the -random numbers themselves, and then perform a reduction to count quantity of points that -fit into the square *S*. Random number generation is performed in scalar manner to simplify your code. - - -.. figure:: images/pi_benchmark.png - :alt: An image of pi chart. - -.. code:: cpp - - float estimated_pi; - { - sycl::queue q(sycl::gpu_selector_v); - auto policy = oneapi::dpl::execution::make_device_policy(q); - - float sum = std::transform_reduce( policy, - oneapi::dpl::counting_iterator(0), - oneapi::dpl::counting_iterator(N), - 0.0f, - std::plus{}, - [=](int n){ - float local_sum = 0.0f; - oneapi::dpl::minstd_rand engine(SEED, n * ITER * 2); - oneapi::dpl::uniform_real_distribution distr; - for(int i = 0; i < ITER; ++i) { - float x = distr(engine); - float y = distr(engine); - if (x * x + y * y <= 1.0) - local_sum += 1.0; - } - return local_sum / (float)ITER; - } - ); - estimated_pi = 4.0f * (float)sum / N; - } - - -Find More -========= - -.. list-table:: - :widths: 50 50 - :header-rows: 1 - - * - Resource Link - - Description - * - `Intel® oneAPI DPC++ Library Guide `_ - - Refer to the |onedpl_short| guide for more in depth information. - * - `System Requirements `_ - - Check system requirements before you install |onedpl_short|. - * - `Intel® oneAPI DPC++ Library Release Notes `_ - - Check the release notes to learn about updates in the latest release. - * - `oneDPL Samples `_ - - Learn how to use |onedpl_short| with samples. - * - `Layers for Yocto* Project `_ - - Add oneAPI components to a Yocto project build using the meta-intel layers. - * - `oneAPI Samples Catalog `_ - - Explore the complete list of oneAPI code samples in the oneAPI Samples Catalog (GitHub*). These samples were designed to help you develop, offload, and optimize multiarchitecture applications targeting CPUs, GPUs, and FPGAs. \ No newline at end of file +Get Started with the |onedpl_long| +################################## + +|onedpl_long| (|onedpl_short|) works with the |dpcpp_cpp_with_gsg_link|_ +to provide high-productivity APIs to developers, which can minimize SYCL* +programming efforts across devices for high performance parallel applications. + +|onedpl_short| consists of the following components: + +* Parallel API +* API for SYCL Kernels +* Macros + + +For general information about |onedpl_short|, visit the `oneDPL GitHub* repository `_, +or visit the |onedpl_library_guide|_ and the `Intel® oneAPI DPC++ Library main page +`_. + +Quick Start +=========== + +Installation +------------ + +Visit the |onedpl_short| `Release Notes +`_ +page for: + +* Where to Find the Release +* Overview +* New Features +* Fixed Issues +* Known Issues and Limitations + +Install the `Intel® oneAPI Base Toolkit (Base Kit) `_ +to use |onedpl_short|. + +To use Parallel API, include the corresponding header files in your source code. + +All |onedpl_short| header files are in the ``oneapi/dpl`` directory. Use ``#include `` to include them. +|onedpl_short| uses the namespace ``oneapi::dpl`` for most its classes and functions. + +To use tested C++ standard APIs, you need to include the corresponding C++ standard header files +and use the ``std`` namespace. + +CMake Support +------------- +`CMake `_ generates build scripts which can then be used +to build and link your application. |onedpl_short| can be added to your project via CMake. + +A simple example for Linux is provided below. For more detailed usage and options including details specific to Windows, +please look to the |dpcpp_cmake_support|_. + +Simple Example CMake File +************************* +To use |onedpl_short| with CMake, create a CMakeLists.txt file for your project's base directory and use +`find_package `_ +and `target_link_libraries `_ to add oneDPL. +For example: + +.. code:: cpp + + project(Foo) + add_executable(foo foo.cpp) + + # Search to find oneDPL + find_package(oneDPL REQUIRED) + + # Connect oneDPL to foo + target_link_libraries(foo oneDPL) + +Simple Example CMake Invocation +******************************* +The following is an example CMake invocation which generates build scripts for the project in the parent directory: + +.. code:: cpp + + mkdir build && cd build + cmake -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=release .. + +Example Build Command +********************* +Once build scripts have been generated for your desired configuration following the instruction above, a `build command +`_ can be issued to build your project: + +.. code:: cpp + + cmake --build . + +pkg-config Support +------------------ + +The pkg-config program is used to retrieve information about your installed libraries, and +to compile and link against one or more libraries. + +Use pkg-config with |onedpl_short| +********************************** + +Use pkg-config with the ``--cflags`` flag to get the include path to the oneDPL directory: + +.. code:: cpp + + icpx -fsycl foo.cpp $(pkg-config --cflags dpl) + +The ``--msvc-syntax`` flag is required when you use a Microsoft Visual C++* compiler. +This flag converts your compiling and linking flags to the appropriate form: + +.. code:: cpp + + icpx -fsycl foo.cpp $(pkg-config --msvc-syntax --cflags dpl) + +.. note:: + Use the pkg-config tool to get rid of large hard-coded paths and make compilation more portable. + + +Usage Examples +-------------- + +|onedpl_short| sample code is available from the +`oneAPI GitHub samples repository `_. +Each sample includes a readme with build instructions. + +\ Header Usage Example +****************************************** + +This example illustrates |onedpl_short| random number generator usage. +The sample below shows you how to create an random number generator engine object (the source of pseudo-randomness), +a distribution object (specifying the desired probability distribution), and how to generate +the random numbers themselves. Random number generation is performed in a vectorized manner +to improve the speed of your computations. + +This example performs its computations on your default SYCL device. You can set the +``SYCL_DEVICE_TYPE`` environment variable to CPU or GPU. + +.. code:: cpp + + template + void random_fill(float* usmptr, std::size_t n) { + auto zero = oneapi::dpl::counting_iterator(0); + + std::for_each(oneapi::dpl::execution::dpcpp_default, + zero, zero + n/VecSize, + [usmptr](std::size_t i) { + auto offset = i * VecSize; + + oneapi::dpl::minstd_rand_vec engine(seed, offset); + oneapi::dpl::uniform_real_distribution> distr; + + auto res = distr(engine); + res.store(i, sycl::global_ptr(usmptr)); + }); + } + +Pi Benchmark Usage Example +************************** + +This example uses a Monte Carlo method to estimate the value of π. +The basic idea is to generate random points within a square, and to check what +fraction of these random points lie in a quarter-circle inscribed within that square. +The expected value is the ratio of the areas of the quarter-circle and the square (π/4). +You can take the observed fraction of points in the quarter-circle as an estimate of π/4. + +This example shows you how to create an random number generator engine object (the source of pseudo-randomness), +a distribution object (specifying the desired probability distribution), generate the +random numbers themselves, and then perform a reduction to count quantity of points that +fit into the square *S*. Random number generation is performed in scalar manner to simplify your code. + + +.. figure:: images/pi_benchmark.png + :alt: An image of pi chart. + +.. code:: cpp + + float estimated_pi; + { + sycl::queue q(sycl::gpu_selector_v); + auto policy = oneapi::dpl::execution::make_device_policy(q); + + float sum = std::transform_reduce( policy, + oneapi::dpl::counting_iterator(0), + oneapi::dpl::counting_iterator(N), + 0.0f, + std::plus{}, + [=](int n){ + float local_sum = 0.0f; + oneapi::dpl::minstd_rand engine(SEED, n * ITER * 2); + oneapi::dpl::uniform_real_distribution distr; + for(int i = 0; i < ITER; ++i) { + float x = distr(engine); + float y = distr(engine); + if (x * x + y * y <= 1.0) + local_sum += 1.0; + } + return local_sum / (float)ITER; + } + ); + estimated_pi = 4.0f * (float)sum / N; + } + + +Find More +========= + +.. list-table:: + :widths: 50 50 + :header-rows: 1 + + * - Resource Link + - Description + * - |onedpl_library_guide|_ + - Refer to the |onedpl_short| guide for more in depth information. + * - `System Requirements `_ + - Check system requirements before you install |onedpl_short|. + * - `Intel® oneAPI DPC++ Library Release Notes + `_ + - Check the release notes to learn about updates in the latest release. + * - `oneDPL Samples `_ + - Learn how to use |onedpl_short| with samples. + * - |yocto_layers|_ + - Add oneAPI components to a Yocto project build using the meta-intel layers. + * - `oneAPI Samples Catalog `_ + - Explore the complete list of oneAPI code samples in the oneAPI Samples Catalog (GitHub*). + These samples were designed to help you develop, offload, and optimize multiarchitecture applications targeting CPUs, GPUs, and FPGAs. \ No newline at end of file diff --git a/_sources/introduction/release_notes.rst b/_sources/introduction/release_notes.rst index 8f5ce72db7..f69dffa710 100644 --- a/_sources/introduction/release_notes.rst +++ b/_sources/introduction/release_notes.rst @@ -1,8 +1,8 @@ -.. |release_notes| replace:: |onedpl_long| Release Notes -.. _release_notes: https://www.intel.com/content/www/us/en/developer/articles/release-notes/intel-oneapi-dpcpp-library-release-notes.html - -============= -Release Notes -============= - +.. |release_notes| replace:: |onedpl_long| Release Notes +.. _release_notes: https://www.intel.com/content/www/us/en/developer/articles/release-notes/intel-oneapi-dpcpp-library-release-notes.html + +============= +Release Notes +============= + Refer to |release_notes|_. \ No newline at end of file diff --git a/_sources/kernel_templates/esimd/radix_sort.rst b/_sources/kernel_templates/esimd/radix_sort.rst index c74ece79a2..f6ba4ad36d 100644 --- a/_sources/kernel_templates/esimd/radix_sort.rst +++ b/_sources/kernel_templates/esimd/radix_sort.rst @@ -1,16 +1,17 @@ Radix Sort ########## ---------------------------------------------------- -radix_sort and radix_sort_by_key Function Templates ---------------------------------------------------- +----------------------------- +radix_sort Function Templates +----------------------------- -The ``radix_sort`` and ``radix_sort_by_key`` functions sort data using the radix sort algorithm. -The sorting is stable, ensuring the preservation of the relative order of elements with equal keys. -The functions implement a Onesweep* [#fnote1]_ algorithm variant. Both in-place and out-of-place -overloads are provided. For out-of-place overloads, the input data order is preserved. +The ``radix_sort`` function sorts data using the radix sort algorithm. +The sorting is stable, preserving the relative order of elements with equal keys. +Both in-place and out-of-place overloads are provided. Out-of-place overloads do not alter the input sequence. -A synopsis of the ``radix_sort`` and ``radix_sort_by_key`` functions is provided below: +The functions implement a Onesweep* [#fnote1]_ algorithm variant. + +A synopsis of the ``radix_sort`` function is provided below: .. code:: cpp @@ -18,7 +19,7 @@ A synopsis of the ``radix_sort`` and ``radix_sort_by_key`` functions is provided namespace oneapi::dpl::experimental::kt::gpu::esimd { - // Sort a single sequence + // Sort in-place template sycl::event @@ -31,55 +32,24 @@ A synopsis of the ``radix_sort`` and ``radix_sort_by_key`` functions is provided radix_sort (sycl::queue q, Range&& r, KernelParam param); // (2) - // Sort a single sequence out-of-place + // Sort out-of-place template sycl::event radix_sort (sycl::queue q, Iterator1 first, Iterator1 last, - Iterator2 first_out, KernelParam param) // (3) + Iterator2 first_out, KernelParam param); // (3) template sycl::event radix_sort (sycl::queue q, Range1&& r, Range2&& r_out, - KernelParam param) // (4) - - - // Sort a sequence of keys and apply the same order to a sequence of values - template - sycl::event - radix_sort_by_key (sycl::queue q, Iterator1 keys_first, Iterator1 keys_last, - Iterator2 values_first, KernelParam param); // (5) - - template - sycl::event - radix_sort_by_key (sycl::queue q, KeysRng&& keys, - ValuesRng&& values, KernelParam param); // (6) - - - // Sort a sequence of keys and values out-of-place - template - sycl::event - radix_sort_by_key (sycl::queue q, KeysIterator1 keys_first, - KeysIterator1 keys_last, ValsIterator1 vals_first, - KeysIterator2 keys_out_first, ValsIterator2 vals_out_first, - KernelParam param) // (7) - - template - sycl::event - radix_sort_by_key (sycl::queue q, KeysRng1&& keys, ValsRng1&& values, - KeysRng2&& keys_out, ValsRng2&& vals_out, - KernelParam param) // (8) + KernelParam param); // (4) } +.. note:: + The ``radix_sort`` is currently available only for Intel® Data Center GPU Max Series, + and requires Intel® oneAPI DPC++/C++ Compiler 2023.2 or newer. Template Parameters -------------------- @@ -99,22 +69,22 @@ Parameters +-----------------------------------------------+---------------------------------------------------------------------+ | Name | Description | +===============================================+=====================================================================+ -| ``q`` | The SYCL* queue where kernels are submitted. | +| ``q`` | The SYCL* queue where kernels are submitted. | +-----------------------------------------------+---------------------------------------------------------------------+ | | | | | The sequences to apply the algorithm to. | | - ``first``, ``last`` (1), | Supported sequence types: | | - ``r`` (2), | | -| - ``first``, ``last``, ``first_out`` (3), | - :ref:`USM pointers ` (1,3,5,7), | -| - ``r``, ``r_out`` (4), | - :ref:`oneapi::dpl::begin and oneapi::dpl::end | -| - ``keys_first``, ``keys_last``, | ` (1,3,5,7). | -| ``values_first`` (5), | - ``sycl::buffer`` (2,4,6,8), | -| - ``keys``, ``values`` (6), | - :ref:`oneapi::dpl::experimental::ranges::views::all | -| - ``keys_first``, ``keys_last``, | ` (2,4,6,8), | -| ``vals_first``, ``keys_out_first``, | - :ref:`oneapi::dpl::experimental::ranges::views::subrange | -| ``values_out_first`` (7) | ` (2,4,6,8), | -| - ``keys``, ``values``, | | -| ``keys_out``, ``values_out`` (8), | | +| - ``first``, ``last``, ``first_out`` (3), | - :ref:`USM pointers ` (1,3), | +| - ``r``, ``r_out`` (4). | - :ref:`oneapi::dpl::begin and oneapi::dpl::end | +| | ` (1,3). | +| | - ``sycl::buffer`` (2,4), | +| | - :ref:`oneapi::dpl::experimental::ranges::views::all | +| | ` (2,4), | +| | - :ref:`oneapi::dpl::experimental::ranges::views::subrange | +| | ` (2,4). | +| | | +| | | | | | +-----------------------------------------------+---------------------------------------------------------------------+ | ``param`` | A :doc:`kernel_param <../kernel_configuration>` object. | @@ -147,8 +117,8 @@ Usage Examples -------------- -radix_sort In-Place Example ---------------------------- +In-Place Example +---------------- .. code:: cpp @@ -185,74 +155,14 @@ radix_sort In-Place Example return 0; } -**Output:** - -.. code:: none +**Output**:: 5 3 3 3 2 1 -radix_sort_by_key In-Place Example ----------------------------------- - -.. code:: cpp - - // possible build and run commands: - // icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key - - #include - #include - #include - - #include - - namespace kt = oneapi::dpl::experimental::kt; - - int main() - { - std::size_t n = 6; - sycl::queue q{sycl::gpu_selector_v}; - sycl::buffer keys{sycl::range<1>(n)}; - sycl::buffer values{sycl::range<1>(n)}; - - // initialize - { - sycl::host_accessor k_acc{keys, sycl::write_only}; - k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3; - - sycl::host_accessor v_acc{values, sycl::write_only}; - v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e'; - } - - // sort - auto e = kt::gpu::esimd::radix_sort_by_key(q, keys, values, kt::kernel_param<96, 64>{}); // (6) - e.wait(); - - // print - { - sycl::host_accessor k_acc{keys, sycl::read_only}; - for(std::size_t i = 0; i < n; ++i) - std::cout << k_acc[i] << ' '; - std::cout << '\n'; - - sycl::host_accessor v_acc{values, sycl::read_only}; - for(std::size_t i = 0; i < n; ++i) - std::cout << v_acc[i] << ' '; - std::cout << '\n'; - } - - return 0; - } - -**Output:** - -.. code:: none - - 1 2 3 3 3 5 - s o r t e d -radix_sort Out-of-Place Example -------------------------------- +Out-of-Place Example +-------------------- .. code:: cpp @@ -294,101 +204,24 @@ radix_sort Out-of-Place Example return 0; } -**Output:** - -.. code:: none +**Output**:: 3 2 1 5 3 3 5 3 3 3 2 1 -radix_sort_by_key Out-of-Place Example --------------------------------------- - -.. code:: cpp - - // possible build and run commands: - // icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key - #include - #include - #include - - #include - - namespace kt = oneapi::dpl::experimental::kt; - - int main() - { - std::size_t n = 6; - sycl::queue q{sycl::gpu_selector_v}; - sycl::buffer keys{sycl::range<1>(n)}; - sycl::buffer keys_out{sycl::range<1>(n)}; - sycl::buffer values{sycl::range<1>(n)}; - sycl::buffer values_out{sycl::range<1>(n)}; - - - // initialize - { - sycl::host_accessor k_acc{keys, sycl::write_only}; - k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3; - - sycl::host_accessor v_acc{values, sycl::write_only}; - v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e'; - } - - // sort - auto e = kt::gpu::esimd::radix_sort_by_key(q, keys, values, keys_out, values_out, - kt::kernel_param<96, 64>{}); // (8) - e.wait(); - - // print - { - sycl::host_accessor k_acc{keys, sycl::read_only}; - for(std::size_t i = 0; i < n; ++i) - std::cout << k_acc[i] << ' '; - std::cout << '\n'; - - sycl::host_accessor v_acc{values, sycl::read_only}; - for(std::size_t i = 0; i < n; ++i) - std::cout << v_acc[i] << ' '; - std::cout << "\n\n"; - - sycl::host_accessor k_out_acc{keys_out, sycl::read_only}; - for(std::size_t i = 0; i < n; ++i) - std::cout << k_out_acc[i] << ' '; - std::cout << '\n'; - - sycl::host_accessor v_out_acc{values_out, sycl::read_only}; - for(std::size_t i = 0; i < n; ++i) - std::cout << v_out_acc[i] << ' '; - std::cout << '\n'; - } - - return 0; - } - -**Output:** - -.. code:: none - - 3 2 1 5 3 3 - r o s d t e - - 1 2 3 3 3 5 - s o r t e d - - -.. _memory-requirements: +.. _radix-sort-memory-requirements: ------------------- Memory Requirements ------------------- -The algorithms use global and local device memory (see `SYCL 2020 Specification +The algorithm uses global and local device memory (see `SYCL 2020 Specification `_) -for intermediate data storage. For the algorithms to operate correctly, there must be enough memory -on the device; otherwise, the behavior is undefined. The amount of memory that is required -depends on input data and configuration parameters, as described below. +for intermediate data storage. For the algorithm to operate correctly, there must be enough memory on the device. +If there is not enough global device memory, a ``std::bad_alloc`` exception is thrown. +The behavior is undefined if there is not enough local memory. +The amount of memory that is required depends on input data and configuration parameters, as described below. Global Memory Requirements -------------------------- @@ -396,12 +229,9 @@ Global Memory Requirements Global memory is used for copying the input sequence(s) and storing internal data such as radix value counters. The used amount depends on many parameters; below is an upper bound approximation: -:``radix_sort``: N\ :sub:`keys` + C * N\ :sub:`keys` - -:``radix_sort_by_key``: N\ :sub:`keys` + N\ :sub:`values` + C * N\ :sub:`keys` + N\ :sub:`keys` + C * N\ :sub:`keys` -where the sequence with keys takes N\ :sub:`keys` space, the sequence with values takes N\ :sub:`values` space, -and the additional space is C * N\ :sub:`keys`. +where the sequence with keys takes N\ :sub:`keys` space, and the additional space is C * N\ :sub:`keys`. The value of `C` depends on ``param.data_per_workitem``, ``param.workgroup_size``, and ``RadixBits``. For ``param.data_per_workitem`` set to `32`, ``param.workgroup_size`` to `64`, and ``RadixBits`` to `8`, @@ -413,35 +243,26 @@ Incrementing ``RadixBits`` increases `C` up to twice, while doubling either If the number of elements to sort does not exceed ``param.data_per_workitem * param.workgroup_size``, ``radix_sort`` is executed by a single work-group and does not use any global memory. - For ``radix_sort_by_key`` there is no single work-group implementation yet. .. The estimation above is not very precise and it seems it is not necessary for the global memory. The C coefficient base is actually 0.53 instead of 1. An increment of RadixBits multiplies C by the factor of ~1.5 on average. - Additionally, C exceeds 1 for radix_sort_by_key, - when N is small and the global histogram takes more space than the sequences. - This space is small, single WG implementation will be added, therefore this is neglected. - -.. _local-memory: Local Memory Requirements ------------------------- -Local memory is used for reordering keys or key-value pairs within a work-group, +Local memory is used for reordering keys within a work-group, and for storing internal data such as radix value counters. The used amount depends on many parameters; below is an upper bound approximation: -:``radix_sort``: N\ :sub:`keys_per_workgroup` + C - -:``radix_sort_by_key``: N\ :sub:`keys_per_workgroup` + N\ :sub:`values_per_workgroup` + C + N\ :sub:`keys_per_workgroup` + C -where N\ :sub:`keys_per_workgroup` and N\ :sub:`values_per_workgroup` are the amounts of memory -to store keys and values, respectively. `C` is some additional space for storing internal data. +where N\ :sub:`keys_per_workgroup` is the amount of memory to store keys. +`C` is some additional space for storing internal data. N\ :sub:`keys_per_workgroup` equals to ``sizeof(key_type) * param.data_per_workitem * param.workgroup_size``, -N\ :sub:`values_per_workgroup` equals to ``sizeof(value_type) * param.data_per_workitem * param.workgroup_size``, `C` does not exceed `4KB`. .. @@ -484,7 +305,7 @@ The initial configuration may be selected according to these high-level guidelin .. warning:: Avoid setting too large ``param.data_per_workitem`` and ``param.workgroup_size`` values. - Make sure that :ref:`Memory requirements ` are satisfied. + Make sure that :ref:`Memory requirements ` are satisfied. .. note:: @@ -492,7 +313,7 @@ The initial configuration may be selected according to these high-level guidelin since ``param.workgroup_size`` currently supports only one value (`64`). -.. [#fnote1] Andy Adinets and Duane Merrill (2022). Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. Retrieved from https://arxiv.org/abs/2206.01784. +.. [#fnote1] Andy Adinets and Duane Merrill (2022). Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. https://arxiv.org/abs/2206.01784. .. [#fnote2] The X\ :sup:`e`-core term is described in the `oneAPI GPU Optimization Guide `_. Check the number of cores in the device specification, such as `Intel® Data Center GPU Max specification diff --git a/_sources/kernel_templates/esimd/radix_sort_by_key.rst b/_sources/kernel_templates/esimd/radix_sort_by_key.rst new file mode 100644 index 0000000000..9ee99fadb6 --- /dev/null +++ b/_sources/kernel_templates/esimd/radix_sort_by_key.rst @@ -0,0 +1,361 @@ +Radix Sort By Key +################# + +------------------------------------ +radix_sort_by_key Function Templates +------------------------------------ + +The ``radix_sort_by_key`` function sorts keys using the radix sort algorithm, applying the same order to the corresponding values. +The sorting is stable, preserving the relative order of elements with equal keys. +Both in-place and out-of-place overloads are provided. Out-of-place overloads do not alter the input sequences. + +The functions implement a Onesweep* [#fnote1]_ algorithm variant. + +A synopsis of the ``radix_sort_by_key`` function is provided below: + +.. code:: cpp + + // defined in + + namespace oneapi::dpl::experimental::kt::gpu::esimd { + + // Sort in-place + template + sycl::event + radix_sort_by_key (sycl::queue q, Iterator1 keys_first, Iterator1 keys_last, + Iterator2 values_first, KernelParam param); // (1) + + template + sycl::event + radix_sort_by_key (sycl::queue q, KeysRng&& keys, + ValuesRng&& values, KernelParam param); // (2) + + + // Sort out-of-place + template + sycl::event + radix_sort_by_key (sycl::queue q, KeysIterator1 keys_first, + KeysIterator1 keys_last, ValuesIterator1 values_first, + KeysIterator2 keys_out_first, ValuesIterator2 values_out_first, + KernelParam param); // (3) + + template + sycl::event + radix_sort_by_key (sycl::queue q, KeysRng1&& keys, ValuesRng1&& values, + KeysRng2&& keys_out, ValuesRng2&& values_out, + KernelParam param); // (4) + } + +.. note:: + The ``radix_sort_by_key`` is currently available only for Intel® Data Center GPU Max Series, + and requires Intel® oneAPI DPC++/C++ Compiler 2023.2 or newer. + +Template Parameters +-------------------- + ++-----------------------------+---------------------------------------------------------------------------------------+ +| Name | Description | ++=============================+=======================================================================================+ +| ``bool IsAscending`` | The sort order. Ascending: ``true``; Descending: ``false``. | ++-----------------------------+---------------------------------------------------------------------------------------+ +| ``std::uint8_t RadixBits`` | The number of bits to sort for each radix sort algorithm pass. | ++-----------------------------+---------------------------------------------------------------------------------------+ + + +Parameters +---------- + ++-----------------------------------------------+---------------------------------------------------------------------+ +| Name | Description | ++===============================================+=====================================================================+ +| ``q`` | The SYCL* queue where kernels are submitted. | ++-----------------------------------------------+---------------------------------------------------------------------+ +| | | +| | The sequences to apply the algorithm to. | +| - ``keys_first``, ``keys_last``, | Supported sequence types: | +| ``values_first`` (1), | | +| - ``keys``, ``values`` (2), | - :ref:`USM pointers ` (1,3), | +| - ``keys_first``, ``keys_last``, | - :ref:`oneapi::dpl::begin and oneapi::dpl::end | +| ``values_first``, ``keys_out_first``, | ` (1,3). | +| ``values_out_first`` (3) | - ``sycl::buffer`` (2,4), | +| - ``keys``, ``values``, | - :ref:`oneapi::dpl::experimental::ranges::views::all | +| ``keys_out``, ``values_out`` (4). | ` (2,4), | +| | - :ref:`oneapi::dpl::experimental::ranges::views::subrange | +| | ` (2,4). | +| | | +| | | +| | | ++-----------------------------------------------+---------------------------------------------------------------------+ +| ``param`` | A :doc:`kernel_param <../kernel_configuration>` object. | +| | Its ``data_per_workitem`` must be a positive multiple of 32. | +| | | +| | | ++-----------------------------------------------+---------------------------------------------------------------------+ + + +**Type Requirements**: + +- The element type of sequence(s) to sort must be a C++ integral or floating-point type + other than ``bool`` with a width of up to 64 bits. + +.. note:: + + Current limitations: + + - Number of elements to sort must not exceed `2^30`. + - ``RadixBits`` can only be `8`. + - ``param.workgroup_size`` can only be `64`. + +Return Value +------------ + +A ``sycl::event`` object representing the status of the algorithm execution. + +-------------- +Usage Examples +-------------- + + +In-Place Example +---------------- + +.. code:: cpp + + // possible build and run commands: + // icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key + + #include + #include + #include + + #include + + namespace kt = oneapi::dpl::experimental::kt; + + int main() + { + std::size_t n = 6; + sycl::queue q{sycl::gpu_selector_v}; + sycl::buffer keys{sycl::range<1>(n)}; + sycl::buffer values{sycl::range<1>(n)}; + + // initialize + { + sycl::host_accessor k_acc{keys, sycl::write_only}; + k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3; + + sycl::host_accessor v_acc{values, sycl::write_only}; + v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e'; + } + + // sort + auto e = kt::gpu::esimd::radix_sort_by_key(q, keys, values, kt::kernel_param<96, 64>{}); // (2) + e.wait(); + + // print + { + sycl::host_accessor k_acc{keys, sycl::read_only}; + for(std::size_t i = 0; i < n; ++i) + std::cout << k_acc[i] << ' '; + std::cout << '\n'; + + sycl::host_accessor v_acc{values, sycl::read_only}; + for(std::size_t i = 0; i < n; ++i) + std::cout << v_acc[i] << ' '; + std::cout << '\n'; + } + + return 0; + } + +**Output**:: + + 1 2 3 3 3 5 + s o r t e d + +Out-of-Place Example +-------------------- + +.. code:: cpp + + // possible build and run commands: + // icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key + + #include + #include + #include + + #include + + namespace kt = oneapi::dpl::experimental::kt; + + int main() + { + std::size_t n = 6; + sycl::queue q{sycl::gpu_selector_v}; + sycl::buffer keys{sycl::range<1>(n)}; + sycl::buffer keys_out{sycl::range<1>(n)}; + sycl::buffer values{sycl::range<1>(n)}; + sycl::buffer values_out{sycl::range<1>(n)}; + + + // initialize + { + sycl::host_accessor k_acc{keys, sycl::write_only}; + k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3; + + sycl::host_accessor v_acc{values, sycl::write_only}; + v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e'; + } + + // sort + auto e = kt::gpu::esimd::radix_sort_by_key(q, keys, values, keys_out, values_out, + kt::kernel_param<96, 64>{}); // (4) + e.wait(); + + // print + { + sycl::host_accessor k_acc{keys, sycl::read_only}; + for(std::size_t i = 0; i < n; ++i) + std::cout << k_acc[i] << ' '; + std::cout << '\n'; + + sycl::host_accessor v_acc{values, sycl::read_only}; + for(std::size_t i = 0; i < n; ++i) + std::cout << v_acc[i] << ' '; + std::cout << "\n\n"; + + sycl::host_accessor k_out_acc{keys_out, sycl::read_only}; + for(std::size_t i = 0; i < n; ++i) + std::cout << k_out_acc[i] << ' '; + std::cout << '\n'; + + sycl::host_accessor v_out_acc{values_out, sycl::read_only}; + for(std::size_t i = 0; i < n; ++i) + std::cout << v_out_acc[i] << ' '; + std::cout << '\n'; + } + + return 0; + } + +**Output**:: + + 3 2 1 5 3 3 + r o s d t e + + 1 2 3 3 3 5 + s o r t e d + + +.. _radix-sort-by-key-memory-requirements: + +------------------- +Memory Requirements +------------------- + +The algorithm uses global and local device memory (see `SYCL 2020 Specification +`_) +for intermediate data storage. For the algorithm to operate correctly, there must be enough memory on the device. +If there is not enough global device memory, a ``std::bad_alloc`` exception is thrown. +The behavior is undefined if there is not enough local memory. +The amount of memory that is required depends on input data and configuration parameters, as described below. + +Global Memory Requirements +-------------------------- + +Global memory is used for copying the input sequence(s) and storing internal data such as radix value counters. +The used amount depends on many parameters; below is an upper bound approximation: + + N\ :sub:`keys` + N\ :sub:`values` + C * N\ :sub:`keys` + +where the sequence with keys takes N\ :sub:`keys` space, the sequence with values takes N\ :sub:`values` space, +and the additional space is C * N\ :sub:`keys`. + +The value of `C` depends on ``param.data_per_workitem``, ``param.workgroup_size``, and ``RadixBits``. +For ``param.data_per_workitem`` set to `32`, ``param.workgroup_size`` to `64`, and ``RadixBits`` to `8`, +`C` approximately equals to `1`. +Incrementing ``RadixBits`` increases `C` up to twice, while doubling either +``param.data_per_workitem`` or ``param.workgroup_size`` leads to a halving of `C`. + +.. + The estimation above is not very precise and it seems it is not necessary for the global memory. + The C coefficient base is actually 0.53 instead of 1. + An increment of RadixBits multiplies C by the factor of ~1.5 on average. + + Additionally, C exceeds 1 for radix_sort_by_key, + when N is small and the global histogram takes more space than the sequences. + This space is small, single WG implementation will be added, therefore this is neglected. + +Local Memory Requirements +------------------------- + +Local memory is used for reordering key-value pairs within a work-group, +and for storing internal data such as radix value counters. +The used amount depends on many parameters; below is an upper bound approximation: + + N\ :sub:`keys_per_workgroup` + N\ :sub:`values_per_workgroup` + C + +where N\ :sub:`keys_per_workgroup` and N\ :sub:`values_per_workgroup` are the amounts of memory +to store keys and values, respectively. `C` is some additional space for storing internal data. + +N\ :sub:`keys_per_workgroup` equals to ``sizeof(key_type) * param.data_per_workitem * param.workgroup_size``, +N\ :sub:`values_per_workgroup` equals to ``sizeof(value_type) * param.data_per_workitem * param.workgroup_size``, +`C` does not exceed `4KB`. + +.. + C as 4KB stands on these points: + 1) Extra space is needed to store a histogram to distribute keys. It's size is 4 * (2^RadixBits). + The estimation is correct for RadixBits 9 (2KB) and smaller. Support of larger RadixBits is not expected. + 1) N_keys + N_values is rounded up at 2KB border (temporarily as a workaround for a GPU driver bug). + +.. + The estimation assumes that reordering keys/pairs takes more space than ranking keys. + The ranking takes approximatelly "2 * workgroup_size * (2^RadixBits)" bytes. + It suprpasses Intel Data Center GPU Max SLM capacity in only marginal cases, + e.g., when RadixBits is 10 and workgroup_size is 64, or when RadixBits is 9 and workgroup_size is 128. + It is ignored as an unrealistic case. + +----------------------------------------- +Recommended Settings for Best Performance +----------------------------------------- + +The general advice is to choose kernel parameters based on performance measurements and profiling information. +The initial configuration may be selected according to these high-level guidelines: + +.. + TODO: add this part when param.workgroup_size supports more than one value: + Increasing ``param.data_per_workitem`` should usually be preferred to increasing ``param.workgroup_size``, + to avoid extra synchronization overhead within a work-group. + +- When the number of elements to sort ``N`` is less than 1M, utilizing all available + compute cores is key for better performance. Allow creating enough work chunks to feed all + X\ :sup:`e`-cores [#fnote2]_ on a GPU: ``param.data_per_workitem * param.workgroup_size ≈ N / xe_core_count``. + +- When the number of elements to sort is large (more than ~1M), maximizing the number of elements + processed by a work-group, which equals to ``param.data_per_workitem * param.workgroup_size``, + reduces synchronization overheads between work-groups and usually benefits the overall performance. + +.. warning:: + + Avoid setting too large ``param.data_per_workitem`` and ``param.workgroup_size`` values. + Make sure that :ref:`Memory requirements ` are satisfied. + +.. note:: + + ``param.data_per_workitem`` is the only available parameter to tune the performance, + since ``param.workgroup_size`` currently supports only one value (`64`). + + +.. [#fnote1] Andy Adinets and Duane Merrill (2022). Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. https://arxiv.org/abs/2206.01784. +.. [#fnote2] The X\ :sup:`e`-core term is described in the `oneAPI GPU Optimization Guide + `_. + Check the number of cores in the device specification, such as `Intel® Data Center GPU Max specification + `_. diff --git a/_sources/kernel_templates/esimd_main.rst b/_sources/kernel_templates/esimd_main.rst index c717e46d1e..3b1c4b555c 100644 --- a/_sources/kernel_templates/esimd_main.rst +++ b/_sources/kernel_templates/esimd_main.rst @@ -1,31 +1,19 @@ ESIMD-Based Kernel Templates ############################ -The ESIMD kernel templates are based on `Explicit SIMD SYCL extension -`_ -of Intel® oneAPI DPC++/C++ Compiler. +The ESIMD kernel templates are based on |esimd_sycl_extension|_ of |dpcpp_cpp|. This technology only supports Intel GPU devices. These templates are available in the ``oneapi::dpl::experimental::kt::gpu::esimd`` namespace. The following are implemented: -* :doc:`radix_sort and radix_sort_by_key ` +* :doc:`radix_sort ` +* :doc:`radix_sort_by_key ` .. toctree:: :maxdepth: 2 :titlesonly: - :glob: :hidden: esimd/radix_sort + esimd/radix_sort_by_key -------------------- -System Requirements -------------------- - -- Hardware: Intel® Data Center GPU Max Series. -- Compiler: Intel® oneAPI DPC++/C++ Compiler 2023.2 and newer. -- Operating Systems: - - - Red Hat Enterprise Linux* 9.2, - - SUSE Linux Enterprise Server* 15 SP5, - - Ubuntu* 22.04. diff --git a/_sources/kernel_templates/kernel_configuration.rst b/_sources/kernel_templates/kernel_configuration.rst index 8b2b9ac796..185d4fc735 100644 --- a/_sources/kernel_templates/kernel_configuration.rst +++ b/_sources/kernel_templates/kernel_configuration.rst @@ -58,21 +58,19 @@ Member Types | ``kernel_name`` | ``KernelName`` | An optional parameter that is used to set a kernel name. | | | | | | | | .. note:: | -| | | | -| | | The ``KernelName`` parameter might be required in case an implementation of | -| | | SYCL is not fully compliant with the SYCL 2020 Specification and | -| | | does not support optional kernel names. | +| | | The ``KernelName`` parameter might be required in case an implementation | +| | | of SYCL is not fully compliant with the `SYCL 2020 Specification`_ | +| | | and does not support optional kernel names. | | | | | | | | If omitted, SYCL kernel name(s) will be automatically generated. | | | | | | | | If provided, it must be a unique C++ typename that satisfies the requirements | -| | | for SYCL kernel names (see `SYCL 2020 Specification | -| | | `_). | +| | | for SYCL kernel names in the `SYCL 2020 Specification`_. | | | | | | | | .. note:: | -| | | | | | | The provided name can be augmented by oneDPL when used with | | | | a template that creates multiple SYCL kernels. | | | | | +-----------------+----------------+----------------------------------------------------------------------------------+ + +.. _`SYCL 2020 Specification`: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:naming.kernels \ No newline at end of file diff --git a/_sources/kernel_templates/single_pass_scan.rst b/_sources/kernel_templates/single_pass_scan.rst index de2472e97b..06dd40bdf5 100644 --- a/_sources/kernel_templates/single_pass_scan.rst +++ b/_sources/kernel_templates/single_pass_scan.rst @@ -78,10 +78,12 @@ Parameters Current limitations: - - The function will internally block until the issued kernels have completed execution. + - The function is intended to be asynchronous, but in some cases, the function will not return until the algorithm fully completes. Although intended in the future to be an asynchronous call, the algorithm is currently synchronous. - The SYCL device associated with the provided queue must support 64-bit atomic operations if the element type is 64-bits. - - There must be a known identity value for the provided combination of the element type and the binary operation. That is, ``sycl::has_known_identity_v`` must evaluate to true. Such operators are listed in the `SYCL 2020 specification `_. + - There must be a known identity value for the provided combination of the element type and the binary operation. That is, + ``sycl::has_known_identity_v`` must evaluate to true. Such operators are listed in + the `SYCL 2020 specification `_. Return Value ------------ @@ -133,9 +135,7 @@ inclusive_scan Example return 0; } -**Output:** - -.. code:: none +**Output**:: 1 3 4 7 8 10 @@ -147,9 +147,10 @@ Memory Requirements The algorithm uses global and local device memory (see `SYCL 2020 Specification `__) -for intermediate data storage. For the algorithm to operate correctly, there must be enough memory -on the device. It throws a ``std::bad_alloc`` exception if there is not enough global device memory. The behavior is undefined if there is not enough local memory. The amount of memory that is required -depends on input data and configuration parameters, as described below. +for intermediate data storage. For the algorithm to operate correctly, there must be enough memory on the device. +If there is not enough global device memory, a ``std::bad_alloc`` exception is thrown. +The behavior is undefined if there is not enough local memory. +The amount of memory that is required depends on input data and configuration parameters, as described below. Global Memory Requirements -------------------------- diff --git a/_sources/kernel_templates_main.rst b/_sources/kernel_templates_main.rst index 77affdc416..108e54221f 100644 --- a/_sources/kernel_templates_main.rst +++ b/_sources/kernel_templates_main.rst @@ -10,7 +10,7 @@ It is recommended to use kernel templates when there is an opportunity to custom for a particular workload (for example, the number of elements and their type), or for a specific device (for example, based on the available local memory). -To use the API, include the ``oneapi/dpl/experimental/kernel_templates`` header file. +To use the API, include the ```` header file. The primary API namespace is ``oneapi::dpl::experimental::kt``, and nested namespaces are used to further categorize the templates. * :doc:`Kernel Configuration `. Generic structure for configuring a kernel template. @@ -20,7 +20,6 @@ The primary API namespace is ``oneapi::dpl::experimental::kt``, and nested names .. toctree:: :maxdepth: 2 :titlesonly: - :glob: :hidden: kernel_templates/kernel_configuration diff --git a/_sources/macros.rst b/_sources/macros.rst index 09718e798c..2a1b5e4f8b 100644 --- a/_sources/macros.rst +++ b/_sources/macros.rst @@ -27,6 +27,24 @@ Macro Description ``_PSTL_VERSION_PATCH`` ``_PSTL_VERSION % 10``: The patch number. ================================= ============================== +.. _feature-macros: + +Feature Macros +============== +Use these macros to test presence of specific |onedpl_short| functionality. + +================================== =============================================== +Macro Macro values and the functionality +================================== =============================================== +``ONEDPL_HAS_RANDOM_NUMBERS`` Pseudo-random number generators and distributions. + + * ``202409L`` - added support of comparison and I/O stream operators and an experimental Philox engine +---------------------------------- ----------------------------------------------- +``ONEDPL_HAS_RANGE_ALGORITHMS`` Parallel range algorithms. + + * ``202409L`` - see :ref:`available algorithms `. +================================== =============================================== + Additional Macros ================== Use these macros to control aspects of |onedpl_short| usage. You can set them in your program code @@ -38,9 +56,7 @@ Macro Description ``PSTL_USE_NONTEMPORAL_STORES`` This macro enables the use of ``#pragma vector nontemporal`` for write-only data when algorithms such as ``std::copy``, ``std::fill``, etc., are executed with unsequenced policies. - For further details about the pragma, - see the `vector page in the Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference - `_. + For further details about the pragma, see the |vector_pragma|_. If the macro evaluates to a non-zero value, the use of ``#pragma vector nontemporal`` is enabled. By default, the macro is not defined. @@ -94,11 +110,17 @@ Macro Description such as ``dpcpp_default`` and ``dpcpp_fpga``. When the macro is not defined (by default) or evaluates to non-zero, predefined policies objects can be used. When the macro is set to 0, predefined policies objects and make functions - without arguments, when ``make_device_policy()``, - ``make_fpga_policy()``, are not available. + without arguments (``make_device_policy()`` and ``make_fpga_policy()``) are not available. ---------------------------------- ------------------------------ ``ONEDPL_ALLOW_DEFERRED_WAITING`` This macro allows waiting for completion of certain algorithms executed with device policies to be deferred. (Disabled by default.) + + When the macro evaluates to non-zero, a call to a oneDPL algorithm with + a device policy might return before the computation completes on the device. + + .. Warning:: Before accessing data produced or modified by the call, waiting + for completion of all tasks in the corresponding SYCL queue is required; + otherwise, the program behavior is undefined. ---------------------------------- ------------------------------ ``ONEDPL_FPGA_DEVICE`` Use this macro to build your code containing |onedpl_short| parallel algorithms for FPGA devices. (Disabled by default.) diff --git a/_sources/onedpl_gsg.rst b/_sources/onedpl_gsg.rst index 5b598c23c3..761b7d677b 100644 --- a/_sources/onedpl_gsg.rst +++ b/_sources/onedpl_gsg.rst @@ -1,217 +1,223 @@ -Get Started with the |onedpl_long| -################################## - -|onedpl_long| (|onedpl_short|) works with the -`Intel® oneAPI DPC++/C++ Compiler `_ -to provide high-productivity APIs to developers, which can minimize SYCL* -programming efforts across devices for high performance parallel applications. - -|onedpl_short| consists of the following components: - -* Parallel API -* API for SYCL Kernels -* Macros - - -For general information about |onedpl_short|, visit the `oneDPL GitHub* repository `_, -or visit the `Intel® oneAPI DPC++ Library Guide `_ -and the `Intel® oneAPI DPC++ Library main page `_. - -Quick Start -=========== - -Installation ------------- - -Visit the |onedpl_short| `Release Notes -`_ -page for: - -* Where to Find the Release -* Overview -* New Features -* Fixed Issues -* Known Issues and Limitations - -Install the `Intel® oneAPI Base Toolkit (Base Kit) `_ -to use |onedpl_short|. - -To use Parallel API, include the corresponding header files in your source code. - -All |onedpl_short| header files are in the ``oneapi/dpl`` directory. Use ``#include `` to include them. -|onedpl_short| uses the namespace ``oneapi::dpl`` for most its classes and functions. - -To use tested C++ standard APIs, you need to include the corresponding C++ standard header files -and use the ``std`` namespace. - -CMake Support -------------- -`CMake `_ generates build scripts which can then be used to build and link your application. |onedpl_short| can be added to your project via CMake. - -A simple example for Linux is provided below. For more detailed usage and options including details specific to Windows, please look to the `CMake Support Page `_. - -Simple Example CMake File -************************* -To use |onedpl_short| with CMake, create a CMakeLists.txt file for your project's base directory and use `find_package `_ and `target_link_libraries `_ to add oneDPL. -For example: - -.. code:: cpp - - project(Foo) - add_executable(foo foo.cpp) - - # Search to find oneDPL - find_package(oneDPL REQUIRED) - - # Connect oneDPL to foo - target_link_libraries(foo oneDPL) - -Simple Example CMake Invocation -******************************* -The following is an example CMake invocation which generates build scripts for the project in the parent directory: - -.. code:: cpp - - mkdir build && cd build - cmake -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=release .. - -Example Build Command -********************* -Once build scripts have been generated for your desired configuration following the instruction above, a `build command `_ can be issued to build your project: - -.. code:: cpp - - cmake --build . - -pkg-config Support ------------------- - -The pkg-config program is used to retrieve information about your installed libraries, and -to compile and link against one or more libraries. - -Use pkg-config with |onedpl_short| -********************************** - -Use pkg-config with the ``--cflags`` flag to get the include path to the oneDPL directory: - -.. code:: cpp - - icpx -fsycl foo.cpp $(pkg-config --cflags dpl) - -The ``--msvc-syntax`` flag is required when you use a Microsoft Visual C++* compiler. -This flag converts your compiling and linking flags to the appropriate form: - -.. code:: cpp - - icpx -fsycl foo.cpp $(pkg-config --msvc-syntax --cflags dpl) - -.. note:: - Use the pkg-config tool to get rid of large hard-coded paths and make compilation more portable. - - -Usage Examples --------------- - -|onedpl_short| sample code is available from the -`oneAPI GitHub samples repository `_. -Each sample includes a readme with build instructions. - -\ Header Usage Example -****************************************** - -This example illustrates |onedpl_short| random number generator usage. -The sample below shows you how to create an random number generator engine object (the source of pseudo-randomness), -a distribution object (specifying the desired probability distribution), and how to generate -the random numbers themselves. Random number generation is performed in a vectorized manner -to improve the speed of your computations. - -This example performs its computations on your default SYCL device. You can set the -``SYCL_DEVICE_TYPE`` environment variable to CPU or GPU. - -.. code:: cpp - - template - void random_fill(float* usmptr, std::size_t n) { - auto zero = oneapi::dpl::counting_iterator(0); - - std::for_each(oneapi::dpl::execution::dpcpp_default, - zero, zero + n/VecSize, - [usmptr](std::size_t i) { - auto offset = i * VecSize; - - oneapi::dpl::minstd_rand_vec engine(seed, offset); - oneapi::dpl::uniform_real_distribution> distr; - - auto res = distr(engine); - res.store(i, sycl::global_ptr(usmptr)); - }); - } - -Pi Benchmark Usage Example -************************** - -This example uses a Monte Carlo method to estimate the value of π. -The basic idea is to generate random points within a square, and to check what -fraction of these random points lie in a quarter-circle inscribed within that square. -The expected value is the ratio of the areas of the quarter-circle and the square (π/4). -You can take the observed fraction of points in the quarter-circle as an estimate of π/4. - -This example shows you how to create an random number generator engine object (the source of pseudo-randomness), -a distribution object (specifying the desired probability distribution), generate the -random numbers themselves, and then perform a reduction to count quantity of points that -fit into the square *S*. Random number generation is performed in scalar manner to simplify your code. - - -.. figure:: images/pi_benchmark.png - :alt: An image of pi chart. - -.. code:: cpp - - float estimated_pi; - { - sycl::queue q(sycl::gpu_selector_v); - auto policy = oneapi::dpl::execution::make_device_policy(q); - - float sum = std::transform_reduce( policy, - oneapi::dpl::counting_iterator(0), - oneapi::dpl::counting_iterator(N), - 0.0f, - std::plus{}, - [=](int n){ - float local_sum = 0.0f; - oneapi::dpl::minstd_rand engine(SEED, n * ITER * 2); - oneapi::dpl::uniform_real_distribution distr; - for(int i = 0; i < ITER; ++i) { - float x = distr(engine); - float y = distr(engine); - if (x * x + y * y <= 1.0) - local_sum += 1.0; - } - return local_sum / (float)ITER; - } - ); - estimated_pi = 4.0f * (float)sum / N; - } - - -Find More -========= - -.. list-table:: - :widths: 50 50 - :header-rows: 1 - - * - Resource Link - - Description - * - `Intel® oneAPI DPC++ Library Guide `_ - - Refer to the |onedpl_short| guide for more in depth information. - * - `System Requirements `_ - - Check system requirements before you install |onedpl_short|. - * - `Intel® oneAPI DPC++ Library Release Notes `_ - - Check the release notes to learn about updates in the latest release. - * - `oneDPL Samples `_ - - Learn how to use |onedpl_short| with samples. - * - `Layers for Yocto* Project `_ - - Add oneAPI components to a Yocto project build using the meta-intel layers. - * - `oneAPI Samples Catalog `_ - - Explore the complete list of oneAPI code samples in the oneAPI Samples Catalog (GitHub*). These samples were designed to help you develop, offload, and optimize multiarchitecture applications targeting CPUs, GPUs, and FPGAs. \ No newline at end of file +Get Started with the |onedpl_long| +################################## + +|onedpl_long| (|onedpl_short|) works with the |dpcpp_cpp_with_gsg_link|_ +to provide high-productivity APIs to developers, which can minimize SYCL* +programming efforts across devices for high performance parallel applications. + +|onedpl_short| consists of the following components: + +* Parallel API +* API for SYCL Kernels +* Macros + + +For general information about |onedpl_short|, visit the `oneDPL GitHub* repository `_, +or visit the |onedpl_library_guide|_ and the `Intel® oneAPI DPC++ Library main page +`_. + +Quick Start +=========== + +Installation +------------ + +Visit the |onedpl_short| `Release Notes +`_ +page for: + +* Where to Find the Release +* Overview +* New Features +* Fixed Issues +* Known Issues and Limitations + +Install the `Intel® oneAPI Base Toolkit (Base Kit) `_ +to use |onedpl_short|. + +To use Parallel API, include the corresponding header files in your source code. + +All |onedpl_short| header files are in the ``oneapi/dpl`` directory. Use ``#include `` to include them. +|onedpl_short| uses the namespace ``oneapi::dpl`` for most its classes and functions. + +To use tested C++ standard APIs, you need to include the corresponding C++ standard header files +and use the ``std`` namespace. + +CMake Support +------------- +`CMake `_ generates build scripts which can then be used +to build and link your application. |onedpl_short| can be added to your project via CMake. + +A simple example for Linux is provided below. For more detailed usage and options including details specific to Windows, +please look to the |dpcpp_cmake_support|_. + +Simple Example CMake File +************************* +To use |onedpl_short| with CMake, create a CMakeLists.txt file for your project's base directory and use +`find_package `_ +and `target_link_libraries `_ to add oneDPL. +For example: + +.. code:: cpp + + project(Foo) + add_executable(foo foo.cpp) + + # Search to find oneDPL + find_package(oneDPL REQUIRED) + + # Connect oneDPL to foo + target_link_libraries(foo oneDPL) + +Simple Example CMake Invocation +******************************* +The following is an example CMake invocation which generates build scripts for the project in the parent directory: + +.. code:: cpp + + mkdir build && cd build + cmake -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=release .. + +Example Build Command +********************* +Once build scripts have been generated for your desired configuration following the instruction above, a `build command +`_ can be issued to build your project: + +.. code:: cpp + + cmake --build . + +pkg-config Support +------------------ + +The pkg-config program is used to retrieve information about your installed libraries, and +to compile and link against one or more libraries. + +Use pkg-config with |onedpl_short| +********************************** + +Use pkg-config with the ``--cflags`` flag to get the include path to the oneDPL directory: + +.. code:: cpp + + icpx -fsycl foo.cpp $(pkg-config --cflags dpl) + +The ``--msvc-syntax`` flag is required when you use a Microsoft Visual C++* compiler. +This flag converts your compiling and linking flags to the appropriate form: + +.. code:: cpp + + icpx -fsycl foo.cpp $(pkg-config --msvc-syntax --cflags dpl) + +.. note:: + Use the pkg-config tool to get rid of large hard-coded paths and make compilation more portable. + + +Usage Examples +-------------- + +|onedpl_short| sample code is available from the +`oneAPI GitHub samples repository `_. +Each sample includes a readme with build instructions. + +\ Header Usage Example +****************************************** + +This example illustrates |onedpl_short| random number generator usage. +The sample below shows you how to create an random number generator engine object (the source of pseudo-randomness), +a distribution object (specifying the desired probability distribution), and how to generate +the random numbers themselves. Random number generation is performed in a vectorized manner +to improve the speed of your computations. + +This example performs its computations on your default SYCL device. You can set the +``SYCL_DEVICE_TYPE`` environment variable to CPU or GPU. + +.. code:: cpp + + template + void random_fill(float* usmptr, std::size_t n) { + auto zero = oneapi::dpl::counting_iterator(0); + + std::for_each(oneapi::dpl::execution::dpcpp_default, + zero, zero + n/VecSize, + [usmptr](std::size_t i) { + auto offset = i * VecSize; + + oneapi::dpl::minstd_rand_vec engine(seed, offset); + oneapi::dpl::uniform_real_distribution> distr; + + auto res = distr(engine); + res.store(i, sycl::global_ptr(usmptr)); + }); + } + +Pi Benchmark Usage Example +************************** + +This example uses a Monte Carlo method to estimate the value of π. +The basic idea is to generate random points within a square, and to check what +fraction of these random points lie in a quarter-circle inscribed within that square. +The expected value is the ratio of the areas of the quarter-circle and the square (π/4). +You can take the observed fraction of points in the quarter-circle as an estimate of π/4. + +This example shows you how to create an random number generator engine object (the source of pseudo-randomness), +a distribution object (specifying the desired probability distribution), generate the +random numbers themselves, and then perform a reduction to count quantity of points that +fit into the square *S*. Random number generation is performed in scalar manner to simplify your code. + + +.. figure:: images/pi_benchmark.png + :alt: An image of pi chart. + +.. code:: cpp + + float estimated_pi; + { + sycl::queue q(sycl::gpu_selector_v); + auto policy = oneapi::dpl::execution::make_device_policy(q); + + float sum = std::transform_reduce( policy, + oneapi::dpl::counting_iterator(0), + oneapi::dpl::counting_iterator(N), + 0.0f, + std::plus{}, + [=](int n){ + float local_sum = 0.0f; + oneapi::dpl::minstd_rand engine(SEED, n * ITER * 2); + oneapi::dpl::uniform_real_distribution distr; + for(int i = 0; i < ITER; ++i) { + float x = distr(engine); + float y = distr(engine); + if (x * x + y * y <= 1.0) + local_sum += 1.0; + } + return local_sum / (float)ITER; + } + ); + estimated_pi = 4.0f * (float)sum / N; + } + + +Find More +========= + +.. list-table:: + :widths: 50 50 + :header-rows: 1 + + * - Resource Link + - Description + * - |onedpl_library_guide|_ + - Refer to the |onedpl_short| guide for more in depth information. + * - `System Requirements `_ + - Check system requirements before you install |onedpl_short|. + * - `Intel® oneAPI DPC++ Library Release Notes + `_ + - Check the release notes to learn about updates in the latest release. + * - `oneDPL Samples `_ + - Learn how to use |onedpl_short| with samples. + * - |yocto_layers|_ + - Add oneAPI components to a Yocto project build using the meta-intel layers. + * - `oneAPI Samples Catalog `_ + - Explore the complete list of oneAPI code samples in the oneAPI Samples Catalog (GitHub*). + These samples were designed to help you develop, offload, and optimize multiarchitecture applications targeting CPUs, GPUs, and FPGAs. \ No newline at end of file diff --git a/_sources/parallel_api/additional_algorithms.rst b/_sources/parallel_api/additional_algorithms.rst index 6827f7d606..5223ba8d91 100644 --- a/_sources/parallel_api/additional_algorithms.rst +++ b/_sources/parallel_api/additional_algorithms.rst @@ -1,147 +1,157 @@ -Additional Algorithms -###################### - -The definitions of the algorithms listed below are available through the ``oneapi/dpl/algorithm`` -header. All algorithms are implemented in the ``oneapi::dpl`` namespace. - -* ``reduce_by_segment``: performs partial reductions on a sequence's values and keys. Each - reduction is computed with a given reduction operation for a contiguous subsequence of values, which are - determined by keys being equal according to a predicate. A return value is a pair of iterators holding - the end of the output sequences for keys and values. - - For correct computation, the reduction operation should be associative. If no operation is specified, - the default operation for the reduction is ``std::plus``, and the default predicate is ``std::equal_to``. - The algorithm requires that the type of the elements used for values be default constructible. For example:: - - keys: [0,0,0,1,1,1] - values: [1,2,3,4,5,6] - output_keys: [0,1] - output_values: [1+2+3=6,4+5+6=15] - -* ``inclusive_scan_by_segment``: performs partial prefix scans on a sequence's values. Each - scan applies to a contiguous subsequence of values, which are determined by the keys associated with the - values being equal. The return value is an iterator targeting the end of the result sequence. - - For correct computation, the prefix scan operation should be associative. If no operation is specified, - the default operation is ``std::plus``, and the default predicate is ``std::equal_to``. The algorithm - requires that the type of the elements used for values be default constructible. For example:: - - keys: [0,0,0,1,1,1] - values: [1,2,3,4,5,6] - result: [1,1+2=3,1+2+3=6,4,4+5=9,4+5+6=15] - -* ``exclusive_scan_by_segment``: performs partial prefix scans on a sequence's values. Each - scan applies to a contiguous subsequence of values that are determined by the keys associated with the values - being equal, and sets the first element to the initial value provided. The return value is an iterator - targeting the end of the result sequence. - - For correct computation, the prefix scan operation should be associative. If no operation is specified, - the default operation is ``std::plus``, and the default predicate is ``std::equal_to``. For example:: - - keys: [0,0,0,1,1,1] - values: [1,2,3,4,5,6] - initial value: [0] - result: [0,0+1=1,0+1+2=3,0,0+4=4,0+4+5=9] - -* ``binary_search``: performs a binary search of the input sequence for each of the values in - the search sequence provided. For each element of the search sequence the algorithm writes a boolean value - to the result sequence that indicates whether the search value was found in the input sequence. An iterator - to one past the last value in the result sequence is returned. The algorithm assumes the input sequence has - been sorted by the comparator provided. If no comparator is provided, then a function object that uses - ``operator<`` to compare the elements is used. For example:: - - input sequence: [0, 2, 2, 2, 3, 3, 3, 3, 6, 6] - search sequence: [0, 2, 4, 7, 6] - result sequence: [true, true, false, false, true] - -* ``lower_bound``: performs a binary search of the input sequence for each of the values in - the search sequence provided to identify the lowest index in the input sequence where the search value could - be inserted without violating the sorted ordering of the input sequence. The lowest index for each search - value is written to the result sequence, and the algorithm returns an iterator to one past the last value - written to the result sequence. If no comparator is provided, then a function object that uses ``operator<`` - to compare the elements is used. For example:: - - input sequence: [0, 2, 2, 2, 3, 3, 3, 3, 6, 6] - search sequence: [0, 2, 4, 7, 6] - result sequence: [0, 1, 8, 10, 8] - -* ``upper_bound``: performs a binary search of the input sequence for each of the values in - the search sequence provided to identify the highest index in the input sequence where the search value could - be inserted without violating the sorted ordering of the input sequence. The highest index for each search - value is written to the result sequence, and the algorithm returns an iterator to one past the last value - written to the result sequence. If no comparator is provided, then a function object that uses ``operator<`` - to compare the elements is used. For example:: - - input sequence: [0, 2, 2, 2, 3, 3, 3, 3, 6, 6] - search sequence: [0, 2, 4, 7, 6] - result sequence: [1, 4, 8, 10, 10] - -* ``sort_by_key``: performs a stable key-value sort. The algorithm sorts the sequence's keys according to - a comparioson operator. If no comparator is provided, then the elements are compared with ``operator<``. - The sequence's values are permutated according to the sorted sequence's keys. The prerequisite for correct - behavior is that the size for both keys sequence and values sequence shall be the same. - For example:: - - keys: [3, 5, 0, 4, 3, 0] - values: ['a', 'b', 'c', 'd', 'e', 'f'] - output_keys: [0, 0, 3, 3, 4, 5] - output_values: ['c', 'f', 'a', 'e', 'd', 'b'] - -* ``transform_if``: performs a transform on the input sequence(s) elements and stores the result into the - corresponding position in the output sequence at each position for which the predicate applied to the - element(s) evaluates to ``true``. If the predicate evaluates to ``false``, the transform is not applied for - the elements(s), and the output sequence's corresponding position is left unmodified. There are two overloads - of this function, one for a single input sequence with a unary transform and a unary predicate, and another - for two input sequences and a binary transform and a binary predicate. - - Unary example:: - - unary predicate: [](auto i){return i % 2 == 0;} // is even - unary transform: [](auto i){return i * 2;} // double element - input sequence: [0, 1, 2, 3, 3, 3, 4, 4, 7, 6] - original output sequence: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] - final output sequence: [0, 8, 4, 6, 5, 4, 8, 8, 1, 12] - - - Binary example:: - - binary predicate: [](auto a, auto b){return a == b;} // are equal - unary transform: [](auto a, auto b){return a + b;} // sum values - input sequence1: [0, 1, 2, 3, 3, 3, 4, 4, 7, 6] - input sequence2: [5, 1, 3, 4, 3, 3, 4, 4, 7, 9] - original output sequence: [9, 9, 9, 9, 9, 9, 9, 9, 9, 9] - final output sequence: [9, 2, 9, 9, 6, 6, 8, 8, 14, 9] - -* ``histogram``: performs a histogram on a sequence of of input elements. Histogram counts the number of - elements which map to each of a defined set of bins. The algorithm has two overloads. - - The first overload takes as input the number of bins, range minimum, and range maximum, then evenly - divides bins within that range. An input element ``a`` maps to a bin ``i`` such that - ``i = floor((a - minimum) / ((maximum - minimum) / num_bins)))``. - - The other overload defines ``m`` bins from a sorted sequence of ``m + 1`` user-provided boundaries - where an input element ``a`` maps to a bin ``i`` if and only if - ``__boundary_first[i] <= a < __boundary_first[i + 1]``. - - Input values which do not map to a defined bin are skipped silently. The algorithm counts the number of - input elements which map to each bin and outputs the result to a user-provided sequence of ``m`` output - bin counts. The user must provide sufficient output data to store each bin, and the type of the output - sequence must be sufficient to store the counts of the histogram without overflow. All input and output - sequences must be ``RandomAccessIterators``. Histogram currently only supports execution with device - policies. - - Evenly divided bins example:: - - inputs: [9, 9, 3, 8, 4, 4, 4, 5, 1, 99] - num_bins: 5 - min: 0 - max: 10 - output: [1, 1, 4, 0 3] - - Custom range bins example:: - - inputs: [9, 9, 3, 8, 4, 4, 4, 5, 1, 99] - boundaries: [-1, 0, 8, 12] - output: [0, 6, 3] - - +Additional Algorithms +###################### + +The definitions of the algorithms listed below are available through the ```` +header. All algorithms are implemented in the ``oneapi::dpl`` namespace. + +* ``reduce_by_segment``: performs partial reductions on a sequence's values and keys. Each + reduction is computed with a given reduction operation for a contiguous subsequence of values, which are + determined by keys being equal according to a predicate. A return value is a pair of iterators holding + the end of the output sequences for keys and values. + + For correct computation, the reduction operation should be associative. If no operation is specified, + the default operation for the reduction is ``std::plus``, and the default predicate is ``std::equal_to``. + The algorithm requires that the type of the elements used for values be default constructible. For example:: + + keys: [0,0,0,1,1,1] + values: [1,2,3,4,5,6] + output_keys: [0,1] + output_values: [1+2+3=6,4+5+6=15] + +* ``inclusive_scan_by_segment``: performs partial prefix scans on a sequence's values. Each + scan applies to a contiguous subsequence of values, which are determined by the keys associated with the + values being equal. The return value is an iterator targeting the end of the result sequence. + + For correct computation, the prefix scan operation should be associative. If no operation is specified, + the default operation is ``std::plus``, and the default predicate is ``std::equal_to``. The algorithm + requires that the type of the elements used for values be default constructible. For example:: + + keys: [0,0,0,1,1,1] + values: [1,2,3,4,5,6] + result: [1,1+2=3,1+2+3=6,4,4+5=9,4+5+6=15] + +* ``exclusive_scan_by_segment``: performs partial prefix scans on a sequence's values. Each + scan applies to a contiguous subsequence of values that are determined by the keys associated with the values + being equal, and sets the first element to the initial value provided. The return value is an iterator + targeting the end of the result sequence. + + For correct computation, the prefix scan operation should be associative. If no operation is specified, + the default operation is ``std::plus``, and the default predicate is ``std::equal_to``. For example:: + + keys: [0,0,0,1,1,1] + values: [1,2,3,4,5,6] + initial value: [0] + result: [0,0+1=1,0+1+2=3,0,0+4=4,0+4+5=9] + +* ``binary_search``: performs a binary search of the input sequence for each of the values in + the search sequence provided. For each element of the search sequence the algorithm writes a boolean value + to the result sequence that indicates whether the search value was found in the input sequence. An iterator + to one past the last value in the result sequence is returned. The algorithm assumes the input sequence has + been sorted by the comparator provided. If no comparator is provided, then a function object that uses + ``operator<`` to compare the elements is used. For example:: + + input sequence: [0, 2, 2, 2, 3, 3, 3, 3, 6, 6] + search sequence: [0, 2, 4, 7, 6] + result sequence: [true, true, false, false, true] + +* ``lower_bound``: performs a binary search of the input sequence for each of the values in + the search sequence provided to identify the lowest index in the input sequence where the search value could + be inserted without violating the sorted ordering of the input sequence. The lowest index for each search + value is written to the result sequence, and the algorithm returns an iterator to one past the last value + written to the result sequence. If no comparator is provided, then a function object that uses ``operator<`` + to compare the elements is used. For example:: + + input sequence: [0, 2, 2, 2, 3, 3, 3, 3, 6, 6] + search sequence: [0, 2, 4, 7, 6] + result sequence: [0, 1, 8, 10, 8] + +* ``upper_bound``: performs a binary search of the input sequence for each of the values in + the search sequence provided to identify the highest index in the input sequence where the search value could + be inserted without violating the sorted ordering of the input sequence. The highest index for each search + value is written to the result sequence, and the algorithm returns an iterator to one past the last value + written to the result sequence. If no comparator is provided, then a function object that uses ``operator<`` + to compare the elements is used. For example:: + + input sequence: [0, 2, 2, 2, 3, 3, 3, 3, 6, 6] + search sequence: [0, 2, 4, 7, 6] + result sequence: [1, 4, 8, 10, 10] + +* ``sort_by_key``: performs a key-value sort. + The algorithm sorts a sequence of keys using a given comparison function object. + If it is not provided, the elements are compared with ``operator<``. + A sequence of values is simultaneously permuted according to the sorted order of keys. + There must be at least as many values as the keys, otherwise the behavior is undefined. + + For example:: + + keys: [3, 5, 0, 4, 3, 0] + values: ['a', 'b', 'c', 'd', 'e', 'f'] + output_keys: [0, 0, 3, 3, 4, 5] + output_values: ['c', 'f', 'a', 'e', 'd', 'b'] + +.. note:: + ``sort_by_key`` currently implements a stable sort for device execution policies, + but may implement an unstable sort in the future. + Use ``stable_sort_by_key`` if stability is essential. + +* ``stable_sort_by_key``: performs a key-value sort similar to ``sort_by_key``, + but with the added guarantee of stability. + +* ``transform_if``: performs a transform on the input sequence(s) elements and stores the result into the + corresponding position in the output sequence at each position for which the predicate applied to the + element(s) evaluates to ``true``. If the predicate evaluates to ``false``, the transform is not applied for + the elements(s), and the output sequence's corresponding position is left unmodified. There are two overloads + of this function, one for a single input sequence with a unary transform and a unary predicate, and another + for two input sequences and a binary transform and a binary predicate. + + Unary example:: + + unary predicate: [](auto i){return i % 2 == 0;} // is even + unary transform: [](auto i){return i * 2;} // double element + input sequence: [0, 1, 2, 3, 3, 3, 4, 4, 7, 6] + original output sequence: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] + final output sequence: [0, 8, 4, 6, 5, 4, 8, 8, 1, 12] + + + Binary example:: + + binary predicate: [](auto a, auto b){return a == b;} // are equal + unary transform: [](auto a, auto b){return a + b;} // sum values + input sequence1: [0, 1, 2, 3, 3, 3, 4, 4, 7, 6] + input sequence2: [5, 1, 3, 4, 3, 3, 4, 4, 7, 9] + original output sequence: [9, 9, 9, 9, 9, 9, 9, 9, 9, 9] + final output sequence: [9, 2, 9, 9, 6, 6, 8, 8, 14, 9] + +* ``histogram``: performs a histogram on a sequence of of input elements. Histogram counts the number of + elements which map to each of a defined set of bins. The algorithm has two overloads. + + The first overload takes as input the number of bins, range minimum, and range maximum, then evenly + divides bins within that range. An input element ``a`` maps to a bin ``i`` such that + ``i = floor((a - minimum) / ((maximum - minimum) / num_bins)))``. + + The other overload defines ``m`` bins from a sorted sequence of ``m + 1`` user-provided boundaries + where an input element ``a`` maps to a bin ``i`` if and only if + ``__boundary_first[i] <= a < __boundary_first[i + 1]``. + + Input values which do not map to a defined bin are skipped silently. The algorithm counts the number of + input elements which map to each bin and outputs the result to a user-provided sequence of ``m`` output + bin counts. The user must provide sufficient output data to store each bin, and the type of the output + sequence must be sufficient to store the counts of the histogram without overflow. All input and output + sequences must be ``RandomAccessIterators``. Histogram currently only supports execution with device + policies. + + Evenly divided bins example:: + + inputs: [9, 9, 3, 8, 4, 4, 4, 5, 1, 99] + num_bins: 5 + min: 0 + max: 10 + output: [1, 1, 4, 0 3] + + Custom range bins example:: + + inputs: [9, 9, 3, 8, 4, 4, 4, 5, 1, 99] + boundaries: [-1, 0, 8, 12] + output: [0, 6, 3] + + diff --git a/_sources/parallel_api/async_api.rst b/_sources/parallel_api/async_api.rst index d6e6f77fe1..242b30c92a 100644 --- a/_sources/parallel_api/async_api.rst +++ b/_sources/parallel_api/async_api.rst @@ -1,5 +1,5 @@ -Asynchronous API Algorithms -########################### +Asynchronous Algorithms +####################### The functions defined in the STL ```` or ```` headers are traditionally blocking. |onedpl_long| (|onedpl_short|) extends the functionality of the C++17 parallel algorithms by providing asynchronous algorithms with non-blocking behavior. @@ -55,19 +55,16 @@ Example of Async API Usage #include int main() { - using namespace oneapi; - { - /* Build and compute a simple dependency chain: Fill buffer -> Transform -> Reduce */ - sycl::buffer a{10}; - - auto fut1 = dpl::experimental::fill_async(dpl::execution::dpcpp_default, - dpl::begin(a),dpl::end(a),7); - - auto fut2 = dpl::experimental::transform_async(dpl::execution::dpcpp_default, - dpl::begin(a),dpl::end(a),dpl::begin(a), - [&](const int& x){return x + 1; },fut1); - auto ret_val = dpl::experimental::reduce_async(dpl::execution::dpcpp_default, - dpl::begin(a),dpl::end(a),fut1,fut2).get(); - } + /* Build and compute a simple dependency chain: Fill buffer -> Transform -> Reduce */ + sycl::buffer a{10}; + + auto fut1 = dpl::experimental::fill_async(dpl::execution::dpcpp_default, + dpl::begin(a),dpl::end(a),7); + + auto fut2 = dpl::experimental::transform_async(dpl::execution::dpcpp_default, + dpl::begin(a),dpl::end(a),dpl::begin(a), + [&](const int& x){return x + 1; },fut1); + auto ret_val = dpl::experimental::reduce_async(dpl::execution::dpcpp_default, + dpl::begin(a),dpl::end(a),fut1,fut2).get(); return 0; } diff --git a/_sources/parallel_api/execution_policies.rst b/_sources/parallel_api/execution_policies.rst index ad922c04f2..a5188446d4 100644 --- a/_sources/parallel_api/execution_policies.rst +++ b/_sources/parallel_api/execution_policies.rst @@ -1,68 +1,70 @@ Execution Policies ################## -The implementation supports the device execution policies used to run the massive parallel -computational model for heterogeneous systems. The policies are specified in -the |onedpl_long| (|onedpl_short|) section of the `oneAPI Specification -`_. - -For any of the implemented algorithms, pass one of the execution policy objects as the first -argument in a call to specify the desired execution behavior. The policies have -the following meaning: - -================================= ============================== -Execution Policy Value Description -================================= ============================== -``seq`` Sequential execution. ---------------------------------- ------------------------------ -``unseq`` Unsequenced SIMD execution. This policy requires that - all functions provided are SIMD-safe. ---------------------------------- ------------------------------ -``par`` Parallel execution by multiple threads. ---------------------------------- ------------------------------ -``par_unseq`` Combined effect of ``unseq`` and ``par``. ---------------------------------- ------------------------------ -``dpcpp_default`` Massive parallel execution on devices using |dpcpp_short|. ---------------------------------- ------------------------------ -``dpcpp_fpga`` Massive parallel execution on FPGA devices. -================================= ============================== +According to `the oneAPI specification +`_, +|onedpl_long| (|onedpl_short|) provides execution policies semantically aligned with the C++ standard, +referred to as *standard-aligned* or *host execution policies*, as well as *device execution policies* +to run data parallel computations on heterogeneous systems. + +The execution policies are defined in the ``oneapi::dpl::execution`` namespace and provided +in the ```` header. The policies have the following meaning: + +====================== ===================================================== +Policy Name / Type Description +====================== ===================================================== +``seq`` The standard-aligned policy for sequential execution. +---------------------- ----------------------------------------------------- +``unseq`` The standard-aligned policy for possible unsequenced SIMD execution. + This policy requires user-provided functions to be SIMD-safe. +---------------------- ----------------------------------------------------- +``par`` The standard-aligned policy for possible parallel execution by multiple threads. +---------------------- ----------------------------------------------------- +``par_unseq`` The standard-aligned policy with the combined effect of ``unseq`` and ``par``. +---------------------- ----------------------------------------------------- +``device_policy`` The class template to create device policies for data parallel execution. +---------------------- ----------------------------------------------------- +``dpcpp_default`` The device policy for data parallel execution on the default SYCL device. +---------------------- ----------------------------------------------------- +``fpga_policy`` The class template to create policies for execution on FPGA devices. +---------------------- ----------------------------------------------------- +``dpcpp_fpga`` The device policy for data parallel execution on a SYCL FPGA device. +====================== ===================================================== The implementation is based on Parallel STL from the `LLVM Project `_. |onedpl_short| supports two parallel backends for execution with ``par`` and ``par_unseq`` policies: -#. TBB backend (enabled by default) uses |onetbb_long| or |tbb_long| for parallel execution. +#. The TBB backend (enabled by default) uses |onetbb_long| or |tbb_long| for parallel execution. -#. OpenMP backend uses OpenMP* pragmas for parallel execution. Visit +#. The OpenMP backend uses OpenMP* pragmas for parallel execution. Visit :doc:`Macros <../macros>` for the information how to enable the OpenMP backend. +OpenMP pragmas are also used for SIMD execution with ``unseq`` and ``par_unseq`` policies. + Follow these steps to add Parallel API to your application: #. Add ``#include `` to your code. Then include one or more of the following header files, depending on the algorithms you intend to use: - #. ``#include `` - #. ``#include `` - #. ``#include `` - - For better coexistence with the C++ standard library, - include |onedpl_short| header files before the standard C++ ones. + - ``#include `` + - ``#include `` + - ``#include `` -#. Pass a |onedpl_short| execution policy object, defined in the ``oneapi::dpl::execution`` - namespace, to a parallel algorithm. -#. Use the C++ standard execution policies: +#. Pass a |onedpl_short| execution policy object as the first argument to a parallel algorithm + to indicate the desired execution behavior. - #. Compile the code with options that enable OpenMP parallelism and/or vectorization pragmas. - #. Link with the |onetbb_long| or |tbb_long| dynamic library for TBB-based parallelism. +#. If you use the standard-aligned execution policies: -#. Use the device execution policies: + - Compile the code with options that enable OpenMP parallelism and/or SIMD vectorization pragmas. + - Compile and link with the |onetbb_short| or |tbb_short| library for TBB-based parallelism. - #. Compile the code with options that enable support for SYCL 2020. + If you use the device execution policies, compile the code with options that enable support for SYCL 2020. -Use the C++ Standard Execution Policies -======================================= +Use the C++ Standard Aligned Execution Policies +=============================================== Example: @@ -80,12 +82,10 @@ Example: } Use the Device Execution Policies -======================================== +================================= -The device execution policy specifies where a parallel algorithm runs. -It encapsulates a SYCL device or queue and allows you to -set an optional kernel name. Device execution policies can be used with all -standard C++ algorithms that support execution policies. +The device execution policy specifies where a |onedpl_short| parallel algorithm runs. +It encapsulates a SYCL device or queue and allows you to set an optional kernel name. To create a policy object, you may use one of the following constructor arguments: @@ -100,8 +100,8 @@ names for SYCL kernel functions. The |dpcpp_cpp| supports it by default; for other compilers it may need to be enabled with compilation options such as ``-fsycl-unnamed-lambda``. Refer to your compiler documentation for more information. -The ``oneapi::dpl::execution::dpcpp_default`` object is a predefined object of -the ``device_policy`` class. It is created with a default kernel name and a default queue. +The ``oneapi::dpl::execution::dpcpp_default`` object is a predefined immutable object of +the ``device_policy`` class. It is created with a default kernel name and uses a default queue. Use it to construct customized policy objects or pass directly when invoking an algorithm. If ``dpcpp_default`` is passed directly to more than one algorithm, you must ensure that the @@ -117,8 +117,8 @@ and ``using namespace sycl;`` directives when referring to policy classes and fu .. code:: cpp - auto policy_a = device_policy {}; - std::for_each(policy_a, ...); + auto policy_a = device_policy {}; + std::for_each(policy_a, ...); .. code:: cpp @@ -167,18 +167,16 @@ The default constructor of ``fpga_policy`` wraps a SYCL queue created for ``fpga_selector``, or for ``fpga_emulator_selector`` if the ``ONEDPL_FPGA_EMULATOR`` is defined. -``oneapi::dpl::execution::dpcpp_fpga`` is a predefined object of +``oneapi::dpl::execution::dpcpp_fpga`` is a predefined immutable object of the ``fpga_policy`` class created with a default unroll factor and a default kernel name. Use it to create customized policy objects or pass directly when invoking an algorithm. .. Note:: Specifying the unroll factor for a policy enables loop unrolling in the implementation of - your algorithms. The default value is 1. - To find out how to choose a more precise value, refer to the `unroll Pragma `_ - and `Loop Analysis `_ content in - the `Intel® oneAPI FPGA Handbook - `_. + |onedpl_short| algorithms. The default value is 1. + To find out how to choose a more precise value, refer to the |unroll_pragma|_ + and |loop_analysis|_ content in the |fpga_handbook|_. The ``make_fpga_policy`` function templates simplify ``fpga_policy`` creation. @@ -197,10 +195,10 @@ The code below assumes you have added ``using namespace oneapi::dpl::execution;` Error Handling with Device Execution Policies -==================================================== +============================================= -The SYCL error handling model supports two types of errors: Synchronous errors cause the SYCL host -runtime libraries throw exceptions. Asynchronous errors may only be processed in a user-supplied error handler +The SYCL error handling model supports two types of errors. Synchronous errors cause the SYCL API functions +to throw exceptions. Asynchronous errors may only be processed in a user-supplied error handler associated with a SYCL queue. For algorithms executed with device policies, handling all errors, synchronous or asynchronous, is a @@ -213,4 +211,4 @@ responsibility of the caller. Specifically: To process SYCL asynchronous errors, the queue associated with a device policy must be created with an error handler object. The predefined policy objects (``dpcpp_default``, etc.) have -no error handlers; do not use them if you need to process asynchronous errors. \ No newline at end of file +no error handlers; do not use them if you need to process asynchronous errors. diff --git a/_sources/parallel_api/iterators.rst b/_sources/parallel_api/iterators.rst index 13884eafa2..c67c9da2b4 100644 --- a/_sources/parallel_api/iterators.rst +++ b/_sources/parallel_api/iterators.rst @@ -1,19 +1,20 @@ Iterators ######### -The definitions of the iterators are available through the ``oneapi/dpl/iterator`` +The definitions of the iterators are available through the ```` header. All iterators are implemented in the ``oneapi::dpl`` namespace. * ``counting_iterator``: a random-access iterator-like type whose dereferenced value is an integer counter. Instances of a ``counting_iterator`` provide read-only dereference operations. The counter of an - ``counting_iterator`` instance changes according to the arithmetic of the random-access iterator type:: + ``counting_iterator`` instance changes according to the arithmetic of the random-access iterator type: + + .. code:: cpp - using namespace oneapi; dpl::counting_iterator count_a(0); dpl::counting_iterator count_b = count_a + 10; int init = count_a[0]; // OK: init == 0 *count_b = 7; // ERROR: counting_iterator does not provide write operations - auto sum = std::reduce(dpl::execution::dpcpp_default, + auto sum = dpl::reduce(dpl::execution::dpcpp_default, count_a, count_b, init); // sum is (0 + 0 + 1 + ... + 9) = 45 * ``zip_iterator``: an iterator constructed with one or more iterators as input. The result of @@ -26,20 +27,22 @@ header. All iterators are implemented in the ``oneapi::dpl`` namespace. The ``zip_iterator`` is useful in defining by key algorithms where input iterators representing keys and values are processed as key-value pairs. The example below demonstrates a stable sort - by key, where only the keys are compared but both keys and values are swapped:: + by key, where only the keys are compared but both keys and values are swapped: + + .. code:: cpp - using namespace oneapi; auto zipped_begin = dpl::make_zip_iterator(keys_begin, vals_begin); - std::stable_sort(dpl::execution::dpcpp_default, zipped_begin, zipped_begin + n, + dpl::stable_sort(dpl::execution::dpcpp_default, zipped_begin, zipped_begin + n, [](auto lhs, auto rhs) { return get<0>(lhs) < get<0>(rhs); }); The dereferenced object of ``zip_iterator`` supports the *structured binding* feature (`C++17 and above `_) for easier access to - wrapped iterators values:: + wrapped iterators values: + + .. code:: cpp - using namespace oneapi; auto zipped_begin = dpl::make_zip_iterator(sequence1.begin(), sequence2.begin(), sequence3.begin()); - auto found = std::find(dpl::execution::dpcpp_default, zipped_begin, zipped_begin + n, + auto found = dpl::find(dpl::execution::dpcpp_default, zipped_begin, zipped_begin + n, [](auto tuple_like_obj) { auto [e1, e2, e3] = tuple_like_obj; return e1 == e2 && e1 == e3; @@ -57,11 +60,12 @@ header. All iterators are implemented in the ``oneapi::dpl`` namespace. The ``discard_iterator`` is useful in the implementation of stencil algorithms where the stencil is not part of the desired output. An example of this would be a ``copy_if`` algorithm that receives an input iterator range, a stencil iterator range, and copies the elements of the input whose corresponding stencil value is 1. Use - ``discard_iterator`` so you do not declare a temporary allocation to store the copy of the stencil:: + ``discard_iterator`` so you do not declare a temporary allocation to store the copy of the stencil: + + .. code:: cpp - using namespace oneapi; auto zipped_first = dpl::make_zip_iterator(first, stencil); - std::copy_if(dpl::execution::dpcpp_default, + dpl::copy_if(dpl::execution::dpcpp_default, zipped_first, zipped_first + (last - first), dpl::make_zip_iterator(result, dpl::discard_iterator()), [](auto t){return get<1>(t) == 1;} @@ -87,14 +91,15 @@ header. All iterators are implemented in the ``oneapi::dpl`` namespace. To simplify the construction of the iterator, ``oneapi::dpl::make_transform_iterator`` is provided. The function receives the base iterator and transform operation instance as arguments, and constructs the - ``transform_iterator`` instance:: + ``transform_iterator`` instance: + + .. code:: cpp - using namespace oneapi; dpl::counting_iterator first(0); dpl::counting_iterator last(10); auto transform_first = dpl::make_transform_iterator(first, std::negate()); auto transform_last = transform_first + (last - first); - auto sum = std::reduce(dpl::execution::dpcpp_default, + auto sum = dpl::reduce(dpl::execution::dpcpp_default, transform_first, transform_last); // sum is (0 + -1 + ... + -9) = -45 * ``permutation_iterator``: an iterator whose dereferenced value set is defined by the source iterator @@ -106,7 +111,9 @@ header. All iterators are implemented in the ``oneapi::dpl`` namespace. in cases where algorithms are executed with device policies. The ``make_permutation_iterator`` is provided to simplify construction of iterator instances. The function - receives the source iterator and the iterator or function object representing the index map:: + receives the source iterator and the iterator or function object representing the index map: + + .. code:: cpp struct multiply_index_by_two { template @@ -118,7 +125,6 @@ header. All iterators are implemented in the ``oneapi::dpl`` namespace. // compute the number of elements in the range between the first and last that are accessed // by the permutation iterator size_t num_elements = std::distance(first, last) / 2 + std::distance(first, last) % 2; - using namespace oneapi; auto permutation_first = dpl::make_permutation_iterator(first, multiply_index_by_two()); auto permutation_last = permutation_first + num_elements; - std::copy(dpl::execution::dpcpp_default, permutation_first, permutation_last, result); + dpl::copy(dpl::execution::dpcpp_default, permutation_first, permutation_last, result); diff --git a/_sources/parallel_api/parallel_range_algorithms.rst b/_sources/parallel_api/parallel_range_algorithms.rst new file mode 100644 index 0000000000..f0356f71e2 --- /dev/null +++ b/_sources/parallel_api/parallel_range_algorithms.rst @@ -0,0 +1,111 @@ +Parallel Range Algorithms +######################### + +C++20 introduces the `Ranges library `_ and +`range algorithms `_ as a modern paradigm for expressing +generic operations on data sequences. + +|onedpl_long| (|onedpl_short|) extends it with *parallel range algorithms*, which can be used with the standard range +classes to leverage |onedpl_short| ability of parallel execution on both the host computer and data parallel devices. + +oneDPL only supports random access ranges, because they allow simultaneous constant-time access to elements +at any position in the range. This enables efficient workload distribution among multiple threads or processing units, +which is essential for achieving high performance in parallel execution. + +.. Note:: + + The use of parallel range algorithms requires C++20 and the C++ standard libraries coming with GCC 10 (or higher), + Clang 16 (or higher) and Microsoft* Visual Studio* 2019 16.10 (or higher). + +Supported Range Views +--------------------- + +`Views `_ are lightweight ranges typically used to describe +data transformation pipelines. The C++20 standard defines two categories of standard range views, called +*factories* and *adaptors*: + +* A range factory generates its data elements on access via an index or an iterator to the range. +* A range adaptor transforms its underlying data range(s) or view(s) into a new view with modified behavior. + +The following C++ standard random access adaptors and factories can be used with the |onedpl_short| +parallel range algorithms: + +* ``std::ranges::views::all``: A range adaptor that returns a view that includes all elements of a range + (only with standard-aligned execution policies). +* ``std::ranges::subrange``: A utility that produces a view from an iterator and a sentinel or from a range. +* ``std::span``: A view to a contiguous data sequence. +* ``std::ranges::iota_view``: A range factory that generates a sequence of elements by repeatedly incrementing + an initial value. +* ``std::ranges::single_view``: A view that contains exactly one element of a specified value. +* ``std::ranges::transform_view``: A range adaptor that produces a view that applies a transformation to each element + of another view. +* ``std::ranges::reverse_view``: A range adaptor that produces a reversed sequence of elements provided by another view. +* ``std::ranges::take_view``: A range adaptor that produces a view of the first N elements from another view. +* ``std::ranges::drop_view``: A range adaptor that produces a view excluding the first N elements from another view. + +Visit :doc:`pass_data_algorithms` for more information, especially on the :ref:`use of range views ` +with device execution policies. + +Supported Algorithms +-------------------- + +The ```` header defines the parallel range algorithms in the ``namespace oneapi::dpl::ranges``. +All algorithms work with both standard-aligned (host) and device execution policies. + +The ``ONEDPL_HAS_RANGE_ALGORITHMS`` :ref:`feature macro ` may be used to test for the presence of +parallel range algorithms. + +.. _range-algorithms-202409L: + +If ``ONEDPL_HAS_RANGE_ALGORITHMS`` is defined to ``202409L`` or a greater value, the following algorithms are provided: + +* ``for_each`` +* ``transform`` +* ``find`` +* ``find_if`` +* ``find_if_not`` +* ``adjacent_find`` +* ``all_of`` +* ``any_of`` +* ``none_of`` +* ``search`` +* ``search_n`` +* ``count`` +* ``count_if`` +* ``equal`` +* ``sort`` +* ``stable_sort`` +* ``is_sorted`` +* ``min_element`` +* ``max_element`` +* ``copy`` +* ``copy_if`` +* ``merge`` + +Usage Example for Parallel Range Algorithms +------------------------------------------- + +.. code:: cpp + + { + std::vector vec_in = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; + std::vector vec_out{vec_in.size()}; + + auto view_in = std::ranges::views::all(vec_in) | std::ranges::views::reverse; + oneapi::dpl::ranges::copy(oneapi::dpl::execution::par, view_in, vec_out); + } + { + using usm_shared_allocator = sycl::usm_allocator; + // Allocate for the queue used by the execution policy + usm_shared_allocator alloc{oneapi::dpl::execution::dpcpp_default.queue()}; + + std::vector vec_in{{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, alloc}; + std::vector vec_out{vec_in.size(), alloc}; + + auto view_in = std::ranges::subrange(vec_in.begin(), vec_in.end()) | std::ranges::views::reverse; + oneapi::dpl::ranges::copy(oneapi::dpl::execution::dpcpp_default, view_in, std::span(vec_out)); + } + +.. rubric:: See also: + +:doc:`range_based_api` diff --git a/_sources/parallel_api/pass_data_algorithms.rst b/_sources/parallel_api/pass_data_algorithms.rst index 4c0ec8a4ca..5d3a0b32b5 100644 --- a/_sources/parallel_api/pass_data_algorithms.rst +++ b/_sources/parallel_api/pass_data_algorithms.rst @@ -1,20 +1,34 @@ Pass Data to Algorithms ####################### -When using the C++ standard execution policies, oneDPL supports data being passed to its algorithms as specified -in the ISO/IEC 14882:2017 standard (commonly called C++17). According to the standard, the calling code -must prevent data races when using algorithms with parallel execution policies. - -Note: Implementations of ``std::vector`` are not required to avoid data races for concurrent modifications -of vector elements. Some implementations may optimize multiple ``bool`` elements into a bitfield, making it unsafe -for multithreading. For this reason, it is recommended to avoid ``std::vector`` for anything but a read-only -input with the C++ standard execution policies. - -When using a device execution policy, you can use one of the following ways to pass data to an algorithm: - -* ``oneapi:dpl::begin`` and ``oneapi::dpl::end`` functions -* Unified shared memory (USM) pointers -* ``std::vector`` with or without a USM allocator +For an algorithm to access data, it is important that the used execution policy matches the data storage type. +The following table shows which execution policies can be used with various data storage types. + +================================================ ========================== ============= +Data Storage Device Policies Host Policies +================================================ ========================== ============= +`SYCL buffer`_ Yes No +Device-allocated `unified shared memory`_ (USM) Yes No +Shared and host-allocated USM Yes Yes +``std::vector`` with ``sycl::usm_allocator`` Yes Yes +``std::vector`` with an ordinary allocator See :ref:`use-std-vector` Yes +Other data in host memory No Yes +================================================ ========================== ============= + +When using the standard-aligned (or *host*) execution policies, |onedpl_short| supports data being passed +to its algorithms as specified in the C++ standard (C++17 for algorithms working with iterators, +C++20 for parallel range algorithms), with :ref:`known restrictions and limitations `. + +According to the standard, the calling code must prevent data races when using algorithms +with parallel execution policies. + +.. note:: + Implementations of ``std::vector`` are not required to avoid data races for concurrent modifications + of vector elements. Some implementations may optimize multiple ``bool`` elements into a bitfield, making it unsafe + for multithreading. For this reason, it is recommended to avoid ``std::vector`` for anything but a read-only + input with the standard-aligned execution policies. + +The following subsections describe proper ways to pass data to an algorithm invoked with a device execution policy. .. _use-buffer-wrappers: @@ -23,7 +37,7 @@ Use oneapi::dpl::begin and oneapi::dpl::end Functions ``oneapi::dpl::begin`` and ``oneapi::dpl::end`` are special helper functions that allow you to pass SYCL buffers to parallel algorithms. These functions accept -a SYCL buffer and return an object of an unspecified type that provides the following API: +a `SYCL buffer`_ and return an object of an unspecified type that provides the following API: * It satisfies ``CopyConstructible`` and ``CopyAssignable`` C++ named requirements and comparable with ``operator==`` and ``operator!=``. @@ -33,16 +47,15 @@ a SYCL buffer and return an object of an unspecified type that provides the foll * It provides the ``get_buffer`` method, which returns the buffer passed to the ``begin`` and ``end`` functions. The ``begin`` and ``end`` functions can take SYCL 2020 deduction tags and ``sycl::no_init`` as arguments -to explicitly mention which access mode should be applied to the buffer accessor when submitting a -SYCL kernel to a device. For example: +to explicitly control which access mode should be applied to a particular buffer when submitting +a SYCL kernel to a device: .. code:: cpp - auto first1 = begin(buf, sycl::read_only); - auto first2 = begin(buf, sycl::write_only, sycl::no_init); - auto first3 = begin(buf, sycl::no_init); - -The example above allows you to control the access mode for the particular buffer passing to a parallel algorithm. + sycl::buffer buf{/*...*/}; + auto first_ro = oneapi::dpl::begin(buf, sycl::read_only); + auto first_wo = oneapi::dpl::begin(buf, sycl::write_only, sycl::no_init); + auto first_ni = oneapi::dpl::begin(buf, sycl::no_init); To use the functions, add ``#include `` to your code. For example: @@ -53,16 +66,16 @@ To use the functions, add ``#include `` to your code. For e #include #include #include + int main(){ std::vector vec(1000); std::generate(vec.begin(), vec.end(), std::minstd_rand{}); - //create a buffer from host memory - sycl::buffer buf { vec.data(), vec.size() }; + sycl::buffer buf{ vec.data(), vec.size() }; auto buf_begin = oneapi::dpl::begin(buf); auto buf_end = oneapi::dpl::end(buf); - std::sort(oneapi::dpl::execution::dpcpp_default, buf_begin, buf_end); + oneapi::dpl::sort(oneapi::dpl::execution::dpcpp_default, buf_begin, buf_end); return 0; } @@ -71,9 +84,9 @@ To use the functions, add ``#include `` to your code. For e Use Unified Shared Memory ------------------------- -If you have USM-allocated memory, pass the pointers to the start and past the end -of the sequence to a parallel algorithm. Make sure that the execution policy and -the USM-allocated memory were created for the same queue. For example: +If you have USM-allocated data, pass the pointers to the start and past the end +of the data sequence to a parallel algorithm. Make sure that the execution policy and +the USM allocation use the same SYCL queue. For example: .. code:: cpp @@ -81,32 +94,32 @@ the USM-allocated memory were created for the same queue. For example: #include #include #include + int main(){ sycl::queue q; const int n = 1000; int* d_head = sycl::malloc_shared(n, q); std::generate(d_head, d_head + n, std::minstd_rand{}); - std::sort(oneapi::dpl::execution::make_device_policy(q), d_head, d_head + n); + oneapi::dpl::sort(oneapi::dpl::execution::make_device_policy(q), d_head, d_head + n); sycl::free(d_head, q); return 0; } +.. note:: + Use of non-USM pointers is not supported for algorithms with device execution policies. + When using device USM, such as allocated by ``malloc_device``, you are responsible for data transfers to and from the device to ensure that input data is device accessible during oneDPL algorithm execution and that the result is available to the subsequent operations. -Use std::vector ------------------------------ - -The following examples demonstrate two ways to use the parallel algorithms with ``std::vector``: +.. _use-std-vector: -* Host allocators -* USM allocators +Use std::vector +--------------- -You can use iterators to host allocated ``std::vector`` data -as shown in the following example: +You can use iterators to an ordinary ``std::vector`` with data in host memory, as shown in the following example: .. code:: cpp @@ -114,28 +127,42 @@ as shown in the following example: #include #include #include + int main(){ std::vector vec( 1000 ); std::generate(vec.begin(), vec.end(), std::minstd_rand{}); - std::sort(oneapi::dpl::execution::dpcpp_default, vec.begin(), vec.end()); + oneapi::dpl::sort(oneapi::dpl::execution::dpcpp_default, vec.begin(), vec.end()); return 0; } -When using iterators to host allocated data, a temporary SYCL buffer is created, and the data -is copied to this buffer. After processing on a device is complete, the modified data is copied -from the temporary buffer back to the host container. While convenient, using host allocated -data can lead to unintended copying between host and device. We recommend working with SYCL buffers -or USM memory to reduce data copying between the host and device. +In this case a temporary SYCL buffer is created, the data is copied to this buffer, and it is processed +according to the algorithm semantics. After processing on a device is complete, the modified data is copied +from the temporary buffer back to the host container. + +.. note:: + For parallel range algorithms with device execution policies the use of ordinary ``std::vector``\s is not supported. -You can also use ``std::vector`` with a USM allocator, as shown in the following example: +While convenient, direct use of an ordinary ``std::vector`` can lead to unintended copying between the host +and the device. We recommend working with SYCL buffers or with USM to reduce data copying. + +.. note:: + For specialized memory algorithms that begin or end the lifetime of data objects, that is, + ``uninitialized_*`` and ``destroy*`` families of functions, the data to initialize or destroy + should be accessible on the device without extra copying. Therefore these algorithms may not use + data storage on the host with device execution policies. + +You can also use ``std::vector`` with a ``sycl::usm_allocator``, as shown in the following example. +Make sure that the allocator and the execution policy use the same SYCL queue: .. code:: cpp #include #include #include + #include #include + int main(){ const int n = 1000; auto policy = oneapi::dpl::execution::dpcpp_default; @@ -144,18 +171,67 @@ You can also use ``std::vector`` with a USM allocator, as shown in the following std::generate(vec.begin(), vec.end(), std::minstd_rand{}); // Recommended to use USM pointers: - std::sort(policy, vec.data(), vec.data() + vec.size()); - - // Iterators for USM allocators might require extra copying - not recommended method - // std::sort(policy, vec.begin(), vec.end()); + oneapi::dpl::sort(policy, vec.data(), vec.data() + vec.size()); + /* + // Iterators for USM allocators might require extra copying - not a recommended method + oneapi::dpl::sort(policy, vec.begin(), vec.end()); + */ return 0; } -Make sure that the execution policy and the USM-allocated memory were created for the same queue. - For ``std::vector`` with a USM allocator we recommend to use ``std::vector::data()`` in combination with ``std::vector::size()`` as shown in the example above, rather than iterators to ``std::vector``. That is because for some implementations of the C++ Standard Library it might not be possible for |onedpl_short| to detect that iterators are pointing to USM-allocated data. In that -case the data will be treated as if it were host-allocated, with an extra copy made to a SYCL buffer. -Retrieving USM pointers from ``std::vector`` as shown guarantees no unintended copying. \ No newline at end of file +case the data will be treated as if it were in host memory, with an extra copy made to a SYCL buffer. +Retrieving USM pointers from ``std::vector`` as shown guarantees no unintended copying. + +.. _use-range-views: + +Use Range Views +--------------- + +For :doc:`parallel range algorithms ` with device execution policies, +place the data in USM or a USM-allocated ``std::vector``, and pass it to an algorithm +via a device-copyable range or view object such as ``std::ranges::subrange`` or ``std::span``. + +.. note:: + Use of ``std::ranges::views::all`` is not supported for algorithms with device execution policies. + +These data ranges as well as supported range adaptors and factories may be combined into +data transformation pipelines that also can be used with parallel range algorithms. For example: + +.. code:: cpp + + #include + #include + #include + #include + #include + #include + #include + #include + + int main(){ + const int n = 1000; + auto policy = oneapi::dpl::execution::dpcpp_default; + sycl::queue q = policy.queue(); + + int* d_head = sycl::malloc_host(n, q); + std::generate(d_head, d_head + n, std::minstd_rand{}); + + sycl::usm_allocator alloc(q); + std::vector vec(n, alloc); + + oneapi::dpl::ranges::copy(policy, + std::ranges::subrange(d_head, d_head + n) | std::views::transform(std::negate{}), + std::span(vec)); + + oneapi::dpl::ranges::sort(policy, std::span(vec)); + + sycl::free(d_head, q); + return 0; + } + +.. _`SYCL buffer`: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#subsec:buffers +.. _`unified shared memory`: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:usm diff --git a/_sources/parallel_api/range_based_api.rst b/_sources/parallel_api/range_based_api.rst index 7323814c59..5d55e11585 100644 --- a/_sources/parallel_api/range_based_api.rst +++ b/_sources/parallel_api/range_based_api.rst @@ -1,32 +1,75 @@ -Range-Based API Algorithms -########################## -.. Note:: +Experimental Range-Based API +############################ + +The ```` header file contains experimental classes and functions that implement +the functionality similar to what is provided by the C++20 Ranges Library, yet only requires C++17. +This allows you to combine |onedpl_short| data parallel execution capabilities with some aspects +of modern range-based API. The functionality is only implemented for the device execution policies. - The use of the range-based API requires C++17 and the C++ standard libraries coming with GCC 8.1 (or higher) - or Clang 7 (or higher). +.. Note:: + The use of the experimental range-based API requires the C++ standard libraries + coming with GCC 8.1 (or higher) or Clang 7 (or higher). -C++20 introduces the Ranges library. C++20 standard splits ranges into two categories: factories and adaptors. -A range factory does not have underlying data. An element is generated on success by an index or by dereferencing an iterator. -A range adaptor, from the |onedpl_long| (|onedpl_short|) perspective, is a utility that transforms the base range, -or another adapted range, into a view with custom behavior. +.. Warning:: + This experimental functionality will be gradually substituted by the + :doc:`parallel range algorithms ` and eventually discontinued. -|onedpl_short| supports an ``iota_view`` range factory. +Range Views +----------- -A ``sycl::buffer`` wrapped with ``all_view`` can be used as the range. +.. _viewable-ranges: -|onedpl_short| considers the supported factories and ``all_view`` as base ranges. -The range adaptors may be combined into a pipeline with a ``base`` range at the beginning. For example: +The following viewable ranges are defined in the ``oneapi::dpl::experimental::ranges`` namespace: + +* ``views::iota``: A range factory that generates a sequence of elements by repeatedly incrementing an initial value. +* ``views::all``: A custom utility that represents a view of all or a part of ``sycl::buffer`` elements + for reading and writing on a device. +* ``views::all_read``: A custom utility that represents a view of all or a part of ``sycl::buffer`` elements + for reading on a device. +* ``views::all_write``: A custom utility that represents a view of all or a part of ``sycl::buffer`` elements + for writing on a device. +* ``views::host_all``: A custom utility that represents a view of all or a part of ``sycl::buffer`` elements + for reading and writing on the host. +* ``views::subrange``: A utility that represents a view of unified shared memory (USM) data range + defined by two USM pointers. +* ``views::zip``: A custom range adaptor that produces one ``zip_view`` from other several views. +* ``views::transform``: A range adaptor that represents a view of an underlying sequence after applying + a transformation to each element. +* ``views::reverse``: A range adaptor that produces a reversed sequence of elements provided by another view. +* ``views::take``: A range adaptor that produces a view of the first N elements from another view. +* ``views::drop``: A range adaptor that produces a view excluding the first N elements from another view. + +Only these ranges, ``sycl::buffer``, and their combinations can be passed to the experimental range-based algorithms. + +A ``sycl::buffer`` wrapped with ``views::all`` and similar utilities, ``views::subrange`` over USM, and ``views::iota`` +are considered *base ranges*. The range adaptors may be combined into a pipeline with a base range at the beginning. +For example: .. code:: cpp sycl::buffer buf(data, sycl::range<1>(10)); - auto range_1 = iota_view(0, 10) | views::reverse(); - auto range_2 = all_view(buf) | views::reverse(); + auto range_1 = views::iota(0, 10) | views::reverse; + auto range_2 = views::all(buf) | views::take(10); + +For ranges based on a SYCL buffer, data access is only permitted on a device, while ``size()`` and ``empty()`` +methods are allowed to be called on both host and device. + +Range-Based Algorithms +---------------------- + +The functions for experimental range based algorithms resemble the standard C++ parallel algorithm overloads +where all data sequences represented by ranges instead of iterators or iterator pairs, for example: + +.. code:: cpp + + template + void copy(ExecutionPolicy&& exec, Range1&& source, Range2&& destination); -For the range, based on the ``all_view`` factory, data access is permitted on a device only. ``size()`` and ``empty()`` methods are allowed -to be called on both host and device. +Note that ``source`` is used instead of two iterators to represent the input, and ``destination`` represents the output. -The following algorithms are available to use with the ranges: +The following algorithms are available to use with the ranges. These algorithms are defined in the +``oneapi::dpl::experimental::ranges`` namespace and can only be invoked with device execution policies. +To use these algorithms, include both ```` and ```` header files. * ``adjacent_find`` * ``all_of`` @@ -76,49 +119,19 @@ The following algorithms are available to use with the ranges: * ``unique`` * ``unique_copy`` -The signature example of the range-based algorithms looks like: - -.. code:: cpp - - template - void copy(ExecutionPolicy&& exec, Range1&& source, Range2&& destination); - -where ``source`` is used instead of two iterators to represent the input, and ``destination`` represents the output. - -These algorithms are declared in the ``oneapi::dpl::experimental::ranges`` namespace and implemented only for device execution policies. -To make these algorithms available, the ```` header should be included (after ````). -Use of the range-based API requires C++17 and the C++ standard libraries that come with GCC 8.1 (or higher) or Clang 7 (or higher). - -The following viewable ranges are declared in the ``oneapi::dpl::experimental::ranges`` namespace. -Only the ranges shown below and ``sycl::buffer`` are available as ranges for range-based algorithms. - -.. _viewable-ranges: - -* ``views::iota``: A range factory that generates a sequence of N elements, which starts from an initial value and ends by final N-1. -* ``views::all``: A custom utility that represents a view of all or a part of ``sycl::buffer`` underlying elements for reading and writing on a device. -* ``views::all_read``: A custom utility that represents a view of all or a part of ``sycl::buffer`` underlying elements for reading on a device. -* ``views::all_write``: A custom utility that represents a view of all or a part of ``sycl::buffer`` underlying elements for writing on a device. -* ``views::host_all``: A custom utility that represents a view of all or a part of ``sycl::buffer`` underlying elements for reading and writing on the host. -* ``views::subrange``: A utility that represents a view of unified shared memory (USM) data range defined by a two USM pointers. -* ``views::zip``: A custom range adapter that produces one ``zip_view`` from other several views. -* ``views::transform``: A range adapter that represents a view of a underlying sequence after applying a transformation to each element. -* ``views::reverse``: A range adapter that produces a reversed sequence of elements provided by another view. -* ``views::take``: A range adapter that produces a view of the first N elements from another view. -* ``views::drop``: A range adapter that produces a view excluding the first N elements from another view. - -Example of Range-Based API Usage --------------------------------- +Usage Example +------------- .. code:: cpp - using namespace oneapi::dpl::experimental::ranges; + namespace rangexp = oneapi::dpl::experimental::ranges; { sycl::buffer A(data, sycl::range<1>(max_n)); sycl::buffer B(data2, sycl::range<1>(max_n)); - auto view = all_view(A) | views::reverse(); - auto range_res = all_view(B); + auto view = rangexp::views::all(A) | rangexp::views::reverse; + auto range_res = rangexp::views::all_write(B); - copy(oneapi::dpl::execution::dpcpp_default, view, range_res); + rangexp::copy(oneapi::dpl::execution::dpcpp_default, view, range_res); } diff --git a/_sources/parallel_api_main.rst b/_sources/parallel_api_main.rst index 638aecc3ad..5c78c6f46f 100644 --- a/_sources/parallel_api_main.rst +++ b/_sources/parallel_api_main.rst @@ -1,38 +1,38 @@ Parallel API ############ -Parallel API is an implementation of the C++ standard libraries algorithms and execution -policies, as specified in the ISO/IEC 14882:2017 standard (commonly called C++17). The implementation -supports the unsequenced execution policy and the ``shift_left``/``shift_right`` algorithms, which are specified -in the final draft of the C++ 20 standard (N4860). For more details see the `C++ Standard Execution -Policies `_. |onedpl_long| (|onedpl_short|) -provides specific versions of the algorithms, including: +The Parallel API in |onedpl_long| (|onedpl_short|) is an implementation of the C++ standard algorithms +with `execution policies `_, +as specified in the ISO/IEC 14882:2017 standard (commonly called C++17), as well as those added in C++20. +It offers threaded and SIMD execution of these algorithms on Intel® processors implemented on top of OpenMP* +and |onetbb_short|, as well as data parallel execution on accelerators backed by SYCL* support in |dpcpp_cpp|. + +Extending the capabilities of `range algorithms `_ in C++20, +the Parallel API provides analogous *parallel range algorithms* that execute according to an execution policy. + +In addition, |onedpl_short| provides specific variations of some algorithms, including: * Segmented reduce * Segmented scan * Vectorized search algorithms - -Parallel API offers support for the parallel and vectorized execution of algorithms on Intel® -processors and heterogeneity support with a DPC++ based implementation for device execution policies. -For sequential execution, |onedpl_short| relies on an available implementation of the C++ standard library. +* Sorting of key-value pairs +* Conditional transform The utility API includes iterators and function object classes. The iterators implement zip, transform, complete permutation operations on other iterators, and include a counting and discard iterator. The function object classes provide minimum, maximum, and identity operations that may be passed to algorithms such as reduce or transform. -|onedpl_short| also includes an experimental implementation of range-based algorithms with their -required ranges and Async API. +|onedpl_short| also includes an experimental implementation of asynchronous algorithms. .. toctree:: :maxdepth: 2 :titlesonly: - :glob: :hidden: parallel_api/execution_policies - parallel_api/iterators - parallel_api/async_api - parallel_api/range_based_api + parallel_api/parallel_range_algorithms parallel_api/additional_algorithms parallel_api/pass_data_algorithms + parallel_api/iterators + parallel_api/range_based_api diff --git a/_static/documentation_options.js b/_static/documentation_options.js index 8302a8473b..89523924e1 100644 --- a/_static/documentation_options.js +++ b/_static/documentation_options.js @@ -1,6 +1,6 @@ var DOCUMENTATION_OPTIONS = { URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'), - VERSION: '2022.6.0', + VERSION: '2022.7.0', LANGUAGE: 'None', COLLAPSE_INDEX: false, BUILDER: 'html', diff --git a/api_for_sycl_kernels/random.html b/api_for_sycl_kernels/random.html index 912411955a..3bdadbd936 100644 --- a/api_for_sycl_kernels/random.html +++ b/api_for_sycl_kernels/random.html @@ -8,7 +8,7 @@ - Random Number Generators — oneDPL Documentation 2022.6.0 documentation + Random Number Generators — oneDPL Documentation 2022.7.0 documentation @@ -45,8 +45,8 @@ - - + + @@ -151,24 +151,31 @@
  • Release Notes
  • Get Started with the oneAPI DPC++ Library
  • -

    Library Guide

    -