From fb7dc3582d4e0fa5ae4d5f15131e6be8b60ec0ef Mon Sep 17 00:00:00 2001
From: Lukasz Wawrzyniak <lwawrzyniak@nvidia.com>
Date: Mon, 30 Sep 2024 13:33:47 -0700
Subject: [PATCH] Dynamic generation docs

---
 CHANGELOG.md             |   1 +
 docs/codegen.rst         | 701 ++++++++++++++++++++++++++++++++++++++-
 docs/index.rst           |   2 +-
 docs/modules/runtime.rst | 200 +++++++----
 4 files changed, 831 insertions(+), 73 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index ac067870..c58005d4 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -26,6 +26,7 @@
 - Support for a new `wp.static(expr)` function that allows arbitrary Python expressions to be evaluated at the time of
   function/kernel definition ([docs](https://nvidia.github.io/warp/codegen.html#static-expressions)).
 - Add a contributing guide to the Sphinx docs.
+- Add documentation for dynamic code generation.
 
 ### Changed
 
diff --git a/docs/codegen.rst b/docs/codegen.rst
index 20e4afa0..9696381c 100644
--- a/docs/codegen.rst
+++ b/docs/codegen.rst
@@ -1,7 +1,99 @@
+.. _code_generation:
+
 Code Generation
 ===============
 
-Warp explicitly generates C++/CUDA code for CPU/GPU and stores the .cpp/.cu source files under the module directories of the kernel cache.
+Overview
+--------
+
+Warp kernels are grouped together by Python module.  Before they can run on a device, they must be translated and compiled for the device architecture.  All kernels in a module are compiled together, which is faster than compiling each one individually.  When a kernel is launched, Warp checks if the module is up-to-date and will compile it if needed.  Adding new kernels to a module at runtime modifies the module, which means that it will need to be reloaded on next launch.
+
+.. code:: python
+
+    @wp.kernel
+    def kernel_foo():
+        print("foo")
+
+    wp.launch(kernel_foo, dim=1)
+
+    @wp.kernel
+    def kernel_bar():
+        print("bar")
+
+    wp.launch(kernel_bar, dim=1)
+
+In the snippet above, kernel definitions are interspersed with kernel launches.  To execute ``kernel_foo``, the module is compiled during the first launch.  Defining ``kernel_bar`` modifies the module, so it needs to be recompiled during the second launch:
+
+.. code:: text
+
+    Module __main__ 6cd1d53 load on device 'cuda:0' took 168.19 ms  (compiled)
+    foo
+    Module __main__ c7c0e9a load on device 'cuda:0' took 160.35 ms  (compiled)
+    bar
+
+The compilation can take a long time for modules with numerous complex kernels, so Warp caches the compiled modules and can reuse them on the next run of the program:
+
+.. code:: text
+
+    Module __main__ 6cd1d53 load on device 'cuda:0' took 4.97 ms  (cached)
+    foo
+    Module __main__ c7c0e9a load on device 'cuda:0' took 0.40 ms  (cached)
+    bar
+
+Loading cached modules is much faster, but it's not free.  In addition, module reloading can cause problems during CUDA graph capture, so there are good reasons to try to avoid it.
+
+The best way to avoid module reloading is to define all the kernels before launching any of them.  This way, the module will be compiled only once:
+
+.. code:: python
+
+    @wp.kernel
+    def kernel_foo():
+        print("foo")
+
+    @wp.kernel
+    def kernel_bar():
+        print("bar")
+
+    wp.launch(kernel_foo, dim=1)
+    wp.launch(kernel_bar, dim=1)
+
+.. code:: text
+
+    Module __main__ c7c0e9a load on device 'cuda:0' took 174.57 ms  (compiled)
+    foo
+    bar
+
+On subsequent runs it will be loaded from the kernel cache only once:
+
+.. code:: text
+
+    Module __main__ c7c0e9a load on device 'cuda:0' took 4.96 ms  (cached)
+    foo
+    bar
+
+Warp tries to recognize duplicate kernels to avoid unnecessary module reloading.  For example, this program creates kernels in a loop, but they are always identical, so the module does not need to be recompiled on every launch:
+
+.. code:: python
+
+    for i in range(3):
+
+        @wp.kernel
+        def kernel_hello():
+            print("hello")
+
+        wp.launch(kernel_hello, dim=1)
+
+Warp filters out the duplicate kernels, so the module is only loaded once:
+
+.. code:: text
+
+    Module __main__ 8194f57 load on device 'cuda:0' took 178.24 ms  (compiled)
+    hello
+    hello
+    hello
+
+
+Warp generates C++/CUDA source code for CPU/GPU and stores the .cpp/.cu source files under the module directories of the kernel cache.
 The kernel cache folder path is printed during the :ref:`Warp initialization <warp-initialization>` and
 can be retrieved after Warp has been initialized from the ``warp.config.kernel_cache_dir`` :ref:`configuration setting <global-settings>`.
 
@@ -42,6 +134,10 @@ To ease the readability, comments referring to the original Python source code l
 Besides the forward pass, the gradient function is also generated, and,
 if a :ref:`custom replay function <custom-gradient-functions>` is provided, the replay function is generated as well.
 
+Warp passes the generated source code to native compilers (e.g., LLVM for CPU and NVRTC for CUDA) to produce executable code that is invoked when launching kernels.
+
+.. _static_expressions:
+
 Static Expressions
 ------------------
 
@@ -201,8 +297,609 @@ Example: Function Pointers
 
 The above program uses a static expression to select the right function given the captured ``op`` variable and prints the following output while compiling the module containing the ``operate`` kernel three times:
 
-.. code:: console
+.. code:: text
 
     [3. 3.]
     [-1.  3.]
     [2. 0.]
+
+
+.. _dynamic_generation:
+
+Dynamic Kernel Creation
+-----------------------
+
+It is often desirable to dynamically customize kernels with different constants, types, or functions.  We can achieve this through runtime kernel specialization using Python closures.
+
+Kernel Closures
+~~~~~~~~~~~~~~~
+
+Constants
+^^^^^^^^^
+
+Warp allows references to external constants in kernels:
+
+.. code:: python
+
+    def create_kernel_with_constant(constant):
+        @wp.kernel
+        def k(a: wp.array(dtype=float)):
+            tid = wp.tid()
+            a[tid] += constant
+        return k
+
+    k1 = create_kernel_with_constant(17.0)
+    k2 = create_kernel_with_constant(42.0)
+
+    a = wp.zeros(5, dtype=float)
+
+    wp.launch(k1, dim=a.shape, inputs=[a])
+    wp.launch(k2, dim=a.shape, inputs=[a])
+
+    print(a)
+
+Output:
+
+.. code:: text
+
+    [59. 59. 59. 59. 59.]
+
+
+Data Types
+^^^^^^^^^^
+
+Warp data types can also be captured in a closure.  Here is an example of creating kernels that work with different vector dimensions:
+
+.. code:: python
+
+    def create_kernel_with_dtype(vec_type):
+        @wp.kernel
+        def k(a: wp.array(dtype=vec_type)):
+            tid = wp.tid()
+            a[tid] += float(tid) * vec_type(1.0)
+        return k
+
+    k2 = create_kernel_with_dtype(wp.vec2)
+    k4 = create_kernel_with_dtype(wp.vec4)
+
+    a2 = wp.ones(3, dtype=wp.vec2)
+    a4 = wp.ones(3, dtype=wp.vec4)
+
+    wp.launch(k2, dim=a2.shape, inputs=[a2])
+    wp.launch(k4, dim=a4.shape, inputs=[a4])
+
+    print(a2)
+    print(a4)
+
+Output:
+
+.. code:: text
+
+    [[1. 1.]
+     [2. 2.]
+     [3. 3.]]
+    [[1. 1. 1. 1.]
+     [2. 2. 2. 2.]
+     [3. 3. 3. 3.]]
+
+
+Functions
+^^^^^^^^^
+
+Here's a kernel generator that's parameterized using different functions:
+
+.. code:: python
+
+    def create_kernel_with_function(f):
+        @wp.kernel
+        def k(a: wp.array(dtype=float)):
+            tid = wp.tid()
+            a[tid] = f(a[tid])
+        return k
+
+    @wp.func
+    def square(x: float):
+        return x * x
+
+    @wp.func
+    def cube(x: float):
+        return x * x * x
+
+    k1 = create_kernel_with_function(square)
+    k2 = create_kernel_with_function(cube)
+
+    a1 = wp.array([1, 2, 3, 4, 5], dtype=float)
+    a2 = wp.array([1, 2, 3, 4, 5], dtype=float)
+
+    wp.launch(k1, dim=a1.shape, inputs=[a1])
+    wp.launch(k2, dim=a2.shape, inputs=[a2])
+
+    print(a1)
+    print(a2)
+
+Output:
+
+.. code:: text
+
+    [ 1.  4.  9.  16.  25.]
+    [ 1.  8.  27.  64.  125.]
+
+
+Function Closures
+~~~~~~~~~~~~~~~~~
+
+Warp functions (``@wp.func``) also support closures, just like kernels:
+
+.. code:: python
+
+    def create_function_with_constant(constant):
+        @wp.func
+        def f(x: float):
+            return constant * x
+        return f
+
+    f1 = create_function_with_constant(2.0)
+    f2 = create_function_with_constant(3.0)
+
+    @wp.kernel
+    def k(a: wp.array(dtype=float)):
+        tid = wp.tid()
+        x = float(tid)
+        a[tid] = f1(x) + f2(x)
+
+    a = wp.ones(5, dtype=float)
+
+    wp.launch(k, dim=a.shape, inputs=[a])
+
+    print(a)
+
+Output:
+
+.. code:: text
+
+    [ 0.  5. 10. 15. 20.]
+
+
+We can also create related function and kernel closures together like this:
+
+.. code:: python
+
+    def create_fk(a, b):
+        @wp.func
+        def f(x: float):
+            return a * x
+
+        @wp.kernel    
+        def k(a: wp.array(dtype=float)):
+            tid = wp.tid()
+            a[tid] = f(a[tid]) + b
+
+        return f, k
+
+    # create related function and kernel closures
+    f1, k1 = create_fk(2.0, 3.0)
+    f2, k2 = create_fk(4.0, 5.0)
+
+    # use the functions separately in a new kernel
+    @wp.kernel
+    def kk(a: wp.array(dtype=float)):
+        tid = wp.tid()
+        a[tid] = f1(a[tid]) + f2(a[tid])
+
+    a1 = wp.array([1, 2, 3, 4, 5], dtype=float)
+    a2 = wp.array([1, 2, 3, 4, 5], dtype=float)
+    ak = wp.array([1, 2, 3, 4, 5], dtype=float)
+
+    wp.launch(k1, dim=a1.shape, inputs=[a1])
+    wp.launch(k2, dim=a2.shape, inputs=[a2])
+    wp.launch(kk, dim=ak.shape, inputs=[ak])
+
+    print(a1)
+    print(a2)
+    print(ak)
+
+Output:
+
+.. code:: text
+
+    [ 5.  7.  9. 11. 13.]
+    [ 9. 13. 17. 21. 25.]
+    [ 6. 12. 18. 24. 30.]
+
+
+Dynamic Structs
+~~~~~~~~~~~~~~~
+
+Sometimes it's useful to customize Warp structs with different data types.
+
+Customize Precision
+^^^^^^^^^^^^^^^^^^^
+
+For example, we can create structs with different floating point precision:
+
+.. code:: python
+
+    def create_struct_with_precision(dtype):
+        @wp.struct
+        class S:
+            a: dtype
+            b: dtype
+        return S
+
+    # create structs with different floating point precision
+    S16 = create_struct_with_precision(wp.float16)
+    S32 = create_struct_with_precision(wp.float32)
+    S64 = create_struct_with_precision(wp.float64)
+
+    s16 = S16()
+    s32 = S32()
+    s64 = S64()
+
+    s16.a, s16.b = 2.0001, 3.0000002
+    s32.a, s32.b = 2.0001, 3.0000002
+    s64.a, s64.b = 2.0001, 3.0000002
+
+    # create a generic kernel that works with the different types
+    @wp.kernel
+    def k(s: Any, output: wp.array(dtype=Any)):
+        tid = wp.tid()
+        x = output.dtype(tid)
+        output[tid] = x * s.a + s.b
+
+    a16 = wp.empty(5, dtype=wp.float16)
+    a32 = wp.empty(5, dtype=wp.float32)
+    a64 = wp.empty(5, dtype=wp.float64)
+
+    wp.launch(k, dim=a16.shape, inputs=[s16, a16])
+    wp.launch(k, dim=a32.shape, inputs=[s32, a32])
+    wp.launch(k, dim=a64.shape, inputs=[s64, a64])
+
+    print(a16)
+    print(a32)
+    print(a64)
+
+We can see the effect of using different floating point precision in the output:
+
+.. code:: text
+
+    [ 3.  5.  7.  9. 11.]
+    [ 3.0000002  5.0001     7.0002003  9.000299  11.0004   ]
+    [ 3.0000002  5.0001002  7.0002002  9.0003002 11.0004002]
+
+
+Customize Dimensions
+^^^^^^^^^^^^^^^^^^^^
+
+Another useful application of dynamic structs is the ability to customize dimensionality.  Here, we create structs that work with 2D and 3D data:
+
+.. code:: python
+
+    # create struct with different vectors and matrix dimensions
+    def create_struct_nd(dim):
+        @wp.struct
+        class S:
+            v: wp.types.vector(dim, float)
+            m: wp.types.matrix((dim, dim), float)
+        return S
+
+    S2 = create_struct_nd(2)
+    S3 = create_struct_nd(3)
+
+    s2 = S2()
+    s2.v = (1.0, 2.0)
+    s2.m = ((2.0, 0.0),
+            (0.0, 0.5))
+
+    s3 = S3()
+    s3.v = (1.0, 2.0, 3.0)
+    s3.m = ((2.0, 0.0, 0.0),
+            (0.0, 0.5, 0.0),
+            (0.0, 0.0, 1.0))
+
+    # create a generic kernel that works with the different types
+    @wp.kernel
+    def k(s: Any, output: wp.array(dtype=Any)):
+        tid = wp.tid()
+        x = float(tid)
+        output[tid] = x * s.v * s.m
+
+    a2 = wp.empty(5, dtype=wp.vec2)
+    a3 = wp.empty(5, dtype=wp.vec3)
+
+    wp.launch(k, dim=a2.shape, inputs=[s2, a2])
+    wp.launch(k, dim=a3.shape, inputs=[s3, a3])
+
+    print(a2)
+    print(a3)
+
+Output:
+
+.. code:: text
+
+    [[0. 0.]
+     [2. 1.]
+     [4. 2.]
+     [6. 3.]
+     [8. 4.]]
+    [[ 0.  0.  0.]
+     [ 2.  1.  3.]
+     [ 4.  2.  6.]
+     [ 6.  3.  9.]
+     [ 8.  4. 12.]]
+
+
+Module Reloading
+~~~~~~~~~~~~~~~~
+
+Frequent recompilation can add overhead to a program, especially if the program is creating kernels at runtime.  Consider this program:
+
+.. code:: python
+
+    def create_kernel_with_constant(constant):
+        @wp.kernel
+        def k(a: wp.array(dtype=float)):
+            tid = wp.tid()
+            a[tid] += constant
+        return k
+
+    a = wp.zeros(5, dtype=float)
+
+    k1 = create_kernel_with_constant(17.0)
+    wp.launch(k1, dim=a.shape, inputs=[a])
+    print(a)
+
+    k2 = create_kernel_with_constant(42.0)
+    wp.launch(k2, dim=a.shape, inputs=[a])
+    print(a)
+
+    k3 = create_kernel_with_constant(-9.0)
+    wp.launch(k3, dim=a.shape, inputs=[a])
+    print(a)
+
+Kernel creation is interspersed with kernel launches, which forces reloading on each kernel launch:
+
+.. code:: text
+
+    Module __main__ 96db544 load on device 'cuda:0' took 165.46 ms  (compiled)
+    [17. 17. 17. 17. 17.]
+    Module __main__ 9f609a4 load on device 'cuda:0' took 151.69 ms  (compiled)
+    [59. 59. 59. 59. 59.]
+    Module __main__ e93fbb9 load on device 'cuda:0' took 167.84 ms  (compiled)
+    [50. 50. 50. 50. 50.]
+
+To avoid reloading, all kernels should be created before launching them:
+
+.. code:: python
+
+    def create_kernel_with_constant(constant):
+        @wp.kernel
+        def k(a: wp.array(dtype=float)):
+            tid = wp.tid()
+            a[tid] += constant
+        return k
+
+    k1 = create_kernel_with_constant(17.0)
+    k2 = create_kernel_with_constant(42.0)
+    k3 = create_kernel_with_constant(-9.0)
+
+    a = wp.zeros(5, dtype=float)
+
+    wp.launch(k1, dim=a.shape, inputs=[a])
+    print(a)
+
+    wp.launch(k2, dim=a.shape, inputs=[a])
+    print(a)
+
+    wp.launch(k3, dim=a.shape, inputs=[a])
+    print(a)
+
+.. code:: text
+
+    Module __main__ e93fbb9 load on device 'cuda:0' took 164.87 ms  (compiled)
+    [17. 17. 17. 17. 17.]
+    [59. 59. 59. 59. 59.]
+    [50. 50. 50. 50. 50.]
+
+Redefining identical kernels, functions, and structs should not cause module reloading, since Warp is able to detect duplicates:
+
+.. code:: python
+
+    def create_struct(dtype):
+        @wp.struct
+        class S:
+            a: dtype
+            b: dtype
+        return S
+
+    def create_function(dtype, S):
+        @wp.func
+        def f(s: S):
+            return s.a * s.b
+        return f
+
+    def create_kernel(dtype, S, f, C):
+        @wp.kernel
+        def k(a: wp.array(dtype=dtype)):
+            tid = wp.tid()
+            s = S(a[tid], C)
+            a[tid] = f(s)
+        return k
+
+    # create identical struct, function, and kernel in a loop
+    for i in range(3):
+        S = create_struct(float)
+        f = create_function(float, S)
+        k = create_kernel(float, S, f, 3.0)
+
+        a = wp.array([1, 2, 3, 4, 5], dtype=float)
+
+        wp.launch(k, dim=a.shape, inputs=[a])
+        print(a)
+
+Even though struct ``S``, function ``f``, and kernel ``k`` are re-created in each iteration of the loop, they are duplicates so the module is only loaded once:
+
+.. code:: text
+
+    Module __main__ 4af2d60 load on device 'cuda:0' took 181.34 ms  (compiled)
+    [ 3.  6.  9. 12. 15.]
+    [ 3.  6.  9. 12. 15.]
+    [ 3.  6.  9. 12. 15.]
+
+
+.. _late_binding:
+
+Late Binding and Static Expressions
+-----------------------------------
+
+Python uses late binding, which means that variables can be referenced in a function before they are defined:
+
+.. code:: python
+
+    def k():
+        # Function f() and constant C are not defined yet.
+        # They will be resolved when k() is called.
+        print(f() + C)
+
+    def f():
+        return 42
+
+    C = 17
+
+    # late binding occurs in this call
+    k()
+
+Warp follows this convention by default, because it's the Pythonic way.  Here is a similar program written in Warp:
+
+.. code:: python
+
+    @wp.kernel
+    def k():
+        # Function f() and constant C are not defined yet.
+        # They will be resolved when k() is called.
+        print(f() + C)
+
+    @wp.func
+    def f():
+        return 42
+
+    C = 17
+
+    # late binding occurs in this launch, when the module is compiled
+    wp.launch(k, dim=1)
+
+    # wait for the output
+    wp.synchronize_device()
+
+Late binding is often convenient, but it can sometimes lead to surprising results.  Consider this snippet, which creates kernels in a loop.  The kernels reference the loop variable as a constant.
+
+.. code:: python
+
+    # create a list of kernels that use the loop variable
+    kernels = []
+    for i in range(3):
+        @wp.kernel
+        def k():
+            print(i)
+        kernels.append(k)
+
+    # launch the kernels
+    for k in kernels:
+        wp.launch(k, dim=1)
+
+    wp.synchronize_device()
+
+This prints:
+
+.. code:: text
+
+    2
+    2
+    2
+
+This might be surprising, but creating a similar program in pure Python would lead to the same results.  Because of late binding, the captured loop variable ``i`` is not evaluated until the kernels are launched.  At that moment, the value of ``i`` is 2 and we see the same output from each kernel.
+
+In Warp, ``wp.static()`` can be used to get around this problem:
+
+.. code:: python
+
+    # create a list of kernels that use the loop variable
+    kernels = []
+    for i in range(3):
+        @wp.kernel
+        def k():
+            print(wp.static(i))  # wp.static() for the win
+        kernels.append(k)
+
+    # launch the kernels
+    for k in kernels:
+        wp.launch(k, dim=1)
+
+    wp.synchronize_device()
+
+Warp replaces the call to ``wp.static()`` with the value of the expression passed as its argument.  The expression is evaluated immediately at the time of kernel definition.  This is similar to static binding used by languages like C++, which means that all variables referenced by the static expression must already be defined.
+
+To further illustrate the difference between the default late binding behavior and static expressions, consider this program:
+
+.. code:: python
+
+    C = 17
+
+    @wp.kernel
+    def k1():
+        print(C)
+
+    @wp.kernel
+    def k2():
+        print(wp.static(C))
+
+    # redefine constant
+    C = 42
+
+    wp.launch(k1, dim=1)
+    wp.launch(k2, dim=1)
+
+    wp.synchronize_device()
+
+Output:
+
+.. code:: text
+
+    42
+    17
+
+Kernel ``k1`` uses late binding of ``C``.  This means that it captures the latest value of ``C``, determined when the module is built during the launch.  Kernel ``k2`` consumes ``C`` in a static expression, thus it captures the value of ``C`` when the kernel is defined.
+
+The same rules apply to resolving Warp functions:
+
+.. code:: python
+
+    @wp.func
+    def f():
+        return 17
+
+    @wp.kernel
+    def k1():
+        print(f())
+
+    @wp.kernel
+    def k2():
+        print(wp.static(f)())
+
+    # redefine function
+    @wp.func
+    def f():
+        return 42
+
+    wp.launch(k1, dim=1)
+    wp.launch(k2, dim=1)
+
+    wp.synchronize_device()
+
+Output:
+
+.. code:: text
+
+    42
+    17
+
+Kernel ``k1`` uses the latest definition of function ``f``, while kernel ``k2`` uses the definition of ``f`` when the kernel was declared.
diff --git a/docs/index.rst b/docs/index.rst
index 5ffef7f2..be23fc3b 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -361,10 +361,10 @@ Full Table of Contents
     :maxdepth: 2
     :caption: Advanced Topics
 
+    codegen
     modules/allocators
     modules/concurrency
     profiling
-    codegen
 
 .. toctree::
     :maxdepth: 2
diff --git a/docs/modules/runtime.rst b/docs/modules/runtime.rst
index 4d0d4fad..ae007d70 100644
--- a/docs/modules/runtime.rst
+++ b/docs/modules/runtime.rst
@@ -45,67 +45,7 @@ generated compilation artifacts as Warp does not automatically try to keep the c
 Runtime Kernel Creation
 #######################
 
-It is often desirable to specialize kernels for different types, constants, or functions at runtime.
-We can achieve this through the use of runtime kernel specialization using Python closures.
-
-For example, we might require a variety of kernels that execute particular functions for each item in an array.
-We might also want this function call to be valid for a variety of data types. Making use of closure and generics, we can generate
-these kernels using a single kernel definition::
-
-    def make_kernel(func, dtype):
-        def closure_kernel_fn(data: wp.array(dtype=dtype), out: wp.array(dtype=dtype)):
-            tid = wp.tid()
-            out[tid] = func(data[tid])
-
-        return wp.Kernel(closure_kernel_fn)
-
-In practice, we might use our kernel generator, ``make_kernel()`` as follows::
-
-    @wp.func
-    def sqr(x: Any) -> Any:
-        return x * x
-
-    @wp.func
-    def cube(x: Any) -> Any:
-        return sqr(x) * x
-
-    sqr_float = make_kernel(sqr, wp.float32)
-    cube_double = make_kernel(cube, wp.float64)
-
-    arr = [1.0, 2.0, 3.0]
-    N = len(arr)
-
-    data_float = wp.array(arr, dtype=wp.float32, device=device)
-    data_double = wp.array(arr, dtype=wp.float64, device=device)
-
-    out_float = wp.zeros(N, dtype=wp.float32, device=device)
-    out_double = wp.zeros(N, dtype=wp.float64, device=device)
-
-    wp.launch(sqr_float, dim=N, inputs=[data_float], outputs=[out_float], device=device)
-    wp.launch(cube_double, dim=N, inputs=[data_double], outputs=[out_double], device=device)
-
-We can specialize kernel definitions over Warp constants similarly. The following generates kernels that add a specified constant
-to a generic-typed array value::
-
-    def make_add_kernel(key, constant):
-        def closure_kernel_fn(data: wp.array(dtype=Any), out: wp.array(dtype=Any)):
-            tid = wp.tid()
-            out[tid] = data[tid] + constant
-
-        return wp.Kernel(closure_kernel_fn, key=key)
-
-    add_ones_int = make_add_kernel("add_one", wp.constant(1))
-    add_ones_vec3 = make_add_kernel("add_ones_vec3", wp.constant(wp.vec3(1.0, 1.0, 1.0)))
-
-    a = wp.zeros(2, dtype=int)
-    b = wp.zeros(2, dtype=wp.vec3)
-
-    a_out = wp.zeros_like(a)
-    b_out = wp.zeros_like(b)
-
-    wp.launch(add_ones_int, dim=a.size, inputs=[a], outputs=[a_out], device=device)
-    wp.launch(add_ones_vec3, dim=b.size, inputs=[b], outputs=[b_out], device=device)
-
+Warp allows generating kernels on-the-fly with various customizations, including closure support.  Refer to the :ref:`Code Generation<code_generation>` section for the latest features.
 
 .. _Arrays:
 
@@ -682,31 +622,151 @@ This can be surprising for users that are accustomed to C-style conversions but
     Users should explicitly cast variables to compatible types using constructors like
     ``int()``, ``float()``, ``wp.float16()``, ``wp.uint8()``, etc.
 
+.. note::
+    For performance reasons, Warp relies on native compilers to perform numeric conversions (e.g., LLVM for CPU and NVRTC for CUDA).  This is generally not a problem, but in some cases the results may vary on different devices.  For example, the conversion ``wp.uint8(-1.0)`` results in undefined behavior, since the floating point value -1.0 is out of range for unsigned integer types.  C++ compilers are free to handle such cases as they see fit.  Numeric conversions are only guaranteed to produce correct results when the value being converted is in the range supported by the target data type.
+
 Constants
 ---------
 
-In general, Warp kernels cannot access variables in the global Python interpreter state. One exception to this is for compile-time constants, which may be declared globally (or as class attributes) and folded into the kernel definition.
+Warp kernels can access Python variables with some restrictions.  External values referenced in a kernel are treated as compile-time constants and are folded into the kernel definition when the module is built.
 
-Constants are defined using the ``wp.constant()`` function. An example is shown below::
+.. code:: python
 
-    TYPE_SPHERE = wp.constant(0)
-    TYPE_CUBE = wp.constant(1)
-    TYPE_CAPSULE = wp.constant(2)
+    TYPE_SPHERE = 0
+    TYPE_CUBE = 1
+    TYPE_CAPSULE = 2
 
     @wp.kernel
     def collide(geometry: wp.array(dtype=int)):
 
         t = geometry[wp.tid()]
 
-        if (t == TYPE_SPHERE):
+        if t == TYPE_SPHERE:
             print("sphere")
-        if (t == TYPE_CUBE):
+        elif t == TYPE_CUBE:
             print("cube")
-        if (t == TYPE_CAPSULE):
+        elif t == TYPE_CAPSULE:
             print("capsule")
 
 
-.. autoclass:: constant
+Supported Constant Types
+########################
+
+Only value types can be used as constants in Warp kernels.  This includes integers, floating point numbers, vectors (``wp.vec*``), matrices (``wp.mat*``) and other built-in math types.  Attempting to capture other variables types will result in an exception:
+
+.. code:: python
+
+    global_array = wp.zeros(5, dtype=int)
+
+    @wp.kernel
+    def k():
+        tid = wp.tid()
+        global_array[tid] = 42  # referencing external arrays is not allowed!
+
+    wp.launch(k, dim=global_array.shape, inputs=[])
+
+Output:
+
+.. code:: text
+
+    TypeError: Invalid external reference type: <class 'warp.types.array'>
+
+The reason why arrays cannot be captured is because they exist on a particular device and contain pointers to the device memory, which would make the kernel not portable across different devices.  Arrays should always be passed as kernel inputs.
+
+
+Usage of ``wp.constant()``
+##########################
+
+In older versions of Warp, ``wp.constant()`` was required to declare constants that can be used in a kernel.  This is no longer necessary, but the old syntax is still supported for backward compatibility.  ``wp.constant()`` can still be used to check if a value can be referenced in a kernel:
+
+.. code:: python
+
+    x = wp.constant(17.0)  # ok
+    v = wp.constant(wp.vec3(1.0, 2.0, 3.0))  # ok
+    a = wp.constant(wp.zeros(n=5, dtype=int))  # error, invalid constant type
+
+    @wp.kernel
+    def k():
+        tid = wp.tid()
+        a[tid] = x * v
+
+In this snippet, a ``TypeError`` will be raised when declaring the array with ``wp.constant()``.  If ``wp.constant()`` was omitted, the error would be raised later during code generation, which might be slightly harder to debug.
+
+
+Updating Constants
+##################
+
+One limitation of using external variables in Warp kernels is that Warp doesn't know when the value is modified:
+
+.. code:: python
+
+    C = 17
+
+    @wp.kernel
+    def k():
+        print(C)
+
+    wp.launch(k, dim=1)
+
+    # redefine constant
+    C = 42
+
+    wp.launch(k, dim=1)
+
+This prints:
+
+.. code:: text
+
+    Module __main__ 4494df2 load on device 'cuda:0' took 163.54 ms  (compiled)
+    17
+    17
+
+During the first launch of kernel ``k``, the kernel is compiled using the existing value of ``C`` (17).  Since ``C`` is just a plain Python variable, Warp has no way of detecting when it is modified.  Thus on the second launch the old value is printed again.
+
+One way to get around this limitation is to tell Warp that the module was modified:
+
+.. code:: python
+
+    C = 17
+
+    @wp.kernel
+    def k():
+        print(C)
+
+    wp.launch(k, dim=1)
+
+    # redefine constant
+    C = 42
+
+    # tell Warp that the module was modified
+    k.module.mark_modified()
+
+    wp.launch(k, dim=1)
+
+This produces the updated output:
+
+.. code:: text
+
+    Module __main__ 4494df2 load on device 'cuda:0' took 167.92 ms  (compiled)
+    17
+    Module __main__ 9a0664f load on device 'cuda:0' took 164.83 ms  (compiled)
+    42
+
+Notice that calling ``module.mark_modified()`` caused the module to be recompiled on the second launch using the latest value of ``C``.
+
+.. note::
+    The ``Module`` class and the ``mark_modified()`` method are considered internal.  A public API for working with modules is planned, but currently it is subject to change without notice.  Programs should not overly rely on the ``mark_modified()`` method, but it can be used in a pinch.
+
+
+Related Links
+#############
+
+The :ref:`Code Generation<code_generation>` section contains additional information about working with constants and external variables:
+
+* :ref:`Static expressions<static_expressions>`
+* :ref:`Dynamic kernel creation<dynamic_generation>`
+* :ref:`Late binding<late_binding>`
+
 
 Predefined Constants
 ####################