How to improve JIT tracing speed with inevitable scalar loops in C++ ? #1435

xacond00 · 2024-12-14T23:07:35Z

xacond00
Dec 14, 2024

I'm writing a algorithm that requires storing sampling data of all sensors in a multisensor situation in vectorized variants. In nutshell it takes samples from one sensor and reprojects the to all the other sensors in the scene + some additional weights.
Ie.

struct SampleData{
 Spectrum result;
 Vector2f pos;
 Float weight;
 Float pdf;
 Bool visible;
... // Other vector data + DRJIT_STRUCT definition
};

The sample data is allocated as a normal array inside a integrator using new SampleData[], and passed into a custom function which works as follows (for simplicity only case with single sensor is given, but in reality we sample all sensors at once in the first step):

Take one sensor and sample its rays
With these rays in mind, run custom sample() algorithm that:
a) Finds intersection of the scene from selected camera
b) For all the other sensors samples_direction and checks visibility, filling the Data arrray in a single scalar loop
c) Based on stored pdf and other factors computes weight for each sample using MIS in a scalar NESTED loop
d) Samples direct emiters and iradiance at the point
e) Stores the computed radiance in all sensors
Store the sample data in a block according to precomputed offsets of sensors

The problem with this is, that with larger number of sensors (> 8), the algorithm takes FOREVER to trace and compile (upwards of 10 minutes), and the main culprit is the loop 2c).
Here is simplified version of what it is trying to do:

for(int k = 0; k < Nsensors; k++){
 Float adj = 0;
 for(int l = 0; l < Nsensors; l++){
   adj += weightSamples(sample[k], sample[j]); // A bit more complicated function, bunch of divisions, selects etc.
  }
 sample[k].weight = sample[k].pdf / adj;
}

And as I said, this takes up majority of tracing and compilation time. And I have no idea how to fix it.

a) I've tried transforming the loop into dr::while, however later I realised, using scalar index doesn't do anything and doesn't affect the times in the slightest. Also nesting two dr::loops kills the performance even more :/.

b) When trying to run the while loop with a vector index, I cannot keep a scalar index alongside it, to index into the array, as it gives runtime error. Why aren't vectorized/symbolic loops with a single counter rather than a vector possible ?

c) Using vector index might be possible, but I would have to use gather/scatter, and not sure how am I supposed to make it work with an array of vector structures ? I've only ever seen it used inside BatchSensor to gather simple dynamic array of sensor pointers into a SensorPtr vector... totally different use case.

d) What even is the difference between a normal C++ loop and dr::while_loop that wouldn't diverge at all ? In documentation, I've seen something about running in symbolic mode, but aren't normal loops with vector code also running in symbolic mode ?

e) From my POV, this loop is absolutely unavoidable, as the rest of the data structures are vectorized instead, as much as they can be.

Thanks for any ideas !

rtabbara · 2024-12-17T15:40:51Z

rtabbara
Dec 17, 2024
Collaborator

Hi @xacond00 ,

I would recommend checking out the Dr.Jit documentation section on control flow to get a good idea of the differences between symbolic and evaluated modes.

Additionally, I would suggest at least as a first pass, if possible, trying to implement your algorithm using Python first, so that you can leverage the @dr.syntax decorator. In particular, it lets you express vectorized loops without getting bogged down in the plumbing of setting up a dr::while_loop, which can be tricky particularly in the nested case.

But in short, using symbolic loops should at the very least reduce tracing time, because you're now no longer unrolling your loop >8 * >8 times, so it could be an implementation issue on your part. But that's why I think writing it in Python first would be beneficial to narrow down any issues.

7 replies

xacond00 Dec 18, 2024
Author

If something is broken with baseline Mitsuba, then we're happy to help. But please understand that we will not be able to delve into your code in detail. Responses like "blew me off" are not helpful.

How is the inability to make non divergent symbolic loops not 'broken' ? In python+drjit+mitsuba front end, those same sort of loops compile a year too and not always is this forced behavior of unrolling the most efficient, nor wanted.
(I'm now talking about, how even the while_loop construct always reduces into a scalar while loop, once it detects scalar index).

If you need some kind of a batch sensor, please first check if the built in batch plugin addresses your use case. This groups multiple sensors into one to trace the rendering just once while rendering many views.

Already using batch sensor, problem is with having to still run some control flow sequentially... Ie. if I trace 32 cameras of the first hit at once, I still need to check visibility/pdf of every other camera with the primary hit point.
This computation expands into something like this (numbers are camera indices, and in fact are replicated millions of times !):

[0 1 2 3... 31] //Primary hit
[1 2 3 4... 0] // Visibility and pdf check from other view points.
[2 3 4 5... 1]
...
[31 0 1 2 ... 30]

The nested MIS computation example is even worse, as it has to check every index with all the other indices.

Sorry, but our resources to support custom modifications are quite limited

If you cannot provide any help, that's all good.

Edit.

If you also wonder, why I don't just simply convert AoS to SoA to be able to use scatter/gather operations with symbolic loops...
Memory usage... The single SampleData already has vectors of size N = 2048*2048*16 as a realistic use case, packing this into a single vector of size Np = N * 32 (number of sensors) would lead to enormous memory usage during the evaluation, especially with Spectrum components, 32GB according to my own calculations, which just isn't feasible.

wjakob Dec 18, 2024
Maintainer

I still don't understand your issue. If I write the following code (Python for simplicity):

import drjit
from drjit.llvm import Int

@drjit.syntax
def foo():
    i = Int(0)
    while i < 10000:
        j = Int(0)
        while j < 10000:
            print('hello')
            j += 1
        i += 1

foo()

I get exactly one hello() printed on the console, because this code is traced just once. So the fact that this loop is scalar doesn't prevent tracing.

If you need to dynamically index into arrays using symbolic indices, then you can use dr.Local (https://drjit.readthedocs.io/en/latest/misc.html#local-memory) to create a scratch pad. This also supports custom data structures (e.g. data classes). This is an area where the Python bindings are currently much more mature than the C++ interface, but it would be possible to expose that functionality in C++ as well.

Does this address your use case?

xacond00 Dec 18, 2024
Author

I still don't understand your issue. If I write the following code (Python for simplicity):

The issue is not with existing functionality, its with "missing" one.

I get exactly one hello() printed on the console, because this code is traced just once. So the fact that this loop is scalar doesn't prevent tracing.

That's not at all what I'm doing. Those indices are no longer scalar, as expected by the documentation.

If you need to dynamically index into arrays using symbolic indices, then you can use dr.Local (https://drjit.readthedocs.io/en/latest/misc.html#local-memory) to create a scratch pad. This also supports custom data structures (e.g. data classes). This is an area where the Python bindings are currently much more mature than the C++ interface, but it would be possible to expose that functionality in C++ as well.

Even if I were to expose the interface and use this local memory, it seems awfully inefficient, especially since I don't need 'computed indices' at all. The extra memory cost, I cannot afford at all.

I'm going to boil down the code to something much more comprehensive, so that I don't confuse you with my specific algorithm.

Let's say you want to compute a result of very large array. Where you also need to access individual parts split into thousand.

size_t huge = ...;
Float x = dr::arrange<Float>(huge);
Float y = dr::arrange<Float>(huge);
scatter(x, val, idx); // Modify the first 1/1000 th of the array
... // Many more opeations on the individual parts
z = dr::sin(x * y) * x + y.....; // Imagine much more complex and longer operation
return dr::sum(z);

If the array doesn't fit into device's memory, it cannot be evaluated, JIT will just crash.
What you could do instead, is to break down the computation, so that you don't overwhelm the device's memory.

size_t mid = huge / 1000;
Float x[] = [dr::arrange<Float>(mid) + i * mid for i in range(1000)]; // Sorry for list generators :D
Float y[] = [dr::arrange<Float>(mid) + i * mid for i in range(1000)];
x[0] = val; // Modify the first 1/1000th of the array
... Other operations on the individual parts.
Float z = 0;
for(int i = 0; i < 1000; i++){
  z += dr::sin(x[i] * y[i]) * x[i] + y[i].....; // Imagine much more complex operation
}
return dr::sum(z);

Now each evaluation perfectly fits into the device's memory...
But there is a huge slowdown from the loop being unrolled during tracing and compilation, as each single iteration is recorded into the graph.
Symbolic loop isn't possible because of accessing scalar array. Local memory would be too inefficient and also not needed, as the indices are not vectors.
However there is no mechanism of making this type of loop symbolic, even when there are no dependencies between individual threads, and in theory this should be perfectly possible on hardware level.
Do you see how this might be a problem ?

wjakob Dec 18, 2024
Maintainer

Can I ask you to take a close look at the dr::Local feature? I think it precisely exists to address this kind of use case. Basically it allows to take what you have described here in C code, and to trace it into an equivalent program.

xacond00 Dec 18, 2024
Author

Can I ask you to take a close look at the dr::Local feature? I think it precisely exists to address this kind of use case. Basically it allows to take what you have described here in C code, and to trace it into an equivalent program.

Don't you see the inefficiency here though ? The use case is completely different. Yes, it would allow me to index into individual elements, but at what cost ? I don't even need to access different elements across threads, I need just the same ones.

xacond00 · 2024-12-21T16:08:26Z

xacond00
Dec 21, 2024
Author

If anyone is wondering. Those kinds of situations cannot even be optimized into a packed vector for single computation instead of a loop, if the types used are dr. structs (Intersections3f etc.), because dr::tile doesn't work on them.
Couldn't get to work gather function on structs either, so vector/symbolic loops are still out of the question.

The previously suggested local memory is very badly documented, and also super inefficient for this use case, considering this shouldn't be thread local data at all, and rather requires some broadcasting capabilities / saving of multiple (but predictable) jitvars during JIT traversal.

This is a huge bottleneck, which should have been remedied in the JIT backend.

I really do hope, this problem becomes solvable one way or another, once the number of "useful" operations in drjit library grows. The few operations available so far, prevent implementation of many less straightforward parallel PT algorithms.

0 replies

xacond00 · 2025-01-07T16:08:55Z

xacond00
Jan 7, 2025
Author

To update on this 'issue'.

The main problem was calling expensive brdf->pdf() virtual functions inside the nested loop.
What marginally helped with compilation speed (about 700 vs 900s total runtime on first pass) was putting the whole inside loop inside "if_stmt" with condition checking for "valid" interactions (visible, non-delta etc.).
Have no idea, if it allowed the code to get symbolized, or if_stmt does some magic like proper "per war jumps" inside CUDA, but somehow it works.

Additionally avoiding the virtual pdf computation altogether with simplified fixed function, resulted in just 220s total runtime. (It works, but is not correct solution for all microfacet brdfs).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve JIT tracing speed with inevitable scalar loops in C++ ? #1435

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to improve JIT tracing speed with inevitable scalar loops in C++ ? #1435

xacond00 Dec 14, 2024

Replies: 3 comments · 7 replies

rtabbara Dec 17, 2024 Collaborator

xacond00 Dec 18, 2024 Author

wjakob Dec 18, 2024 Maintainer

xacond00 Dec 18, 2024 Author

wjakob Dec 18, 2024 Maintainer

xacond00 Dec 18, 2024 Author

xacond00 Dec 21, 2024 Author

xacond00 Jan 7, 2025 Author

xacond00
Dec 14, 2024

Replies: 3 comments 7 replies

rtabbara
Dec 17, 2024
Collaborator

xacond00 Dec 18, 2024
Author

wjakob Dec 18, 2024
Maintainer

xacond00 Dec 18, 2024
Author

wjakob Dec 18, 2024
Maintainer

xacond00 Dec 18, 2024
Author

xacond00
Dec 21, 2024
Author

xacond00
Jan 7, 2025
Author