Feature/limited multithreading #23

BaptisteLamic · 2025-02-05T20:39:27Z

Hi,
This PR introduces some parallelization in HssMatrices. Benchmarks on an M1 Pro show a performance increase when parallelization is enabled, with no degradation compared to main when it is disabled.

There is probably room for performance improvement, but I think that is a good start.

Benchmark Results
The benchmarks were performed using an updated benchmarking script for both the feature branch and the old version. Results are as follows:

New Version

Without Multithreading (multithreaded = false)

Benchmarking full...
12.468 ms (1473 allocations: 70.92 MiB)
Benchmarking getindex...
75.417 μs (1227 allocations: 688.86 KiB)
Benchmarking compression...
126.705 ms (10080 allocations: 437.57 MiB)
Benchmarking randomized compression...
47.087 ms (20965 allocations: 65.62 MiB)
Benchmarking re-compression...
2.398 ms (14970 allocations: 4.80 MiB)
Benchmarking proper...
3.492 ms (7490 allocations: 6.37 MiB)
Benchmarking addition...
206.250 μs (1086 allocations: 4.89 MiB)
Benchmarking matvec...
360.417 μs (1374 allocations: 1.48 MiB)
Benchmarking matrix products...
2.243 ms (4640 allocations: 11.13 MiB)
Benchmarking ulvfactsolve...
8.661 ms (7497 allocations: 22.56 MiB)
Benchmarking hssldivide...
42.087 ms (66937 allocations: 76.29 MiB)

With Multithreading (multithreaded = true)

Threads.nthreads() = 6
Benchmarking full...
12.143 ms (1473 allocations: 70.92 MiB)
Benchmarking getindex...
76.583 μs (1227 allocations: 688.86 KiB)
Benchmarking compression...
121.999 ms (10080 allocations: 437.57 MiB)
Benchmarking randomized compression...
47.028 ms (21475 allocations: 65.68 MiB)
Benchmarking re-compression...
988.250 μs (15195 allocations: 4.82 MiB)
Benchmarking proper...
830.834 μs (7565 allocations: 6.38 MiB)
Benchmarking addition...
103.042 μs (1236 allocations: 4.90 MiB)
Benchmarking matvec...
343.208 μs (1629 allocations: 1.51 MiB)
Benchmarking matrix products...
1.382 ms (4865 allocations: 11.15 MiB)
Benchmarking ulvfactsolve...
8.834 ms (7497 allocations: 22.56 MiB)
Benchmarking hssldivide...
19.697 ms (65897 allocations: 72.22 MiB)

Old version

Benchmarking full...
12.572 ms (1473 allocations: 70.92 MiB)
Benchmarking getindex...
77.625 μs (1227 allocations: 688.86 KiB)
Benchmarking compression...
119.936 ms (10080 allocations: 437.57 MiB)
Benchmarking randomized compression...
45.687 ms (20281 allocations: 65.58 MiB)
Benchmarking re-compression...
2.354 ms (14736 allocations: 4.78 MiB)
Benchmarking proper...
3.482 ms (7396 allocations: 6.37 MiB)
Benchmarking addition...
207.500 μs (1085 allocations: 4.89 MiB)
Benchmarking matvec...
316.625 μs (1032 allocations: 1.46 MiB)
Benchmarking matrix products...
2.270 ms (4453 allocations: 11.12 MiB)
Benchmarking ulvfactsolve...
8.861 ms (7497 allocations: 22.56 MiB)
Benchmarking hssldivide...
41.304 ms (63843 allocations: 76.16 MiB)

BaptisteLamic · 2025-02-05T20:42:33Z

I just notice that they are a lot of “fake change”, I probably have run the autoformatter at some point.
I will fix that tomorrow.
Best

bonevbs · 2025-02-06T13:08:45Z

src/HssMatrices.jl

looking at the diff, there is a lot of removed whitespace. Is this a formatting issue?

Yes, for some reason these change did not appear in vscode comparison tool. I will fix that.

bonevbs · 2025-02-06T13:11:05Z

src/compression.jl

@@ -285,8 +293,8 @@ function randcompress(A::AbstractMatOrLinOp{T}, rcl::ClusterTree, ccl::ClusterTr

  # compute initial sampling
  k = kest; r = opts.noversampling;
-  Ωcol = randn(n, k+r)
-  Ωrow = randn(m, k+r)
+  Ωcol = randn(T,n, k+r)


is the inclusion of T here correct? I am not sure how the sampling matrix in randomized SVD needs to be for complex numbers?

Good point, I will check that.

bonevbs · 2025-02-06T13:14:45Z

test/runtests.jl

@@ -1,4 +1,5 @@
 using Test, LinearAlgebra, HssMatrices
+BLAS.set_num_threads(2)


what does this do here?

Using too many threads when combining BLAS operations with Julia multithreading results in slowdowns due to CPU oversubscription. I put 2 here as a compromise, 1 could make more sense.

bonevbs

Thanks for contributing this. The changes look good to me.

I see that you have unittests already for both the multithreaded and single threaded case. This is great for making sure things work as intended. Are there any cornercases which we haven't thought of due to the introduction of parallelism?

but otherwise LGTM, jsut minor questions on my end

bonevbs · 2025-02-06T13:24:29Z

Great work Baptiste! Do you have any intuition why matvec isn't faster?

bonevbs · 2025-02-06T13:26:57Z

also - I would suggest bumping up the version number and pushing a new package once this is merged

BaptisteLamic · 2025-02-07T07:16:39Z

Hi,
I think that I have been quite conservative in the parallelization, and so far I did not notice any issue when using it for my simulation. So I think that should be fine. The only issue I encountered is the old problem where the code crashes when a block completely vanishes. I may have a look at this over the next few weeks.

Regarding the absence of performance gain for matvec, I suspect that the serial version is simply so fast that the multithreading overhead isn’t worth it. The situation may change for larger problem sizes. Overall, the efficiency of the parallel version is not great, but it still provides a useful speedup.

I agree that could deserve a version bump.

BaptisteLamic added 12 commits November 30, 2024 16:20

Limited mulithreading

56d9715

Extend multithreading in ulvdivide and matmul

9bd0506

Replace Threads by RecursionTools

f362297

Tune tests parameters to reduce tests time

50bcd94

Make multithreading in generators.jl optional

3807377

Recursion optional for ldiv

ab9d49e

Optional multithreading for ulvdivide.jl

cbb8db0

Update benchmark script

26229c1

+ and - optionaly multithreaded

39fc7f6

Optional multithreading for matmul.jl

c02e8c0

Fine grained threading control

4b7ac1d

Add a conversion and improve benchmarking

67f96be

bonevbs reviewed Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/limited multithreading #23

Feature/limited multithreading #23

BaptisteLamic commented Feb 5, 2025

BaptisteLamic commented Feb 5, 2025

bonevbs Feb 6, 2025

BaptisteLamic Feb 7, 2025

bonevbs Feb 6, 2025

BaptisteLamic Feb 7, 2025

bonevbs Feb 6, 2025

BaptisteLamic Feb 7, 2025

bonevbs left a comment

bonevbs commented Feb 6, 2025

bonevbs commented Feb 6, 2025

BaptisteLamic commented Feb 7, 2025

		@@ -1,4 +1,5 @@
		using Test, LinearAlgebra, HssMatrices
		BLAS.set_num_threads(2)

Feature/limited multithreading #23

Are you sure you want to change the base?

Feature/limited multithreading #23

Conversation

BaptisteLamic commented Feb 5, 2025

New Version

Without Multithreading (multithreaded = false)

With Multithreading (multithreaded = true)

Old version

BaptisteLamic commented Feb 5, 2025

bonevbs Feb 6, 2025

Choose a reason for hiding this comment

BaptisteLamic Feb 7, 2025

Choose a reason for hiding this comment

bonevbs Feb 6, 2025

Choose a reason for hiding this comment

BaptisteLamic Feb 7, 2025

Choose a reason for hiding this comment

bonevbs Feb 6, 2025

Choose a reason for hiding this comment

BaptisteLamic Feb 7, 2025

Choose a reason for hiding this comment

bonevbs left a comment

Choose a reason for hiding this comment

bonevbs commented Feb 6, 2025

bonevbs commented Feb 6, 2025

BaptisteLamic commented Feb 7, 2025