Parallel Pixel Maps #602

ripytide · 2024-05-12T16:13:45Z

This is just a draft so I can get some feedback on an initial implementation of map_subpixels_parallel() before I add the remaining variation of functions map_subpixels_parallel_mut() map_pixels_parallel() map_pixels_parallel_mut().

This would address #429

ripytide · 2024-05-12T16:33:55Z

I've just realized this was attempted in #446 but was closed due to being slower than the original. After reading the PR it looks like it used par_bridge as image may not have had raw par iterators 4 years ago which it now has. I'll do some benchmarks to test how my implementation is compared to the non-parallel version.

ripytide · 2024-05-12T16:41:35Z

Here are the results:

running 4 tests
test map::benches::bench_map_subpixels_100           ... bench:       9,856 ns/iter (+/- 1,715)
test map::benches::bench_map_subpixels_1000              ... bench:   1,047,116 ns/iter (+/- 42,256)
test map::benches::bench_map_subpixels_parallel_100   ... bench:      26,220 ns/iter (+/- 14,109)
test map::benches::bench_map_subpixels_parallel_1000   ... bench:     375,886 ns/iter (+/- 170,319)

As can be seen which function is faster depends on the size of the image, 100x100 images are faster with the non-parallel version whereas 1000x1000 images are faster with the parallel version.

Morgane55440 · 2024-05-12T20:28:20Z

my PC is slower than yours (probably all the other stuff i'm running), but i might have something to help for the inefficiencies for low values.

lots of time is lost on dispatching, so we need to dispatch less, by parallelizing on lines instead of pixels.
of course, there is no support for this from the image crate, so i had to improvize.

here are my results :
your impl on my pc

test map::benches::bench_map_subpixels_100                                            ... bench:      16,226 ns/iter (+/- 189)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,021,782 ns/iter (+/- 138,602)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      31,478 ns/iter (+/- 3,454)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     685,178 ns/iter (+/- 105,519)

my impl on my pc

test map::benches::bench_map_subpixels_100                                            ... bench:      16,320 ns/iter (+/- 214)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,001,935 ns/iter (+/- 114,785)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      15,008 ns/iter (+/- 2,176)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     634,393 ns/iter (+/- 158,002)

it is still much slower for very small images, so one could just default to sequential for images under 10k pixels

test map::benches::bench_map_subpixels_10                                             ... bench:         200 ns/iter (+/- 12)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,187 ns/iter (+/- 278)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,009,675 ns/iter (+/- 115,502)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:       5,757 ns/iter (+/- 1,303)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      15,283 ns/iter (+/- 1,232)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     699,168 ns/iter (+/- 148,572)

Morgane55440 · 2024-05-12T20:31:08Z

here is my take on the function body

{
    use itertools::Itertools;
    use rayon::iter::IndexedParallelIterator;
    use rayon::iter::ParallelIterator;
    use rayon::slice::{ParallelSlice, ParallelSliceMut};

    let (width, height) = image.dimensions();
    let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);

    image
        .par_chunks(width as usize * <P as Pixel>::CHANNEL_COUNT as usize)
        .zip_eq(
            out.par_chunks_mut(
                width as usize * <ChannelMap<P, S> as Pixel>::CHANNEL_COUNT as usize,
            ),
        )
        .for_each(|(input_line, output_line)| {
            input_line
                .chunks_exact(<P as Pixel>::CHANNEL_COUNT as usize)
                .map(|v| <P as Pixel>::from_slice(v))
                .zip_eq(
                    output_line
                        .chunks_exact_mut(<ChannelMap<P, S> as Pixel>::CHANNEL_COUNT as usize)
                        .map(|v| <ChannelMap<P, S> as Pixel>::from_slice_mut(v)),
                )
                .for_each(|(input_pixel, output_channels)| {
                    input_pixel
                        .channels()
                        .iter()
                        .zip(output_channels.channels_mut())
                        .for_each(|(input_subpixel, output_subpixel)| {
                            *output_subpixel = f(*input_subpixel)
                        })
                });
        });

    out
}

this is more efficient the faster the function is, which is great when it's just a sum or multiplication

ripytide · 2024-05-12T20:45:09Z

Great work, line based parallelism does seem like the best option for CPU cache since all the memory would be right next to each other. I wonder how these scale at 10000x10000.

Morgane55440 · 2024-05-12T20:47:51Z

i have a better iteration, that is both faster for small values and more readable

{
    use rayon::iter::IndexedParallelIterator;
    use rayon::iter::ParallelIterator;
    use rayon::slice::{ParallelSlice, ParallelSliceMut};

    let (width, height) = image.dimensions();
    let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);

    const SUBPIXEL_PER_CHUNK : usize = 512;
    image
        .par_chunks(SUBPIXEL_PER_CHUNK)
        .zip_eq(out.par_chunks_mut(SUBPIXEL_PER_CHUNK))
        .for_each(|(input_line, output_line)| {
            input_line.iter().zip(output_line.iter_mut()).for_each(
                |(input_subpixel, output_subpixel)| *output_subpixel = f(*input_subpixel),
            );
        });

    out
}

Morgane55440 · 2024-05-12T20:49:40Z

SUBPIXEL_PER_CHUNK is arbitrary, but 512 seems pretty good experimentally

ripytide · 2024-05-12T21:01:01Z

I've added you to my fork in case you want to update the implementation as we find faster versions.

Morgane55440 · 2024-05-12T21:08:27Z

Great work, line based parallelism does seem like the best option for CPU cache since all the memory would be right next to each other. I wonder how these scale at 10000x10000.

sadly for me it doesn't seem to improve much beyond 3x. might be my pc's fault

test map::benches::bench_map_subpixels_10                                             ... bench:         204 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,439 ns/iter (+/- 195)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,079,892 ns/iter (+/- 395,519)
test map::benches::bench_map_subpixels_10_000                                         ... bench: 222,443,410 ns/iter (+/- 20,580,424)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          66 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      12,425 ns/iter (+/- 3,846)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     700,052 ns/iter (+/- 121,973)
test map::benches::bench_map_subpixels_parallel_10_000                                ... bench:  61,087,340 ns/iter (+/- 5,226,359)

Morgane55440 · 2024-05-12T21:46:19Z

the values i get from my tests are really weird, i think there might be something weird going on.

it'd be nice if you could test it on your machine.

Morgane55440 · 2024-05-12T21:50:34Z

it says it's 10x faster for a 30 by 30 but only about 3x faster for a 1000 by 1000

test map::benches::bench_map_subpixels_10                                             ... bench:         192 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,343 ns/iter (+/- 233)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,019,145 ns/iter (+/- 997,715)
test map::benches::bench_map_subpixels_30                                             ... bench:       1,425 ns/iter (+/- 47)
test map::benches::bench_map_subpixels_60                                             ... bench:       5,580 ns/iter (+/- 86)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          65 ns/iter (+/- 0)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      13,283 ns/iter (+/- 1,725)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     728,398 ns/iter (+/- 118,186)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:         156 ns/iter (+/- 4)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:       2,558 ns/iter (+/- 1,018)

which is extra weird because for 30 by 30 there should only be one thread running so there shouldn't be any significant improvement

Morgane55440 · 2024-05-12T22:03:19Z

small question : why does map_subpixels_parallel only support ImageBuffer while all the other functions support GenericImage ?

ripytide · 2024-05-12T22:38:33Z

GenericImage doesn't require the image to be stored contiguously, whereas ImageBuffer does since it implements Deref<Target=[P::Subpixel]> which is needed for rayon methods like par_chunks(). Actually, on reading the source it looks like it implements Deref<[]> only when the Container also implements Deref<> which is the case for Vec. Maybe we could expand it to GenericImage with some bound like Deref<[]>?

Morgane55440 · 2024-05-12T22:45:29Z

well, i have bad news
all the speed increase is solely due to the fact that we're working with ImageBuffer and contiguous memory, and removing the parralel bits makes it all faster including 1000 by 1000

ripytide · 2024-05-12T22:45:55Z

The benches on my machine (6-core Ryzen 5 4500U):

test map::benches::bench_map_subpixels_10                                             ... bench:         207 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,271 ns/iter (+/- 145)
test map::benches::bench_map_subpixels_1000                                           ... bench:   1,725,553 ns/iter (+/- 107,609)
test map::benches::bench_map_subpixels_30                                             ... bench:       1,467 ns/iter (+/- 12)
test map::benches::bench_map_subpixels_60                                             ... bench:       5,660 ns/iter (+/- 26)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          81 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:       9,735 ns/iter (+/- 1,870)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     243,560 ns/iter (+/- 43,263)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:         168 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:       7,178 ns/iter (+/- 1,279)

Morgane55440 · 2024-05-12T22:48:33Z

what happens if you replace the body by

{
    use itertools::Itertools;

    let (width, height) = image.dimensions();
    let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);

    image.iter().zip_eq(out.iter_mut()).for_each(
            |(input_subpixel, output_subpixel)| *output_subpixel = f(*input_subpixel),
        );

    out
}

?

ripytide · 2024-05-12T22:50:28Z

test map::benches::bench_map_subpixels_10                                             ... bench:         208 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,292 ns/iter (+/- 337)
test map::benches::bench_map_subpixels_1000                                           ... bench:   1,698,793 ns/iter (+/- 202,548)
test map::benches::bench_map_subpixels_30                                             ... bench:       1,464 ns/iter (+/- 19)
test map::benches::bench_map_subpixels_60                                             ... bench:       6,032 ns/iter (+/- 54)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          58 ns/iter (+/- 0)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:       1,035 ns/iter (+/- 9)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     384,083 ns/iter (+/- 22,183)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:         137 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:         393 ns/iter (+/- 2)

Morgane55440 · 2024-05-12T22:51:41Z

well it seems the parallel version is slightly faster for big images

ripytide · 2024-05-12T22:53:24Z

Well at least we've learnt that there is massive room for optimization via ImageBuffer over GenericImage hehe.

Morgane55440 · 2024-05-12T22:57:46Z

still a bit perplexed at how inefficient the parallelization is, there might be something else to it i'm missing

ripytide · 2024-05-12T23:04:14Z

From the cargo docs:

The --jobs argument affects the building of the benchmark executable but does not affect how many threads are used when running the benchmarks. The Rust test harness runs benchmarks serially in a single thread.

I think libtest may only be running programs with singlethreaded? Oh wait I think that just means they are run sequentially and not that they are limited to one thread when executing.

Morgane55440 · 2024-05-13T07:13:41Z

the parallelization is happening for sure

this is all very fishy

Morgane55440 · 2024-05-13T08:31:22Z

here is what happen if i remove the with_min_len (dramatic slow down for small images), but use more expensive computation as the input function (|x| f64::from(x).sqrt() as u8 + 1) :

test map::benches::bench_map_subpixels_parallel_10                                    ... bench:      15,389 ns/iter (+/- 2,866)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      40,917 ns/iter (+/- 5,783)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     911,286 ns/iter (+/- 100,335)
test map::benches::bench_map_subpixels_parallel_10_000                                ... bench:  79,796,480 ns/iter (+/- 8,970,528)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:      26,007 ns/iter (+/- 7,937)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:      33,506 ns/iter (+/- 7,006)
test map::benches::bench_map_subpixels_sequential_10                                  ... bench:         401 ns/iter (+/- 6)
test map::benches::bench_map_subpixels_sequential_100                                 ... bench:      37,682 ns/iter (+/- 605)
test map::benches::bench_map_subpixels_sequential_1000                                ... bench:   4,110,240 ns/iter (+/- 190,140)
test map::benches::bench_map_subpixels_sequential_10_000                              ... bench: 427,149,130 ns/iter (+/- 326,256,736)
test map::benches::bench_map_subpixels_sequential_30                                  ... bench:       3,537 ns/iter (+/- 572)
test map::benches::bench_map_subpixels_sequential_60                                  ... bench:      13,524 ns/iter (+/- 326)

the good value of MIN_SUBPIXEL_PER_CHUNK completely depends on the input functions ; if we had a function that took 1ms to compute, then the parallel version would be much faster even for a 10 by 10 image.
therefore, imo, we should ship the parralel version without using with_min_len, explaining that it is slower than sequential for small images and fast functions, but can be significantly faster for large images(1 million pixel+) and slow fuctions

Morgane55440 · 2024-05-13T08:34:16Z

i have also come to realize that chunking manually is actually pointless, because that's already what rayon does

Morgane55440 · 2024-05-13T09:09:44Z

T_T

i thought i had done clippy on my machine

ripytide · 2024-05-13T13:31:15Z

I've de-duplicated the benchmark mapping function into the macro and switched map_subpixels_parallel() to be more similar to the rest of the mapping functions using GenericImage but with a Deref<[]> bound. I tested benchmarks before and after the commits and it had no effects as expected.

ripytide · 2024-05-13T13:33:35Z

The benchmarks take forever (200s) on my machine mainly due to the 10_000 functions (especially the sequential one), we'll probably have to remove those for the final PR.

Morgane55440 · 2024-05-13T13:48:13Z

makes sense

theotherphil · 2024-05-14T07:11:19Z

Thanks for the PR. I’ll wait until after the next release to merge this. Which will probably be next weekend as I don’t have much time on weekdays.

theotherphil · 2024-05-14T07:12:21Z

src/map.rs

+    S: std::marker::Send,
+    S: std::marker::Sync,
+    <P as WithChannel<S>>::Pixel: std::marker::Send,
+    <P as WithChannel<S>>::Pixel: std::marker::Sync,


This where clause is horrific! Is there any way of shortening it?

we could add use std::marker::{Send, Sync} at the top of the file and then

S: std::marker::Send, S: std::marker::Sync,

would become

S: Send + Sync,

the final result would be something like that :

pub fn map_subpixels_parallel<I, P, F, S>(image: &I, f: F) -> Image<ChannelMap<P, S>> where I: GenericImage<Pixel = P> + Deref<Target = [P::Subpixel]>, P: WithChannel<S> + Send + Sync, S: Primitive + Send + Sync, F: Fn(P::Subpixel) -> S + Sync, P::Subpixel: Send + Sync, >::Pixel: Send + Sync,

Send and Sync are already in rust prelude

…e parallel vs sequential version

ripytide · 2024-05-25T13:17:05Z

Okay now that all the blockers have been resolved I've fleshed out the rest of this PR.
Some other changes I made:

Renamed red_channel() to into_red_channel() since that is more clear on the operation being done.
Renamed as_red_channel() to from_red_channel() for the same reason as above, also as_ functions are supposed to be cheap operations according to the rust API guidelines.
All signatures have changes from map_pixels<I, P>(image: &I) where I: GenericImage to map_pixels(image: &Image) since it provides substantial performance improvements as it enforces contiguous storage. This is more consistent with other imageproc functions which use Image's and is more inline with the projects principle of performance over maximum genericness.

ripytide · 2024-05-25T13:24:39Z

@theotherphil this is now ready for review.

theotherphil · 2024-05-26T07:29:33Z

The new code looks good and the signature change makes sense. I’m not convinced by the name changes though. You’re right that as_ should be cheap, but there’s also a standard convention that into_ and from_ consume an owned value which isn’t the case here.

ripytide · 2024-05-26T10:22:18Z

That's true, but it is the lesser evil in my opinion as users can see their signatures fairly easily from the docs. Unless you have an alternative suggestion?

ripytide · 2024-05-26T10:24:01Z

One alternative:
grayscale_from_red_channel() and rgb_from_red_grayscale()?

theotherphil · 2024-05-26T13:14:48Z

I’ll merge as is and think about what best to name these before the next release.

Thanks for the new functions, and it’s good to see your new doc macros already paying off.

theotherphil reviewed May 14, 2024

View reviewed changes

ripytide force-pushed the map_parallel branch from 44de7c6 to f444758 Compare May 22, 2024 10:51

ripytide and others added 13 commits May 25, 2024 12:56

implement map_subpixels_parallel()

496fe40

add benchmarks for parallel vs non_parallel pixels maps

3d74298

fix overflowing multiplication

a585371

improved speed for low size + more tests in critical range

dcd85f7

simplified the function + updated/refactored benches

236431b

clippy

ce8b28b

refactor benches and switch to GenericImage over ImageBuffer

1d5f1fa

switch to GenericImage + Deref<Target=[P::Subpixel]>

0f6316e

added a note to the documentation to talk about the performance of th…

a4994ed

…e parallel vs sequential version

reduced where bounds

847d7ba

remove benches now testing has been carried out

4fc7cd1

simplify map_subpixels

7682027

rename colors to pixels and pixels to enumerated_pixels

5bd645f

ripytide force-pushed the map_parallel branch from f444758 to 5bd645f Compare May 25, 2024 11:57

ripytide added 3 commits May 25, 2024 14:00

finish creating all new mapping functions and their variations

5cb51f0

minor bugfix

ab0c4f6

add back map_pixels2

c894227

fix tests

f467c26

ripytide marked this pull request as ready for review May 25, 2024 13:24

theotherphil merged commit 161be33 into image-rs:master May 26, 2024
15 checks passed

ripytide deleted the map_parallel branch May 26, 2024 13:19

This was referenced May 26, 2024

Rename imageproc::map::map_colors() to map_pixels() #631

Closed

Parallelize map_colors/map_pixels/map_* #429

Closed

Parallel Pixel Maps #602

Parallel Pixel Maps #602

Conversation

ripytide commented May 12, 2024 • edited Loading

ripytide commented May 12, 2024

ripytide commented May 12, 2024 • edited Loading

Morgane55440 commented May 12, 2024

Morgane55440 commented May 12, 2024 • edited Loading

ripytide commented May 12, 2024

Morgane55440 commented May 12, 2024

Morgane55440 commented May 12, 2024 • edited Loading

ripytide commented May 12, 2024

Morgane55440 commented May 12, 2024 • edited Loading

Morgane55440 commented May 12, 2024

Morgane55440 commented May 12, 2024 • edited Loading

Morgane55440 commented May 12, 2024 • edited Loading

ripytide commented May 12, 2024 • edited Loading

Morgane55440 commented May 12, 2024

ripytide commented May 12, 2024 • edited Loading

Morgane55440 commented May 12, 2024

ripytide commented May 12, 2024

Morgane55440 commented May 12, 2024

ripytide commented May 12, 2024

Morgane55440 commented May 12, 2024

ripytide commented May 12, 2024 • edited Loading

Morgane55440 commented May 13, 2024

Morgane55440 commented May 13, 2024 • edited Loading

Morgane55440 commented May 13, 2024

Morgane55440 commented May 13, 2024

ripytide commented May 13, 2024

ripytide commented May 13, 2024

Morgane55440 commented May 13, 2024

theotherphil commented May 14, 2024

theotherphil May 14, 2024

Choose a reason for hiding this comment

Morgane55440 May 14, 2024

Choose a reason for hiding this comment

Morgane55440 May 14, 2024 • edited Loading

Choose a reason for hiding this comment

cospectrum May 14, 2024

Choose a reason for hiding this comment

ripytide commented May 25, 2024 • edited Loading

ripytide commented May 25, 2024

theotherphil commented May 26, 2024

ripytide commented May 26, 2024

ripytide commented May 26, 2024 • edited Loading

theotherphil commented May 26, 2024

ripytide commented May 12, 2024 •

edited

Loading

ripytide commented May 12, 2024 •

edited

Loading

Morgane55440 commented May 12, 2024 •

edited

Loading

Morgane55440 commented May 12, 2024 •

edited

Loading

Morgane55440 commented May 12, 2024 •

edited

Loading

Morgane55440 commented May 12, 2024 •

edited

Loading

Morgane55440 commented May 12, 2024 •

edited

Loading

ripytide commented May 12, 2024 •

edited

Loading

ripytide commented May 12, 2024 •

edited

Loading

ripytide commented May 12, 2024 •

edited

Loading

Morgane55440 commented May 13, 2024 •

edited

Loading

Morgane55440 May 14, 2024 •

edited

Loading

ripytide commented May 25, 2024 •

edited

Loading

ripytide commented May 26, 2024 •

edited

Loading