Looked at permute paper #79

DenisYaroshevskiy · 2023-06-16T16:56:34Z

I believe that p2664r3 is not implementable outside of very modern intel (the dynamic index lookup).
I also fail to see how

// Free function to generate compile-time permute of simd
template<std::size_t SizeSelector = 0, typename T, typename Abi, typename PermuteGenerator>
constexpr resize_t<output-size, simd<T, Abi>>
permute(const simd<T, Abi>>& v, PermuteGenerator&& fn)

Can be done since fn will return runtime indexes. You need compile time parameters to make that work

The text was updated successfully, but these errors were encountered:

DenisYaroshevskiy · 2023-06-16T16:59:58Z

simd<T, Abi> compress(const simd<T, Abi>& value, const simd_mask<T, Abi>& mask, T default_value);

Is not efficiently implementable almost anywhere. Even on avx512 skylake you can't make it work well.

mattkretz · 2023-06-17T06:40:44Z

Can be done since fn will return runtime indexes. You need compile time parameters to make that work

You meant "Can not be done", right?

Well it can, and it has been implemented by Daniel and me independently.

[...] compress [...] Is not efficiently implementable almost anywhere.

Even if, so what? std::simd is not a thin wrapper around SSE2. This is about defining a data-parallel type. Compress and expand are very useful (and even necessary) operations for some of the patterns that come up when expressing data-parallelism via the type system.

DenisYaroshevskiy · 2023-06-17T07:28:19Z

Well it can, and it has been implemented by Daniel and me independently.

Can't be done with any degree of performance? Can you show me a 16 short shuffle on avx2 with short indexes

Even if, so what?

A slow function is a useless funciton that is going to confuse people. There is no reason to use a slow function.
If you propose a slow function, it means you didn't do your job as a simd library well.

For example, compress.

There is an efficient compress like functionality that you can do.

It should return smth like:

compress_good(simd, simd_mask) -> std::array<std::pair<smaller_simd, int>, N>
UPD - int is how many elements were true, that probably wasn't clear

It splits the register in smaller chunks that it can actually efficiently compress.
You can always do 2 or 4 elements in a reasonable way.

As a result, eve::remove_if, for example, works reasonably well on sse2. While proposed compres will have horrible perf in most cases for no good reason

Here is the code to see how many chunks we generate

https://godbolt.org/z/rKGzYa3od

DenisYaroshevskiy · 2023-06-17T07:41:05Z

Compress and expand are very useful (and even necessary) operations

If you write simd code code - inefficient operations are almost never useful.
Everything inside the loop (in this case compress/expand) should be fast to be useful.
Outside of the loop it should not kill performance of the entire loop (example - masked store).

danieltowner · 2023-06-17T08:24:44Z

This is all about defining an interface to useful functions. Something like compress cannot be easily implemented in terms of anything else that is in the simd library, so if this feature is missing then users will be forced to come up with their own way to achieve the same effect. Users won't have portable access to hardware features to help them, nor will they normally have much understanding of the internals of the library implementation, nor an understanding of the uarch of the processor. By providing the compress function, the library can combine all the specialised knowledge necessary for an efficient implementation of compress for a given target that gives users the best possible chance of solving their problem.

Processors are constantly evolving too and simd should also reflect what is possible in modern processors, rather than limiting itself only to what used to be available many years ago. If you have a recent processor in your laptop, then you should be able to compile simd to use its features.

DenisYaroshevskiy · 2023-06-17T09:11:54Z

The library can combine all the specialised knowledge necessary for an efficient implementation of compress for a given target

I don't see how you can not understand me. Your interface is by design very inefficient for whatever the user wants to do. Zero overhead abstruction it is not.

danieltowner · 2023-06-17T09:21:42Z

I'm sorry I don't understand what you are trying to say. Can you give an example of how the interface should be changed so that users who want to do a compress-like operation (or std::copy_if) can do so efficiently using the existing API of std::simd?

DenisYaroshevskiy · 2023-06-17T09:29:06Z

Start from the algorith - which one you want.

If it's std::ranges::remove - compress_store worked well for us but it is abit restrictive.
eve::detail::compress - returns a tuple of smaller compressed parts with how many elements were there.

For example 32 chars on avx2 with bmi would be split into 2 chunks of 16 chars each

tuple<pair<wide<char, 16>, int>, pair<wide<char, 16>, int>>

int is how many elements from the front are selected.

I was thinking to do partition the same way, but I didn't do it yet.

Advantage of this is that you don't have to put the chunks together in one register, which is expensive.|
UPD: well at least I don't know how to comine to __m128i together with a runtime offset efficiently.

DenisYaroshevskiy · 2023-06-17T17:36:33Z

Looked more at:

permute(const simd<T, Abi>>& v, PermuteGenerator&& fn)

I think you meant

permute(const simd<T, Abi>>& v, PermuteGenerator)
requires std::default_constructible

This is not how in C++ we generally do lambdas but, if everyone i on board with this trickery, fine

danieltowner · 2023-06-19T08:01:27Z

For example 32 chars on avx2 with bmi would be split into 2 chunks of 16 chars each
tuple<pair<wide<char, 16>, int>, pair<wide<char, 16>, int>>
int is how many elements from the front are selected.

That isn't so far removed from how I implement it in some situations either. My implementation can break the larger simd into smaller pieces - typically native sized but the implementation decides for itself depending on what ISA is available - where each piece is a compressed block + offset. Those blocks are then concatenated together, such as by assembling the values into memory, or using a 2-source permute to combine compressed regions of adjacent registers. However, there are other ways to do the compress too - for example, create 8-bit indexes using iota and then compress those and do a permute using the compressed indexes. This method works really well on Ice Lake (which has 64xbyte compress and 2-source permutes), and can also be used standalone for small simd compress (i.e., without blocking), or combined into the block+offset approach as a building block for larger simd value compression.

If I understand correctly, you want the API to allow a compress to be broken down into smaller pieces which the user then assembles into the final result depending what they are doing. The user can decide how to glue the pieces together depending upon what they are trying to achieve. So you combine the faster performance of compressing small blocks, with user knowledge of how those blocks get used?

Although my implementation often internally does the same thing as yours - block+offset - I don't want to expose that detail. I want to keep everything hidden away inside the compress/expand functions so that the implementation can use alternative methods when that would yield better performance.

DenisYaroshevskiy · 2023-06-19T08:10:05Z

..This method works really well on Ice Lake..

Does it look like this? https://github.com/jfalcou/eve/blob/main/include/eve/detail/compress/simd/x86/compress_using_bmi.hpp

You can at most do 16 chars for avx2 this way.

Those blocks are then concatenated together, such as by assembling the values into memory

This is extra 4-5 cycles for very not clear reasom.

I don't want to expose that detail

Your api is nice. The problem is that it is not efficient. If you are doing simd, it means the function is very expensive. Then less than efficient api becomes a burden instead of help.

danieltowner · 2023-06-19T08:33:32Z

..This method works really well on Ice Lake..

Does it look like this? https://github.com/jfalcou/eve/blob/main/include/eve/detail/compress/simd/x86/compress_using_bmi.hpp

You can at most do 16 chars for avx2 this way.

I am talking about Ice Lake's VBMI2 ISA which has compress/expand for 8, 16, 32 and 64-bit elements. You can compress 64 8-bit iota indexes with that instruction, and then do a permute from the result.

DenisYaroshevskiy · 2023-06-19T08:45:20Z

I am talking about Ice Lake's VBMI2 ISA

Yes, a 10k dollars intel chip and next generation of arm can do your api well. I don't believe that's enough

danieltowner · 2023-06-19T10:55:45Z

If a user wants to do a compress-like operation then they have no choice. If we don't provide an API for doing what they want then the user has to invent their own, and whatever they invent is likely to be less efficient than the one provided by the implementor of std::simd.

Not only have functions like std::copy_if been around since C++17, but the addition of these compress/expand instruction sets to hardware shows how important those operations are. Hardware vendors don't add such expensive machinery to their silicon without significant justification. So I think that std::simd should reflect the importance of that by making those instructions portably available.

VBMI2 is available on more than just expensive servers. 3 of the 5 laptops in my household have VBMI2 for example.

DenisYaroshevskiy · 2023-06-19T11:00:11Z

If a user wants to do a compress-like operation then they have no choice

This is where I disagree with you completely. You give people the things you can and they build on top. If "this is what can be done" then they will figure out how to write code on top of it. And that code will be faster.

VBMI2 is available on more than just expensive servers. 3 of the 5 laptops in my household have VBMI2 for example.

Avx2 max is very relevant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looked at permute paper #79

Looked at permute paper #79

DenisYaroshevskiy commented Jun 16, 2023

DenisYaroshevskiy commented Jun 16, 2023

mattkretz commented Jun 17, 2023

DenisYaroshevskiy commented Jun 17, 2023 •

edited

Loading

DenisYaroshevskiy commented Jun 17, 2023

danieltowner commented Jun 17, 2023

DenisYaroshevskiy commented Jun 17, 2023

danieltowner commented Jun 17, 2023

DenisYaroshevskiy commented Jun 17, 2023 •

edited

Loading

DenisYaroshevskiy commented Jun 17, 2023

danieltowner commented Jun 19, 2023

DenisYaroshevskiy commented Jun 19, 2023

danieltowner commented Jun 19, 2023

DenisYaroshevskiy commented Jun 19, 2023

danieltowner commented Jun 19, 2023

DenisYaroshevskiy commented Jun 19, 2023 •

edited

Loading

Looked at permute paper #79

Looked at permute paper #79

Comments

DenisYaroshevskiy commented Jun 16, 2023

DenisYaroshevskiy commented Jun 16, 2023

mattkretz commented Jun 17, 2023

DenisYaroshevskiy commented Jun 17, 2023 • edited Loading

DenisYaroshevskiy commented Jun 17, 2023

danieltowner commented Jun 17, 2023

DenisYaroshevskiy commented Jun 17, 2023

danieltowner commented Jun 17, 2023

DenisYaroshevskiy commented Jun 17, 2023 • edited Loading

DenisYaroshevskiy commented Jun 17, 2023

danieltowner commented Jun 19, 2023

DenisYaroshevskiy commented Jun 19, 2023

danieltowner commented Jun 19, 2023

DenisYaroshevskiy commented Jun 19, 2023

danieltowner commented Jun 19, 2023

DenisYaroshevskiy commented Jun 19, 2023 • edited Loading

DenisYaroshevskiy commented Jun 17, 2023 •

edited

Loading

DenisYaroshevskiy commented Jun 17, 2023 •

edited

Loading

DenisYaroshevskiy commented Jun 19, 2023 •

edited

Loading