Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looked at permute paper #79

Open
DenisYaroshevskiy opened this issue Jun 16, 2023 · 15 comments
Open

Looked at permute paper #79

DenisYaroshevskiy opened this issue Jun 16, 2023 · 15 comments

Comments

@DenisYaroshevskiy
Copy link

I believe that p2664r3 is not implementable outside of very modern intel (the dynamic index lookup).
I also fail to see how

// Free function to generate compile-time permute of simd
template<std::size_t SizeSelector = 0, typename T, typename Abi, typename PermuteGenerator>
constexpr resize_t<output-size, simd<T, Abi>>
permute(const simd<T, Abi>>& v, PermuteGenerator&& fn)

Can be done since fn will return runtime indexes. You need compile time parameters to make that work

@DenisYaroshevskiy
Copy link
Author

simd<T, Abi> compress(const simd<T, Abi>& value, const simd_mask<T, Abi>& mask, T default_value);

Is not efficiently implementable almost anywhere. Even on avx512 skylake you can't make it work well.

@mattkretz
Copy link
Owner

Can be done since fn will return runtime indexes. You need compile time parameters to make that work

You meant "Can not be done", right?

Well it can, and it has been implemented by Daniel and me independently.

[...] compress [...] Is not efficiently implementable almost anywhere.

Even if, so what? std::simd is not a thin wrapper around SSE2. This is about defining a data-parallel type. Compress and expand are very useful (and even necessary) operations for some of the patterns that come up when expressing data-parallelism via the type system.

@DenisYaroshevskiy
Copy link
Author

DenisYaroshevskiy commented Jun 17, 2023

Well it can, and it has been implemented by Daniel and me independently.

Can't be done with any degree of performance? Can you show me a 16 short shuffle on avx2 with short indexes

Even if, so what?

A slow function is a useless funciton that is going to confuse people. There is no reason to use a slow function.
If you propose a slow function, it means you didn't do your job as a simd library well.

For example, compress.

There is an efficient compress like functionality that you can do.

It should return smth like:

compress_good(simd, simd_mask) -> std::array<std::pair<smaller_simd, int>, N>
UPD - int is how many elements were true, that probably wasn't clear

It splits the register in smaller chunks that it can actually efficiently compress.
You can always do 2 or 4 elements in a reasonable way.

As a result, eve::remove_if, for example, works reasonably well on sse2. While proposed compres will have horrible perf in most cases for no good reason

newplot (27)

Here is the code to see how many chunks we generate

https://godbolt.org/z/rKGzYa3od

@DenisYaroshevskiy
Copy link
Author

Compress and expand are very useful (and even necessary) operations

If you write simd code code - inefficient operations are almost never useful.
Everything inside the loop (in this case compress/expand) should be fast to be useful.
Outside of the loop it should not kill performance of the entire loop (example - masked store).

@danieltowner
Copy link
Collaborator

This is all about defining an interface to useful functions. Something like compress cannot be easily implemented in terms of anything else that is in the simd library, so if this feature is missing then users will be forced to come up with their own way to achieve the same effect. Users won't have portable access to hardware features to help them, nor will they normally have much understanding of the internals of the library implementation, nor an understanding of the uarch of the processor. By providing the compress function, the library can combine all the specialised knowledge necessary for an efficient implementation of compress for a given target that gives users the best possible chance of solving their problem.

Processors are constantly evolving too and simd should also reflect what is possible in modern processors, rather than limiting itself only to what used to be available many years ago. If you have a recent processor in your laptop, then you should be able to compile simd to use its features.

@DenisYaroshevskiy
Copy link
Author

The library can combine all the specialised knowledge necessary for an efficient implementation of compress for a given target

I don't see how you can not understand me. Your interface is by design very inefficient for whatever the user wants to do. Zero overhead abstruction it is not.

@danieltowner
Copy link
Collaborator

I'm sorry I don't understand what you are trying to say. Can you give an example of how the interface should be changed so that users who want to do a compress-like operation (or std::copy_if) can do so efficiently using the existing API of std::simd?

@DenisYaroshevskiy
Copy link
Author

DenisYaroshevskiy commented Jun 17, 2023

Start from the algorith - which one you want.

If it's std::ranges::remove - compress_store worked well for us but it is abit restrictive.
eve::detail::compress - returns a tuple of smaller compressed parts with how many elements were there.

For example 32 chars on avx2 with bmi would be split into 2 chunks of 16 chars each

tuple<pair<wide<char, 16>, int>, pair<wide<char, 16>, int>>

int is how many elements from the front are selected.

I was thinking to do partition the same way, but I didn't do it yet.

Advantage of this is that you don't have to put the chunks together in one register, which is expensive.|
UPD: well at least I don't know how to comine to __m128i together with a runtime offset efficiently.

@DenisYaroshevskiy
Copy link
Author

Looked more at:

permute(const simd<T, Abi>>& v, PermuteGenerator&& fn)

I think you meant

permute(const simd<T, Abi>>& v, PermuteGenerator)
requires std::default_constructible

This is not how in C++ we generally do lambdas but, if everyone i on board with this trickery, fine

@danieltowner
Copy link
Collaborator

For example 32 chars on avx2 with bmi would be split into 2 chunks of 16 chars each

tuple<pair<wide<char, 16>, int>, pair<wide<char, 16>, int>>

int is how many elements from the front are selected.

That isn't so far removed from how I implement it in some situations either. My implementation can break the larger simd into smaller pieces - typically native sized but the implementation decides for itself depending on what ISA is available - where each piece is a compressed block + offset. Those blocks are then concatenated together, such as by assembling the values into memory, or using a 2-source permute to combine compressed regions of adjacent registers. However, there are other ways to do the compress too - for example, create 8-bit indexes using iota and then compress those and do a permute using the compressed indexes. This method works really well on Ice Lake (which has 64xbyte compress and 2-source permutes), and can also be used standalone for small simd compress (i.e., without blocking), or combined into the block+offset approach as a building block for larger simd value compression.

If I understand correctly, you want the API to allow a compress to be broken down into smaller pieces which the user then assembles into the final result depending what they are doing. The user can decide how to glue the pieces together depending upon what they are trying to achieve. So you combine the faster performance of compressing small blocks, with user knowledge of how those blocks get used?

Although my implementation often internally does the same thing as yours - block+offset - I don't want to expose that detail. I want to keep everything hidden away inside the compress/expand functions so that the implementation can use alternative methods when that would yield better performance.

@DenisYaroshevskiy
Copy link
Author

..This method works really well on Ice Lake..

Does it look like this? https://github.com/jfalcou/eve/blob/main/include/eve/detail/compress/simd/x86/compress_using_bmi.hpp

You can at most do 16 chars for avx2 this way.

Those blocks are then concatenated together, such as by assembling the values into memory

This is extra 4-5 cycles for very not clear reasom.

I don't want to expose that detail

Your api is nice. The problem is that it is not efficient. If you are doing simd, it means the function is very expensive. Then less than efficient api becomes a burden instead of help.

@danieltowner
Copy link
Collaborator

..This method works really well on Ice Lake..

Does it look like this? https://github.com/jfalcou/eve/blob/main/include/eve/detail/compress/simd/x86/compress_using_bmi.hpp

You can at most do 16 chars for avx2 this way.

I am talking about Ice Lake's VBMI2 ISA which has compress/expand for 8, 16, 32 and 64-bit elements. You can compress 64 8-bit iota indexes with that instruction, and then do a permute from the result.

@DenisYaroshevskiy
Copy link
Author

I am talking about Ice Lake's VBMI2 ISA

Yes, a 10k dollars intel chip and next generation of arm can do your api well. I don't believe that's enough

@danieltowner
Copy link
Collaborator

If a user wants to do a compress-like operation then they have no choice. If we don't provide an API for doing what they want then the user has to invent their own, and whatever they invent is likely to be less efficient than the one provided by the implementor of std::simd.

Not only have functions like std::copy_if been around since C++17, but the addition of these compress/expand instruction sets to hardware shows how important those operations are. Hardware vendors don't add such expensive machinery to their silicon without significant justification. So I think that std::simd should reflect the importance of that by making those instructions portably available.

VBMI2 is available on more than just expensive servers. 3 of the 5 laptops in my household have VBMI2 for example.

@DenisYaroshevskiy
Copy link
Author

DenisYaroshevskiy commented Jun 19, 2023

If a user wants to do a compress-like operation then they have no choice

This is where I disagree with you completely. You give people the things you can and they build on top. If "this is what can be done" then they will figure out how to write code on top of it. And that code will be faster.

VBMI2 is available on more than just expensive servers. 3 of the 5 laptops in my household have VBMI2 for example.

Avx2 max is very relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants