-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Looked at permute paper #79
Comments
Is not efficiently implementable almost anywhere. Even on avx512 skylake you can't make it work well. |
You meant "Can not be done", right? Well it can, and it has been implemented by Daniel and me independently.
Even if, so what? |
Can't be done with any degree of performance? Can you show me a 16 short shuffle on avx2 with short indexes
A slow function is a useless funciton that is going to confuse people. There is no reason to use a slow function. For example, compress. There is an efficient compress like functionality that you can do. It should return smth like:
It splits the register in smaller chunks that it can actually efficiently compress. As a result, eve::remove_if, for example, works reasonably well on sse2. While proposed compres will have horrible perf in most cases for no good reason Here is the code to see how many chunks we generate |
If you write simd code code - inefficient operations are almost never useful. |
This is all about defining an interface to useful functions. Something like compress cannot be easily implemented in terms of anything else that is in the simd library, so if this feature is missing then users will be forced to come up with their own way to achieve the same effect. Users won't have portable access to hardware features to help them, nor will they normally have much understanding of the internals of the library implementation, nor an understanding of the uarch of the processor. By providing the compress function, the library can combine all the specialised knowledge necessary for an efficient implementation of compress for a given target that gives users the best possible chance of solving their problem. Processors are constantly evolving too and simd should also reflect what is possible in modern processors, rather than limiting itself only to what used to be available many years ago. If you have a recent processor in your laptop, then you should be able to compile simd to use its features. |
I don't see how you can not understand me. Your interface is by design very inefficient for whatever the user wants to do. Zero overhead abstruction it is not. |
I'm sorry I don't understand what you are trying to say. Can you give an example of how the interface should be changed so that users who want to do a compress-like operation (or |
Start from the algorith - which one you want. If it's For example 32 chars on avx2 with bmi would be split into 2 chunks of 16 chars each
int is how many elements from the front are selected. I was thinking to do partition the same way, but I didn't do it yet. Advantage of this is that you don't have to put the chunks together in one register, which is expensive.| |
Looked more at: permute(const simd<T, Abi>>& v, PermuteGenerator&& fn) I think you meant permute(const simd<T, Abi>>& v, PermuteGenerator) This is not how in C++ we generally do lambdas but, if everyone i on board with this trickery, fine |
That isn't so far removed from how I implement it in some situations either. My implementation can break the larger simd into smaller pieces - typically native sized but the implementation decides for itself depending on what ISA is available - where each piece is a compressed block + offset. Those blocks are then concatenated together, such as by assembling the values into memory, or using a 2-source permute to combine compressed regions of adjacent registers. However, there are other ways to do the compress too - for example, create 8-bit indexes using iota and then compress those and do a permute using the compressed indexes. This method works really well on Ice Lake (which has 64xbyte compress and 2-source permutes), and can also be used standalone for small simd compress (i.e., without blocking), or combined into the block+offset approach as a building block for larger simd value compression. If I understand correctly, you want the API to allow a compress to be broken down into smaller pieces which the user then assembles into the final result depending what they are doing. The user can decide how to glue the pieces together depending upon what they are trying to achieve. So you combine the faster performance of compressing small blocks, with user knowledge of how those blocks get used? Although my implementation often internally does the same thing as yours - block+offset - I don't want to expose that detail. I want to keep everything hidden away inside the compress/expand functions so that the implementation can use alternative methods when that would yield better performance. |
Does it look like this? https://github.com/jfalcou/eve/blob/main/include/eve/detail/compress/simd/x86/compress_using_bmi.hpp You can at most do 16 chars for avx2 this way.
This is extra 4-5 cycles for very not clear reasom.
Your api is nice. The problem is that it is not efficient. If you are doing simd, it means the function is very expensive. Then less than efficient api becomes a burden instead of help. |
I am talking about Ice Lake's VBMI2 ISA which has compress/expand for 8, 16, 32 and 64-bit elements. You can compress 64 8-bit iota indexes with that instruction, and then do a permute from the result. |
Yes, a 10k dollars intel chip and next generation of arm can do your api well. I don't believe that's enough |
If a user wants to do a compress-like operation then they have no choice. If we don't provide an API for doing what they want then the user has to invent their own, and whatever they invent is likely to be less efficient than the one provided by the implementor of std::simd. Not only have functions like VBMI2 is available on more than just expensive servers. 3 of the 5 laptops in my household have VBMI2 for example. |
This is where I disagree with you completely. You give people the things you can and they build on top. If "this is what can be done" then they will figure out how to write code on top of it. And that code will be faster.
Avx2 max is very relevant |
I believe that p2664r3 is not implementable outside of very modern intel (the dynamic index lookup).
I also fail to see how
Can be done since fn will return runtime indexes. You need compile time parameters to make that work
The text was updated successfully, but these errors were encountered: