-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any way to specify how the registers are store internally? #77
Comments
I don't think we really want to expose exactly how the registers are stored since that could easily vary from one implementation to another. In the Intel implementation we build on top of the compiler's own vector extensions (e.g., https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html), so we just hand off all the storage to gcc or clang without caring how it handles it. GCC requires power-of-2, but clang is happy to use any size by simply using a multiple of full-sized registers, plus some remainder. Maybe we should think about this the other way around. Rather than exposing the internal storage so that the more unusual intrinsics may be used on it, we should instead allow small intrinsic building blocks to be given to std::simd to be applied to the internal storage. This is how Intel's implementation works and it allows us to create new operations from intrinsics very easily. @mattkretz IIRC you have proposed something similar, if not identical, to what I set out below too, but I can't find the reference anywhere, so my apologies if I have restated something you have already said. This has turned into something bigger than I originally set out to say, but I think it is useful to say it anyway. Firstly, std::simd already has ways to convert to and from the implementation types for calling intrinsics. From https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p1928r4.pdf:
For simd values which fit a native register, this allows intrinsics to be called directly. I actually take this a little further and provide named functions:
I do this so that it makes it very clear what the intent of the code is, and also to ensure that those operators can statically assert that the data fits a native register. In all cases - my version and std::simd - the functions are always full native registers which may be partially filled. What happens in the extra elements that aren't used is undefined. My version always uses the smallest possible register too, so Converting to and from a register works for small simd values, but simd values which span multiple native registers need a different approach. Here is an example of how we do this, starting with a lambda which handles one register-sized piece:
and we then call it like this:
The
My example above is therefore something equivalent to doing this:
and the generated code might look like: Originally I made the invocable have implementation-defined parameters (e.g., __m512i) but these lack information about what exactly is being passed in (i.e., 8 x uint64, or 64 x uint8_t), and whether it is a full or partial register, so I went with breaking it down into simd pieces instead. Users have found that this mechanism is useful in breaking other problems down into smaller pieces which can be processed separately, not just for calling intrinsics. Thoughts and comments welcome. |
There is a lot. Let's do in pieces. If it's of a natively supported size: 16, 32, 64 on x86, I should be able to bist_cast to and from safely, correct? if x is 16 chars
is free and gives me a correct result, right? And the element number 0 stays consistent |
Agreed. |
I'm thinking, I 100% will need to cast to native intrinsics to do things. If specifying what internals is something we don't want, can we specify a cast function?
And encourage the implementations to provide: simd_cast_to_native<__m128(i)>, simd_cast_to_native<__m256(i)>, simd_cast_to_native<__m512(i)> and similar to arm neon? The implementation is encouraged to provide these if sizeof(element_type) * number_of_elements <= sizeof(intrinsic_register_type). Otherwise, I know I will be writing these functions with a bunch of platform dependent macros. |
As I noted in my previous response, P1928 already provides a conversion operator and a constructor for working with implementation defined data. Also, I think an extension which provides |
It does but those do not tell me what they are at all.
The implementation knows how to do this better than I do - I would like them to do it. |
Hi, sorry for being late to the discussion. I admit I have not read everything in depth yet.
The standard cannot tell you. What are you asking for then? A member type? How would that help? And how would you do it for multiple types (e.g.
It still doesn't necessarily know what you want. Again, the integer SSE types: the intrinsic type or the GCC vector type? But in general I'm currently tending rather towards removing than adding stuff in this area. That's because simd<float> addsub(simd<float> lhs, simd<float> rhs)
{
if constexpr (sizeof(lhs) == sizeof(__m128))
return std::bit_cast<simd<float>>(_mm_addsub_ps(std::bit_cast<__m128>(lhs),
std::bit_cast<__m128>(rhs)));
else if constexpr (sizeof(lhs) == sizeof(__m256))
return std::bit_cast<simd<float>>(_mm256_addsub_ps(std::bit_cast<__m256>(lhs),
std::bit_cast<__m256>(rhs)));
// ...
} What might be interesting though, is to provide a better "split (or up-size) to register-sized chunks" function. See P1928R5 Section 5.5 for a start. |
Oh, and another point I wanted to add to your topic question: The ABI tag specifies "how the registers are stored internally". The implementation can (and really should) add implementation-defined ABI tags. This implies the implementation has to document them. E.g. GCC/libstdc++ documents (well I should make it so, I suspect) |
Although there are potentially several types which can be used to define a register, I think std::simd itself should specify one register type which it thinks is the most appropriate type to use to call an intrinsic. In my implementation I currently do this as a member type in simd itself:
I don't think that the member register type should be permitted for I can see that using std::bit_cast is a neat way to avoid having to put ctors or conversion operators into std::simd itself, but should we this for programmers to make it as easy as possible to call an intrinsic when they need to? I think we should keep the implementation defined constructor, but tweaked to:
and to retrieve a value I always like named accessors like this:
Providing an accessor ensures that the value is correctly retrieved out of the simd into a valid register for the target, it makes the intent of the code clear, and it ensures the best register type is chosen to interact with intrinsics. It then allows overloading to be used to select from several different options without complicated if-else conditionals being needed, which makes the code simpler:
Once we have agreed on what the minimal support is needed to invoke intrinsics with small simd objects, then we can separately address how to break big simd objects up in ways which allow the small calls to be invoked. |
One reason why I tried to be very conservative is that experience with "native handle" functions in standard library types has been a source of problems and has been mentioned again and again as a "don't repeat that mistake again":
I'm sure it's not much work to write a user-defined non-member typename <std::floating_point FP>
simd<FP> addsub(simd<FP> lhs, simd<FP> rhs) {
return to_simd(intel_addsub(to_intrin(lhs), to_intrin(rhs));
} By keeping this part of the standard vague the type can much better stick to being an abstraction of a data-parallel type, rather than a manifestation of a CPU register. 😉 |
I think based on the platform it will be a lot of work, like a lot. Especially for non native register sizes. One random way we can specify it is: Function behaving as is
(There is probably a UB somewhere but you get my point). And then we encourage a specialized implementations for native intrinsic types that people might want to use. |
I can see this is tricky, and having something abstract is cleaner and easier to deal with. But I think we also need to be pragmatic and accept that intrinsics and target interaction are inevitable, and we should make that as easy as possible. But if we can't get that view past the committee then we don't have any choice but to accept it. At the moment the audience of Intel's simd implementation is experienced programmers who would normally use intrinsics, and are using std::simd for the easier syntax, but don't want to lose control over the power of the more unusual intrinsics that targets often have. It's the question that they raise again and again - how do they call the favourite intrinsic without inventing their own mechanism? I agree that it is simple to have user-defined I like your point about innovation, and in an abstract way I agree that tying down the mechanism could be bad. But practically, is there any simd target which wouldn't have at least some basic level of being able to define a container (or register) for interacting with its intrinsics? |
Maybe even
|
Another suggestion: What if we just add a Note with suggested conversions? Like we cannot specify them formally but we can help people provide a good interface. Smth like: NOTE:
(what is in the remainging bits of the buf is unspecified and can be different each time). if current platfrom supports AVX and simd::size() * sizeof(simd::value_type) == 32 So basically it's the same approach as now in the paper but we nodge them to do the thing that'd be useful. |
The proposal seems to be geared towards a seamless interraction between intrinsics and std::simd, so that you can fall back to intrinsics when the standard does not provide the tools you want.
This is awesome and I full heartedly support it.
However it is no way specified how exactly the values are represented, specificall non standard sizes.
Can this be done as a note?
Would be nice if this intrinsic code was portable between compilers, even if not in strictly standard way then at least in practice.
What can be doe here?
FYI:
in eve,
Note1 - we do not support arbitrary sizes only powers of 2
Note2 - If I'm not mistaken our ppc tests are for a big endian infrastructure. I believe still works the same. Sorry - very rarely touch ppc, I can find out if helpful.
The text was updated successfully, but these errors were encountered: