Description
Summary
With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.
This was done by mirroring the surface area exposed by Vector<T>
. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.
The APIs expose would include the following:
ExtractMostSignificantBits
- On x86/x64 this would be emitted as
MoveMask
and performs exactly as expected - On ARM64, this would be emulated via
and, element-wise shift-right, 64-bit pairwise add, extract
. The JIT could optionally detect if theinput
is the result of aCompare
instruction and elide theshift-right
. - On WASM, this is called
bitmask
and works identically toMoveMask
- This API and its emulation are used throughout the BCL
- On x86/x64 this would be emitted as
Load/Store
- This is the basic load/store operations already in use for x86, x64, and ARM64
LoadAligned/StoreAligned
- This works exactly as the same named APIs on x86/x64
- When optimizations are disabled the alignment is verified
- When optimizations are enabled, this alignment checking may be skipped due to the load being folded into an instruction on modern hardware
- This enables efficient usage of the instruction on both older (pre-AVX) hardware as well as newer (post-AVX) or ARM64 hardware (where no load/store aligned instructions exist)
LoadAlignedNonTemporal/StoreAlignedNonTemporal
- This behaves as
LoadAligned/StoreAligned
but may optionally treat the memory access asnon-temporal
and avoid polluting the cache
- This behaves as
LoadUnsafe/StoreUnsafe
- These are "new APIs", they cover a "gap" in the API surface that has been encountered and worked around in the BCL and which is semi-regularly requested by the community
- The API that just takes a
ref T
behaves exactly like the version that takes apointer
, just without requiring pinning - The API that additionally takes an
nuint index
behaves likeref Unsafe.Add(ref value, index)
and avoids needing to further bloat IL and hinder readability
API Proposal
namespace System.Runtime.Intrinsics
{
public static partial class Vector64
{
public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);
public static Vector64<T> Load<T>(T* address);
public static Vector64<T> LoadAligned<T>(T* address);
public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector64<T> LoadUnsafe<T>(ref T address);
public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector64<T> source);
public static void StoreAligned<T>(T* address, Vector64<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
}
public static partial class Vector128
{
public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);
public static Vector128<T> Load<T>(T* address);
public static Vector128<T> LoadAligned<T>(T* address);
public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector128<T> LoadUnsafe<T>(ref T address);
public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector128<T> source);
public static void StoreAligned<T>(T* address, Vector128<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
}
public static partial class Vector256
{
public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);
public static Vector256<T> Load<T>(T* address);
public static Vector256<T> LoadAligned<T>(T* address);
public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector256<T> LoadUnsafe<T>(ref T address);
public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector256<T> source);
public static void StoreAligned<T>(T* address, Vector256<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
}
}
Additional Notes
Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:
- On x86/x64 these are referred to as
Shuffle
orPermute
(generally takes two elements and one element, respectively; but that isn't always the case) - On ARM64, these are referred to as
VectorTableLookup
(only takes two elements) - On WASM, these are referred to as
Shuffle
(takes two elements) andSwizzle
(takes one element). - On LLVM, these are referred to as
VectorShuffle
and only take two elements
Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for Vector128<T>
is consistent on all platforms. Vector64<T>
is ARM64 specific and Vector256<T>
is x86/x64 specific. The former behaves like Vector128<T>
while the latter generally behaves like 2x Vector128<T>
(outside a few APIs called Permute#x#
). For consistency, the Vector256<T>
APIs exposed here would behave identically to Vector128<T>
and allow "cross lane permutation".
For the single-vector reordering, the APIs are "trivial":
public static Vector128<byte> Shuffle(Vector128<byte> vector, Vector128<byte> indices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> vector, Vector128<sbyte> indices)
public static Vector128<short> Shuffle(Vector128<short> vector, Vector128<short> indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)
public static Vector128<int> Shuffle(Vector128<int> vector, Vector128<int> indices)
public static Vector128<uint> Shuffle(Vector128<uint> vector, Vector128<uint> indices)
public static Vector128<float> Shuffle(Vector128<float> vector, Vector128<int> indices)
public static Vector128<long> Shuffle(Vector128<long> vector, Vector128<long> indices)
public static Vector128<ulong> Shuffle(Vector128<ulong> vector, Vector128<ulong> indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long> indices)
For the two-vector reordering, the APIs are generally the same:
public static Vector128<byte> Shuffle(Vector128<byte> lower, Vector128<byte> upper, Vector128<byte> indices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> lower, Vector128<sbyte> upper, Vector128<sbyte> indices)
public static Vector128<short> Shuffle(Vector128<short> lower, Vector128<short> upper, Vector128<short> indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<ushort> indices)
public static Vector128<int> Shuffle(Vector128<int> lower, Vector128<int> upper, Vector128<int> indices)
public static Vector128<uint> Shuffle(Vector128<uint> lower, Vector128<uint> upper, Vector128<uint> indices)
public static Vector128<float> Shuffle(Vector128<float> lower, Vector128<float> upper, Vector128<int> indices)
public static Vector128<long> Shuffle(Vector128<long> lower, Vector128<long> upper, Vector128<long> indices)
public static Vector128<ulong> Shuffle(Vector128<ulong> lower, Vector128<ulong> upper, Vector128<ulong> indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<long> indices)
An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain Vector256<T>
shuffles involving byte
, sbyte
, short
, or ushort
that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.
This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.