Skip to content

Additional cross-platform hardware intrinsic APIs for loading/storing, reordering, and extracting a per-element "mask" #63331

Open
@tannergooding

Description

@tannergooding

Summary

With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.

This was done by mirroring the surface area exposed by Vector<T>. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.

The APIs expose would include the following:

  • ExtractMostSignificantBits
    • On x86/x64 this would be emitted as MoveMask and performs exactly as expected
    • On ARM64, this would be emulated via and, element-wise shift-right, 64-bit pairwise add, extract. The JIT could optionally detect if the input is the result of a Compare instruction and elide the shift-right.
    • On WASM, this is called bitmask and works identically to MoveMask
    • This API and its emulation are used throughout the BCL
  • Load/Store
    • This is the basic load/store operations already in use for x86, x64, and ARM64
  • LoadAligned/StoreAligned
    • This works exactly as the same named APIs on x86/x64
    • When optimizations are disabled the alignment is verified
    • When optimizations are enabled, this alignment checking may be skipped due to the load being folded into an instruction on modern hardware
    • This enables efficient usage of the instruction on both older (pre-AVX) hardware as well as newer (post-AVX) or ARM64 hardware (where no load/store aligned instructions exist)
  • LoadAlignedNonTemporal/StoreAlignedNonTemporal
    • This behaves as LoadAligned/StoreAligned but may optionally treat the memory access as non-temporal and avoid polluting the cache
  • LoadUnsafe/StoreUnsafe
    • These are "new APIs", they cover a "gap" in the API surface that has been encountered and worked around in the BCL and which is semi-regularly requested by the community
    • The API that just takes a ref T behaves exactly like the version that takes a pointer, just without requiring pinning
    • The API that additionally takes an nuint index behaves like ref Unsafe.Add(ref value, index) and avoids needing to further bloat IL and hinder readability

API Proposal

namespace System.Runtime.Intrinsics
{
    public static partial class Vector64
    {
        public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);

        public static Vector64<T> Load<T>(T* address);
        public static Vector64<T> LoadAligned<T>(T* address);
        public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector64<T> LoadUnsafe<T>(ref T address);
        public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector64<T> source);
        public static void StoreAligned<T>(T* address, Vector64<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
    }

    public static partial class Vector128
    {
        public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);

        public static Vector128<T> Load<T>(T* address);
        public static Vector128<T> LoadAligned<T>(T* address);
        public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector128<T> LoadUnsafe<T>(ref T address);
        public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector128<T> source);
        public static void StoreAligned<T>(T* address, Vector128<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
    }

    public static partial class Vector256
    {
        public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);

        public static Vector256<T> Load<T>(T* address);
        public static Vector256<T> LoadAligned<T>(T* address);
        public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector256<T> LoadUnsafe<T>(ref T address);
        public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector256<T> source);
        public static void StoreAligned<T>(T* address, Vector256<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
    }
}

Additional Notes

Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:

  • On x86/x64 these are referred to as Shuffle or Permute (generally takes two elements and one element, respectively; but that isn't always the case)
  • On ARM64, these are referred to as VectorTableLookup (only takes two elements)
  • On WASM, these are referred to as Shuffle (takes two elements) and Swizzle (takes one element).
  • On LLVM, these are referred to as VectorShuffle and only take two elements

Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for Vector128<T> is consistent on all platforms. Vector64<T> is ARM64 specific and Vector256<T> is x86/x64 specific. The former behaves like Vector128<T> while the latter generally behaves like 2x Vector128<T> (outside a few APIs called Permute#x#). For consistency, the Vector256<T> APIs exposed here would behave identically to Vector128<T> and allow "cross lane permutation".

For the single-vector reordering, the APIs are "trivial":

public static Vector128<byte>   Shuffle(Vector128<byte>   vector, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  vector, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  vector, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    vector, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   vector, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  vector, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   vector, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  vector, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long>   indices)

For the two-vector reordering, the APIs are generally the same:

public static Vector128<byte>   Shuffle(Vector128<byte>  lower,  Vector128<byte>   upper, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte> lower,  Vector128<sbyte>  upper, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  lower, Vector128<short>  upper, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    lower, Vector128<int>    upper, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   lower, Vector128<uint>   upper, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  lower, Vector128<float>  upper, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   lower, Vector128<long>   upper, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  lower, Vector128<ulong>  upper, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<long>   indices)

An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain Vector256<T> shuffles involving byte, sbyte, short, or ushort that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.

This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions