-
Notifications
You must be signed in to change notification settings - Fork 259
Implementing a New Function
Adding a new function to SIMDe, including tests to make sure it's implemented correctly, is pretty straightforward once you know where things are.
This document will guide you through adding everything you need in order to get a patch accepted. We'll use an AVX2 function, _mm256_add_epi32
, as an example but the process is similar for other ISA extensions.
First, we'll create a stub implementation of our function in the right place. For AVX2, that's simde/x86/avx2.h
. The functions are generally in alphabetical order, but if it makes more sense to put it next to a function that is similar instead you can do that, too.
We don't need to worry about the portable implementation much right now, we just want something that will compile and run so we can generate a test vector. I generally just copy an existing function with the same prototype if possible (for example, the body of our _mm256_add_epi32
could just be a copy of _mm256_sub_epi32
with the function name changed for now).
We do need to change the native implementation to call the right function, though, since we'll be using it soon to generate a test vector.
All functions in SIMDe require at least one test case; patches missing a test will not be accepted. Writing these tests can be quite repetitive, but SIMDe has lots of code to help you get started. Since we're writing an AVX2 function, the first thing you'll want to do is open up test/x86/skel.c
and find function that corresponds with the types you want to use; in the case of _mm256_add_epi32
, we're adding two 256-bit vectors of 8 32-bit integers each, so you'll want to use test_simde_mm256_xxx_epi32
. Note the naming convention; you should be able to find the necessary function quickly by searching instead of reading through the whole thing.
Here is what that function looks like at the time this document was written:
static MunitResult
test_simde_mm256_xxx_epi32(const MunitParameter params[], void* data) {
(void) params;
(void) data;
const struct {
simde__m256i a;
simde__m256i b;
simde__m256i r;
} test_vec[8] = {
};
printf("\n");
for (size_t i = 0 ; i < (sizeof(test_vec) / (sizeof(test_vec[0]))) ; i++) {
simde__m256i a, b, r;
munit_rand_memory(sizeof(a), (uint8_t*) &a);
munit_rand_memory(sizeof(b), (uint8_t*) &b);
r = simde_mm256_xxx_epi32(a, b);
printf(" { simde_mm256_set_epi32(INT32_C(%11d), INT32_C(%11d), INT32_C(%11d), INT32_C(%11d),\n"
" INT32_C(%11d), INT32_C(%11d), INT32_C(%11d), INT32_C(%11d)),\n",
a.i32[7], a.i32[6], a.i32[5], a.i32[4], a.i32[3], a.i32[2], a.i32[1], a.i32[0]);
printf(" simde_mm256_set_epi32(INT32_C(%11d), INT32_C(%11d), INT32_C(%11d), INT32_C(%11d),\n"
" INT32_C(%11d), INT32_C(%11d), INT32_C(%11d), INT32_C(%11d)),\n",
b.i32[7], b.i32[6], b.i32[5], b.i32[4], b.i32[3], b.i32[2], b.i32[1], b.i32[0]);
printf(" simde_mm256_set_epi32(INT32_C(%11d), INT32_C(%11d), INT32_C(%11d), INT32_C(%11d),\n"
" INT32_C(%11d), INT32_C(%11d), INT32_C(%11d), INT32_C(%11d)) },\n",
r.i32[7], r.i32[6], r.i32[5], r.i32[4], r.i32[3], r.i32[2], r.i32[1], r.i32[0]);
}
return MUNIT_FAIL;
for (size_t i = 0 ; i < (sizeof(test_vec) / sizeof(test_vec[0])); i++) {
simde__m256i r = simde_mm256_xxx_epi32(test_vec[i].a, test_vec[i].b);
simde_assert_m256i_i32(r, ==, test_vec[i].r);
}
return MUNIT_OK;
}
Copy the code into the file with all the other test cases for the ISA extension you're working on; in this case, it's test/x86/avx2.c
and change the name to reflect the function you're working on, which is test_simde_mm256_xxx_epi32
for this example. Don't forget to also change the function calls in the body of the function; you can usually just to a find and replace in the file to replace "xxx" with whatever you need ("add") in this case.
You'll also need to tell the test runner about the function; at the end of the file, there is an array called test_suite_tests
, just add an entry to that:
static MunitTest test_suite_tests[] = {
/* ... */
TEST_FUNC(mm256_add_epi32),
/* ... */
Now our test function will run when the test suite runs.
Most of the test function is currently not actually used for testing, but rather to generate a test vector. If you run the tests on a machine which supports the instruction it should print out a test vector for you to copy into the test function. For example, for _mm256_add_epi32
you might end up with something like:
{ simde_mm256_set_epi32(INT32_C( 1102687755), INT32_C( 1275949869), INT32_C( -388043296), INT32_C( 1616523445),
INT32_C( -312991452), INT32_C(-1980926618), INT32_C( 1274012126), INT32_C( -45808693)),
simde_mm256_set_epi32(INT32_C(-1821401638), INT32_C( 1143218625), INT32_C(-1072188421), INT32_C( -228883992),
INT32_C( 1453787917), INT32_C(-1686415046), INT32_C(-1856178723), INT32_C(-1344248495)),
simde_mm256_set_epi32(INT32_C( -718713883), INT32_C(-1875798802), INT32_C(-1460231717), INT32_C( 1387639453),
INT32_C( 1140796465), INT32_C( 627625632), INT32_C( -582166597), INT32_C(-1390057188)) },
{ simde_mm256_set_epi32(INT32_C( -511556352), INT32_C( 512138684), INT32_C( 2115720361), INT32_C( -345092241),
INT32_C( -115713034), INT32_C( 1435785542), INT32_C( -578341737), INT32_C( 626663856)),
simde_mm256_set_epi32(INT32_C( 1905028737), INT32_C( 164639990), INT32_C(-1952346601), INT32_C( 1853095591),
INT32_C(-1825217200), INT32_C(-1102744367), INT32_C(-1105586227), INT32_C(-1908622941)),
simde_mm256_set_epi32(INT32_C( 1393472385), INT32_C( 676778674), INT32_C( 163373760), INT32_C( 1508003350),
INT32_C(-1940930234), INT32_C( 333041175), INT32_C(-1683927964), INT32_C(-1281959085)) },
{ simde_mm256_set_epi32(INT32_C( 841608097), INT32_C(-2001797484), INT32_C(-1658305288), INT32_C( 966942303),
INT32_C( 842108123), INT32_C( 697774066), INT32_C(-1273233002), INT32_C( -331057125)),
simde_mm256_set_epi32(INT32_C( 824745259), INT32_C( 1162513122), INT32_C( 1536105364), INT32_C( 1572988069),
INT32_C( 1601630355), INT32_C( 105174023), INT32_C( -548723565), INT32_C( 342919548)),
simde_mm256_set_epi32(INT32_C( 1666353356), INT32_C( -839284362), INT32_C( -122199924), INT32_C(-1755036924),
INT32_C(-1851228818), INT32_C( 802948089), INT32_C(-1821956567), INT32_C( 11862423)) },
{ simde_mm256_set_epi32(INT32_C(-1982661498), INT32_C( -454967885), INT32_C( 1606399367), INT32_C( 1911771725),
INT32_C( -320200723), INT32_C( 2055189331), INT32_C( 1782567162), INT32_C( 617047003)),
simde_mm256_set_epi32(INT32_C(-1988185598), INT32_C( 1350171177), INT32_C( -741176174), INT32_C( 1024642864),
INT32_C( 1174775607), INT32_C(-1489493977), INT32_C( 2114610376), INT32_C(-1150946108)),
simde_mm256_set_epi32(INT32_C( 324120200), INT32_C( 895203292), INT32_C( 865223193), INT32_C(-1358552707),
INT32_C( 854574884), INT32_C( 565695354), INT32_C( -397789758), INT32_C( -533899105)) },
{ simde_mm256_set_epi32(INT32_C(-1636237507), INT32_C(-2022044523), INT32_C( 1298417038), INT32_C( -498789244),
INT32_C(-1120565370), INT32_C( -10552717), INT32_C( 1267811859), INT32_C( 1736112342)),
simde_mm256_set_epi32(INT32_C( 30746202), INT32_C( 1464439343), INT32_C( 1694184093), INT32_C(-1066802952),
INT32_C( -664495133), INT32_C(-2016253412), INT32_C(-1975304715), INT32_C( -70672826)),
simde_mm256_set_epi32(INT32_C(-1605491305), INT32_C( -557605180), INT32_C(-1302366165), INT32_C(-1565592196),
INT32_C(-1785060503), INT32_C(-2026806129), INT32_C( -707492856), INT32_C( 1665439516)) },
{ simde_mm256_set_epi32(INT32_C( 289000373), INT32_C( 1573632519), INT32_C( -39248751), INT32_C( -989305129),
INT32_C( -946333511), INT32_C( -275686449), INT32_C( -98660627), INT32_C(-1519479102)),
simde_mm256_set_epi32(INT32_C( 297476793), INT32_C( 436731799), INT32_C( 124294563), INT32_C(-1635813332),
INT32_C( 263383074), INT32_C( -533172755), INT32_C( 1125990821), INT32_C( -786980387)),
simde_mm256_set_epi32(INT32_C( 586477166), INT32_C( 2010364318), INT32_C( 85045812), INT32_C( 1669848835),
INT32_C( -682950437), INT32_C( -808859204), INT32_C( 1027330194), INT32_C( 1988507807)) },
{ simde_mm256_set_epi32(INT32_C( 518182194), INT32_C(-1204047142), INT32_C( -66070725), INT32_C( 499109808),
INT32_C(-2041576579), INT32_C( -621515360), INT32_C( 566201077), INT32_C( 301667364)),
simde_mm256_set_epi32(INT32_C(-1846226401), INT32_C(-1479610627), INT32_C( -205605694), INT32_C( 2074175879),
INT32_C( 797873427), INT32_C( 232260429), INT32_C( 2122451120), INT32_C(-1502060759)),
simde_mm256_set_epi32(INT32_C(-1328044207), INT32_C( 1611309527), INT32_C( -271676419), INT32_C(-1721681609),
INT32_C(-1243703152), INT32_C( -389254931), INT32_C(-1606315099), INT32_C(-1200393395)) },
{ simde_mm256_set_epi32(INT32_C( 405834501), INT32_C(-1910761465), INT32_C( 957239954), INT32_C( -786856288),
INT32_C( 843920617), INT32_C( 327146567), INT32_C( -333483012), INT32_C(-1269489720)),
simde_mm256_set_epi32(INT32_C( -343554450), INT32_C( -768698719), INT32_C(-1629325598), INT32_C( -86112156),
INT32_C(-1762054840), INT32_C(-1230219631), INT32_C(-1955142376), INT32_C( 681367456)),
simde_mm256_set_epi32(INT32_C( 62280051), INT32_C( 1615507112), INT32_C( -672085644), INT32_C( -872968444),
INT32_C( -918134223), INT32_C( -903073064), INT32_C( 2006341908), INT32_C( -588122264)) }
Go ahead and copy that, and paste it right into the empty test_vector
array near the beginning of the function. Then you can get rid of all the code to generate the test vector, which is everything from the printf("\n");
to the return MUNIT_FAIL;
.
Now you're done with the test, and you can get on with the implementation.
Head back over to your implementation stub in simde/x86/avx2.h
. You'll probably be able to figure out the rest from looking at other functions in the same file, so feel free to skip the rest of this document.
First, take a look at Intel's documentation; remember, our example is for _mm256_add_epi32
. The documentation isn't always easy to understand, but between it and the test vector you should be able to figure it out. In our case, we just have to add the elements in a
to the elements in b
.
The simde_mm256
type is a giant union, and you should probably take a look at the definition if you haven't already. It's in the header for the first version of the ISA extension that uses the type, so in this case simde/x86/avx2.h
. You'll see members for various different integer sizes, both signed and unsigned, with a consistent naming scheme (32-bit integers are i32
, 64-bit unsigned integers are u64
, etc.). You should also note the n
(for "native") member, which exists if the compiler supports the datatype natively in the current configuration. Finally, you can treat the simde_m256
as an array of two __m128
vectors; this makes it easy to create a fallback using two 128-bit operations.
An initial version of our function might look like this:
SIMDE__FUNCTION_ATTRIBUTES
simde__m256i
simde_mm256_add_epi32 (simde__m256i a, simde__m256i b) {
simde__m256i r;
#if defined(SIMDE_AVX2_NATIVE)
r.n = _mm256_add_epi32(a.n, b.n);
#else
for (size_t i = 0 ; i < (sizeof(r.i32) / sizeof(r.i32[0])) ; i++) {
r.i32[i] = a.i32[i] + b.i32[i];
}
#endif
return r;
}
The first thing you'll probably want to do is provide a hint to the compiler that it can (and should) try to automatically vectorize the loop. We do this using one the SIMDE__VECTORIZE
macro which is declared in simde/simde-common.h
header. These macros use either OpenMP 4 SIMD, Cilk Plus, or compiler-specific pragmas, depending on which compiler is in use. Usually you'll just need SIMDE__VECTORIZE
, not SIMDE__VECTORIZE_SAFELEN
, SIMDE__VECTORIZE_REDUCTION
, or SIMDE__VECTORIZE_ALIGNED
. Just place the macro before the loop:
SIMDE__VECTORIZE
for (size_t i = 0 ; i < (sizeof(r.i32) / sizeof(r.i32[0])) ; i++) {
r.i32[i] = a.i32[i] + b.i32[i];
}
For a simple function like add
, the compiler will likely be able to generate vector instructions even for architectures we haven't explicitly created an implementation for, but it certainly doesn't hurt to add optimized fallbacks. Let's go ahead and create one for SSE2:
SIMDE__FUNCTION_ATTRIBUTES
simde__m256i
simde_mm256_add_epi32 (simde__m256i a, simde__m256i b) {
simde__m256i r;
#if defined(SIMDE_AVX2_NATIVE)
r.n = _mm256_add_epi32(a.n, b.n);
#elif defined(SIMDE_SSE2_NATIVE)
r.m128i[0] = _mm_add_epi32(a.m128i[0], b.m128i[0]);
r.m128i[1] = _mm_add_epi32(a.m128i[1], b.m128i[1]);
#else
for (size_t i = 0 ; i < (sizeof(r.i32) / sizeof(r.i32[0])) ; i++) {
r.i32[i] = a.i32[i] + b.i32[i];
}
#endif
return r;
}
Note: currently running the tests locally doesn't test all code paths, so it is advisable to comment out different implementations (AVX2_NATIVE, SSE2_NATIVE, SIMDE__SHUFFLE_VECTOR, etc) to double check
And you're done. If your implementations are correct the tests should pass, and you're ready to submit a pull request (please!).