No obvious way to xor byte stream with [u8; 4] #46

Shnatsel · 2018-07-10T12:34:21Z

I'm trying to get into SIMD by implementing a trivial operation: XOR unmasking of a byte stream as required by the WebSocket specification. The implementation in x86 intrinsics is actually very straightforward, but I have a hard time wrapping my head around expressing it in terms of Faster iterators API.

The part I'm having trouble with is getting an input [u8; 4] to cycle within a SIMD vector of u8. I have looked at:

load() which does accept &[u8] as input, but its behavior in case of length mismatch is completely undocumented. It's also not obvious what offset parameter does.
Casting the input [u8; 4] to u32, calling vecs::u32s() and then downcasting repeatedly to get a SIMD vector of u8, but Downcast seems to do not at all what I want.
Getting a SIMD vector of length 4 and arbitrary type inside it, load [u8; 4] into it (lengths now match, so it should work) then downcast repeatedly until I get a vector of u8 with arbitrary length. Except there seems to be no way to request a SIMD vector of length 4 and arbitrary type.
After over an hour of head-scratching I've noticed that From<u32x4> is implemented for u8x16, so I could replace Downcast with it in approach 2 and probably get the correct result, except I have no idea how such conversions interact with host endianness.

I actually expected this to be a trivial task. I guess for someone familiar with SIMD it is, but for the likes of me a snippet in examples/ folder that loads [u8; 4] into a vector would go a long way. Or perhaps even a convenience function in the API that deals with endianness properly, to make it harder to mess up.

The text was updated successfully, but these errors were encountered:

AdamNiederer · 2018-07-11T04:03:05Z

There aren't any 32-bit SIMD registers on x86. Typically the "cross-platform" way to do something like that would be to repeat the 4 u8s across the width of a vector.

The stackoverflow thread you reference doesn't initialize the mask to anything, so I'm not sure about the exact pattern you're looking to xor with your byte stream. However, you should be able to do something like

byte_stream.simd_iter().simd_map(|v| v ^ u32s(YOUR_4_BYTE_PATTERN_HERE).be_u8s()).scalar_fill(&mut output)

Casting the input [u8; 4] to u32, calling vecs::u32s() and then downcasting repeatedly to get a SIMD vector of u8, but Downcast seems to do not at all what I want.

Downcast preserves the numeric value of each element, or saturates it if it's too big to fit in the newly-sized integer. You can use the mem::transmute or faster's Transmute trait to re-type the vector without changing its contents.

Shnatsel · 2018-07-11T10:44:20Z

Typically the "cross-platform" way to do something like that would be to repeat the 4 u8s across the width of a vector.

Yes, this is exactly what I'm trying to do. However, currently the only way to do it is to convert [u8; 4] to a u32, call u32s() on that and then convert it to vector of u8s, which requires me to care about host endianness. It would be great if Faster could expose an API to repeat [u8; 4] across the width of a vector without touching endianness at all.

Thanks a lot for the code snippet! If I manage to get it to do what I want, I'll open a PR to add my code to examples.

AdamNiederer · 2018-07-11T15:31:48Z

You may be able to use then Endian trait (vec.to_be() and vec.to_le()) to ensure the endianness is correct across platforms. Since it's a computation on a constant, LLVM should be able to just compile the whole thing down to a movaps (or similar).

Shnatsel · 2018-07-12T20:29:21Z

Ignoring endianness for now, the following code seems to do roughly what I want:

let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_collect().truncate(buf.len())

However, this is 2x slower than scalar code that does the same in-place. I've tried rewriting this to mutate in-place with .simd_iter_mut() but I'm not sure that to chain it with to mutate the input: simd_map() insists I should return something, and there are no uses of .simd_iter_mut() in examples. Is in-place mutation even supported at this point?

AdamNiederer · 2018-07-12T20:31:56Z

In-place mutation is in progress, but I think you may be better served by a scalar_fill call instead of scalar_collect, as the latter performs a heap allocation. You also shouldn't need to truncate the output buffer, as faster takes care of misaligned and arbitrarily-sized data for you.

Shnatsel · 2018-07-12T20:58:00Z

The input can be almost arbitrarily large and is not guaranteed to fit on the stack, so it's going to be either a heap allocation or an in-place mutation.

FWIW I've also tried the following, but it was 25% slower than .scalar_collect():

let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
let mut output = vec![0u8; buf.len()];
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_fill(&mut output);

Also: curiously, on an AMD CPU with AVX using RUSTFLAGS="-C target-cpu=native" actually makes the code 10% slower compared to RUSTFLAGS="-C target-cpu=x86-64 on a 10Kb input buffer.

AdamNiederer · 2018-07-12T20:59:48Z

Hm, that's interesting. Are you using Zen? I'll see if I can find anything weird in the disassembly and bench that code on my boxes.

Shnatsel · 2018-07-12T21:03:34Z

Nope, plain old FX-4300. I'm on nightly compiler obviously, so codegeneration might vary from version to version. Compiler version I've tested this on is rustc 1.28.0-nightly (60efbdead 2018-06-23).

I'm benchmarking with Criterion, I can share the entire test harness if you're interested.

AdamNiederer · 2018-07-12T21:26:58Z

That would be awesome, thanks. You won't see a speedup with AVX compared to SSE2 for what you're doing, but compiling for your native CPU shouldn't slow you down by that much.

Shnatsel · 2018-07-12T21:43:35Z

https://github.com/Shnatsel/tungstenite-rs/tree/mask-simd - it's under benches/ in branch mask-simd.
I've been working on it in tungstenite codebase (websocket protocol implementation in Rust), but it's a self-contained file and can probably be decoupled fairly easily.

Shnatsel · 2018-07-12T22:11:19Z

Also, with SIMD disabled I get performance roughly equal to the naive per-byte XOR - apply_mask_fallback() in that file. The polyfill does not have to be that slow - for example, function apply_mask_fast32() in the same file that does not use SIMD instructions but operates on 32 bits at a time is 20x faster than apply_mask_fallback().

Shnatsel · 2018-07-13T00:20:14Z

Turns out AVX requires a switch to a higher power consumption state and this takes time; until that happens, it runs at significantly lower frequencies. So using AVX is not worthwhile unless you're going to use it for a while, so the time before the switch to a higher power state becomes negligible. Source

This is one possible explanation for the performance drop on AVX.

AdamNiederer · 2018-07-13T00:25:38Z

The main reason you're not seeing a speedup is because faster isn't well-optimized for AVX-only CPUs at the moment, and uses the SSE xor instruction. If you compile on a machine with AVX2, you will see the speedup.

Depending on how busy I am with work, a fix for this may or may not make it into the next release (along with runtime detection).

AdamNiederer · 2018-07-13T01:24:01Z

Here's the main loop, according to cargo-disassemble

.LBB3_7:
	cmp	rax, rsi
	ja	.LBB3_8
	cmp	rax, rcx
	ja	.LBB3_23
	vpxor	xmm0, xmm1, xmmword ptr [r8 + rax]
	vmovdqu	xmmword ptr [rdx + rax], xmm0
	lea	rdi, [rax + 16]
	add	rax, 32
	cmp	rax, rsi
	mov	rax, rdi
	jbe	.LBB3_7

The jump at the end makes sense, but the two at the beginning shouldn't be there. It looks like a bounds check, which hints that I may be missing an unchecked load somewhere.

Shnatsel mentioned this issue Jul 10, 2018

Consider SIMD for unmasking snapview/tungstenite-rs#36

Open

alecmocatta mentioned this issue Sep 20, 2018

u64x8 iterator by zero-extending each byte in a [u8] slice #59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No obvious way to xor byte stream with [u8; 4] #46

No obvious way to xor byte stream with [u8; 4] #46

Shnatsel commented Jul 10, 2018

AdamNiederer commented Jul 11, 2018 •

edited

Loading

Shnatsel commented Jul 11, 2018

AdamNiederer commented Jul 11, 2018

Shnatsel commented Jul 12, 2018

AdamNiederer commented Jul 12, 2018

Shnatsel commented Jul 12, 2018

AdamNiederer commented Jul 12, 2018

Shnatsel commented Jul 12, 2018

AdamNiederer commented Jul 12, 2018

Shnatsel commented Jul 12, 2018

Shnatsel commented Jul 12, 2018

Shnatsel commented Jul 13, 2018

AdamNiederer commented Jul 13, 2018

AdamNiederer commented Jul 13, 2018 •

edited

Loading

No obvious way to xor byte stream with [u8; 4] #46

No obvious way to xor byte stream with [u8; 4] #46

Comments

Shnatsel commented Jul 10, 2018

AdamNiederer commented Jul 11, 2018 • edited Loading

Shnatsel commented Jul 11, 2018

AdamNiederer commented Jul 11, 2018

Shnatsel commented Jul 12, 2018

AdamNiederer commented Jul 12, 2018

Shnatsel commented Jul 12, 2018

AdamNiederer commented Jul 12, 2018

Shnatsel commented Jul 12, 2018

AdamNiederer commented Jul 12, 2018

Shnatsel commented Jul 12, 2018

Shnatsel commented Jul 12, 2018

Shnatsel commented Jul 13, 2018

AdamNiederer commented Jul 13, 2018

AdamNiederer commented Jul 13, 2018 • edited Loading

AdamNiederer commented Jul 11, 2018 •

edited

Loading

AdamNiederer commented Jul 13, 2018 •

edited

Loading