Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No obvious way to xor byte stream with [u8; 4] #46

Open
Shnatsel opened this issue Jul 10, 2018 · 14 comments
Open

No obvious way to xor byte stream with [u8; 4] #46

Shnatsel opened this issue Jul 10, 2018 · 14 comments

Comments

@Shnatsel
Copy link

I'm trying to get into SIMD by implementing a trivial operation: XOR unmasking of a byte stream as required by the WebSocket specification. The implementation in x86 intrinsics is actually very straightforward, but I have a hard time wrapping my head around expressing it in terms of Faster iterators API.

The part I'm having trouble with is getting an input [u8; 4] to cycle within a SIMD vector of u8. I have looked at:

  1. load() which does accept &[u8] as input, but its behavior in case of length mismatch is completely undocumented. It's also not obvious what offset parameter does.
  2. Casting the input [u8; 4] to u32, calling vecs::u32s() and then downcasting repeatedly to get a SIMD vector of u8, but Downcast seems to do not at all what I want.
  3. Getting a SIMD vector of length 4 and arbitrary type inside it, load [u8; 4] into it (lengths now match, so it should work) then downcast repeatedly until I get a vector of u8 with arbitrary length. Except there seems to be no way to request a SIMD vector of length 4 and arbitrary type.
  4. After over an hour of head-scratching I've noticed that From<u32x4> is implemented for u8x16, so I could replace Downcast with it in approach 2 and probably get the correct result, except I have no idea how such conversions interact with host endianness.

I actually expected this to be a trivial task. I guess for someone familiar with SIMD it is, but for the likes of me a snippet in examples/ folder that loads [u8; 4] into a vector would go a long way. Or perhaps even a convenience function in the API that deals with endianness properly, to make it harder to mess up.

@AdamNiederer
Copy link
Owner

AdamNiederer commented Jul 11, 2018

There aren't any 32-bit SIMD registers on x86. Typically the "cross-platform" way to do something like that would be to repeat the 4 u8s across the width of a vector.

The stackoverflow thread you reference doesn't initialize the mask to anything, so I'm not sure about the exact pattern you're looking to xor with your byte stream. However, you should be able to do something like

byte_stream.simd_iter().simd_map(|v| v ^ u32s(YOUR_4_BYTE_PATTERN_HERE).be_u8s()).scalar_fill(&mut output)

Casting the input [u8; 4] to u32, calling vecs::u32s() and then downcasting repeatedly to get a SIMD vector of u8, but Downcast seems to do not at all what I want.

Downcast preserves the numeric value of each element, or saturates it if it's too big to fit in the newly-sized integer. You can use the mem::transmute or faster's Transmute trait to re-type the vector without changing its contents.

@Shnatsel
Copy link
Author

Typically the "cross-platform" way to do something like that would be to repeat the 4 u8s across the width of a vector.

Yes, this is exactly what I'm trying to do. However, currently the only way to do it is to convert [u8; 4] to a u32, call u32s() on that and then convert it to vector of u8s, which requires me to care about host endianness. It would be great if Faster could expose an API to repeat [u8; 4] across the width of a vector without touching endianness at all.

Thanks a lot for the code snippet! If I manage to get it to do what I want, I'll open a PR to add my code to examples.

@AdamNiederer
Copy link
Owner

You may be able to use then Endian trait (vec.to_be() and vec.to_le()) to ensure the endianness is correct across platforms. Since it's a computation on a constant, LLVM should be able to just compile the whole thing down to a movaps (or similar).

@Shnatsel
Copy link
Author

Ignoring endianness for now, the following code seems to do roughly what I want:

let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_collect().truncate(buf.len())

However, this is 2x slower than scalar code that does the same in-place. I've tried rewriting this to mutate in-place with .simd_iter_mut() but I'm not sure that to chain it with to mutate the input: simd_map() insists I should return something, and there are no uses of .simd_iter_mut() in examples. Is in-place mutation even supported at this point?

@AdamNiederer
Copy link
Owner

In-place mutation is in progress, but I think you may be better served by a scalar_fill call instead of scalar_collect, as the latter performs a heap allocation. You also shouldn't need to truncate the output buffer, as faster takes care of misaligned and arbitrarily-sized data for you.

@Shnatsel
Copy link
Author

The input can be almost arbitrarily large and is not guaranteed to fit on the stack, so it's going to be either a heap allocation or an in-place mutation.

FWIW I've also tried the following, but it was 25% slower than .scalar_collect():

let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
let mut output = vec![0u8; buf.len()];
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_fill(&mut output);

Also: curiously, on an AMD CPU with AVX using RUSTFLAGS="-C target-cpu=native" actually makes the code 10% slower compared to RUSTFLAGS="-C target-cpu=x86-64 on a 10Kb input buffer.

@AdamNiederer
Copy link
Owner

Hm, that's interesting. Are you using Zen? I'll see if I can find anything weird in the disassembly and bench that code on my boxes.

@Shnatsel
Copy link
Author

Nope, plain old FX-4300. I'm on nightly compiler obviously, so codegeneration might vary from version to version. Compiler version I've tested this on is rustc 1.28.0-nightly (60efbdead 2018-06-23).

I'm benchmarking with Criterion, I can share the entire test harness if you're interested.

@AdamNiederer
Copy link
Owner

That would be awesome, thanks. You won't see a speedup with AVX compared to SSE2 for what you're doing, but compiling for your native CPU shouldn't slow you down by that much.

@Shnatsel
Copy link
Author

https://github.com/Shnatsel/tungstenite-rs/tree/mask-simd - it's under benches/ in branch mask-simd.
I've been working on it in tungstenite codebase (websocket protocol implementation in Rust), but it's a self-contained file and can probably be decoupled fairly easily.

@Shnatsel
Copy link
Author

Also, with SIMD disabled I get performance roughly equal to the naive per-byte XOR - apply_mask_fallback() in that file. The polyfill does not have to be that slow - for example, function apply_mask_fast32() in the same file that does not use SIMD instructions but operates on 32 bits at a time is 20x faster than apply_mask_fallback().

@Shnatsel
Copy link
Author

Turns out AVX requires a switch to a higher power consumption state and this takes time; until that happens, it runs at significantly lower frequencies. So using AVX is not worthwhile unless you're going to use it for a while, so the time before the switch to a higher power state becomes negligible. Source

This is one possible explanation for the performance drop on AVX.

@AdamNiederer
Copy link
Owner

The main reason you're not seeing a speedup is because faster isn't well-optimized for AVX-only CPUs at the moment, and uses the SSE xor instruction. If you compile on a machine with AVX2, you will see the speedup.

Depending on how busy I am with work, a fix for this may or may not make it into the next release (along with runtime detection).

@AdamNiederer
Copy link
Owner

AdamNiederer commented Jul 13, 2018

Here's the main loop, according to cargo-disassemble

.LBB3_7:
	cmp	rax, rsi
	ja	.LBB3_8
	cmp	rax, rcx
	ja	.LBB3_23
	vpxor	xmm0, xmm1, xmmword ptr [r8 + rax]
	vmovdqu	xmmword ptr [rdx + rax], xmm0
	lea	rdi, [rax + 16]
	add	rax, 32
	cmp	rax, rsi
	mov	rax, rdi
	jbe	.LBB3_7

The jump at the end makes sense, but the two at the beginning shouldn't be there. It looks like a bounds check, which hints that I may be missing an unchecked load somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants