-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No obvious way to xor byte stream with [u8; 4] #46
Comments
There aren't any 32-bit SIMD registers on x86. Typically the "cross-platform" way to do something like that would be to repeat the 4 The stackoverflow thread you reference doesn't initialize the mask to anything, so I'm not sure about the exact pattern you're looking to xor with your byte stream. However, you should be able to do something like byte_stream.simd_iter().simd_map(|v| v ^ u32s(YOUR_4_BYTE_PATTERN_HERE).be_u8s()).scalar_fill(&mut output)
|
Yes, this is exactly what I'm trying to do. However, currently the only way to do it is to convert Thanks a lot for the code snippet! If I manage to get it to do what I want, I'll open a PR to add my code to examples. |
You may be able to use then |
Ignoring endianness for now, the following code seems to do roughly what I want: let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_collect().truncate(buf.len()) However, this is 2x slower than scalar code that does the same in-place. I've tried rewriting this to mutate in-place with |
In-place mutation is in progress, but I think you may be better served by a |
The input can be almost arbitrarily large and is not guaranteed to fit on the stack, so it's going to be either a heap allocation or an in-place mutation. FWIW I've also tried the following, but it was 25% slower than let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
let mut output = vec![0u8; buf.len()];
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_fill(&mut output); Also: curiously, on an AMD CPU with AVX using |
Hm, that's interesting. Are you using Zen? I'll see if I can find anything weird in the disassembly and bench that code on my boxes. |
Nope, plain old FX-4300. I'm on nightly compiler obviously, so codegeneration might vary from version to version. Compiler version I've tested this on is I'm benchmarking with Criterion, I can share the entire test harness if you're interested. |
That would be awesome, thanks. You won't see a speedup with AVX compared to SSE2 for what you're doing, but compiling for your native CPU shouldn't slow you down by that much. |
https://github.com/Shnatsel/tungstenite-rs/tree/mask-simd - it's under benches/ in branch |
Also, with SIMD disabled I get performance roughly equal to the naive per-byte XOR - |
Turns out AVX requires a switch to a higher power consumption state and this takes time; until that happens, it runs at significantly lower frequencies. So using AVX is not worthwhile unless you're going to use it for a while, so the time before the switch to a higher power state becomes negligible. Source This is one possible explanation for the performance drop on AVX. |
The main reason you're not seeing a speedup is because Depending on how busy I am with work, a fix for this may or may not make it into the next release (along with runtime detection). |
Here's the main loop, according to .LBB3_7:
cmp rax, rsi
ja .LBB3_8
cmp rax, rcx
ja .LBB3_23
vpxor xmm0, xmm1, xmmword ptr [r8 + rax]
vmovdqu xmmword ptr [rdx + rax], xmm0
lea rdi, [rax + 16]
add rax, 32
cmp rax, rsi
mov rax, rdi
jbe .LBB3_7 The jump at the end makes sense, but the two at the beginning shouldn't be there. It looks like a bounds check, which hints that I may be missing an unchecked load somewhere. |
I'm trying to get into SIMD by implementing a trivial operation: XOR unmasking of a byte stream as required by the WebSocket specification. The implementation in x86 intrinsics is actually very straightforward, but I have a hard time wrapping my head around expressing it in terms of Faster iterators API.
The part I'm having trouble with is getting an input
[u8; 4]
to cycle within a SIMD vector ofu8
. I have looked at:load()
which does accept&[u8]
as input, but its behavior in case of length mismatch is completely undocumented. It's also not obvious whatoffset
parameter does.[u8; 4]
tou32
, callingvecs::u32s()
and then downcasting repeatedly to get a SIMD vector of u8, but Downcast seems to do not at all what I want.[u8; 4]
into it (lengths now match, so it should work) then downcast repeatedly until I get a vector of u8 with arbitrary length. Except there seems to be no way to request a SIMD vector of length 4 and arbitrary type.From<u32x4>
is implemented foru8x16
, so I could replace Downcast with it in approach 2 and probably get the correct result, except I have no idea how such conversions interact with host endianness.I actually expected this to be a trivial task. I guess for someone familiar with SIMD it is, but for the likes of me a snippet in examples/ folder that loads
[u8; 4]
into a vector would go a long way. Or perhaps even a convenience function in the API that deals with endianness properly, to make it harder to mess up.The text was updated successfully, but these errors were encountered: