Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows TLS Emulation is extremely slow #61

Open
mratsim opened this issue Dec 21, 2019 · 4 comments
Open

Windows TLS Emulation is extremely slow #61

mratsim opened this issue Dec 21, 2019 · 4 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Dec 21, 2019

Overhead-bound benchmarks like Fibonacci and Depth-First Search are significantly slower on Windows than Linux and Mac.

Config: i9-9980XE 18 cores, 36 threads, with 4.1GHz all core Turbo

On Fibonacci in particular, the default eager futures takes 14s under windows while it takes 370ms under Linux for a whopping 30x slowdown.
Lazy futures allocated via alloca takes 800ms while they take 180ms under Linux.

This points to a memory allocator issue.

Memory-bound benchmarks (transpose) and CPU-bound benchmarks (Black-Scholes) seem to behave somewhat similarly to Linux.

Similar issues:

Low priority as we can't probably do anything more than what we have now in our memory subsytem. It's doubtful than even using Mimalloc on Windows (just for Weave) would help as our memory pool is based on the same techniques. Lastly Fibonacci is an extreme case with computation load of 1 cycle while Weave targets being efficient at 2000 cycles.

TODO: benchmark Cilk and TBB to make sure we are not missing something.

@mratsim
Copy link
Owner Author

mratsim commented Dec 28, 2019

Not sure if it's only the allocator. Even GEMM which shouldn't allocate much is abysmally slow:

Compiled with MSVC

$ ./build/weave_gemm.exe

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 300 samples in 6679 ms
Average time: 21.157 ms
Stddev  time: 42.421 ms
Min     time: 11.000 ms
Max     time: 350.000 ms
Perf:         669.093 GFLOP/s

as of https://github.com/mratsim/weave/tree/7e414ad5fc59920a7e97930e636dd411f323c860
Performance on Linux is 2250 GFLOP/s on that commit (i.e. without #77)

@mratsim
Copy link
Owner Author

mratsim commented Dec 29, 2019

So we shouldn't blame everything on microsoft, Nim TLS emulation which is turned on for MSVC and GCC it seems shows up in VTune:

image

@mratsim mratsim changed the title Windows memory allocator is extremely slow Windows is extremely slow Dec 29, 2019
@mratsim
Copy link
Owner Author

mratsim commented Dec 29, 2019

We can get 3x perf on both fibonacci and GEMM by disabling TLS emulation. GEMM does segfaults from time to time though.

With Clang on WIndows

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 300 samples in 2705 ms
Average time: 7.803 ms
Stddev  time: 4.028 ms
Min     time: 7.000 ms
Max     time: 44.000 ms
Perf:         1814.068 GFLOP/s

MSVC can only reach 1.2TFlops. Given that the microkernel is pure intrinsics the generated code should be the same.

@mratsim mratsim changed the title Windows is extremely slow Windows TLS Emulation is extremely slow Dec 29, 2019
@Araq
Copy link

Araq commented Dec 30, 2019

I wonder why tlsEmulation is still turned on after all these years... cough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants