-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding CPU features detection code #468
base: develop
Are you sure you want to change the base?
Conversation
/* cpu_features.c -- Processor features detection. | ||
* | ||
* Copyright 2018 The Chromium Authors. All rights reserved. | ||
* Use of this source code is governed by a BSD-style license that can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@madler I'm looking for your feedback concerning the licensing of the new code.
Chromium's license can be found at: https://cs.chromium.org/chromium/src/LICENSE
I believe we may be able to adjust it to follow ZLIB license if that is your preference.
@@ -1,4 +1,4 @@ | |||
cmake_minimum_required(VERSION 2.4.4) | |||
cmake_minimum_required(VERSION 2.6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake 2.6 is the minimal to be able to use 'Threads' macro and later on add it to the libraries to be linked to zlib.
#include <cpuid.h> | ||
#endif | ||
|
||
/* TODO(cavalcantii): remove checks for x86_flags on deflate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imported the code pretty much as it is from:
https://cs.chromium.org/chromium/src/third_party/zlib/cpu_features.c
To clarify: this TODO is not applicable within the context of this patch.
static void _cpu_check_features(void); | ||
#endif | ||
|
||
#if defined(ARMV8_OS_ANDROID) || defined(ARMV8_OS_LINUX) || defined(ARMV8_OS_FUCHSIA) || defined(X86_NOT_WINDOWS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An explanation here: we have to support many different combinations in Chromium (e.g. Android without SIMD, with SIMD, 32bit, 64bit, etc) and operating systems e.g. Fuchsia, WoA (Windows on ARM), etc.
Unfortunately, there isn't much else I can do to make the code simpler.
:-(
/* On x86 we simply use a instruction to check the CPU features. | ||
* (i.e. CPUID). | ||
*/ | ||
static void _cpu_check_features(void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though we don't use x86 CPU features flags for now, they will be needed for the next patch (i.e. optimized insert_string).
@@ -648,15 +649,7 @@ unsigned long ZEXPORT crc32_z(crc, buf, len) | |||
z_size_t last, last2, i; | |||
z_size_t num; | |||
|
|||
/* Return initial CRC, if requested. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hooked into the portable code to save some code duplication.
As a result, for ARMv8 we will build both implementations (e.g. portable + optimized) and decide which one to use at runtime.
We also cache the CPU features detection flags to avoid wasting CPU cycles.
* detection early (and infrequently) on. | ||
*/ | ||
if (!len) | ||
cpu_check_features(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An explanation here: in Chromium code base, we have client code (e.g. network code) that may rely on using zlib's crc32.
There is no guarantee that the client code is performing compression/decompression of data, so we have to cover that corner case by ensuring that we perform CPU features detection in a scenario where the user only rely on calling crc32() (e.g. first to get an initial valid crc32 value and next with a real data vector).
// Feature detection is not triggered while using RAW mode (i.e. we never | ||
// call crc32() with a NULL buffer). | ||
#if defined(CRC32_ARMV8_CRC32) || defined(CRC32_SIMD_SSE42_PCLMUL) | ||
cpu_check_features(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another valid entry point to perform CPU features detection, covering the scenario where we perform compression using RAW mode.
I found this corner case last December while V8 was starting to use zlib's checksums for speeding up loading snapshots (e.g. code blobs).
I think we should target to have in near future an optimized CRC32 for intel as we have this done already, for reference:
https://cs.chromium.org/chromium/src/third_party/zlib/crc32_simd.c
|
||
#include "zlib.h" | ||
|
||
/* TODO(cavalcantii): remove checks for x86_flags on deflate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous note about importing the code.
return; /* Report nothing if ASIMD(NEON) is missing */ | ||
arm_cpu_enable_crc32 = !!(features & ZX_ARM64_FEATURE_ISA_CRC32); | ||
arm_cpu_enable_pmull = !!(features & ZX_ARM64_FEATURE_ISA_PMULL); | ||
#elif defined(ARMV8_OS_WINDOWS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for WoA (Windows on ARM).
static pthread_once_t cpu_check_inited_once = PTHREAD_ONCE_INIT; | ||
void ZLIB_INTERNAL cpu_check_features(void) | ||
{ | ||
pthread_once(&cpu_check_inited_once, _cpu_check_features); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that we could easily change this to make it optional to have threaded synchronization for CPU features detection.
That would allow users to opt-in into the features (i.e. if they plan to run the code in a multithreaded app), allowing zlib to keep its required set of dependencies minimal as it is today.
The implementation would be something like:
#if defined(HAVE_THREAD_SUPPORT)
// here goes code using POSIX thread API
#else
// Simply call _cpu_check_features().
#endif
@madler what you think?
@Adenilson do you want to compare your solution to this one. Both use non-GPL code from Chrome to accelerate CRC. The CloudFlare patch has a few features:
The benchmarks show robust performance across a wide range of systems. I have only examined x86 support. |
@neurolabusc I left a few comments on the patch you pointed out, I hope it helps. For the perspective of CPU features detection, unless I'm mistaken, it uses __get_cpuid() intrinsic which is the same apporach we use on Chromium's zlib ToT (https://cs.chromium.org/chromium/src/third_party/zlib/cpu_features.c?l=131). But for some reason, the patch you pointed doesn't perform detection on ARM? ARM is a platform that is supported by Chromium, is there a reason to drop support for it? Proceeding with the analysis: the extra function call to has_pcmul() at the beginning of crc32() function is troubling, when I implemented the optimized crc32 on Chromium's zlib for ARM, this kind of extraneous function call was enough to slow down things (i.e. which is why we cache the results of CPU features detection and have in a cold branch at the start of the function). Finally, for a multithreaded/multiprocess application (i.e. Chromium) there is the requirement to have some kind of thread barrier to avoid a race condition while performing CPU features detection. That requires OS specific thread synchronization functions and the code we distribute on Chromium is vastly tested for the last couple years. I understand that some zlib users may not have this kind of requirement, which is why I suggested that we can make this an opt-in feature. |
@Adenilson Thanks for your useful comments. I will try to address each in an update.
|
Hi @Adenilson, I don't think using
Although it may not be available for Windows systems, this can be the default for Linux and latest Android (bionic has support starting with Android Q), at least. The current approach can serve as a fallback. PRs #457 #458 #459 use it and have some infrastructure code that can be reused by other archs to implement ifuncs as well. |
@mscastanho thanks for the comments. Chromium's use case is a bit challenging, as we have to a) Run the code inside of the sandbox (i.e. while inside of a RendererProcess); b) Support really old versions of Android (i.e. kitkat 4.4). So I'm afraid an ifunc-based solution wouldn't scale for 2 of the required OSes (i.e. Android and Windows). Linux may be a possibility, but I would have to confirm if it would still work properly while running inside of the sandbox, as zlib runs both sandboxed (i.e. Renderer) as also in a privileged level (e.g. network operations). That being said, to justify maintaining the code only for Linux (and CrOS), there would have to be some advantage (e.g. performance). Do you happen to have some performance numbers comparing a pthread based implementation vs using ifunc? |
@Adenilson , @mscastanho is not suggesting to remove the support for this new mechanism based on But, as there are operating systems that still do not support ifunc, it's fair to add extra code in those scenarios in order to provide a similar functionality. However, it's unfair to require that operating systems that already provide ifunc have to also maintain extra code in order to do the same thing. In other words: I do not oppose adding this new mechanism, but I think it must not be mandatory for all the operating systems. The operating systems that already support ifunc should be able to use it. |
@tuliom my point is that any extra code comes with a price: maintainers time + zlib performance. To justify adding new code, the advantages (e.g. performance) must significantly surpass the cost of maintaining the code. The documentation referenced by @mscastanho points that (quote): "Unfortunately there are actually a lot of restrictions placed on IFUNC usage which aren't entirely clear and the documentation needs to be updated.". That is why I asked for performance numbers to help assess the cost/benefit of using ifunc. I decided to go ahead and perform a few experiments with ifunc on a) Performance: median improvement on x86 was around 1.14% for decompression and 0.5% (i.e. less than 1%). On ARM it was kind of a mixed bag: seems to help slightly (0.4% to 0.9% on little cores@aarch64) but hurts compression performance on big cores (-1.6% on a Pixel4) with little to no gain on decompression (0.24%). b) Ecosystem: as I've explained in my previous comment, in Chrome we need to support really old Android devices released way back in 2012. But even with an Android 10 device (Pixel 4), there are issues related with proper linker/loader support i.e. invoking in the 'resolver' any function not defined within the compilation unit will crash the client code i.e. cpu_check_features(). To complicate things, ifunc was enabled by default only on gcc 8.x (released in 2018) on ARM. That means that anyone running Ubuntu < 19.x won't have compilers supporting the feature by default (i.e. you can't even build the code using 'ifunc'). As an example, AWS Graviton (i.e. ARM big core) instances latest available Ubuntu version is 18.04 LTS. On Chrome we have extra restrictions (e.g. NaCL) that also disables support for using ifunc (notice the red bots in the link I posted at the start of the message). Contrast that with using pthread_once, that is part of a spec published 25 years ago (https://standards.ieee.org/standard/1003_1c-1995.html). Any C compiler that can link with a pthread library will work just fine. Wrapping up: I feel we have higher impact optimizations with broad ecosystem support to land on zlib before we start looking for alternative optimizations. |
@madler ping? |
1efeedf
to
df9eb9a
Compare
322b3c9
to
81db790
Compare
Some operations speed (e.g. crc-32) can be considerably boosted (e.g. 10x) by leveraging special instructions available on modern CPUs. The issue is that we must perform CPU features detection at run time and that can be easy (x86) or tricky (Arm) depending on the context (e.g. OS, running in a sandbox, multithreaded context, etc). This patch imports the code we are currently shipping on Chromium's zlib. The following operating systems are supported (Android, ChromeOS, Windows, Fuchsia, Linux) as also both ARM and Intel CPUs, running on both 64bit and 32bit. Caveats: only linux@x86-64bit, linux@aarch64 and Android@32bit were tested with the current CMake buildsystem. That means that users may have to add the proper defines (e.g. ARMV8_OS_FUCHSIA) if building for another target/environment. Change-Id: Idc57bae6bb061c5103350cc781e829bb8afc86e2
81db790
to
ce456fe
Compare
Some operations speed (e.g. crc-32) can be considerably boosted (e.g. 10x) by
leveraging special instructions available on modern CPUs.
The issue is that we got perform CPU features detection at run time and that
can be easy (x86) or tricky (ARM) depending on the context (e.g. OS, running
in a sandbox, multithreaded context, etc).
This patch will import the code we are currently shipping on Chromium's zlib.
The following operating systems are supported (Android, ChromeOS, Windows,
Fuchsia, Linux) as also both ARM and Intel CPUs, running on both 64bit and 32bit.
Caveats: only linux@x86-64bit and linux@aarch64 were tested with the current
CMake buildsystem. That means that users may have to add the proper defines
(e.g. ARMV8_OS_FUCHSIA) if building for another target/environment.