Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding CPU features detection code #468

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

Adenilson
Copy link

Some operations speed (e.g. crc-32) can be considerably boosted (e.g. 10x) by
leveraging special instructions available on modern CPUs.

The issue is that we got perform CPU features detection at run time and that
can be easy (x86) or tricky (ARM) depending on the context (e.g. OS, running
in a sandbox, multithreaded context, etc).

This patch will import the code we are currently shipping on Chromium's zlib.
The following operating systems are supported (Android, ChromeOS, Windows,
Fuchsia, Linux) as also both ARM and Intel CPUs, running on both 64bit and 32bit.

Caveats: only linux@x86-64bit and linux@aarch64 were tested with the current
CMake buildsystem. That means that users may have to add the proper defines
(e.g. ARMV8_OS_FUCHSIA) if building for another target/environment.

/* cpu_features.c -- Processor features detection.
*
* Copyright 2018 The Chromium Authors. All rights reserved.
* Use of this source code is governed by a BSD-style license that can be
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madler I'm looking for your feedback concerning the licensing of the new code.

Chromium's license can be found at: https://cs.chromium.org/chromium/src/LICENSE

I believe we may be able to adjust it to follow ZLIB license if that is your preference.

@@ -1,4 +1,4 @@
cmake_minimum_required(VERSION 2.4.4)
cmake_minimum_required(VERSION 2.6)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake 2.6 is the minimal to be able to use 'Threads' macro and later on add it to the libraries to be linked to zlib.

#include <cpuid.h>
#endif

/* TODO(cavalcantii): remove checks for x86_flags on deflate.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imported the code pretty much as it is from:
https://cs.chromium.org/chromium/src/third_party/zlib/cpu_features.c

To clarify: this TODO is not applicable within the context of this patch.

static void _cpu_check_features(void);
#endif

#if defined(ARMV8_OS_ANDROID) || defined(ARMV8_OS_LINUX) || defined(ARMV8_OS_FUCHSIA) || defined(X86_NOT_WINDOWS)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An explanation here: we have to support many different combinations in Chromium (e.g. Android without SIMD, with SIMD, 32bit, 64bit, etc) and operating systems e.g. Fuchsia, WoA (Windows on ARM), etc.

Unfortunately, there isn't much else I can do to make the code simpler.
:-(

/* On x86 we simply use a instruction to check the CPU features.
* (i.e. CPUID).
*/
static void _cpu_check_features(void)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we don't use x86 CPU features flags for now, they will be needed for the next patch (i.e. optimized insert_string).

@@ -648,15 +649,7 @@ unsigned long ZEXPORT crc32_z(crc, buf, len)
z_size_t last, last2, i;
z_size_t num;

/* Return initial CRC, if requested. */
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hooked into the portable code to save some code duplication.

As a result, for ARMv8 we will build both implementations (e.g. portable + optimized) and decide which one to use at runtime.

We also cache the CPU features detection flags to avoid wasting CPU cycles.

* detection early (and infrequently) on.
*/
if (!len)
cpu_check_features();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An explanation here: in Chromium code base, we have client code (e.g. network code) that may rely on using zlib's crc32.

There is no guarantee that the client code is performing compression/decompression of data, so we have to cover that corner case by ensuring that we perform CPU features detection in a scenario where the user only rely on calling crc32() (e.g. first to get an initial valid crc32 value and next with a real data vector).

// Feature detection is not triggered while using RAW mode (i.e. we never
// call crc32() with a NULL buffer).
#if defined(CRC32_ARMV8_CRC32) || defined(CRC32_SIMD_SSE42_PCLMUL)
cpu_check_features();
Copy link
Author

@Adenilson Adenilson Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another valid entry point to perform CPU features detection, covering the scenario where we perform compression using RAW mode.

I found this corner case last December while V8 was starting to use zlib's checksums for speeding up loading snapshots (e.g. code blobs).

I think we should target to have in near future an optimized CRC32 for intel as we have this done already, for reference:
https://cs.chromium.org/chromium/src/third_party/zlib/crc32_simd.c


#include "zlib.h"

/* TODO(cavalcantii): remove checks for x86_flags on deflate.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous note about importing the code.

return; /* Report nothing if ASIMD(NEON) is missing */
arm_cpu_enable_crc32 = !!(features & ZX_ARM64_FEATURE_ISA_CRC32);
arm_cpu_enable_pmull = !!(features & ZX_ARM64_FEATURE_ISA_PMULL);
#elif defined(ARMV8_OS_WINDOWS)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for WoA (Windows on ARM).

@Adenilson Adenilson requested a review from madler January 22, 2020 01:43
static pthread_once_t cpu_check_inited_once = PTHREAD_ONCE_INIT;
void ZLIB_INTERNAL cpu_check_features(void)
{
pthread_once(&cpu_check_inited_once, _cpu_check_features);
Copy link
Author

@Adenilson Adenilson Jan 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that we could easily change this to make it optional to have threaded synchronization for CPU features detection.

That would allow users to opt-in into the features (i.e. if they plan to run the code in a multithreaded app), allowing zlib to keep its required set of dependencies minimal as it is today.

The implementation would be something like:
#if defined(HAVE_THREAD_SUPPORT)
// here goes code using POSIX thread API
#else
// Simply call _cpu_check_features().
#endif

@madler what you think?

@neurolabusc
Copy link

@Adenilson do you want to compare your solution to this one. Both use non-GPL code from Chrome to accelerate CRC. The CloudFlare patch has a few features:

  • The CloudFlare branch includes several other robust methods for acceleration.
    -The CLMUL detection works on older compilers (gcc 4.8.4) where the cpuid and pthread_once have issues.
    -It includes several of @ningfei's updates to the cmake that support modern Microsoft compilers.

The benchmarks show robust performance across a wide range of systems. I have only examined x86 support.

@Adenilson
Copy link
Author

Adenilson commented Jan 28, 2020

@Adenilson do you want to compare your solution to this one. Both use non-GPL code from Chrome to accelerate CRC. The CloudFlare patch has a few features:

@neurolabusc I left a few comments on the patch you pointed out, I hope it helps.

For the perspective of CPU features detection, unless I'm mistaken, it uses __get_cpuid() intrinsic which is the same apporach we use on Chromium's zlib ToT (https://cs.chromium.org/chromium/src/third_party/zlib/cpu_features.c?l=131).

But for some reason, the patch you pointed doesn't perform detection on ARM? ARM is a platform that is supported by Chromium, is there a reason to drop support for it?

Proceeding with the analysis: the extra function call to has_pcmul() at the beginning of crc32() function is troubling, when I implemented the optimized crc32 on Chromium's zlib for ARM, this kind of extraneous function call was enough to slow down things (i.e. which is why we cache the results of CPU features detection and have in a cold branch at the start of the function).

Finally, for a multithreaded/multiprocess application (i.e. Chromium) there is the requirement to have some kind of thread barrier to avoid a race condition while performing CPU features detection.

That requires OS specific thread synchronization functions and the code we distribute on Chromium is vastly tested for the last couple years.

I understand that some zlib users may not have this kind of requirement, which is why I suggested that we can make this an opt-in feature.

@neurolabusc
Copy link

@Adenilson Thanks for your useful comments. I will try to address each in an update.

  • The variable cpu_has_pclmul is set to -1 and is set to either 0 or 1 after first call to CRC. After that it the function returns the static value without having to read the cpuid. I would think this is very efficient, does not require a separate initialization and ensures the correct result even on a computer where the pcmul has been disabled in bios.

  • I only added the x86-64 because that is the only system I could test, and the platform used in my field. I would suggest I could create a first pass of adding your suggestions to the CloudFlare zlib and then you could refine my changes and validate support on other platforms. The CloudFlare zlib seems to have made the fewest changes to the zlib for the biggest benefit. Integrating your ideas would be a real help to the community.

@mscastanho
Copy link

Hi @Adenilson, I don't think using pthread_once is the best approach for Linux since ifunc is available. That's a more robust mechanism than using pthread_once. You can implement the detection code on the resolver and provide optimized functions for each case. The advantages:

  • Feature detection is done lazily, adding minimum overhead only to the first call to the function.
  • Since the linker updates the PLT, later calls to the optimized function will incur no extra overhead.
  • Supports multiple threads simultaneously.
  • Both GCC and clang support it

Although it may not be available for Windows systems, this can be the default for Linux and latest Android (bionic has support starting with Android Q), at least. The current approach can serve as a fallback.

PRs #457 #458 #459 use it and have some infrastructure code that can be reused by other archs to implement ifuncs as well.

@Adenilson
Copy link
Author

@mscastanho thanks for the comments.

Chromium's use case is a bit challenging, as we have to a) Run the code inside of the sandbox (i.e. while inside of a RendererProcess); b) Support really old versions of Android (i.e. kitkat 4.4).

So I'm afraid an ifunc-based solution wouldn't scale for 2 of the required OSes (i.e. Android and Windows).

Linux may be a possibility, but I would have to confirm if it would still work properly while running inside of the sandbox, as zlib runs both sandboxed (i.e. Renderer) as also in a privileged level (e.g. network operations).

That being said, to justify maintaining the code only for Linux (and CrOS), there would have to be some advantage (e.g. performance).

Do you happen to have some performance numbers comparing a pthread based implementation vs using ifunc?

@tuliom
Copy link

tuliom commented Feb 4, 2020

So I'm afraid an ifunc-based solution wouldn't scale for 2 of the required OSes (i.e. Android and Windows).
...
That being said, to justify maintaining the code only for Linux...

@Adenilson , @mscastanho is not suggesting to remove the support for this new mechanism based on pthread_once.
He's just saying that ifunc is already supported in most Linux distributions (Debian, Ubuntu, Fedora, RHEL, Open Suse, SLES) and the latest Android.
It does all of that without requiring extra code in zlib, because that code is already available in the compiler, linker and loaders: it's just the new function and the resolver.

But, as there are operating systems that still do not support ifunc, it's fair to add extra code in those scenarios in order to provide a similar functionality. However, it's unfair to require that operating systems that already provide ifunc have to also maintain extra code in order to do the same thing.

In other words: I do not oppose adding this new mechanism, but I think it must not be mandatory for all the operating systems. The operating systems that already support ifunc should be able to use it.

@Adenilson
Copy link
Author

Adenilson commented Feb 8, 2020

@tuliom my point is that any extra code comes with a price: maintainers time + zlib performance.

To justify adding new code, the advantages (e.g. performance) must significantly surpass the cost of maintaining the code.

The documentation referenced by @mscastanho points that (quote): "Unfortunately there are actually a lot of restrictions placed on IFUNC usage which aren't entirely clear and the documentation needs to be updated.".

That is why I asked for performance numbers to help assess the cost/benefit of using ifunc.

I decided to go ahead and perform a few experiments with ifunc on
Chrome's zlib (https://chromium-review.googlesource.com/c/chromium/src/+/2041483) and the results are a bit lacking from both a performance as also ecosystem's perspective.

a) Performance: median improvement on x86 was around 1.14% for decompression and 0.5% (i.e. less than 1%). On ARM it was kind of a mixed bag: seems to help slightly (0.4% to 0.9% on little cores@aarch64) but hurts compression performance on big cores (-1.6% on a Pixel4) with little to no gain on decompression (0.24%).

b) Ecosystem: as I've explained in my previous comment, in Chrome we need to support really old Android devices released way back in 2012. But even with an Android 10 device (Pixel 4), there are issues related with proper linker/loader support i.e. invoking in the 'resolver' any function not defined within the compilation unit will crash the client code i.e. cpu_check_features().

To complicate things, ifunc was enabled by default only on gcc 8.x (released in 2018) on ARM. That means that anyone running Ubuntu < 19.x won't have compilers supporting the feature by default (i.e. you can't even build the code using 'ifunc').

As an example, AWS Graviton (i.e. ARM big core) instances latest available Ubuntu version is 18.04 LTS.

On Chrome we have extra restrictions (e.g. NaCL) that also disables support for using ifunc (notice the red bots in the link I posted at the start of the message).

Contrast that with using pthread_once, that is part of a spec published 25 years ago (https://standards.ieee.org/standard/1003_1c-1995.html). Any C compiler that can link with a pthread library will work just fine.

Wrapping up: I feel we have higher impact optimizations with broad ecosystem support to land on zlib before we start looking for alternative optimizations.

@Adenilson
Copy link
Author

@madler ping?

crc32.c Outdated Show resolved Hide resolved
@Adenilson Adenilson force-pushed the developChromium branch 2 times, most recently from 322b3c9 to 81db790 Compare March 12, 2020 21:59
Some operations speed (e.g. crc-32) can be considerably boosted
(e.g. 10x) by leveraging special instructions available on
modern CPUs.

The issue is that we must perform CPU features detection at run
time and that can be easy (x86) or tricky (Arm) depending on the
context (e.g. OS, running in a sandbox, multithreaded context, etc).

This patch imports the code we are currently shipping on
Chromium's zlib. The following operating systems are supported
(Android, ChromeOS, Windows, Fuchsia, Linux) as also both ARM and
Intel CPUs, running on both 64bit and 32bit.

Caveats: only linux@x86-64bit, linux@aarch64 and Android@32bit were
tested with the current CMake buildsystem. That means that users may
have to add the proper defines (e.g. ARMV8_OS_FUCHSIA) if building for
another target/environment.

Change-Id: Idc57bae6bb061c5103350cc781e829bb8afc86e2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants