Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace memcmp to s2n_constant_time_equals #4709

Merged
merged 11 commits into from
Sep 5, 2024

Conversation

boquan-fang
Copy link
Contributor

@boquan-fang boquan-fang commented Aug 15, 2024

Resolved issues:

Solving issue #3062

Description of changes:

Change most of memcmp usages to s2n_constant_time_equals.

There are two parts that weren't change, which are in s2n_cipher_suits.c (line 1110) and s2n_config.c (line 323). Both of those parts use the return integer from memcmp which s2n_constant_time_equals can't do.

Regression Tests Results

We recently added new regression tests that lets us know the instruction count differences over some common codepaths.

--------------------------------------------------------------------------------
-- Summary
--------------------------------------------------------------------------------
Ir____________ 

1,707 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
-- File:function summary
--------------------------------------------------------------------------------
  Ir____________________  file:function

< 2,190 (128.3%, 128.3%)  /home/ubuntu/workspace/s2n-tls/bindings/rust/s2n-tls-sys/lib/utils/s2n_safety.c:s2n_constant_time_equals

<  -384 (-22.5%, 105.8%)  ./string/../sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S:__memcmp_avx2_movbe

<   -62  (-3.6%, 102.2%)  /rust/deps/hashbrown-0.14.5/src/raw/mod.rs:
    -58  (-3.4%)            hashbrown::map::HashMap<K,V,S,A>::insert
     -4  (-0.2%)            core::iter::adapters::try_process

<   -34  (-2.0%, 100.2%)  /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/../../stdarch/crates/core_arch/src/x86/sse2.rs:
    -30  (-1.8%)            hashbrown::map::HashMap<K,V,S,A>::insert
     -2  (-0.1%)            hashbrown::raw::RawTable<T,A>::reserve_rehash
     -2  (-0.1%)            core::iter::adapters::try_process

This shows us that there is about a 1,707 instruction count penalty for shifting to s2n_constant_time_equals. The main differences are the additional 2,190 instruction from s2n_constant_time_equals and the 384 instruction count reduction from memcmp. Relative to the 79,292,794 instructions for a complete RSA handshake, this is a 0.00215 % regression in handshake performance.

Note: The hashbrown instruction count differences are due to randomness between runs.

Handshake Benchmarks

We also have benchmarks covering handshake and throughput performance. Comparing these benchmarks between this PR and mainline we see the following differences.


handshake-rsa2048/s2n-tls
                        time:   [1.0697 ms 1.0702 ms 1.0707 ms]
                        change: [-0.8251% +0.1691% +1.2108%] (p = 0.78 > 0.05)
                        No change in performance detected.

handshake-rsa3072/s2n-tls
                        time:   [2.9331 ms 2.9364 ms 2.9396 ms]
                        change: [-0.1189% +0.0360% +0.1889%] (p = 0.65 > 0.05)
                        No change in performance detected.
                        
handshake-rsa4096/s2n-tls
                        time:   [6.2232 ms 6.2324 ms 6.2420 ms]
                        change: [-0.1699% +0.0467% +0.2659%] (p = 0.68 > 0.05)
                        No change in performance detected.

handshake-ecdsa384/s2n-tls
                        time:   [1.1160 ms 1.1162 ms 1.1164 ms]
                        change: [-0.1847% +0.0291% +0.2434%] (p = 0.81 > 0.05)
                        No change in performance detected.

handshake-ecdsa256/s2n-tls
                        time:   [398.52 µs 398.62 µs 398.74 µs]
                        change: [+0.3657% +0.7706% +1.1642%] (p = 0.00 < 0.05)
                        Change within noise threshold.

throughput-AES_128_GCM_SHA256/s2n-tls
                        time:   [108.28 µs 108.70 µs 109.14 µs]
                        thrpt:  [873.83 MiB/s 877.37 MiB/s 880.71 MiB/s]
                 change:
                        time:   [+0.8096% +1.3883% +1.9335%] (p = 0.00 < 0.05)
                        thrpt:  [-1.8968% -1.3693% -0.8031%]
                        Change within noise threshold.

throughput-AES_256_GCM_SHA384/s2n-tls
                        time:   [118.10 µs 118.55 µs 119.01 µs]
                        thrpt:  [801.35 MiB/s 804.47 MiB/s 807.49 MiB/s]
                 change:
                        time:   [-1.3449% -0.9600% -0.5646%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5678% +0.9693% +1.3632%]

As expected, we do not see any significant changes. While there are some slight differences reported, these are below the accuracy threshold of the benchmarks.

s2n_constant_time_equals vs memcmp Benchmarks

We ran criterion benchmarks comparing the performance of memcmp against s2n_constant_time_equals.

  • small data, not equal: how long does it take to compare an unequal blob of 255 bytes?
  • small data, equal: how long does it take to compare an equal blob of 255 bytes?

These cases give numbers for the common case of comparing small pieces of data.

  • many small blobs, not equal: how long does it take to compare 16 kB of 255 byte blobs.

This is a representation of a pathological case possible in places like s2n_protocol_preferences.c. Generally there would be only a small number of protocols, but it is possible for a client to send up to 16 kB of preferences, where each protocol has a maximum size of 255 bytes. This case gives us an understanding of the performance impact in this worst case scenario.

# 386 ps vs 130,280 ps
low-level-comparison - small data, not equal/memcmp
                        time:   [386.08 ps 386.10 ps 386.12 ps]

low-level-comparison - small data, not equal/s2n_constant_time_equals
                        time:   [130.28 ns 130.28 ns 130.29 ns]


# 386 ps vs 130,300 ps
low-level-comparison - small data, equal/memcmp
                        time:   [386.07 ps 386.09 ps 386.11 ps]

low-level-comparison - small data, equal/s2n_constant_time_equals_c
                        time:   [130.29 ns 130.30 ns 130.31 ns]


# 519 ns vs 30,593 ns
low-level-comparison - many small blobs, not equal/memcmp
                        time:   [519.23 ns 519.37 ns 519.50 ns]

low-level-comparison - many small blobs, not equal/s2n_constant_time_equals
                        time:   [30.592 µs 30.593 µs 30.595 µs]

We see that while s2n_constant_time_equals is significantly slower than memcmp, it remains a very fast function, and its cost is small relative to the cost of an entire handshake (120ns / 1_000_000ns => .0000012% ).

For pathological cases (comparing an entire 16kb extension) the s2n_const_time_equals would represent a noticeable portion of the handshake (30us / 1_000us => 3%). Therefore we avoid the usage of s2n_constant_time_equals in scenarios where largeish data (> 1kB) would be compared.

Call-outs:

Needs to make a note to those two parts that aren't changed.

Testing:

This change is a refactor change. The code ran and passed all test cases via cmake.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@jmayclin jmayclin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also update grep_simple_mistakes as part of this PR.

crypto/s2n_rsa.c Show resolved Hide resolved
crypto/s2n_rsa.c Outdated Show resolved Hide resolved
tls/s2n_connection.c Show resolved Hide resolved
tls/s2n_early_data.c Outdated Show resolved Hide resolved
* Let s2n_stuffer_read_expected_str use memcmp to avoid CBMC problem.
* Adding one more known memcmp location to grep_simple_mistakes.sh.
@boquan-fang boquan-fang enabled auto-merge (squash) August 16, 2024 00:07
Boquan Fang added 2 commits August 16, 2024 00:32
* Replace S2N_ERROR_IF to POSIX_ENSURE.
* Reorder the .sh order to fit the directory order.
@colmmacc
Copy link
Contributor

What's the performance impact of this change? libc's memcmp and memcpy are highly optimized to do word-wise comparisons, pipeline prefetching, etc.

Remove the usage of s2n_constant_time_equals from any functions where
it is possible to compare relatively large amounts of data (~ > 1 kB)
even if that scenario is unlikely.
@jmayclin
Copy link
Contributor

Hi Colm! For context, the goal of this PR is to reduce the number of places that we are using memcmp. This helps reduce the risk of sidechannels by removing data-dependent comparisons.

In-depth benchmarks were added to the PR description. For the specific s2n-tls configuration that we use for handshake benchmarks:

  • no regression was detected using wall clock benchmarks
  • there was a 0.00215% regression in instruction count

This is in line with expectations, because most of the places that memcmp was replaced were comparing small, statically sized amounts of data. The places that aren’t comparing statically sized data are comparing small, bounded amounts of data. For these bounded comparisons, s2n_constant_time_equals adds ~130 ns. For more information see s2n_constant_time_equals vs memcmp Benchmarks in the PR description.

non static, bounded data

  • s2n_early_data.c:
    • app_protocol_size is the size of the user supplied application protocol preferences
    • This is guaranteed to be less than 256 bytes, because the buffer that we store it in has a capacity of 256 bytes.
    • This happens at most once per handshake when validating the early data presented.

/* The application protocol decided upon during the client hello.
* If ALPN is being used, then:
* In server mode, this will be set by the time client_hello_cb is invoked.
* In client mode, this will be set after is_handshake_complete(connection) is true.
*/
char application_protocol[256];

  • s2n_server_hello.c
    • session_id_len is the length of the session id
    • This is guaranteed to be less than 32 bytes, because we bail with an error message if the length is more than 32 bytes.
    • This happens at most once per handshake when the server_hello is parsed.

S2N_ERROR_IF(session_id_len > S2N_TLS_SESSION_ID_MAX_LEN, S2N_ERR_BAD_MESSAGE);

Note: This PR was originally changing some “generally small” cases, but there was the possibility of pathological inputs, like a client sending 16 kB of application protocols. #4717 showed that this would have cost ~3 µs, so these cases were reverted.

@lrstewart lrstewart added the do_not_merge PR might needs something before merging, even if approved and passing label Aug 27, 2024
@boquan-fang boquan-fang merged commit 08d413a into aws:main Sep 5, 2024
36 checks passed
@boquan-fang boquan-fang removed the do_not_merge PR might needs something before merging, even if approved and passing label Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants