Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify previously missed register clobbers in AES-NI asm blocks #9809

Open
wants to merge 3 commits into
base: development
Choose a base branch
from

Conversation

solardiz
Copy link

@solardiz solardiz commented Nov 30, 2024

Description

We have just integrated mbedTLS AES code into John the Ripper via openwall/john#5591 and I've been reviewing our changes as well as looking at mbedTLS original code. One of our concerns was how good or bad the inline asm blocks are compared to the intrinsics, and whether we should possibly get rid of the inline asm. I found you also have an issue open for that (#8231), which is great. Meanwhile, I noticed a number of performance issues with the inline asm code and what I think is one bug. This PR is to fix the bug:

While most of the inline asm blocks specify what they clobber (memory, condition flags, other registers), the one in mbedtls_aesni_crypt_ecb clobbers a couple of input registers without specifying so. This PR fixes that in the same fashion as further asm blocks in the same source file use.

The risk from this bug was a potential miscompile that could result in incorrect computation/behavior, including potentially a security vulnerability. However, in practice this is unlikely because the entire non-inline function consists solely of this asm block. Issues could arise with link-time optimization and aggressive function inlining into an application using mbedTLS.

Fix #9819

PR checklist

  • changelog provided
  • development PR provided (this one)
  • framework PR not needed
  • 3.6 PR TODO
  • 2.28 PR TODO
  • tests not required because: no functionality is added, and the bug isn't normally reproducible in testing

@solardiz
Copy link
Author

Oh, unfortunately this trivial change resulted in slight code size increase for me (2 more instructions), from:

0000000000000050 <mbedtls_aesni_crypt_ecb>:
  50:   49 89 d0                mov    %rdx,%r8
  53:   48 8b 57 08             mov    0x8(%rdi),%rdx
  57:   8b 07                   mov    (%rdi),%eax
  59:   48 8d 54 97 10          lea    0x10(%rdi,%rdx,4),%rdx
  5e:   f3 41 0f 6f 00          movdqu (%r8),%xmm0
  63:   f3 0f 6f 0a             movdqu (%rdx),%xmm1
  67:   66 0f ef c1             pxor   %xmm1,%xmm0
  6b:   48 83 c2 10             add    $0x10,%rdx
  6f:   83 e8 01                sub    $0x1,%eax
  72:   85 f6                   test   %esi,%esi
  74:   74 1d                   je     93 <mbedtls_aesni_crypt_ecb+0x43>
  76:   f3 0f 6f 0a             movdqu (%rdx),%xmm1
  7a:   66 0f 38 dc c1          aesenc %xmm1,%xmm0
  7f:   48 83 c2 10             add    $0x10,%rdx
  83:   83 e8 01                sub    $0x1,%eax
  86:   75 ee                   jne    76 <mbedtls_aesni_crypt_ecb+0x26>
  88:   f3 0f 6f 0a             movdqu (%rdx),%xmm1
  8c:   66 0f 38 dd c1          aesenclast %xmm1,%xmm0
  91:   eb 1b                   jmp    ae <mbedtls_aesni_crypt_ecb+0x5e>
  93:   f3 0f 6f 0a             movdqu (%rdx),%xmm1
  97:   66 0f 38 de c1          aesdec %xmm1,%xmm0
  9c:   48 83 c2 10             add    $0x10,%rdx
  a0:   83 e8 01                sub    $0x1,%eax
  a3:   75 ee                   jne    93 <mbedtls_aesni_crypt_ecb+0x43>
  a5:   f3 0f 6f 0a             movdqu (%rdx),%xmm1
  a9:   66 0f 38 df c1          aesdeclast %xmm1,%xmm0
  ae:   f3 0f 7f 01             movdqu %xmm0,(%rcx)
  b2:   31 c0                   xor    %eax,%eax
  b4:   c3                      retq   

to:

0000000000000050 <mbedtls_aesni_crypt_ecb>:
  50:   48 8b 47 08             mov    0x8(%rdi),%rax
  54:   41 89 f0                mov    %esi,%r8d
  57:   49 89 ca                mov    %rcx,%r10
  5a:   49 89 d1                mov    %rdx,%r9
  5d:   8b 0f                   mov    (%rdi),%ecx
  5f:   48 8d 74 87 10          lea    0x10(%rdi,%rax,4),%rsi
  64:   f3 41 0f 6f 01          movdqu (%r9),%xmm0
  69:   f3 0f 6f 0e             movdqu (%rsi),%xmm1
  6d:   66 0f ef c1             pxor   %xmm1,%xmm0
  71:   48 83 c6 10             add    $0x10,%rsi
  75:   83 e9 01                sub    $0x1,%ecx
  78:   45 85 c0                test   %r8d,%r8d
  7b:   74 1d                   je     9a <mbedtls_aesni_crypt_ecb+0x4a>
  7d:   f3 0f 6f 0e             movdqu (%rsi),%xmm1
  81:   66 0f 38 dc c1          aesenc %xmm1,%xmm0
  86:   48 83 c6 10             add    $0x10,%rsi
  8a:   83 e9 01                sub    $0x1,%ecx
  8d:   75 ee                   jne    7d <mbedtls_aesni_crypt_ecb+0x2d>
  8f:   f3 0f 6f 0e             movdqu (%rsi),%xmm1
  93:   66 0f 38 dd c1          aesenclast %xmm1,%xmm0
  98:   eb 1b                   jmp    b5 <mbedtls_aesni_crypt_ecb+0x65>
  9a:   f3 0f 6f 0e             movdqu (%rsi),%xmm1
  9e:   66 0f 38 de c1          aesdec %xmm1,%xmm0
  a3:   48 83 c6 10             add    $0x10,%rsi
  a7:   83 e9 01                sub    $0x1,%ecx
  aa:   75 ee                   jne    9a <mbedtls_aesni_crypt_ecb+0x4a>
  ac:   f3 0f 6f 0e             movdqu (%rsi),%xmm1
  b0:   66 0f 38 df c1          aesdeclast %xmm1,%xmm0
  b5:   f3 41 0f 7f 02          movdqu %xmm0,(%r10)
  ba:   31 c0                   xor    %eax,%eax
  bc:   c3                      retq   

This is with gcc version 11.3.1 20220421 (Red Hat 11.3.1-2).

Here's a more elaborate change (specifying these registers as input/output) that results in no change in generated code:

+++ b/src/mbedtls/aesni.c
@@ -456,12 +456,13 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
                             const unsigned char input[16],
                             unsigned char output[16])
 {
-    asm ("movdqu    (%3), %%xmm0    \n\t" // load input
+    uint32_t n = ctx->nr, *p = ctx->buf + ctx->rk_offset;
+    asm ("movdqu    (%4), %%xmm0    \n\t" // load input
          "movdqu    (%1), %%xmm1    \n\t" // load round key 0
          "pxor      %%xmm1, %%xmm0  \n\t" // round 0
          "add       $16, %1         \n\t" // point to next round key
          "subl      $1, %0          \n\t" // normal rounds = nr - 1
-         "test      %2, %2          \n\t" // mode?
+         "test      %3, %3          \n\t" // mode?
          "jz        2f              \n\t" // 0 = decrypt
 
          "1:                        \n\t" // encryption loop
@@ -486,10 +487,10 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
 #endif
 
          "3:                        \n\t"
-         "movdqu    %%xmm0, (%4)    \n\t" // export output
-         :
-         : "r" (ctx->nr), "r" (ctx->buf + ctx->rk_offset), "r" (mode), "r" (input), "r" (output)
-         : "memory", "cc", "xmm0", "xmm1");
+         "movdqu    %%xmm0, %2      \n\t" // export output
+         : "+r" (n), "+r" (p), "=m" (*(uint8_t(*)[16]) output)
+         : "r" (mode), "r" (input)
+         : "cc", "xmm0", "xmm1");
 
 
     return 0;

I'm not adding this as a commit yet - let me know if I should.

Also, while playing towards this more elaborate change, I briefly got the compiler to optimize this function's body out entirely - apparently, the reliance on "memory" to tell the compiler there might be outputs is not enough! In the patch above, I solve this by making the output explicit. However, there are many other asm blocks in this source file that similarly rely solely on "memory". A quick workaround for those may be to make them asm volatile.

What a can of worms.

@solardiz
Copy link
Author

while playing towards this more elaborate change, I briefly got the compiler to optimize this function's body out entirely

That was false alarm, kind of. Per gcc documentation, "asm statements that have no output operands [...] are implicitly volatile." So when I made a couple of operands input/output, I removed this implicit volatile, which is why the code got optimized out until I added the explicit output.

This means that the rest of asm blocks in this source file are fine in this respect without the volatile keyword because they have it implicit as long as they don't specify any outputs (but do specify clobbering "memory", so that further C code knows to reload the actual outputs... as well as unnecessarily everything else).

To summarize:

  1. This PR's current one-line change is fine to fix what its title says, but it may make the code slightly larger/slower (redundant MOVs). Apparently, this approach with declaring inputs as clobbered is inefficient, but it's the approach used by the rest of this source file (room for optimization there as well?)
  2. If desired, I can add a second commit with the more elaborate and invasive patch I posted above, which (despite of being a larger source code change) avoids any changes to the generated machine code (relative to before this PR) in my testing (of course, things may vary between compilers). I'm not eager to make elaborate changes to one asm block without doing similar for the rest, and it feels wrong to make many changes to legacy code like this (risk of introducing or exposing bugs, whereas this code is quite stable now and should be retired soon).

@gilles-peskine-arm gilles-peskine-arm added needs-review Every commit must be reviewed by at least two team members, component-crypto Crypto primitives and low-level interfaces needs-reviewer This PR needs someone to pick it up for review priority-high High priority - will be reviewed soon size-xs Estimated task size: extra small (a few hours at most) bug labels Dec 2, 2024
@gilles-peskine-arm
Copy link
Contributor

Thank you very much for reporting this! I've filed #9819 so we can keep track of the bug and when it's fixed.

Even if no known compiler optimizes the code badly, we take this seriously. For example, we had a similar bug in bignum assembly code that went undetected for years, and then the next version of Clang caused it to pretty systematically generate incorrect code.

Fixing the potential bug is more important than code size, especially if it's a tiny increase. As you note, at this point, this is somewhat legacy code, since we favor the intrinsics. (But if you find that the intrinsics are noticeably slower than the assembly with a recent compiler, please let us know!)

@solardiz
Copy link
Author

solardiz commented Dec 2, 2024

Thank you @gilles-peskine-arm. I think the slight code size increase is a non-issue per se, but it may have associated performance cost on some CPUs (although recent ones may end up handling the extra MOVs via register renaming).

For JtR, we went with the more elaborate patch above:
openwall/john@f2f7c2a
plus replacing SUB with DEC:
openwall/john@2480619
but we're about to switch to intrinsics anyway.

if you find that the intrinsics are noticeably slower than the assembly with a recent compiler, please let us know!

It's a mixed bag - a lot depends on how these functions are used - most notably, on whether it's primarily key setup or bulk en/decryption. We end up with key setup significantly affecting performance when there's only a little data to process, but such processing is in a loop. It appears that key setup becomes slower and en/decryption faster with intrinsics. Here I benchmarked two different JtR "formats" - one became up to 5% slower with intrinsics (new key setup performed per just a few CBC mode blocks decrypted), the other 25% faster (encryption of two independent blocks with the same key repeated a lot of times): openwall/john#5593 (comment)

@@ -489,7 +489,7 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
"movdqu %%xmm0, (%4) \n\t" // export output
:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a changelog entry file. Even if we don't know for sure that a platform is affected, insufficient clobbers are a bug. If the next GCC/Clang/MSVC/… triggers the bug, users should be informed of which version of Mbed TLS fixed it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to include this under Security, Bugfix, or Changes? I notice that a previous related change was somehow under Changes:

Changes
[...]
   * Fix clobber list in MIPS assembly for large integer multiplication.
     Previously, this could lead to functionally incorrect assembly being
     produced by some optimizing compilers, showing up as failures in
     e.g. RSA or ECC signature operations. Reported in #1722, fix suggested
     by Aurelien Jarno and submitted by Jeffrey Martin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under Bugfix, please.

Looking at the history, it seems we messed up the changelog sections in the 2.17.0 release. Originally that entry was under Bugfix.

@gilles-peskine-arm gilles-peskine-arm added needs-work and removed needs-review Every commit must be reviewed by at least two team members, needs-reviewer This PR needs someone to pick it up for review labels Dec 2, 2024
@@ -489,7 +489,7 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
"movdqu %%xmm0, (%4) \n\t" // export output
:
: "r" (ctx->nr), "r" (ctx->buf + ctx->rk_offset), "r" (mode), "r" (input), "r" (output)
: "memory", "cc", "xmm0", "xmm1");
: "memory", "cc", "xmm0", "xmm1", "0", "1");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm checking the other asm blocks in this file.

In aesni_setkey_enc_128, I think xmm0 and xmm1 are missing from the clobber list, right? And in aesni_setkey_enc_192 and aesni_setkey_enc_256, same and also xmm2. With your fix, the rest look fine to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In aesni_setkey_enc_128, I think xmm0 and xmm1 are missing from the clobber list, right? And in aesni_setkey_enc_192 and aesni_setkey_enc_256, same and also xmm2. With your fix, the rest look fine to me.

Oh, you're right. Is this something you'd fix separately from this PR?

OTOH, I think in mbedtls_aesni_gcm_mult, the clobber list unnecessarily includes cc - I think none of those instructions modify flags. But this may be better to leave as-is at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate it if you could fix those clobber lists while you're at it, so we can say we fixed the assembly in the AESNI code and not just in one function. But if not we'll make a follow-up pull request.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I intend to fix these as well (and credit you in the commit message for noticing them), but it's a busy week and it's taking me a while to get back to "free time" work again. Just letting you know that I accepted the task, but can't handle it as a high priority.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added a commit fixing the missed XMM clobbers and adding a changelog entry. I hope this is as desired.

I tested these same changes in a copy/revision of the code as we integrated it in JtR. I temporarily reverted from usage of intrinsics to asm, rebuilt JtR, and ran our tests - so whatever functions we do use there were tested. I also checked the aesni.o text section size, which remained unchanged at 1445 bytes for me (which does not guarantee no increase in actually used code as the function entry points are aligned, so each previous function is padded).

I never tried building/testing mbedTLS proper (sorry!) and quickly trying to do so now first gave me this:

$ make check
Makefile:19: *** /framework/exported.make not found (and does not appear to be a git checkout). Please ensure you have downloaded the right archive from the release page on GitHub..  Stop.

which made me look inside the Makefile, see the other error message nearby (is the condition on which message to print maybe wrong?) and then do:

$ git submodule update --init
Submodule 'framework' (https://github.com/Mbed-TLS/mbedtls-framework) registered for path 'framework'
Cloning into '/home/user/mbedtls/framework'...
Submodule path 'framework': checked out 'df0144c4a3c0fc9beea606afde07cf8708233675'

Then rerunning make check fails at:

  CC    src/test_helpers/ssl_helpers.c
make[1]: Leaving directory '/home/user/mbedtls/tests'
make[1]: Entering directory '/home/user/mbedtls/library'
  Gen   ../tf-psa-crypto/core/psa_crypto_driver_wrappers.h ../tf-psa-crypto/core/psa_crypto_driver_wrappers_no_static.c
Traceback (most recent call last):
  File "/home/user/mbedtls/library/../scripts/generate_driver_wrappers.py", line 18, in <module>
    import jsonschema
ModuleNotFoundError: No module named 'jsonschema'

I'm not eager to install a Python module without creating a dedicated environment for this testing first, so I stopped here.

I mention this as maybe-useful feedback on maybe improving error messages and maybe relaxing build/test dependencies for new users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I intend to fix these as well (and credit you in the commit message for noticing them), but it's a busy week and it's taking me a while to get back to "free time" work again. Just letting you know that I accepted the task, but can't handle it as a high priority.

No worries, I know all about “free time”! If you prefer, I can take over and finish the patch. Well, you've gone ahead and updated, thank you very much, but I can take over if there further updates are needed.

I never tried building/testing mbedTLS proper (sorry!) and quickly trying to do so now first gave me this: (…)

Feedback noted. Unfortunately, while we'd like to get rid of the Python dependencies, that would require significant engineering work. For what it's worth, releases should be fine on both counts, you just run into these difficulties when you download from a development branch. Hey, at least you didn't have to run autotools!

Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for letting us know about this and submitting a fix! And sorry for the suboptimal new user experience.

The code changes are correct and complete. But I'm afraid I need to request a changelog improvement. You've already helped us a lot, so please feel free to let us finish the boring bits. I can handle the changelog polishing and any further review feedback.

@@ -0,0 +1,3 @@
Bugfix
* Specify previously missed XMM register clobbers in AES-NI asm blocks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just XMM: for the one you'd spotted originally, it's general-purpose registers. But anyway that's more detail than matters in the changelog, and conversely we'd like the changelog to explain the impact. So something like this:

Fix missing constraints on the AESNI inline assembly which is used on GCC-like compilers when building AES for generic x86_64 targets. This may have resulted in incorrect code with some compilers, depending on optimizations. Fixes #9819.

Note that the other reviewers (we need two reviewers for each pull request) may request more changes.

I can handle the changelog updates (and any other rework) if you'd like.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sure, I didn't mean to include XMM in that changelog entry, but ended up copy-pasting with the commit message too much. And it should be a separate commit then.

When I fix this, OK to amend/force-push the previous commit, or should I strictly be adding commits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amending the last commit would be fine here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I used your suggested wording almost as-is, although I'm not sure about the word "generic": not all x86_64 CPUs support AES-NI, but maybe in Mbed TLS context this word is somehow appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant “generic” as in compiling without a target option like -maes. I think that's the usual way to phrase this, though I do realize it's ambiguous here. If you build the library in its default configuration for a generic x86_64 target, the AESNI assembly gets built, and then it may or may not be executed at runtime depending on whether AESNI is present.

@gilles-peskine-arm gilles-peskine-arm added the needs-backports Backports are missing or are pending review and approval. label Dec 9, 2024
solardiz and others added 2 commits December 11, 2024 02:44
Noticed by Gilles Peskine

Co-authored-by: Gilles Peskine <[email protected]>
Signed-off-by: Solar Designer <[email protected]>
Co-authored-by: Gilles Peskine <[email protected]>
Signed-off-by: Solar Designer <[email protected]>
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the updates! Looks good to me now.

Our policy requires two reviewers for pull requests, and we'll want to backport the fix to long-term support branches (mbedtls-2.28 and mbedtls-3.6). If the second reviewer requests changes, and to do the backports, a team member can handle that if you want. Once again, thank you very much for reporting this — incorrect assembly constraints are pretty hard to notice!

@solardiz solardiz changed the title Specify register clobbers in mbedtls_aesni_crypt_ecb() Specify previously missed register clobbers in AES-NI asm blocks Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug component-crypto Crypto primitives and low-level interfaces needs-backports Backports are missing or are pending review and approval. needs-work priority-high High priority - will be reviewed soon size-xs Estimated task size: extra small (a few hours at most)
Projects
Status: In Development
Development

Successfully merging this pull request may close these issues.

Insufficient clobber declarations in AESNI assembly
2 participants