-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify previously missed register clobbers in AES-NI asm blocks #9809
base: development
Are you sure you want to change the base?
Conversation
Signed-off-by: Solar Designer <[email protected]>
Oh, unfortunately this trivial change resulted in slight code size increase for me (2 more instructions), from:
to:
This is with Here's a more elaborate change (specifying these registers as input/output) that results in no change in generated code: +++ b/src/mbedtls/aesni.c
@@ -456,12 +456,13 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
const unsigned char input[16],
unsigned char output[16])
{
- asm ("movdqu (%3), %%xmm0 \n\t" // load input
+ uint32_t n = ctx->nr, *p = ctx->buf + ctx->rk_offset;
+ asm ("movdqu (%4), %%xmm0 \n\t" // load input
"movdqu (%1), %%xmm1 \n\t" // load round key 0
"pxor %%xmm1, %%xmm0 \n\t" // round 0
"add $16, %1 \n\t" // point to next round key
"subl $1, %0 \n\t" // normal rounds = nr - 1
- "test %2, %2 \n\t" // mode?
+ "test %3, %3 \n\t" // mode?
"jz 2f \n\t" // 0 = decrypt
"1: \n\t" // encryption loop
@@ -486,10 +487,10 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx,
#endif
"3: \n\t"
- "movdqu %%xmm0, (%4) \n\t" // export output
- :
- : "r" (ctx->nr), "r" (ctx->buf + ctx->rk_offset), "r" (mode), "r" (input), "r" (output)
- : "memory", "cc", "xmm0", "xmm1");
+ "movdqu %%xmm0, %2 \n\t" // export output
+ : "+r" (n), "+r" (p), "=m" (*(uint8_t(*)[16]) output)
+ : "r" (mode), "r" (input)
+ : "cc", "xmm0", "xmm1");
return 0; I'm not adding this as a commit yet - let me know if I should. Also, while playing towards this more elaborate change, I briefly got the compiler to optimize this function's body out entirely - apparently, the reliance on What a can of worms. |
That was false alarm, kind of. Per gcc documentation, "asm statements that have no output operands [...] are implicitly volatile." So when I made a couple of operands input/output, I removed this implicit volatile, which is why the code got optimized out until I added the explicit output. This means that the rest of asm blocks in this source file are fine in this respect without the volatile keyword because they have it implicit as long as they don't specify any outputs (but do specify clobbering To summarize:
|
Thank you very much for reporting this! I've filed #9819 so we can keep track of the bug and when it's fixed. Even if no known compiler optimizes the code badly, we take this seriously. For example, we had a similar bug in bignum assembly code that went undetected for years, and then the next version of Clang caused it to pretty systematically generate incorrect code. Fixing the potential bug is more important than code size, especially if it's a tiny increase. As you note, at this point, this is somewhat legacy code, since we favor the intrinsics. (But if you find that the intrinsics are noticeably slower than the assembly with a recent compiler, please let us know!) |
Thank you @gilles-peskine-arm. I think the slight code size increase is a non-issue per se, but it may have associated performance cost on some CPUs (although recent ones may end up handling the extra MOVs via register renaming). For JtR, we went with the more elaborate patch above:
It's a mixed bag - a lot depends on how these functions are used - most notably, on whether it's primarily key setup or bulk en/decryption. We end up with key setup significantly affecting performance when there's only a little data to process, but such processing is in a loop. It appears that key setup becomes slower and en/decryption faster with intrinsics. Here I benchmarked two different JtR "formats" - one became up to 5% slower with intrinsics (new key setup performed per just a few CBC mode blocks decrypted), the other 25% faster (encryption of two independent blocks with the same key repeated a lot of times): openwall/john#5593 (comment) |
@@ -489,7 +489,7 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx, | |||
"movdqu %%xmm0, (%4) \n\t" // export output | |||
: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a changelog entry file. Even if we don't know for sure that a platform is affected, insufficient clobbers are a bug. If the next GCC/Clang/MSVC/… triggers the bug, users should be informed of which version of Mbed TLS fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like me to include this under Security, Bugfix, or Changes? I notice that a previous related change was somehow under Changes:
Changes
[...]
* Fix clobber list in MIPS assembly for large integer multiplication.
Previously, this could lead to functionally incorrect assembly being
produced by some optimizing compilers, showing up as failures in
e.g. RSA or ECC signature operations. Reported in #1722, fix suggested
by Aurelien Jarno and submitted by Jeffrey Martin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under Bugfix, please.
Looking at the history, it seems we messed up the changelog sections in the 2.17.0 release. Originally that entry was under Bugfix.
@@ -489,7 +489,7 @@ int mbedtls_aesni_crypt_ecb(mbedtls_aes_context *ctx, | |||
"movdqu %%xmm0, (%4) \n\t" // export output | |||
: | |||
: "r" (ctx->nr), "r" (ctx->buf + ctx->rk_offset), "r" (mode), "r" (input), "r" (output) | |||
: "memory", "cc", "xmm0", "xmm1"); | |||
: "memory", "cc", "xmm0", "xmm1", "0", "1"); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm checking the other asm blocks in this file.
In aesni_setkey_enc_128
, I think xmm0
and xmm1
are missing from the clobber list, right? And in aesni_setkey_enc_192
and aesni_setkey_enc_256
, same and also xmm2
. With your fix, the rest look fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In
aesni_setkey_enc_128
, I thinkxmm0
andxmm1
are missing from the clobber list, right? And inaesni_setkey_enc_192
andaesni_setkey_enc_256
, same and alsoxmm2
. With your fix, the rest look fine to me.
Oh, you're right. Is this something you'd fix separately from this PR?
OTOH, I think in mbedtls_aesni_gcm_mult
, the clobber list unnecessarily includes cc
- I think none of those instructions modify flags. But this may be better to leave as-is at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd appreciate it if you could fix those clobber lists while you're at it, so we can say we fixed the assembly in the AESNI code and not just in one function. But if not we'll make a follow-up pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I intend to fix these as well (and credit you in the commit message for noticing them), but it's a busy week and it's taking me a while to get back to "free time" work again. Just letting you know that I accepted the task, but can't handle it as a high priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just added a commit fixing the missed XMM clobbers and adding a changelog entry. I hope this is as desired.
I tested these same changes in a copy/revision of the code as we integrated it in JtR. I temporarily reverted from usage of intrinsics to asm, rebuilt JtR, and ran our tests - so whatever functions we do use there were tested. I also checked the aesni.o
text section size, which remained unchanged at 1445 bytes for me (which does not guarantee no increase in actually used code as the function entry points are aligned, so each previous function is padded).
I never tried building/testing mbedTLS proper (sorry!) and quickly trying to do so now first gave me this:
$ make check
Makefile:19: *** /framework/exported.make not found (and does not appear to be a git checkout). Please ensure you have downloaded the right archive from the release page on GitHub.. Stop.
which made me look inside the Makefile
, see the other error message nearby (is the condition on which message to print maybe wrong?) and then do:
$ git submodule update --init
Submodule 'framework' (https://github.com/Mbed-TLS/mbedtls-framework) registered for path 'framework'
Cloning into '/home/user/mbedtls/framework'...
Submodule path 'framework': checked out 'df0144c4a3c0fc9beea606afde07cf8708233675'
Then rerunning make check
fails at:
CC src/test_helpers/ssl_helpers.c
make[1]: Leaving directory '/home/user/mbedtls/tests'
make[1]: Entering directory '/home/user/mbedtls/library'
Gen ../tf-psa-crypto/core/psa_crypto_driver_wrappers.h ../tf-psa-crypto/core/psa_crypto_driver_wrappers_no_static.c
Traceback (most recent call last):
File "/home/user/mbedtls/library/../scripts/generate_driver_wrappers.py", line 18, in <module>
import jsonschema
ModuleNotFoundError: No module named 'jsonschema'
I'm not eager to install a Python module without creating a dedicated environment for this testing first, so I stopped here.
I mention this as maybe-useful feedback on maybe improving error messages and maybe relaxing build/test dependencies for new users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I intend to fix these as well (and credit you in the commit message for noticing them), but it's a busy week and it's taking me a while to get back to "free time" work again. Just letting you know that I accepted the task, but can't handle it as a high priority.
No worries, I know all about “free time”! If you prefer, I can take over and finish the patch. Well, you've gone ahead and updated, thank you very much, but I can take over if there further updates are needed.
I never tried building/testing mbedTLS proper (sorry!) and quickly trying to do so now first gave me this: (…)
Feedback noted. Unfortunately, while we'd like to get rid of the Python dependencies, that would require significant engineering work. For what it's worth, releases should be fine on both counts, you just run into these difficulties when you download from a development branch. Hey, at least you didn't have to run autotools!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for letting us know about this and submitting a fix! And sorry for the suboptimal new user experience.
The code changes are correct and complete. But I'm afraid I need to request a changelog improvement. You've already helped us a lot, so please feel free to let us finish the boring bits. I can handle the changelog polishing and any further review feedback.
@@ -0,0 +1,3 @@ | |||
Bugfix | |||
* Specify previously missed XMM register clobbers in AES-NI asm blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just XMM: for the one you'd spotted originally, it's general-purpose registers. But anyway that's more detail than matters in the changelog, and conversely we'd like the changelog to explain the impact. So something like this:
Fix missing constraints on the AESNI inline assembly which is used on GCC-like compilers when building AES for generic x86_64 targets. This may have resulted in incorrect code with some compilers, depending on optimizations. Fixes #9819.
Note that the other reviewers (we need two reviewers for each pull request) may request more changes.
I can handle the changelog updates (and any other rework) if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sure, I didn't mean to include XMM in that changelog entry, but ended up copy-pasting with the commit message too much. And it should be a separate commit then.
When I fix this, OK to amend/force-push the previous commit, or should I strictly be adding commits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amending the last commit would be fine here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I used your suggested wording almost as-is, although I'm not sure about the word "generic": not all x86_64 CPUs support AES-NI, but maybe in Mbed TLS context this word is somehow appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant “generic” as in compiling without a target option like -maes
. I think that's the usual way to phrase this, though I do realize it's ambiguous here. If you build the library in its default configuration for a generic x86_64 target, the AESNI assembly gets built, and then it may or may not be executed at runtime depending on whether AESNI is present.
Noticed by Gilles Peskine Co-authored-by: Gilles Peskine <[email protected]> Signed-off-by: Solar Designer <[email protected]>
Co-authored-by: Gilles Peskine <[email protected]> Signed-off-by: Solar Designer <[email protected]>
0da9f0f
to
6b2ca18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the updates! Looks good to me now.
Our policy requires two reviewers for pull requests, and we'll want to backport the fix to long-term support branches (mbedtls-2.28
and mbedtls-3.6
). If the second reviewer requests changes, and to do the backports, a team member can handle that if you want. Once again, thank you very much for reporting this — incorrect assembly constraints are pretty hard to notice!
Description
We have just integrated mbedTLS AES code into John the Ripper via openwall/john#5591 and I've been reviewing our changes as well as looking at mbedTLS original code. One of our concerns was how good or bad the inline asm blocks are compared to the intrinsics, and whether we should possibly get rid of the inline asm. I found you also have an issue open for that (#8231), which is great. Meanwhile, I noticed a number of performance issues with the inline asm code and what I think is one bug. This PR is to fix the bug:
While most of the inline asm blocks specify what they clobber (memory, condition flags, other registers), the one in
mbedtls_aesni_crypt_ecb
clobbers a couple of input registers without specifying so. This PR fixes that in the same fashion as further asm blocks in the same source file use.The risk from this bug was a potential miscompile that could result in incorrect computation/behavior, including potentially a security vulnerability. However, in practice this is unlikely because the entire non-inline function consists solely of this asm block. Issues could arise with link-time optimization and aggressive function inlining into an application using mbedTLS.
Fix #9819
PR checklist