Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce size even more: generate s-box table? #228

Open
simntonic opened this issue Jun 10, 2024 · 7 comments
Open

Reduce size even more: generate s-box table? #228

simntonic opened this issue Jun 10, 2024 · 7 comments

Comments

@simntonic
Copy link

simntonic commented Jun 10, 2024

Hi :)

First a HUGE thanks for making this and releasing it publically this way, really awesomely useful :))

I'm implementing this on a tiny uC and I'm trying to reduce (compiled) code size as much as possible. As suggested in the readme, I'm only using CTR.

The aes.c file mentions this:

// The lookup-tables are marked const so they can be placed in read-only storage instead of RAM
// The numbers below can be computed dynamically trading ROM for RAM - 
// This can be useful in (embedded) bootloader applications, where ROM is often limited.
static const uint8_t sbox[256] = {

I've been looking around the net for a way to generate this tables (I'm not using rsbox since I'm only using CTR), the best I could find is this: https://crypto.stackexchange.com/questions/98396/how-does-this-code-calculating-aes-s-box-work
But I am not sure how to get the value of p from the polynomial 1 + x + x^3 + x^4 + x^8 in C code?

Thank you!

@simntonic
Copy link
Author

Nervermind I found it in the comments, p is set to the polynomial with x=2, i.e. p=283.

So putting sbox in RAM and adding this function reduces the code size by 212 bytes :)
(but obviously increases the RAM usage by 265 bytes)

void AES_generateSbox(void) {
	uint32_t p = 283;

	uint32_t t[256];
	uint32_t x = 1;
	for (int i = 0; i < 256; i ++) {
		t[i] = x;
		x ^= (x << 1) ^ ((x >> 7) * p);
	}

	sbox[0] = 0x63;
	for (int i = 0; i < 255; i ++) {
		x = t[255 - i];
		x |= x << 8;
		x ^= (x >> 4) ^ (x >> 5) ^ (x >> 6) ^ (x >> 7);
		sbox[t[i]] = (x ^ 0x63) & 0xFF;
	}
}

I checked and the generated sbox is correct.

If anybody has any ideas on how to reduce the function's code size, I'm happy to hear it :)

@simntonic
Copy link
Author

simntonic commented Jun 10, 2024

Just FYI I found a few more minor optimizations, when compiling for a 32-bit processor. All the operations are byte-based, while for a 32-bit processor doing operations as 32-bit words (when possible) is much more efficient, saving both space and computing time. Fortunately the compiler optimizes already most of the possible ones, and there are many which are not optimizable (for example when working on 4 non-adjacent bytes).

Replacing all the uint8_t i with uint32_t i saved a couple of bytes.

Since I'm using only CTR most of the functions benefit from being inlined, so I added a bunch of "inline" on these functions' declaration... Only to find out that the compiler was inlining them already :D

This last optimization, in KeyExpansion() saved a few more (about 30?):

	// The first round key is the key itself.
	for (i = 0; i < Nk; ++i)
	{
		((uint32_t*)roundKey)[i] = ((uint32_t*)AES_key)[i]; // optimized version of the commented 4 lines below
		/*roundKey[(i * 4) + 0] = AES_key[(i * 4) + 0];
		roundKey[(i * 4) + 1] = AES_key[(i * 4) + 1];
		roundKey[(i * 4) + 2] = AES_key[(i * 4) + 2];
		roundKey[(i * 4) + 3] = AES_key[(i * 4) + 3];*/
	}

With these, as well as removing the AES_ctx structure (fixed key stored in ROM), I reduced the code size down to 684 bytes (including the key in ROM).

@paulmenzel
Copy link

Nice. It’d be great if you sent a merge/pull request.

@algrobman
Copy link

there are a few more cases where the code could be optimized for speed and maybe code size, assuming the buffers are DW aligned. the block structure can be represented by array of 2 uint64_t. uint64_t will be emulated with 32 bit operations on 32 bit machine and utilize power of 64 machine.

  1. memcpy - all data movement are limited to 16 bytes ( AES_BLOCKLEN):
void aes_memcpy(uint8_t *to, const uint8_t *from, size_t len){
// always copies 16 bytes 
    uint64_t *to64, *from64;
    from64 = (uint64_t *)from;
    to64   = (uint64_t *)to;
    to64[0] = from64[0];
    to64[1] = from64[1];
}
  1. XOR with Iv can be recoded as :
static void XorWithIv(uint8_t* buf, const uint8_t* Iv){
    uint64_t *buf_ptr, *Iv_ptr;
    buf_ptr = (uint64_t *) buf;
    Iv_ptr = (uint64_t *) Iv;
    buf_ptr[0] ^= Iv_ptr[0];
    buf_ptr[1] ^= Iv_ptr[1];
}
  1. finally : (there is no need to worry about 8 bit counters overflow and again Xor'ing can be speed up)
void AES_CTR_xcrypt_buffer(struct AES_ctx* ctx, uint8_t* buf, size_t length){

  uint8_t buffer[AES_BLOCKLEN];
  uint64_t *buffer64, *buf64;
  
  size_t i;
  int bi;

  buffer64 = (uint64_t *) buffer;
  buf64 = (uint64_t *) buf;
  
  for (i = 0; i < length; i += AES_BLOCKLEN){
      
      aes_memcpy(buffer, ctx->Iv, AES_BLOCKLEN);
      Cipher((state_t*)buffer,ctx->RoundKey);

      /* Increment Iv and handle overflow */
      for (bi = (AES_BLOCKLEN - 1); bi >= 0; --bi){
        ctx->Iv[bi]++;
        if(ctx->Iv[bi]) { break; }  
      }
 
      buf64[0] ^= buffer64[0];
      buf64[1] ^= buffer64[1];
      buf64 += 2;
  }
}

Probably 4/8 byte operations could be enforced more elegantly having a union in AES_ctx buffers with byte and DW arrays.

@battlesnake
Copy link

Perhaps types such as uint_fast8_t might be useful, to allow the compiler to choose the fastest type for the target platform, with a minimum width.

@kokke
Copy link
Owner

kokke commented Sep 17, 2024

Hi @Midronome 😄

First a HUGE thanks for making this and releasing it publically this way, really awesomely useful :))

You're very welcome! And thanks a lot, I really appreciate the kind words 😊 👍

It looks like you found the answer to your original question already (how to generate the s-box table) ?

there are a few more cases where the code could be optimized for speed and maybe code size

Yes definitely, both speed & size I'd say. There are some size optimizations easily available if you are willing to compromise on portability (e.g. targeting 32-bit MCUs) or flexibility (e.g. hard-coding key and IV).

W.r.t. speed, there is much that can be improved - this library is honestly quite slow compared to other implementations (e.g. with table-lookups) because of the focus on small binary size.

Nice. It’d be great if you sent a merge/pull request.

To be frank I would reject it if targeting main/master. I would perhaps be more positively inclined if the work was made available in a new branch instead.

A write-up / tutorial in a markdown-document would be much preferable IMO. I have been wanting a writeup of usage-examples for some times. Something that demonstrates various ways to use or (slightly) mold the code to various use-cases.

The main focus of this library, is to make AES available on machines where traditional implementations require too many resources (RAM/ROM). I originally wrote this library for an 8-bit processor many years ago, because I couldn't find anything I could fit on the chip 😆

@simntonic
Copy link
Author

Awesome :) I would love to hear about those specific optimization removing portability (32-bit MCUs only) and flexibility (hard-coding key + IV). Just like you, I have no concerns about execution speed, only about small binary size (actually in my case there is plenty RAM but ROM is extremely limited).
Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants