Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lednew #16

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Lednew #16

wants to merge 27 commits into from

Conversation

osamuaoki
Copy link
Collaborator

Hi,

You updated source on your led_feature_support branch.

If user enable LED for promicro, binary size will be 514.

I rebased lednew branch to your led_feature_support branch and applied few more size optimization Sergey Vlasov [email protected] was doing in his testing branch. Yes, 12 bytes reduction!

I cherry picked them with some minor trivial touch up to make them cleanly applicable.

Binary size:

  • 494 bytes (as is)
  • 500 bytes (enable LED support with "LED_ACTIVE_LEVEL 1" (Leonardo, Nano, Teensy 2.0-type)
  • 502 bytes (enable LED support with "LED_ACTIVE_LEVEL 0" (Promicro-type)

I also updated README.md accordingly.

@osamuaoki osamuaoki mentioned this pull request Feb 5, 2022
@osamuaoki
Copy link
Collaborator Author

I included more optimization from Rodorigo's sites.

I see more possibilities around

  • UNHANDLED_SETUP_REQUEST
  • UNHANDLED_SETUP_REQUEST_1
  • fallthrough to SET_HID_REPORT

Consideration of br** command shorter than rjmp is a factor.

Since these needs to be checked in assembled code log, I am not yet merging them. (Besides they tends to cause merge conflicts)

@osamuaoki
Copy link
Collaborator Author

I now included all code shrinks. (More than enough for LED).

Mostly Sergey's idea but I made 6 bytes shrink using Y+ for WDT.

No rush except one of any patches to get under 512 for any use case.

@NicoHood
Copy link

NicoHood commented Feb 5, 2022

Guys, I love the passion that you have working on optimizing this bootloader. It is such a pleasure to (silently) look at how it improves. Great!

@osamuaoki
Copy link
Collaborator Author

Sorry ... I had a stupid mistake. I used "git push -f".

If you ever pulled today before reading this, please clone fresh.

@osamuaoki
Copy link
Collaborator Author

I think I covered every optimization I did so far by now. Many of them are borrowed from Sergey.

Since this series of changes were extensive, there may be some errors in my merge operations. Please check before accepting all. At least this is a linear history.

  • 468 bytes (default) -- no LED
  • 474 bytes (opt-in) -- enable LED support with "LED_ACTIVE_LEVEL 1" (Leonardo, Nano, Teensy 2.0-type)
  • 476 bytes (opt-in) -- enable LED support with "LED_ACTIVE_LEVEL 0" (Promicro-type)

Cheers!

@osamuaoki
Copy link
Collaborator Author

I tested on Teensy-type board and lednew did not work.

Oops, my last significant commit had a bug of dropping EP_ISR_END:. No wonder ...

  • 96d5453 ("Save 6 bytes by refactoring UNHANDLED_SETUP_REQUEST usage", 2022-02-05)

I will update tree with forth push.

@osamuaoki
Copy link
Collaborator Author

Hi,

Tested with Teensy compatible PCB. It works.

I used string-descriptors branch from https://github.com/sigprof/nanoBoot/
I had to drop sigprof's following size optimization since it caused problem.

  • 6bc21dd ("Save 6 bytes by refactoring UNHANDLED_SETUP_REQUEST usage", 2021-09-04)

Product name seems to use 2 bytes per character. So I trimmed product name to "nanoBt" (4 bytes savings)

Now all platform (including promicro type LED) should fit into 512 bytes.

I think this is the best we can do for 512 bytes.

sigprof and others added 19 commits February 17, 2022 01:08
Instead of looping until one of two bits in UEINTX is set, and then
retesting the value after the loop, just do `reti` inside the loop if
`RXSTPI` is set.  This saves 4 bytes by removing both a duplicate bit
test instruction and an extra jump instruction.
The `wait_TXINI` and `wait_RXOUTI` subroutines were always called after
a call to some other `clear_XXX` subroutine to clear a bit in `UEINTX`.
Replace those two subroutines with `clear_bit_and_wait_TXINI` and
`clear_bit_and_wait_RXOUTI`, which get the bit to be cleared as a
parameter in `r17`, and then inline the remaining `clear_XXX`
subroutines (a subroutine with just 2 instructions in its body actually
takes more space than inline code if it is called less than 3 times).
Rewrite the part of the GET_DESCRIPTOR handling code that loads the
address and length of the requested descriptor to use less jumps; the
resulting code consumes 2 bytes less, even though it is actually more
correct (no longer replies with some descriptor to requests for an
unknown descriptor type).  The new code also avoids hardcoding the high
address byte, and no longer depends on the fact that all descriptors
have the same high address byte (but it depends on the fact that all
offsets between adjacent descriptors can fit into the `adiw` constant
argument (0...63)).
The original LUFA code performed `UDCON |= (1 << DETACH);` to detach the
USB device; however, the other bits of `UDCON` are known to be 0 at this
time, therefore a simple write of a constant value could be performed
here.  In addition, the value of `(1 << DETACH)` is 1 on all AVR chips
that could be potentially supported by this code, therefore even the
instruction to load the constant value into a register could be omitted.
Doing these changes removes 2 instructions, saving 4 bytes.
The initialization of USBCON does not need to read the current register
value - just writing the reset value into that register is enough.  This
removes one instruction, saving 2 bytes.

In theory this write could even be omitted completely, saving 4 more
bytes, but this would make the code less robust if the application code
attempting to enter the bootloader does not initialize some USB
controller registers properly, therefore leaving that USBCON write there
is safer.
The instruction which applied the mask of `(CONTROL_REQTYPE_TYPE |
CONTROL_REQTYPE_RECIPIENT)` to bmRequestType was not actually needed,
because the value of this mask is 0x7F, and the only bit which is not
covered by the mask (0x80) is already known to be 0, therefore the
masking did not actually change anything.  Removing this instruction
saves 2 bytes.

Signed-off-by: Sergey Vlasov <[email protected]>

Since following commits are skipped previously when cherry-picking.
  97c8d96 ("TEMPORARY: EEPROM code which does not fit", 2021-08-25)
  e48801a ("WIP: Optimize some more code to make the EEPROM support fit", 2021-08-25)
  78b804d ("WIP: Redo the EEPROM write implementation", 2021-08-29)
  e7e94f3 ("Save 2 bytes by reusing the TXINI clearing mask", 2021-08-29)

Adjusted to keep branch to thunk code.

Signed-off-by: Osamu Aoki <[email protected]>
The DEVICE_TO_HOST handling code needs to handle only the GET_DESCRIPTOR
requests, which may come with two possible bmRequestType values:

  - 0x80 (IN direction, USB Standard Request, Recipient is the device) if
    a descriptor which applies to the device as a whole is requested (this
    code is used for the device and configuration descriptors);

  - 0x81 (IN direction, USB Standard Request, Recipient is the interface)
    if a descriptor specific to a particular interface is requested (this
    code is used for the HID class and HID report descriptors).

Because these codes are numerically sequential, and the direction bit has
already been tested (therefore it is known that bmRequestType >= 0x80), it
is enough to make a single comparison with 0x82 to detemine whether
bmRequestType has one of the above values.  Removing the bit masking
operation which is not needed after that change saves 2 bytes.
Signed-off-by: Osamu Aoki <[email protected]>
Access code to data in the extended IO address range can be made compact
by using Y+.  Both WDT and USB registers are in extended IO address.
Fortunately, calls to these 2 types of addresses do not overlap.

 * initial code calls WDT
 * main code calls USB
 * exit code calls WDT

This patch enables to use Y+ for WDT in addtion to USB.

 * YH is common to USB and WDT and 0 (very first).
 * YL is different.  It's initialization is moved to each subroutine to
   avoid code duplication.
   * inside of the set_watchdog_timer subroutine
   * USB Initialization section

Signed-off-by: Osamu Aoki <[email protected]>
Signed-off-by: Osamu Aoki <[email protected]>
No size or behavior changes, just a preparation to save some bytes with
fallthrough.

Cherry picked from Sergey's repo

LED code properly moved.

Signed-off-by: Osamu Aoki <[email protected]>
Restructure conditional jumps to use fallthrough for the
`SET_HID_REPORT` case; this removes one jump instruction, saving 2
bytes.

Cherry picked from Sergey Vlasov <[email protected]>:
a358e89 ("Save 2 bytes by using fallthrough for SET_HID_REPORT", 2021-08-24)

Comment by Osamu (conflict resolution)

UNHANDLED_SETUP_REQUEST_1 was used since br command PC(word) offset is 6
bit

    offset      address
 00             00007ede first call ending up at UNHANDLED_SETUP_REQUEST
                This requires thunk to reach with br** command

 4A  00         00007f28 old UNHANDLED_DEVICE_TO_HOST entry ending up at UNHANDLED_SETUP_REQUEST
 5E  14  00     00007f3c UNHANDLED_SETUP_REQUEST_1    entry ending up at UNHANDLED_SETUP_REQUEST
!B8  6E  5A     00007f96 UNHANDLED_SETUP_REQUEST      entry

    br** commands
       6 bit offset PC offset (64)
       +/- 2^7 byte offset +7E, -80 (126)

    rjmp command
      11 bit offset PC offset (2K)
      +/- 2^12 byte offset +FFE, -1000 (4094)

Signed-off-by: Osamu Aoki <[email protected]>
Due to reordering of code, UNHANDLED_SETUP_REQUEST is within scope of
direct jump from br** command now.

00007f3e <DEVICE_TO_HOST>:
      ;  - 0x80 - IN Type Request, USB Standard Request, Recipient is the device
      ;  - 0x81 - IN Type Request, USB Standard Request, Recipient is the interface
      ; At this step it is known that bmRequestType >= 0x80, therefore checking for bmRequestType < 0x82
      ; is enough to detect whether bmRequestType has one of the above values.

      cpi         reg_bmRequestType, 0x82       ; Check whether bmRequestType is less than 0x82 (then it must be either 0x80 or 0x81)
    7f3e:       22 38           cpi     r18, 0x82       ; 130
      brcc        UNHANDLED_SETUP_REQUEST       ; If bmRequestType >= 0x82, this request type is not handled here (it's not a GET_DESCRIPTOR request)
    7f40:       50 f5           brcc    .+84            ; 0x7f96 <UNHANDLED_SETUP_REQUEST>

      cpi         reg_bRequest, 0x06            ; Compare bRequest with value 0x06 (REQ_GetDescriptor)
    7f42:       36 30           cpi     r19, 0x06       ; 6
      brne        UNHANDLED_SETUP_REQUEST       ; jump to UNHANDLED_SETUP_REQUEST through f not equal
    7f44:       41 f5           brne    .+80            ; 0x7f96 <UNHANDLED_SETUP_REQUEST>

Signed-off-by: Osamu Aoki <[email protected]>
Signed-off-by: Osamu Aoki <[email protected]>
br** command has relatively narrow jump range.  Since
UNHANDLED_SETUP_REQUEST is frequent entry point, moving it up to keep it
at the center of code makes sense.

This enables to drop thunk.

Signed-off-by: Osamu Aoki <[email protected]>
No more need to call through its thunk

Signed-off-by: Osamu Aoki <[email protected]>
osamuaoki and others added 8 commits February 17, 2022 01:08
There is no need to turn off LED
Write program message gives the completion confirmation.

Signed-off-by: Osamu Aoki <[email protected]>
Borrowed from from 99c4dbf ("WIP: String descriptor implementation", 2021-09-04)

I skipped size optimization in USB signal handling since it was
problematic.  If no LED are used, this is good enough.

Signed-off-by: Osamu Aoki <[email protected]>
Although we can further reduce size by skipping cli and some clr
commands if we only care about power-on reset or real hardware reset.

But if bootloader is called by long jump, this may not be safe thing to
do.  So reactivating this.

Signed-off-by: Osamu Aoki <[email protected]>
This makes it possible to compile with opt-in features without touching
the source code.

Signed-off-by: Osamu Aoki <[email protected]>
/NUL certainly breaks on non-Windows system and should never be here
NUL is ugly since it leaves NUL on other platform

I like /dev/null better, but just in case, I am taking safer option.

This should be the least invasive change to reduce noise on non-Windows
platform without breaking Windows environment.

Signed-off-by: Osamu Aoki <[email protected]>
Signed-off-by: Osamu Aoki <[email protected]>
@osamuaoki
Copy link
Collaborator Author

osamuaoki commented Feb 16, 2022

Hi,

Thanks.

The USB device name thing is working for all platform types and within 512 bytes.
Now this branch is mostly code shrinks.
The last byte was squeezed by using shorter device name "nanoBt". (4 bytes saved).
Rebasing on top of your new main was clean and I just pushed it.

Before the rebase, this was tested on my hardware. So this should be in good shape.

I think this is as much we can do for nanoBoot.

Osamu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants