Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

microblaze: Fix -Os right shift optimization is allowed into delay slot #37

Open
wants to merge 1 commit into
base: zephyr-gcc-12.2.0
Choose a base branch
from

Conversation

alpsayin
Copy link

@alpsayin alpsayin commented Oct 18, 2024

During picolibc testing, it's found that -Os produces assembly code that compiler
squeezes into a single delay slot. Thus, only the first instruction emitted by this optimization is run and the rest is skipped.
Optimization is generated by

  [(set (match_operand:SI 0 "register_operand" "=&d")
       (ashiftrt:SI (match_operand:SI 1 "register_operand"  "d")
                   (match_operand:SI 2 "immediate_operand" "I")))]
  "(INTVAL (operands[2]) > 5 && optimize_size)"
  {
    operands[3] = gen_rtx_REG (SImode, MB_ABI_ASM_TEMP_REGNUM);

    output_asm_insn ("ori\t%3,r0,%2", operands);
    if (REGNO (operands[0]) != REGNO (operands[1]))
        output_asm_insn ("addk\t%0,%1,r0", operands);

    output_asm_insn ("addik\t%3,%3,-1", operands);
    output_asm_insn ("bneid\t%3,.-4", operands);
    return "sra\t%0,%0";
  }
  [(set_attr "type"    "arith")
  (set_attr "mode"    "SI")
  (set_attr "length"  "20")]

But arith type is NOT disallowed from going into delay slot (see below):

[(and (eq_attr "type" "!branch,call,jump,icmp,multi,no_delay_arith,no_delay_load,no_delay_store,no_delay_imul,no_delay_move,darith")

"Optimized" code is between [191b8-191c8]

   191a8:	bc830194 	bgti	r3, 404		// 1933c
    if (subnormal_y) { /* subnormal y */
   191ac:	b0007ff0 	imm	32752
   191b0:	a47c0000 	andi	r3, r28, 0
   191b4:	be2301ec 	bneid	r3, 492		// 193a0
   191b8:	a2400014 	ori	r18, r0, 20
   191bc:	131e0000 	addk	r24, r30, r0
   191c0:	3252ffff 	addik	r18, r18, -1
   191c4:	be32fffc 	bneid	r18, -4		// 191c0
   191c8:	93180001 	sra	r24, r24
...
        iy = (hy >> 20) - 1023;
   193a0:	b810fe40 	brid	-448		// 191e0
   193a4:	3318fc01 	addik	r24, r24, -1023

where operands are:

operands[0] = r24
operands[1] = r29
operands[2] = 20
operands[3] = r18

As a result, this code returns a iy (r24) value of (whatever was in r24) - 1023`

The fix is simple. I've redeclared this size-optimization as multi which is

  1. Not delay-slot allowed
  2. Also the same type for other shift optimizations (they're left shift optimizations)
  3. [(set_attr "type" "multi")
  4. [(set_attr "type" "multi")

For context, non-size-optimized code is N copies of:

sra	r24, r24 /* arithmetic 1-bit shift r24 and put result into r24 */

And as per the above optimization rule, if N <= 5 we'll still get 5 copies of sra instruction.

Currently under test via zephyrproject-rtos/sdk-ng#647

@alpsayin
Copy link
Author

alpsayin commented Oct 19, 2024

Fix seems to have worked, new disassembly as below:

previously now
if (subnormal_y) { /* subnormal y */ if (subnormal_y) { /* subnormal y */
191ac: b0007ff0 imm 32752 19198: b0007ff0 imm 32752
191b0: a47c0000 andi r3, r28, 0 1919c: a47c0000 andi r3, r28, 0
191b4: be2301ec bneid r3, 492 // 193a0 191a0: bc2301d8 bnei r3, 472 // 19378
/* jump to else but also somehow loop 20x single bit right shift */ /* jump to else no delay slot */
191b8: a2400014 ori r18, r0, 20
191bc: 131e0000 addk r24, r30, r0
191c0: 3252ffff addik r18, r18, -1
191c4: be32fffc bneid r18, -4 // 191c0
191c8: 93180001 sra r24, r24

After the jump

previously now
iy = (hy >> 20) - 1023; iy = (hy >> 20) - 1023;
/*jump to whatever but with delay-slot */ /* loop 20: single bit right shift */
193a0: b810fe40 brid -448 // 191e0 19378: a2400014 ori r18, r0, 20
193a4: 3318fc01 addik r24, r24, -1023 1937c: 131e0000 addk r24, r30, r0
/* -1023 executed and branched */ 19380: 3252ffff addik r18, r18, -1
19384: be32fffc bneid r18, -4 // 19380
19388: 93180001 sra r24, r24
1938c: b810fe2c brid -468 // 191b8
19390: 3318fc01 addik r24, r24, -1023

CC @keith-packard you can grab a copy of the toolchain from https://github.com/zephyrproject-rtos/sdk-ng/actions/runs/11407925329?pr=647 if it's of any interest.

p.s. new disassembly again but monospace cuz it hurts my eyes:

   19194:	bc830180 	bgti	r3, 384		// 19314
    if (subnormal_y) { /* subnormal y */
   19198:	b0007ff0 	imm	32752
   1919c:	a47c0000 	andi	r3, r28, 0
   191a0:	bc2301d8 	bnei	r3, 472		// 19378
...
        iy = (hy >> 20) - 1023;
   19378:	a2400014 	ori	r18, r0, 20
   1937c:	131e0000 	addk	r24, r30, r0
   19380:	3252ffff 	addik	r18, r18, -1
   19384:	be32fffc 	bneid	r18, -4		// 19380
   19388:	93180001 	sra	r24, r24
   1938c:	b810fe2c 	brid	-468		// 191b8
   19390:	3318fc01 	addik	r24, r24, -1023

compared to:

   191a8:	bc830194 	bgti	r3, 404		// 1933c
    if (subnormal_y) { /* subnormal y */
   191ac:	b0007ff0 	imm	32752
   191b0:	a47c0000 	andi	r3, r28, 0
   191b4:	be2301ec 	bneid	r3, 492		// 193a0
   191b8:	a2400014 	ori	r18, r0, 20
   191bc:	131e0000 	addk	r24, r30, r0
   191c0:	3252ffff 	addik	r18, r18, -1
   191c4:	be32fffc 	bneid	r18, -4		// 191c0
   191c8:	93180001 	sra	r24, r24
...
        iy = (hy >> 20) - 1023;
   193a0:	b810fe40 	brid	-448		// 191e0
   193a4:	3318fc01 	addik	r24, r24, -1023

During picolibc testing, it's found that `-Os` produces assembly code that
compiler squeezes into a single delay slot. Thus, only the first
instruction (the one in delay slot) emitted by this optimization is
executed and the rest is skipped.

Signed-off-by: Alp Sayin <[email protected]>
@alpsayin alpsayin force-pushed the zephyr-gcc-12.2.0-bad-Os-shift-optimisation branch from 0e3007a to 815ae62 Compare October 19, 2024 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant