Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RT1170 enhancements #2865

Merged
merged 17 commits into from
Dec 18, 2024
Merged

RT1170 enhancements #2865

merged 17 commits into from
Dec 18, 2024

Conversation

HiFiPhile
Copy link
Collaborator

@HiFiPhile HiFiPhile commented Nov 2, 2024

Describe the PR

  • Replace cache clean/invalidate by MPU config. Since we can't guarantee buffer sizes are multiple of cache line size, doing cache clean/invalidate can cause data consistency issue, it also hurts performance. Use MPU to set RAM as non-cacheable like mcux-sdk example.
  • Add M4 core ram image build support
    • cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBOARD=mimxrt1170_evkb -DM4=1 -G Ninja -B rt1170_cm4
    • make BOARD=mimxrt1170_evkb M4=1

PS: MCHP has nice write-up on cache https://ww1.microchip.com/downloads/en/DeviceDoc/Managing-Cache-Coherency-on-Cortex-M7-Based-MCUs-DS90003195A.pdf

CFLAGS += \
-D__STARTUP_CLEAR_BSS \
-DCFG_TUSB_MCU=OPT_MCU_MIMXRT1XXX \
-DCFG_TUSB_MEM_SECTION='__attribute__((section("NonCacheable")))' \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On M7 core NonCacheable is located on DTCM so there is no need to add a if switch.

@HiFiPhile
Copy link
Collaborator Author

Hi @mastupristi,

I've managed to run TinyUSB stack on M4 core. The DMA controller inside USB IP can't access M4 core's TCM so packet buffer must be placed in OCRAM, it's done by CFG_TUSB_MEM_SECTION=__attribute__((section("NonCacheable"))), it is a section defined by default linker script.
Also memory section attribute is configured by BOARD_ConfigMPU();

@mastupristi
Copy link

Hi @HiFiPhile

I've managed to run TinyUSB stack on M4 core. The DMA controller inside USB IP can't access M4 core's TCM so packet buffer must be placed in OCRAM, it's done by CFG_TUSB_MEM_SECTION=__attribute__((section("NonCacheable"))), it is a section defined by default linker script. Also memory section attribute is configured by BOARD_ConfigMPU();

Just wanted to share some great news—I ran an initial test (cdc_msc example) of your branch for the CM4 on the RT1170, and it worked like a charm! 🎉

Here what my kernel say:
image

Thanks so much for putting this together so quickly. We’re thrilled with the progress and super grateful for your help.
My colleagues and I will also try to review the code soon

@hathach
Copy link
Owner

hathach commented Nov 27, 2024

as mentioned in esp32p4 cache, I would still wnat to keep the dcache clean/invidate. IMO, it does not hurt performance but actual improve it (depending on the usage). As one of the main difference between M7 and M3/M4 is actually the cache (data + instruction). M7 can run insanely fast (up to 1Ghz), and can perform lots of computattion on data e.g video before passing it to USB/DMA for transfer.
@HiFiPhile If you are busy, I can make pull and make the change myself later on when I got time.

@HiFiPhile
Copy link
Collaborator Author

as mentioned in esp32p4 cache, I would still want to keep the dcache clean/invidate. IMO, it does not hurt performance but actual improve it (depending on the usage).

I was thinking about add back cache support for M7 core but didn't have the time. Secondary M4 core uses a customized cache controller which is more complicated.

As one of the main difference between M7 and M3/M4 is actually the cache (data + instruction). M7 can run insanely fast (up to 1Ghz), and can perform lots of computation on data e.g video before passing it to USB/DMA for transfer.

I did a little test based on audio_4_channel_mic example:

  • ICache and DCache are default ON
  • Change the linker to put data on OCRAM (default is DTCM)
  • Copy i2s_dummy_buffer to cached location and clean cache
  • Copy i2s_dummy_buffer to non cached location
  • Compare cycle counter
code

volatile uint32_t clock_cycles_counter;
volatile unsigned int *DWT_CYCCNT = (uint32_t *)0xE0001004; //address of the register
volatile unsigned int *DWT_CONTROL = (uint32_t *)0xE0001000; //address of the register
volatile unsigned int *SCB_DEMCR = (uint32_t *)0xE000EDFC; //address of the register

uint16_t i2s_dummy_buffer2[CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO][CFG_TUD_AUDIO_FUNC_1_N_CHANNELS_TX*CFG_TUD_AUDIO_FUNC_1_SAMPLE_RATE/1000/CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO];
CFG_TUSB_MEM_SECTION uint16_t i2s_dummy_buffer3[CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO][CFG_TUD_AUDIO_FUNC_1_N_CHANNELS_TX*CFG_TUD_AUDIO_FUNC_1_SAMPLE_RATE/1000/CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO];

  // in main()
  clock_cycles_counter = 0;
  *SCB_DEMCR = *SCB_DEMCR | 0x01000000;
  *DWT_CYCCNT = 0;
  *DWT_CONTROL |=  1;
  
  memcpy(i2s_dummy_buffer2, i2s_dummy_buffer, sizeof(i2s_dummy_buffer2));
  
  SCB_CleanDCache_by_Addr((uint32_t*)i2s_dummy_buffer2, sizeof(i2s_dummy_buffer2));
  *DWT_CONTROL &= ~1;
  clock_cycles_counter = *DWT_CYCCNT;
  
  clock_cycles_counter = 0;
  *SCB_DEMCR = *SCB_DEMCR | 0x01000000;
  *DWT_CYCCNT = 0;
  *DWT_CONTROL |=  1;
  
  memcpy(i2s_dummy_buffer3, i2s_dummy_buffer, sizeof(i2s_dummy_buffer3));
  
  *DWT_CONTROL &= ~1;
  clock_cycles_counter = *DWT_CYCCNT;
  
  // Test cached location again to ensure memcpy is in ICACHE
  clock_cycles_counter = 0;
  *SCB_DEMCR = *SCB_DEMCR | 0x01000000;
  *DWT_CYCCNT = 0;
  *DWT_CONTROL |=  1;
  
  memcpy(i2s_dummy_buffer2, i2s_dummy_buffer, sizeof(i2s_dummy_buffer2));
  
  SCB_CleanDCache_by_Addr((uint32_t*)i2s_dummy_buffer2, sizeof(i2s_dummy_buffer2));
  *DWT_CONTROL &= ~1;
  clock_cycles_counter = *DWT_CYCCNT;
  
  asm("bkpt 0x55");

Although OCRAM is only clocked at fCPU/4 (for STM32H7 is fCPU/2), performance on non cached location is higher.

Cached Non-cached Cached 2nd loop
1121 963 1036

Still it is slow to copy 384 bytes, I expect ~500 cycles to copy 96 words counting 4 cycles per word.

@hathach
Copy link
Owner

hathach commented Nov 28, 2024

@HiFiPhile thanks for the detailed test ersult, though this may not reflect all usage. Dcache clean/invalidate as any solution does introduce overhead, in a scenario when user need to do heavy computation on memory such as encrypting a large block of bytes and/or lots of video/dsp processing. It would outweight the overhead. As general rule of thumb for cpu world, I still think the more cache the better/faster in general.

@HiFiPhile
Copy link
Collaborator Author

Anyway for most applications TCM is used as the buffer, as the default linker script.

In a later test when both buffer are inside TCM I got a result of less than 500 cycles, while testing from OCRAM without cache clean still cost 900 cycles.
I'm curious to find out why cache is slower than TCM in this case.

@hathach
Copy link
Owner

hathach commented Nov 28, 2024

Anyway for most applications TCM is used as the buffer, as the default linker script.

In a later test when both buffer are inside TCM I got a result of less than 500 cycles, while testing from OCRAM without cache clean still cost 900 cycles. I'm curious to find out why cache is slower than TCM in this case.

I have no idea though, tbh I am new to these dcache as well. These mpu configuration is also bit complicated for me :)

Signed-off-by: HiFiPhile <[email protected]>
@mastupristi
Copy link

@HiFiPhile I found the errata ERR050396 that says what is necessary to do if you intend to use the M7's TCM as the destination for USB writes.
I didn't understand if this is already done or not

image

Signed-off-by: HiFiPhile <[email protected]>
@HiFiPhile
Copy link
Collaborator Author

@HiFiPhile I found the errata ERR050396 that says what is necessary to do if you intend to use the M7's TCM as the destination for USB writes. I didn't understand if this is already done or not

It's good to know that, it's not fixed in the dcd driver but with a quick search it's included in kSDK:
https://github.com/hathach/mcux-sdk/blob/e8902892850385d8fb99e01b785852df4691604b/devices/MIMXRT1176/system_MIMXRT1176_cm7.c#L125
https://github.com/hathach/mcux-sdk/blob/e8902892850385d8fb99e01b785852df4691604b/devices/MIMXRT1176/system_MIMXRT1176_cm4.c#L143

Signed-off-by: HiFiPhile <[email protected]>
Signed-off-by: HiFiPhile <[email protected]>
@HiFiPhile
Copy link
Collaborator Author

HiFiPhile commented Nov 30, 2024

@hathach I've restored M7 cache support and also updated audio class.

PS: I did a more elaborated benchmark and post my question: https://community.nxp.com/t5/i-MX-Processors/RT1170-question-about-memcpy-benchmark/m-p/2004350#M231361

@hathach
Copy link
Owner

hathach commented Dec 6, 2024

@hathach I've restored M7 cache support and also updated audio class.

PS: I did a more elaborated benchmark and post my question: https://community.nxp.com/t5/i-MX-Processors/RT1170-question-about-memcpy-benchmark/m-p/2004350#M231361

thank you, will review asap

#endif
}

/* MPU configuration. */
#if __CORTEX_M == 7
static void BOARD_ConfigMPU(void) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HiFiPhile is this BOARD_ConfigMPU() disable cache memory for M7. I actually prefer to have cache enabled for imxrt by default (CFG_TUD/H_MEM_DCACHE_ENABLE=1). Since that is what most user with this chip would do. Also having it enable in example will make sure we test with cacheable memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOARD_ConfigMPU() is copied from ksdk, which enables DCache and ICache.
If you want to test examples on cached OCRAM you can simply modify the linker, default linker script from ksdk use DTCM.

Copy link
Owner

@hathach hathach Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @HiFiPhile for the info, that sound good. I am currrently messing around with mcux config tool for mpu(), these are too complicated. Maybe I would do that in a follow-up PR to test cache function later.

image

@hathach
Copy link
Owner

hathach commented Dec 17, 2024

I tried to compile and flash both with cmake and make with M4=1, but somehow flashing doesn't seems to get the mcu running at all. I will only run the last CM7 binaries. If I erase the chip and only flash the cm4 image, mcu won't boot at all. Maybe I miss something.

@HiFiPhile
Copy link
Collaborator Author

I tried to compile and flash both with cmake and make with M4=1, but somehow flashing doesn't seems to get the mcu running at all. I will only run the last CM7 binaries. If I erase the chip and only flash the cm4 image, mcu won't boot at all. Maybe I miss something.

Hum... What happens if you erase the chip, then in Jlink commander load the binary and go ?

@hathach
Copy link
Owner

hathach commented Dec 17, 2024

I tried to compile and flash both with cmake and make with M4=1, but somehow flashing doesn't seems to get the mcu running at all. I will only run the last CM7 binaries. If I erase the chip and only flash the cm4 image, mcu won't boot at all. Maybe I miss something.

Hum... What happens if you erase the chip, then in Jlink commander load the binary and go ?

Here is my log, even though jlink report that loading file is OK. It may not, normally when erase/flashing, there is a progresss windows pop-up. That only appear with normal M7 build. I guess it is not actually flashed at all. I done with both cmake and make and try flashing with pyocd as well. Not sure what I am missing.

make + pyocd
msc_dual_lun$ make BOARD=mimxrt1170_evkb M4=1 flash
pyocd flash -t mimxrt1170_cm4  _build/mimxrt1170_evkb/msc_dual_lun.hex
0001380 I Loading /home/hathach/code/tinyusb/examples/device/msc_dual_lun/_build/mimxrt1170_evkb/msc_dual_lun.hex [load_cmd]
[==================================================] 100%
0002907 I Erased 0 bytes (0 sectors), programmed 47964 bytes (0 pages), skipped 0 bytes (0 pages) at 31.20 kB/s [loader]
#pyocd reset -t mimxrt1170_cm4
cmake + jlink
 JLinkExe -device MIMXRT1176xxxA_M4 -if swd -JTAGConf -1,-1 -speed auto -CommandFile /home/hathach/code/tinyusb/examples/cmake-build-mimxrt1170_evkb_cm4/device/msc_dual_lun/msc_dual_lun.jlink
SEGGER J-Link Commander V8.10l (Compiled Dec 11 2024 16:08:39)
DLL version V8.10l, compiled Dec 11 2024 16:07:04


J-Link Command File read successfully.
Processing script file...
J-Link>halt
J-Link connection not established yet but required for command.
Connecting to J-Link via USB...O.K.
Firmware: J-Trace PRO V2 Cortex-M compiled Dec  4 2024 17:58:04
Hardware version: V2.00
J-Link uptime (since boot): 0d 04h 46m 00s
S/N: 752001685
License(s): RDI, FlashBP, FlashDL, JFlash, GDB
USB speed mode: Super speed (5 GBit/s)
IP-Addr: DHCP (no addr. received yet)
Emulator has RAWTRACE capability
VTref=3.288V
Target connection not established yet but required for command.
Device "MIMXRT1176XXXA_M4" selected.


Connecting to target via SWD
InitTarget() start
InitTarget() end - Took 801ms
Found SW-DP with ID 0x6BA02477
Failed to power up DAP
InitTarget() start
Cortex-M4 not released yet. Preparing spin code in RAM @ 0x20200000
InitTarget() end - Took 5.89ms
Found SW-DP with ID 0x6BA02477
DPIDR: 0x6BA02477
CoreSight SoC-400 or earlier
AP map detection skipped. Manually configured AP map found.
AP[0]: AHB-AP (IDR: Not set, ADDR: 0x00000000)
AP[1]: AHB-AP (IDR: Not set, ADDR: 0x00000000)
AP[2]: APB-AP (IDR: Not set, ADDR: 0x00000000)
AP[1]: Skipped. Could not read CPUID register
Attach to CPU failed. Executing connect under reset.
Failed to power up DAP
Error occurred: Could not connect to the target device.
For troubleshooting steps visit: https://wiki.segger.com/J-Link_Troubleshooting
J-Link>loadfile /home/hathach/code/tinyusb/examples/cmake-build-mimxrt1170_evkb_cm4/device/msc_dual_lun/msc_dual_lun.elf
Target connection not established yet but required for command.
Device "MIMXRT1176XXXA_M4" selected.


Connecting to target via SWD
InitTarget() start
Cortex-M4 not released yet. Preparing spin code in RAM @ 0x20200000
InitTarget() end - Took 7.00ms
Found SW-DP with ID 0x6BA02477
DPIDR: 0x6BA02477
CoreSight SoC-400 or earlier
AP map detection skipped. Manually configured AP map found.
AP[0]: AHB-AP (IDR: Not set, ADDR: 0x00000000)
AP[1]: AHB-AP (IDR: Not set, ADDR: 0x00000000)
AP[2]: APB-AP (IDR: Not set, ADDR: 0x00000000)
AP[1]: Core found
AP[1]: AHB-AP ROM base: 0xE00FF000
CPUID register: 0x410FC241. Implementer code: 0x41 (ARM)
Found Cortex-M4 r0p1, Little endian.
FPUnit: 6 code (BP) slots and 2 literal slots
CoreSight components:
ROMTbl[0] @ E00FF000
[0][0]: E000E000 CID B105E00D PID 000BB00C SCS-M7
[0][1]: E0001000 CID B105E00D PID 003BB002 DWT
[0][2]: E0002000 CID B105E00D PID 002BB003 FPB
[0][3]: E0000000 CID B105E00D PID 003BB001 ITM
[0][5]: E0041000 CID B105900D PID 000BB925 ETM
[0][7]: E0043000 CID B105900D PID 001BB908 CSTF
[0][8]: E0042000 CID B105900D PID 005BB906 CTI
Memory zones:
  Zone: "Default" Description: Default access mode
Cortex-M4 identified.
'loadfile': Performing implicit reset & halt of MCU.
ResetTarget() start
HandleBeforeMemAccessWrite() start
HandleBeforeMemAccessWrite() end - Took 232us
ResetTarget() end - Took 673us
Device specific reset executed.
Downloading file [/home/hathach/code/tinyusb/examples/cmake-build-mimxrt1170_evkb_cm4/device/msc_dual_lun/msc_dual_lun.elf]...
O.K.
J-Link>r
Reset delay: 0 ms
ResetTarget() start
ResetTarget() end - Took 974us
Device specific reset executed.
J-Link>go
Memory map 'after startup completion point' is active
J-Link>exit

Script processing completed.


Build finished

@HiFiPhile
Copy link
Collaborator Author

normally when erase/flashing, there is a progresss windows pop-up. That only appear with normal M7 build. I guess it is not actually flashed at all

Since M4 is linked as a RAM image maybe the popup is only displayed during ROM flashing, I'll take a look later.

@hathach
Copy link
Owner

hathach commented Dec 17, 2024

normally when erase/flashing, there is a progresss windows pop-up. That only appear with normal M7 build. I guess it is not actually flashed at all

Since M4 is linked as a RAM image maybe the popup is only displayed during ROM flashing, I'll take a look later.

oh it runs from SRAM ? I miss that, do i need any code running on M7 first e.g blinky, or change any boot switch ?

@HiFiPhile
Copy link
Collaborator Author

oh it runs from SRAM ? I miss that, do i need any code running on M7 first e.g blinky, or change any boot switch ?

M4 run the same init code as M7, I remember I put M7 core in a while loop without doing anything. But I expect erasing the flash will do the same thing...

@HiFiPhile
Copy link
Collaborator Author

In fact it's more complicated to run M4 core:

  • 1st way is burn the fuse to set M4 core as the startup core instead of M7 which is irreversible

  • 2nd way:

  1. Put M7 core in a while loop. I tried simply erase the flash but M7 core will enter ROM bootloader with a device shown as NXP SEMICONDUCTORS SE Blank RT Family - HID and interfere with M4 core.
  2. I suspect J-Link is buggy and can't reset M4 core correctly, after loading the file loadfile rt1170_cm4/cdc_dual_ports.hex, I have to manually set PC to Reset_Handler with wreg "R15 (PC)",0x1ffe0400. Since Reset_Handler's address can be different each time it can't hard coded.
  3. Now the program will run with go
    Full command is:
connect
loadfile rt1170_cm4/cdc_dual_ports.hex
wreg "R15 (PC)",0x1ffe0400
go
  • 3rd way is load the .elf in Segger Ozone debugger, it can figure it out by itself. I think it puts M7 core in a loop.

@hathach
Copy link
Owner

hathach commented Dec 18, 2024

In fact it's more complicated to run M4 core:

  • 1st way is burn the fuse to set M4 core as the startup core instead of M7 which is irreversible
  • 2nd way:
  1. Put M7 core in a while loop. I tried simply erase the flash but M7 core will enter ROM bootloader with a device shown as NXP SEMICONDUCTORS SE Blank RT Family - HID and interfere with M4 core.
  2. I suspect J-Link is buggy and can't reset M4 core correctly, after loading the file loadfile rt1170_cm4/cdc_dual_ports.hex, I have to manually set PC to Reset_Handler with wreg "R15 (PC)",0x1ffe0400. Since Reset_Handler's address can be different each time it can't hard coded.
  3. Now the program will run with go
    Full command is:
connect
loadfile rt1170_cm4/cdc_dual_ports.hex
wreg "R15 (PC)",0x1ffe0400
go
  • 3rd way is load the .elf in Segger Ozone debugger, it can figure it out by itself. I think it puts M7 core in a loop.

Ah thank you very much for the detail explanation, I found using ozone is much easier and straight forward. I used it occassionally for ETM tracing as well. And it works perfectly.

Copy link
Owner

@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect, thank you very much for this great work.

@hathach hathach merged commit 7c1afa8 into hathach:master Dec 18, 2024
107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants