-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RT1170 enhancements #2865
RT1170 enhancements #2865
Conversation
CFLAGS += \ | ||
-D__STARTUP_CLEAR_BSS \ | ||
-DCFG_TUSB_MCU=OPT_MCU_MIMXRT1XXX \ | ||
-DCFG_TUSB_MEM_SECTION='__attribute__((section("NonCacheable")))' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On M7 core NonCacheable
is located on DTCM so there is no need to add a if switch.
Hi @mastupristi, I've managed to run TinyUSB stack on M4 core. The DMA controller inside USB IP can't access M4 core's TCM so packet buffer must be placed in OCRAM, it's done by |
Hi @HiFiPhile
Just wanted to share some great news—I ran an initial test ( Thanks so much for putting this together so quickly. We’re thrilled with the progress and super grateful for your help. |
as mentioned in esp32p4 cache, I would still wnat to keep the dcache clean/invidate. IMO, it does not hurt performance but actual improve it (depending on the usage). As one of the main difference between M7 and M3/M4 is actually the cache (data + instruction). M7 can run insanely fast (up to 1Ghz), and can perform lots of computattion on data e.g video before passing it to USB/DMA for transfer. |
I was thinking about add back cache support for M7 core but didn't have the time. Secondary M4 core uses a customized cache controller which is more complicated.
I did a little test based on
code
volatile uint32_t clock_cycles_counter;
volatile unsigned int *DWT_CYCCNT = (uint32_t *)0xE0001004; //address of the register
volatile unsigned int *DWT_CONTROL = (uint32_t *)0xE0001000; //address of the register
volatile unsigned int *SCB_DEMCR = (uint32_t *)0xE000EDFC; //address of the register
uint16_t i2s_dummy_buffer2[CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO][CFG_TUD_AUDIO_FUNC_1_N_CHANNELS_TX*CFG_TUD_AUDIO_FUNC_1_SAMPLE_RATE/1000/CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO];
CFG_TUSB_MEM_SECTION uint16_t i2s_dummy_buffer3[CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO][CFG_TUD_AUDIO_FUNC_1_N_CHANNELS_TX*CFG_TUD_AUDIO_FUNC_1_SAMPLE_RATE/1000/CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO];
// in main()
clock_cycles_counter = 0;
*SCB_DEMCR = *SCB_DEMCR | 0x01000000;
*DWT_CYCCNT = 0;
*DWT_CONTROL |= 1;
memcpy(i2s_dummy_buffer2, i2s_dummy_buffer, sizeof(i2s_dummy_buffer2));
SCB_CleanDCache_by_Addr((uint32_t*)i2s_dummy_buffer2, sizeof(i2s_dummy_buffer2));
*DWT_CONTROL &= ~1;
clock_cycles_counter = *DWT_CYCCNT;
clock_cycles_counter = 0;
*SCB_DEMCR = *SCB_DEMCR | 0x01000000;
*DWT_CYCCNT = 0;
*DWT_CONTROL |= 1;
memcpy(i2s_dummy_buffer3, i2s_dummy_buffer, sizeof(i2s_dummy_buffer3));
*DWT_CONTROL &= ~1;
clock_cycles_counter = *DWT_CYCCNT;
// Test cached location again to ensure memcpy is in ICACHE
clock_cycles_counter = 0;
*SCB_DEMCR = *SCB_DEMCR | 0x01000000;
*DWT_CYCCNT = 0;
*DWT_CONTROL |= 1;
memcpy(i2s_dummy_buffer2, i2s_dummy_buffer, sizeof(i2s_dummy_buffer2));
SCB_CleanDCache_by_Addr((uint32_t*)i2s_dummy_buffer2, sizeof(i2s_dummy_buffer2));
*DWT_CONTROL &= ~1;
clock_cycles_counter = *DWT_CYCCNT;
asm("bkpt 0x55"); Although OCRAM is only clocked at fCPU/4 (for STM32H7 is fCPU/2), performance on non cached location is higher.
Still it is slow to copy 384 bytes, I expect ~500 cycles to copy 96 words counting 4 cycles per word. |
@HiFiPhile thanks for the detailed test ersult, though this may not reflect all usage. Dcache clean/invalidate as any solution does introduce overhead, in a scenario when user need to do heavy computation on memory such as encrypting a large block of bytes and/or lots of video/dsp processing. It would outweight the overhead. As general rule of thumb for cpu world, I still think the more cache the better/faster in general. |
Anyway for most applications TCM is used as the buffer, as the default linker script. In a later test when both buffer are inside TCM I got a result of less than 500 cycles, while testing from OCRAM without cache clean still cost 900 cycles. |
I have no idea though, tbh I am new to these dcache as well. These mpu configuration is also bit complicated for me :) |
Signed-off-by: HiFiPhile <[email protected]>
Signed-off-by: HiFiPhile <[email protected]>
Signed-off-by: HiFiPhile <[email protected]>
Signed-off-by: HiFiPhile <[email protected]>
@HiFiPhile I found the errata ERR050396 that says what is necessary to do if you intend to use the M7's TCM as the destination for USB writes. |
Signed-off-by: HiFiPhile <[email protected]>
It's good to know that, it's not fixed in the dcd driver but with a quick search it's included in kSDK: |
Signed-off-by: HiFiPhile <[email protected]>
Signed-off-by: HiFiPhile <[email protected]>
@hathach I've restored M7 cache support and also updated audio class. PS: I did a more elaborated benchmark and post my question: https://community.nxp.com/t5/i-MX-Processors/RT1170-question-about-memcpy-benchmark/m-p/2004350#M231361 |
thank you, will review asap |
#endif | ||
} | ||
|
||
/* MPU configuration. */ | ||
#if __CORTEX_M == 7 | ||
static void BOARD_ConfigMPU(void) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HiFiPhile is this BOARD_ConfigMPU() disable cache memory for M7. I actually prefer to have cache enabled for imxrt by default (CFG_TUD/H_MEM_DCACHE_ENABLE=1). Since that is what most user with this chip would do. Also having it enable in example will make sure we test with cacheable memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOARD_ConfigMPU()
is copied from ksdk, which enables DCache and ICache.
If you want to test examples on cached OCRAM you can simply modify the linker, default linker script from ksdk use DTCM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @HiFiPhile for the info, that sound good. I am currrently messing around with mcux config tool for mpu(), these are too complicated. Maybe I would do that in a follow-up PR to test cache function later.
I tried to compile and flash both with cmake and make with M4=1, but somehow flashing doesn't seems to get the mcu running at all. I will only run the last CM7 binaries. If I erase the chip and only flash the cm4 image, mcu won't boot at all. Maybe I miss something. |
Hum... What happens if you erase the chip, then in Jlink commander load the binary and go ? |
Here is my log, even though jlink report that loading file is OK. It may not, normally when erase/flashing, there is a progresss windows pop-up. That only appear with normal M7 build. I guess it is not actually flashed at all. I done with both cmake and make and try flashing with pyocd as well. Not sure what I am missing.
|
Since M4 is linked as a RAM image maybe the popup is only displayed during ROM flashing, I'll take a look later. |
oh it runs from SRAM ? I miss that, do i need any code running on M7 first e.g blinky, or change any boot switch ? |
M4 run the same init code as M7, I remember I put M7 core in a while loop without doing anything. But I expect erasing the flash will do the same thing... |
In fact it's more complicated to run M4 core:
|
Ah thank you very much for the detail explanation, I found using ozone is much easier and straight forward. I used it occassionally for ETM tracing as well. And it works perfectly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect, thank you very much for this great work.
Describe the PR
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBOARD=mimxrt1170_evkb -DM4=1 -G Ninja -B rt1170_cm4
make BOARD=mimxrt1170_evkb M4=1
PS: MCHP has nice write-up on cache https://ww1.microchip.com/downloads/en/DeviceDoc/Managing-Cache-Coherency-on-Cortex-M7-Based-MCUs-DS90003195A.pdf