Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

tRAS of 40 may be too low for newer Carambola 2 devices #207

Closed
codehero opened this issue Jul 16, 2018 · 16 comments
Closed

tRAS of 40 may be too low for newer Carambola 2 devices #207

codehero opened this issue Jul 16, 2018 · 16 comments

Comments

@codehero
Copy link

I am having very intermittent stability issues using both Caraboot and pepe-2k.
Both u-boot versions assume a tRAS value of 40 ns
This was true when the Carambola2 module used the W9751G6JB25 DDR2 module.
However, I popped off the cap of a newer Carambola2 module and it now uses W9751G6KB25
The tRAS value specified is 45 ns for the KB25

Should the default safe value be at 45 ns???

See datasheets

Page 45 of
http://digichip.ru/datasheet/PDF/df799b2e552ae92d5acb3f8b9c437f77/68da5750c408c276e3bcd1df60096ddc/W9751G6JB25.pdf

Page 45 of
https://www.winbond.com/resource-files/da00-w9751g6kbg1.pdf

@pepe2k
Copy link
Owner

pepe2k commented Jul 18, 2018

Hi @codehero,

I am having very intermittent stability issues using both Caraboot and pepe-2k.

Could you give more details here, what kind of issues you are having and how to reproduce them (if you already found out a way).

Should the default safe value be at 45 ns???

Have you tried changing it yourself, does higher value solve your issues?

Cheers,
Piotr

@codehero
Copy link
Author

Hi Piotr,
I am having a hard time reproducing them, but some of the details are in this issue:

8devices/Caraboot#5

I am confused by mantas-p's comment:

"Is not the DDR speed 400 MHz??"
"This is PLL output to DDR controller, DRAM clock line runs at 200MHz, I've checked with oscilloscope."

As you are taking RAM timings in nanoseconds, are you calculating timing values using 400 MHz or 200 MHz?

@pepe2k
Copy link
Owner

pepe2k commented Jul 18, 2018

Hi @codehero,

I am confused by mantas-p's comment:

"Is not the DDR speed 400 MHz??"
"This is PLL output to DDR controller, DRAM clock line runs at 200MHz, I've checked with oscilloscope."

The internal clock for the DRAM controller runs at 400 MHz due to the nature of the way how DDR memories work (Double Data Rate comes from the fact that with 200 MHz clock you get theoretical 400 Mbps throughput per pin, reading data twice on every clock tick).

So, the internal DRAM controller clock is in fact 400 MHz (that's the PLL value @mantas-p mentioned above and the value kernel shows as DDR clock) but the external one, observed on CLK line (DDR_CK_P/N) with scope is 200 MHz.

As you are taking RAM timings in nanoseconds, are you calculating timing values using 400 MHz or 200 MHz?

There are two things about setting tRAS and other timing values in QCA datasheets (first part comes from AR934x):

  1. Set the timing parameters in “DDR DRAM Configuration (DDR_CONFIG)”. These numbers typically use the values from the specification, but greater values can also be used. Numbers are in terms of the number of controller clocks.

DRAM tRAS parameter rounded up in memory core clock cycles

The default tRAS value in AR9331 DDR_CONFIG register is 0x10/16. If you assume that value is rounded up in external clock cycles as @mantas-p wrote (200 MHz/5 ns) then you will get tRAS = 80 ns. But... in my code I assume that these values are rounded up in DRAM controller clock cycles (in this case 400 MHz/2.5 ns). There are two main reasons for this assumption:

  • default value 0x10/16 makes in this case more sense as it gives by default 40 ns (16 * 2.5)
  • this line from datasheet: "Numbers are in terms of the number of controller clocks"

I just looked at Carambola 2 running my last image and tRAS value in register is set to 0x15/21 which, if I'm correct above, gives 52.5 ns or 105 ns if I'm not right. The reason for that is that for DDR2 I'm using higher minimum clock for calculations, see these lines: https://github.com/pepe2k/u-boot_mod/blob/master/u-boot/cpu/mips/ar7240/qca_dram.c#L917 and https://github.com/pepe2k/u-boot_mod/blob/master/u-boot/cpu/mips/ar7240/qca_dram.c#L517. With latest Caraboot image, I see tRAS set to 0x10/16.

Anyway, I don't think that tRAS value is the reason of your problems. I have some other ideas what might be wrong here but currently I lack free time to help you with that... maybe next week.

Cheers,
Piotr

@codehero
Copy link
Author

I would certainly be glad to hear what your ideas are.

FWIW, changing 40 to 45 I have not noticed any issues after programming a few dozen.
That is not conclusive proof an issue but I am sticking to 45.
I will keep this thread updated if I find anything further.

@pepe2k
Copy link
Owner

pepe2k commented Jul 20, 2018

Hi @codehero

FWIW, changing 40 to 45 I have not noticed any issues after programming a few dozen.
That is not conclusive proof an issue but I am sticking to 45.
I will keep this thread updated if I find anything further.

I suppose you mean here changing to 45ns on Caraboot, not in my code? As I wrote above, Carambola 2 with my code will have tRAS set to 52.5 ns.

Cheers,
Piotr

@thornley-touchstar
Copy link
Contributor

Hi @codehero!

We have seen similar issues on hardware we have in production which is using skylabs SKW72A. We actually started receiving batches of devices that switched from A3R12E40DBF-AH (Zentel) to W9751G6KB-25 (Winbond) DRAM - which coincidently is the same part you struggled with!

We are still trying a few things here, like reset the DDR registers closer to the original AP121 reference settings (uboot) but since you are already way ahead of us with this, can you please let me know if you resolved this completely with the adjustment to tRAS?

Interestingly the actual register value in the DDR_CONFIG register for tRAS was 21, originally in AP121 uboot it was 16, but it sounds like you would have adjusted this to 24 (if our calculations are correct). Can you please confirm?

I am also running the same stress test here on a few devices as this problem has been elusive, so far only devices running WinBond and the MTBF has been on some devices 4-5 days of usage in the field.

Thanks!

David.

@thornley-touchstar
Copy link
Contributor

Hi @codehero,

An update, we have continued with our testing here, so far I can state the following: -

  1. Reverting to AP121 reference timings (also what is shipped with the SKW72A) did not help.
  2. Forcing tRAS to 45 with u-boot-mod timings did not help.
  3. Confirmed PLL register configuration is consistent between u-boot-mod and AP121 reference.

Just curious if you were able to get to the bottom of this in the end?

Regards,

David.

@DanielRIOT
Copy link

DanielRIOT commented May 6, 2019

Hi @thornley-touchstar ,
We're also using the Winbond W9751G6KB RAM modules and are also getting strange crashes and hangs but its fairly repeatable when I cycle the Wifi system ( change channel or tx power and then reload wifi ).

  • mine crashes as the Wifi systems comes up when the driver puts the wifi subsystem to sleep/ low power mode ( last log entry is usually ath: phy0: AWAKE -> FULL-SLEEP )

a bug report on LEDE's bug system was logged, but I did not get anywhere other than it possibly being a DRAM issue.

I've also gone back and forth between the DRAM timings from Arduino Yun's bootloader, pep2k's standard as well as messing with tRAS and CAS timings with no change in behavior ( other than no boot if they are mis-configured ). - carambola2 setup is pretty much identical to the Arduino Yun,
Other than timings I've seen differences between the burst "size" configurations...

Do you know if memtest does large buffer flushes or large DMA burst writes like I suspect the network driver does when it shuts down or creates network buffers - those RAM interactions might be more sensitive to RAM timings than regular single page reads and writes.

Do you have inline Resistors on your DRAM lines ( like Arduino Yun to attenuate signal reflections ) or are they directly connected and only length matched ( like ours are ) ?

Arduino Yun uses the same Winbond memory but has series resistors on the DQ , clk and addr lines..

@thornley-touchstar
Copy link
Contributor

thornley-touchstar commented May 13, 2019

Hi @DanielRIOT,

We tried the alt memory test in pepe2k's uboot and couldn't reproduce the problem. Additionally we tried memtester (mentioned in 8devices/Caraboot#5) and also using stress (CPU) and iperf (WIFI) combination... with no luck. I imagine the latter would stress DMA but the memory tests themselves don't specificlly target DMA (to my knowledge).

@pepe2k did suggest that we drop the memory bus bandwidth (from 200Mhz), but we haven't got around to that yet. It's interesting that you mentioned switching WIFI might be related, and my understanding is that the AR9331 originally had issues with USB stability related to this. It could be a process issue or imitations of the WSoc design itself, all speculation.

I did manage to source an Arduino Yun and I can see a lot of additional resistors around the DRAM chip when compared to the SKW72A and Camabola2 which also use the WinBond part.

Which device are you actually testing with? Also, do you know if the Arduino Yun specifically does not suffer from this problem?

@DanielRIOT
Copy link

DanielRIOT commented May 13, 2019

I've also tried to get mine to fail with memtester, but they ran flawlessly for a few days until I cycled the wifi a few times ( change channel, call "wifi" ) as a sanity check and they crashed again.

I have not had the same issues on my modified Yuns, only on our layout

We started on the Aruino Yun as a proof of concept and then created our own board - from microscope images and bits and pieces of AR9331 schematics and layouts around the internet (QCA is not very helpful with new product development in EMEA untill one gets the needed volumes). We use a different antenna arrangement as well as GPS, and a few other peripherals on the main PCB.

I'll keep poking at the RAM configuration from u-boot ( it looks like Linux doesn't touch it after u-boot sets it up ), as well as add the 22R resistors on the DQ lines and do further tests ( on die termination is not enabled in this application so it could be the ringing in the lines that mess with data )- I have a spectrum analyser (40MHz BW, but LO moves around ) and a 100 MHz scope so I cannot easily capture a time series at 200 MHz with those fast rise times to "see" ringing.

@thornley-touchstar
Copy link
Contributor

@DanielRIOT, today I tried switching channels with memtester running and iperf (pushing data over wifi) and could not reproduce what you experience (on a SKW72)

I use the following method to randomly change channel: -

uci set wireless.radio0.channel=`awk -v min=1 -v max=11 'BEGIN{srand(); print int(min+rand()*(max-min+1))}'`; uci changes ; uci commit wireless ; wifi

I am curious about one thing though, if you push 'memtester' too much so that it allocates all available memory. In my case, the 'want' is 40M but it shows 'got' as 24MB, and when this occurs it tends to naturally crash anyway as the kernel is exhausted of memory (I don't have any swap). Do you experience the same?

@DanielRIOT
Copy link

@thornley-touchstar I also get an eventual crash when allocate large blocks with memtester ( 40M ish ) - but they all had an obvious "out of memory" type message.

my channel change was a shell script from the openwrt bug page.

its a strange crash and I only get it when Wifi channel or power is changed, and sometimes on boot ( when the wifi system comes up - I'm usually running in 5MHz wifi mode but even when i put it to normal 20MHz mode it acts the same) . I've disabled my batmesh and VLAN configurations and it still happens

  • I thought it might be some race condition or an unallocated buffer that's being used or brought up out of sequence that may result in a illegal memory access ( invalid pointer that makes the network driver destroy some important kernel data ).. but that's just speculation- after poking around with the ath9k debug i found the crash always happens after a "AWAKE->FULL_SLEEP" log output ( after the driver re initializes ) and I have not been able to dig further than that yet.

most other people who looked at the logs say its a RAM issue, this week will be "scrape tracks under a microscope and add series resistors" week

while true; do
 for chan in $(seq 1 13); do
   uci set wireless.radio0.channel=$chan;
   uci commit; wifi; sleep 5; iwinfo | grep Channel
   sleep 60
 done
done

@thornley-touchstar
Copy link
Contributor

@DanielRIOT understood, if you do have any success with adding the 22R resistors it would be great to hear if it resolves the issue :)

@DanielRIOT
Copy link

@thornley-touchstar after adding inline resistors to the DQ and control lines ( following a TI and Micron app note ) it didn't seem to change my crash behavior... will keep digging

@DanielRIOT
Copy link

DanielRIOT commented Jul 17, 2019

I eventually reflowed a new RAM module ( micrel art IIRC )to the device - destroyed 1 board and the other had no change in behavior - bummer.

I also found a few older bug reports that related to kernel crashes when networking stuff happens, ( https://dev.archive.openwrt.org/ticket/22283 and https://dev.archive.openwrt.org/ticket/22265.html ), so i removed 2 patches (020-backport_netfilter_rtcache and 150-bridge_allow_receiption_on_disabled_port ). - there was also a mention to remove these in the 19.07 pull requests

The system was more stable and while trying to get a clearer picture of why the crash happened ( in the few times i got a kernel crashlog ) I enabled a few more debugging features :CONFIG_DEBUG=y
CONFIG_KERNEL_DEVMEM=y
CONFIG_KERNEL_ENABLE_DEFAULT_TRACERS=y
CONFIG_KERNEL_FTRACE=y
CONFIG_KERNEL_FTRACE_SYSCALLS=y
CONFIG_KERNEL_PROVE_LOCKING=y
CONFIG_KERNEL_RELAY=y

the most notable one is CONFIG_KERNEL_PROVE_LOCKING=y , that seemed to make the crash on wireless start go away - but then the system would crash on reboot for which i found this bug report :
https://dev.archive.openwrt.org/ticket/17839#ticket

seems the re-anabling of the interrupts right after writing to the reset register caused the reset bit to be cleared before the system could reset, so i placed a delay right after the register write and now reboot also works again - seems like a few race conditions are around :(

but my system works "good enough" for now

@pepe2k
Copy link
Owner

pepe2k commented Oct 13, 2022

Sorry but this project is no longer maintained.

@pepe2k pepe2k closed this as completed Oct 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants