Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent ram issues on Carambola 2 boards #5

Open
codehero opened this issue Jul 16, 2018 · 5 comments
Open

Intermittent ram issues on Carambola 2 boards #5

codehero opened this issue Jul 16, 2018 · 5 comments

Comments

@codehero
Copy link

I am having very intermittent stability issues using both Caraboot and pepe-2k.
Both u-boot versions assume a tRAS value of 40 ns
This was true when the Carambola2 module used the W9751G6JB25 DDR2 module.
However, I popped off the cap of a newer Carambola2 module and it now uses W9751G6KB25

According to line in include/configs/carambola2.h

#define CFG_DDR_CONFIG_VAL 0x7fbc8cd0

tRAS is still specified at 40 ns

Should the default safe value be at 45 ns???

See datasheets

Page 45 of
http://digichip.ru/datasheet/PDF/df799b2e552ae92d5acb3f8b9c437f77/68da5750c408c276e3bcd1df60096ddc/W9751G6JB25.pdf

Page 45 of
https://www.winbond.com/resource-files/da00-w9751g6kbg1.pdf

@mantas-p
Copy link
Contributor

Hi,

Carambola2 actually uses 200MHz DDR clock so:
tCK = 1/200MHz = 5ns
tRAS = (0x10) * 5ns = 80ns
Which is OK with both DDR chips used.

Have you tried running memtest on unstable devices, if not please try (https://github.com/mantas-p/files/blob/master/memtest_mipsbe/memtester?raw=true)
Run like this:
/tmp/memtester 40M
Leave for few hours, see if any errors come up.

Are there any specific conditions when issue happens, or it is completely random?
How often you get crashes?
Do you have any logs to share?

Are you using 8devices devboard or your own design? If it's your design, please contact [email protected] we may have suggestions regarding HW design.

@codehero
Copy link
Author

codehero commented Jul 16, 2018

This is a custom design.

I have memtester running now on some units
I am confused by the following line then:

[ 0.000000] Clocks: CPU:400.000MHz, DDR:400.000MHz, AHB:200.000MHz, Ref:40.000MHz
Is not the DDR speed 400 MHz??

The crashes are quite random. Sometimes often, sometimes never even under stress loads, such as
stress -c 64

Here is a crashlog

<6>[ 0.000013] sched_clock: 32 bits at 200MHz, resolution 5ns, wraps every 10737418237ns
<6>[ 0.007920] Calibrating delay loop... 265.42 BogoMIPS (lpj=1327104)
<6>[ 0.089183] pid_max: default: 32768 minimum: 301
<6>[ 0.093968] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
<6>[ 0.100417] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
<6>[ 0.109143] Performance counters: mips/24K PMU enabled, 2 32-bit counters available to each CPU, irq 13
<6>[ 0.119380] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
<6>[ 0.127864] futex hash table entries: 256 (order: -1, 3072 bytes)
<6>[ 0.135737] NET: Registered protocol family 16
<6>[ 0.140840] MIPS: machine is Remote Access Device
<6>[ 0.601304] i2c-gpio i2c-gpio.0: using pins 20 (SDA) and 19 (SCL)
<6>[ 0.607269] clocksource: Switched to clocksource MIPS
<6>[ 0.612977] NET: Registered protocol family 2
<6>[ 0.617207] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
<6>[ 0.622928] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
<6>[ 0.629240] TCP: Hash tables configured (established 1024 bind 1024)
<6>[ 0.635688] UDP hash table entries: 256 (order: 0, 4096 bytes)
<6>[ 0.641438] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
<6>[ 0.648063] NET: Registered protocol family 1
<7>[ 0.652060] PCI: CLS 0 bytes, default 32
<4>[ 0.659005] Crashlog allocated RAM at address 0x3f00000
<6>[ 0.684234] squashfs: version 4.0 (2009/01/31) Phillip Lougher
<6>[ 0.688729] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
<6>[ 0.702215] io scheduler noop registered
<6>[ 0.704698] io scheduler deadline registered (default)
<6>[ 0.710185] Serial: 8250/16550 driver, 1 ports, IRQ sharing disabled
<6>[ 0.716853] ar933x-uart: ttyATH0 at MMIO 0x18020000 (irq = 11, base_baud = 2500000) is a AR933X UART
<6>[ 0.726034] console [ttyATH0] enabled
<6>[ 0.732591] bootconsole [early0] disabled
<4>[ 0.744748] m25p80 spi0.0: found w25q128, expected m25p80
<6>[ 0.748800] m25p80 spi0.0: w25q128 (16384 Kbytes)
<5>[ 0.753417] 4 cmdlinepart partitions found on MTD device spi0.0
<5>[ 0.759307] Creating 4 MTD partitions on "spi0.0":
<5>[ 0.764074] 0x000000000000-0x000000040000 : "u-boot"
<5>[ 0.772327] 0x000000040000-0x000000050000 : "u-boot-env"
<5>[ 0.778578] 0x000000050000-0x000000ff0000 : "firmware"
<5>[ 0.809129] 2 uimage-fw partitions found on MTD device firmware
<5>[ 0.813617] 0x000000050000-0x000000190000 : "kernel"
<5>[ 0.820507] 0x000000190000-0x000000ff0000 : "rootfs"
<5>[ 0.826301] mtd: device 4 (rootfs) set to be root filesystem
<5>[ 0.830637] 1 squashfs-split partitions found on MTD device rootfs
<5>[ 0.836690] 0x000000580000-0x000000ff0000 : "rootfs_data"
<5>[ 0.844446] 0x000000ff0000-0x000001000000 : "art"
<4>[ 0.850300] m25p80 spi0.1: found mx25r3235f, expected m25p80
<6>[ 0.854531] m25p80 spi0.1: mx25r3235f (4096 Kbytes)
<6>[ 0.879734] libphy: ag71xx_mdio: probed
<6>[ 1.468750] ag71xx-mdio.1: Found an AR7240/AR9330 built-in switch
<6>[ 1.500853] eth0: Atheros AG71xx at 0xba000000, irq 5, mode:GMII
<6>[ 2.088983] ag71xx ag71xx.0: connected to PHY at ag71xx-mdio.1:04 [uid=004dd041, driver=Generic PHY]
<6>[ 2.097744] eth1: Atheros AG71xx at 0xb9000000, irq 4, mode:MII
<4>[ 2.104992] rtc-ds1374 0-0068: oscillator discontinuity flagged, time unreliable
<6>[ 2.119904] rtc-ds1374 0-0068: rtc core: registered ds1374 as rtc0
<6>[ 2.125245] leds-cat3626 0-0066: setting platform data
<6>[ 2.143925] NET: Registered protocol family 17
<6>[ 2.147047] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this.
<6>[ 2.159745] 8021q: 802.1Q VLAN Support v1.8
<6>[ 2.172669] rtc-ds1374 0-0068: setting system clock to 1970-01-01 00:00:18 UTC (18)
<6>[ 2.185611] VFS: Mounted root (squashfs filesystem) readonly on device 31:4.
<6>[ 2.193583] Freeing unused kernel memory: 264K
<14>[ 3.623675] init: Console is alive
<14>[ 3.625918] init: - watchdog -
<14>[ 4.893845] kmodloader: loading kernel modules from /etc/modules-boot.d/*
<14>[ 4.969333] kmodloader: done loading kernel modules from /etc/modules-boot.d/*
<14>[ 4.985699] init: - preinit -
<5>[ 6.169690] jffs2: notice: (332) jffs2_build_xattr_subsystem: complete building xattr subsystem, 0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.
<14>[ 6.185916] mount_root: switching to jffs2 overlay
<12>[ 6.200553] urandom-seed: Seeding with /etc/urandom.seed
<14>[ 6.499649] procd: - early -
<14>[ 6.501215] procd: - watchdog -
<14>[ 7.134013] procd: - watchdog -
<14>[ 7.136084] procd: - ubus -
<5>[ 7.276161] random: ubusd: uninitialized urandom read (4 bytes read, 11 bits of entropy available)
<5>[ 7.285056] random: ubusd: uninitialized urandom read (4 bytes read, 11 bits of entropy available)
<5>[ 7.293337] random: ubusd: uninitialized urandom read (4 bytes read, 11 bits of entropy available)
<5>[ 7.301690] random: ubusd: uninitialized urandom read (4 bytes read, 11 bits of entropy available)
<5>[ 7.311075] random: ubusd: uninitialized urandom read (4 bytes read, 11 bits of entropy available)
<5>[ 7.319584] random: ubusd: uninitialized urandom read (4 bytes read, 11 bits of entropy available)
<5>[ 7.328936] random: ubusd: uninitialized urandom read (4 bytes read, 11 bits of entropy available)
<5>[ 7.337649] random: ubusd: uninitialized urandom read (4 bytes read, 11 bits of entropy available)
<14>[ 7.346769] procd: - init -
<14>[ 7.621322] kmodloader: loading kernel modules from /etc/modules.d/*
<6>[ 7.678310] Initializing XFRM netlink socket
<6>[ 7.685490] NET: Registered protocol family 15
<6>[ 7.696791] ipip: IPv4 over IPv4 tunneling driver
<6>[ 7.880469] i2c /dev entries driver
<6>[ 7.891225] Loading modules backported from Linux version wt-2017-01-31-0-ge882dff19e7f
<6>[ 7.897858] Backport generated by backports.git backports-20160324-13-g24da7d3c
<6>[ 7.928404] nf_conntrack version 0.5.0 (946 buckets, 3784 max)
<6>[ 8.062540] xt_time: kernel timezone is -0000
<6>[ 8.126258] ip_tables: (C) 2000-2006 Netfilter Core Team
<7>[ 8.258583] ath: EEPROM regdomain: 0x0
<7>[ 8.258614] ath: EEPROM indicates default country code should be used
<7>[ 8.258628] ath: doing EEPROM country->regdmn map search
<7>[ 8.258658] ath: country maps to regdmn code: 0x3a
<7>[ 8.258673] ath: Country alpha2 being used: US
<7>[ 8.258687] ath: Regpair used: 0x3a
<7>[ 8.270494] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
<6>[ 8.278844] ieee80211 phy0: Atheros AR9330 Rev:1 mem=0xb8100000, irq=2
<14>[ 8.324993] kmodloader: done loading kernel modules from /etc/modules.d/*
<5>[ 9.767538] random: jshn: uninitialized urandom read (4 bytes read, 15 bits of entropy available)
<5>[ 9.849552] random: ubusd: uninitialized urandom read (4 bytes read, 15 bits of entropy available)
<6>[ 19.302849] device eth0 entered promiscuous mode
<6>[ 20.810260] eth0: link up (1000Mbps/Full duplex)
<6>[ 20.811712] br-lan: port 1(eth0) entered forwarding state
<6>[ 20.813214] br-lan: port 1(eth0) entered forwarding state
<6>[ 22.807284] br-lan: port 1(eth0) entered forwarding state
<6>[ 32.662409]
<6>[ 32.662409] do_page_fault(): sending SIGSEGV to rad_driver for invalid read access from 0000004c
<6>[ 32.664880] epc = 777f43b0 in libc.so[77788000+91000]
<6>[ 32.666555] ra = 777f43a4 in libc.so[77788000+91000]
<6>[ 32.668227]
<5>[ 60.916239] random: nonblocking pool is initialized
<4>[ 422.487687] Unhandled kernel unaligned access[#1]:
<4>[ 422.488844] CPU: 0 PID: 665 Comm: netifd Not tainted 4.4.92 #0
<4>[ 422.490776] task: 838b7340 ti: 83012000 task.ti: 83012000
<4>[ 422.492559] $ 0 : 00000000 772a8854 00000001 00000001
<4>[ 422.494295] $ 4 : 00000001 000f4240 00000400 838e9ef0
<4>[ 422.496031] $ 8 : 00000001 fffffffc 00000000 732d7365
<4>[ 422.497767] $12 : 61726368 00000264 00000000 00000000
<4>[ 422.499503] $16 : 00000001 83b4432c 000000c3 00000000
<4>[ 422.501240] $20 : 00000001 00000001 00000000 834ca140
<4>[ 422.502976] $24 : 00000000 8009ff18
<4>[ 422.504711] $28 : 83012000 83013c98 00000000 8026c840
<4>[ 422.506450] Hi : 001ba63f
<4>[ 422.507404] Lo : b1200000
<4>[ 422.508397] epc : 8026c848 alloc_skb_with_frags+0xb4/0x1d8
<4>[ 422.510251] ra : 8026c840 alloc_skb_with_frags+0xac/0x1d8
<4>[ 422.512120] Status: 1000f402 KERNEL EXL
<4>[ 422.513425] Cause : 00800010 (ExcCode 04)
<4>[ 422.514753] BadVA : 000000b1
<4>[ 422.515710] PrId : 00019374 (MIPS 24Kc)
<4>[ 422.517009] Modules linked in: ath9k ath9k_common iptable_nat ath9k_hw ath nf_nat_ipv4 nf_conntrack_ipv4 mac80211 iptable_mangle iptable_filter ipt_ah ipt_REJECT ipt_MASQUERADE ip_tables cfg80211 xt_time xt_tcpudp xt_state xt_policy xt_nat xt_multiport xt_mark xt_mac xt_limit xt_esp xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG x_tables nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_log_common nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack crc_ccitt compat i2c_dev ledtrig_heartbeat ipcomp xfrm4_tunnel xfrm4_mode_tunnel xfrm4_mode_transport xfrm4_mode_beet esp4 ah4 ipip tunnel4 ip_tunnel af_key xfrm_user xfrm_ipcomp xfrm_algo sha256_generic sha1_generic jitterentropy_rng drbg md5 hmac echainiv des_generic deflate zlib_inflate zlib_deflate cbc authenc cryptomgr aead crypto_null crypto_hash
<4>[ 422.541354] Process netifd (pid: 665, threadinfo=83012000, task=838b7340, tls=7736ed48)
<4>[ 422.544006] Stack : 83013ef4 833c06c0 00000000 802701bc 801247d4 833e5174 00000001 00000000
<4>[ 422.544006] 833e5180 800a2374 0000026c 00000000 00000000 0000026c 00000000 00000001
<4>[ 422.544006] 833c06c0 83965100 0000026c 83013eec 83965ba0 800a27dc 833c06c0 80305b6c
<4>[ 422.544006] 00000000 00000000 000000c3 833c06c0 83965100 802667ec 83013eec 83965ba0
<4>[ 422.544006] 00000000 834ca140 00000000 803075a8 0000003c 83013db8 7fa28c8c 00000001
<4>[ 422.544006] ...
<4>[ 422.555811] Call Trace:
<4>[ 422.556634] [<8026c848>] alloc_skb_with_frags+0xb4/0x1d8
<4>[ 422.558393]
<4>[ 422.558878]
<4>[ 422.558878] Code: 00402821 1040ffe6 00408021 <8e0300b0> 3c13ffbf 00121300 3673adff 00621021 02b39824
<1>[ 422.562261] CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 00000000, ra == 00000000
<4>[ 422.565710] Oops[#2]:
<4>[ 422.566471] CPU: 0 PID: 665 Comm: netifd Tainted: G D 4.4.92 #0
<4>[ 422.568811] task: 838b7340 ti: 83012000 task.ti: 83012000
<4>[ 422.570597] $ 0 : 00000000 00000001 00000001 00000001
<4>[ 422.572333] $ 4 : 00000001 000f4240 00000400 838b4030
<4>[ 422.574069] $ 8 : 00000001 fffffffc 00000001 00000000
<4>[ 422.575806] $12 : 00989680 ac000000 00000000 00000000
<4>[ 422.577542] $16 : 833d3c80 83b441c4 00000000 00000000
<4>[ 422.579277] $20 : 00000001 00000000 803ed320 83013ac0
<4>[ 422.581014] $24 : 00000000 8009ff18
<4>[ 422.582750] $28 : 83012000 83809d84 00989680 00000000
<4>[ 422.584488] Hi : 001c349a
<4>[ 422.585443] Lo : e3400000
<4>[ 422.586399] epc : 00000000 (null)
<4>[ 422.587614] ra : 00000000 (null)
<4>[ 422.588827] Status: 10007402 KERNEL EXL
<4>[ 422.590132] Cause : 10800008 (ExcCode 02)
<4>[ 422.591460] BadVA : 00000000
<4>[ 422.592417] PrId : 00019374 (MIPS 24Kc)
<4>[ 422.593716] Modules linked in: ath9k ath9k_common iptable_nat ath9k_hw ath nf_nat_ipv4 nf_conntrack_ipv4 mac80211 iptable_mangle iptable_filter ipt_ah ipt_REJECT ipt_MASQUERADE ip_tables cfg80211 xt_time xt_tcpudp xt_state xt_policy xt_nat xt_multiport xt_mark xt_mac xt_limit xt_esp xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG x_tables nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_log_common nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack crc_ccitt compat i2c_dev ledtrig_heartbeat ipcomp xfrm4_tunnel xfrm4_mode_tunnel xfrm4_mode_transport xfrm4_mode_beet esp4 ah4 ipip tunnel4 ip_tunnel af_key xfrm_user xfrm_ipcomp xfrm_algo sha256_generic sha1_generic jitterentropy_rng drbg md5 hmac echainiv des_generic deflate zlib_inflate zlib_deflate cbc authenc cryptomgr aead crypto_null crypto_hash
<4>[ 422.618061] Process netifd (pid: 665, threadinfo=83012000, task=838b7340, tls=7736ed48)
<4>[ 422.620711] Stack : 00000000 80477b60 00000000 800b7660 82c48242 803fcac0 00000001 00000000
<4>[ 422.620711] 803fcacc 800a2374 001c2212 80477b60 00000000 07735940 00000000 00000000
<4>[ 422.620711] 00000006 00000007 fffffffe fffffffc 00000000 800a23c4 0000af3b 00000000
<4>[ 422.620711] 00000003 00000000 00000000 00000000 803ec920 800c8e1c 00000000 80470000
<4>[ 422.620711] 838b59a0 00000001 838b7340 838b7340 00000000 80470000 80470000 61a44d8b
<4>[ 422.620711] ...
<4>[ 422.632518] Call Trace:
<4>[ 422.633345] [<800b7660>] timekeeping_update+0x1b4/0x228
<4>[ 422.635112] [<800a2374>] __wake_up_common+0x84/0xb0
<4>[ 422.636700] [<800a23c4>] __wake_up+0x24/0x48
<4>[ 422.638126] [<800c8e1c>] irq_work_run_list+0xb0/0xd8
<4>[ 422.639790] [<800b1400>] update_process_times+0x50/0x70
<4>[ 422.641528] [<800bf184>] tick_sched_timer+0x1cc/0x258
<4>[ 422.643181] [<8009fcd4>] check_preempt_wakeup+0xd8/0x178
<4>[ 422.644950] [<800b223c>] __hrtimer_run_queues.constprop.6+0x128/0x1b0
<4>[ 422.647090] [<800b243c>] hrtimer_interrupt+0xd4/0x260
<4>[ 422.648780] [<800a8c54>] handle_irq_event_percpu+0x80/0x184
<4>[ 422.650626] [<80074430>] c0_compare_interrupt+0x4c/0x5c
<4>[ 422.652351] [<800b136c>] run_timer_softirq+0x1d4/0x1f8
<4>[ 422.654059] [<800a8c98>] handle_irq_event_percpu+0xc4/0x184
<4>[ 422.655926] [<800ac0e4>] handle_percpu_irq+0x50/0x80
<4>[ 422.657560] [<800a85b4>] generic_handle_irq+0x24/0x3c
<4>[ 422.659266] [<8006e944>] do_IRQ+0x1c/0x2c
<4>[ 422.660570] [<8006a6e0>] plat_irq_dispatch+0xf8/0x10c
<4>[ 422.662261] [<8026c848>] alloc_skb_with_frags+0xb4/0x1d8
<4>[ 422.664012] [<8026c840>] alloc_skb_with_frags+0xac/0x1d8
<4>[ 422.665776] [<80060bf8>] handle_int+0x138/0x144
<4>[ 422.667269]
<4>[ 422.667761]
<4>[ 422.667761] Code: (Bad address in epc)
<4>[ 422.669062]
<4>[ 422.669600] ---[ end trace 69daf50c69daf50c ]---

@mantas-p
Copy link
Contributor

Is not the DDR speed 400 MHz??

This is PLL output to DDR controller, DRAM clock line runs at 200MHz, I've checked with oscilloscope.

Crash log definitely looks like DRAM issue. Did you find anything with memtester? We ran it overnight with (W9751G6KB25) module - no issue reproduced.

How many Carambola2 samples you're having issue with? Did any previous batch worked well in your design?

@codehero
Copy link
Author

History
Over the past year I have been using several different openwrt (original with carambola, Designated Driver, LEDE 17.01.5).
-I do not have data for the original in terms of crashes
-I saw most instability with DD
-LEDE has shown user level instability, but I am unsure if this is related to DRAM

Testing
So I am focusing on the present batch of units (dozens) with LEDE running on custom systems with good running subystems.
I had been having issues with memtester crashing the kernel, so I have been using

stress -c 64 -m 16 --vm-bytes 1M

I am ran 5 overnight tests..
None have system crashed so far but I have seen

           do_page_fault(): sending SIGSEGV to dnsmasq for invalid read access from 003fdad4

[38599.019192] epc = 003fdad5 in
[38599.020123] dnsmasq[400000+1f000]
[38599.020742] ra = 0040d961 in
[38599.021726] dnsmasq[400000+1f000]

I am not sure if this a bug in the kernel, DRAM error, or userland error.
On a malfunctioning test unit from a few months ago I saw several messages like this for random processes, resulting in eventual system crash.

I will take the rest of this issue to the support email, but I would still like to document the final resolution here.

@DanielRIOT
Copy link

DanielRIOT commented May 6, 2019

@codehero , I get a similar crash ( on a custom board too) but its fairly repeatable when I cycle the Wifi system ( change channel or tx power and then reload wifi ).

a bug report on LEDE's bug system was logged, but I did not get anywhere other than it possibly being a DRAM issue.

What in the crashlog leads one to the DRAM issue ( corrupt addresses ? )

I'm running memtest 40M to see if the crash happens due to memory read/writes.

Do you have inline Resistors on your DRAM lines ( like Arduino Yun and some other AR9331 designs ) or are they directly connected and only length matched ?

We're also using Winbond W9751G6KB RAM ( like in pepe2k/u-boot_mod#207 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants