Skip to content
T-X edited this page Jan 31, 2018 · 7 revisions

Introduction

On embedded devices we are regularly dealing with a very limited amount of flash and RAM. This page serves the purpose of helping with and tracking the status of the latter, RAM issues.

The motivation of this page is ticket #1243 in particular.

Helpful knowledge, links and articles

Collect any external things that might help other people to understand and debug OOM issues here.

  • [Add some links here, explaining userspace vs. kernelspace allocations, kmalloc(), kmem_cache_alloc(), vmalloc(), /proc/slabinfo, /proc/vmallocinfo, /proc/vmstat, echo 'm' > /proc/sysrq-trigger,...]

How to Debug

  • Build Gluon with
  • On OOM and after reboot, get crash report from /sys/kernel/debug/crashlog
  • Try to find a reproducable, isolate setup!
  • Observe:
    • /proc/slabinfo
    • /proc/vmallocinfo
    • /proc/vmstat
    • echo 'm' > /proc/sysrq-trigger; dmesg
  • Helpful tools:
    • Traffic monitoring: tcpdump, wireshark, etc.
    • Traffic generators: mausezahn, iperf, tcpreplay, etc.
  • ...
  • Profit

Current Issues, Observations and Status

Out-of-memory due to kernel allocations

Status: Unsolved

Issue: OOM due to allocations in kernelspace.

Related tickets: #1243, #1306, #1197

How to trigger: In networks with a high number of nodes?


Observations so far:

  • First observed after the first Gluon releases based on LEDE
  • Nothing suspicious in /proc/slabinfo on crash ** Seems to outrule the Linux bridge or batman-adv as a potential causes
  • Setting 'echo fq_memory_limit 200 > /sys/kernel/debug/ieee80211/phy0/aqm' (seemingly?) had a positive effect

Tasks:

  • Finding a setup to reproduce the issue in an isolated configuration.

OOM on IP Fragments

Status: Unsolved

Issue: IPv4+v6 fragmentation buffers may buffer packets of up to a size of 8MB in total (4MB per address family)

Related tickets: -

How to trigger: An OOM was easily triggered via iperf3 running on a node, if packets were fragmented ($ iperf3 -l 1500). However should potentially be triggerable with no extra tools on the node and just external traffic, too?


Should be easily fixable by trimming /proc/sys/net/ipv6/ip6frag_{low,high}thresh and /proc/sys/net/ipv4/ipfrag{low,high}_thresh. Additional firewall rules might be considered, too.

Archived Issues

OOM on accessing transtable_global via debugfs

Status: Solved

Issue: In setups involving ~2500 client devices, nodes crashed frequently. The issue was alfred and respondd accessing the global batman-adv translation table via debugfs which caused high-order memory allocations due to the large table size.

Related tickets: #753

How to trigger: Spawn >2500 client devices, then 'cat /sys/kernel/debug/batman_adv/bat0/transtable_global'


The issue was fixed by implementing a netlink based interface in batman-adv and using that for alfred and respondd to access the global batman-adv translation table.

Clone this wiki locally