After 30 hours - crash where to start #10146

GC-RnD · 2022-12-01T20:42:38Z

GC-RnD
Dec 1, 2022

After 1,800 loops of 60ea =30hr (this almost went under the wire)
or 1,800 loops @ 3.4 sec = ~1hr 42min
the esp32 crashes.

INFO:
MicroPython v1.17-805-g7b1d10d69-dirty on 2022-04-05;
4MB/OTA/SPIram module with ESP32
LVGL v8.1.0-dev

Partition.find(Partition.TYPE_APP)
Partition type=0, subtype=16, address=65536, size=1835008, label=ota_0, encrypted=0
Partition type=0, subtype=17, address=1900544, size=1835008, label=ota_1, encrypted=0

Partition.find(Partition.TYPE_DATA)
Partition type=1, subtype=2, address=36864, size=16384, label=nvs, encrypted=0
Partition type=1, subtype=0, address=53248, size=8192, label=otadata, encrypted=0
Partition type=1, subtype=1, address=61440, size=4096, label=phy_init, encrypted=0
Partition type=1, subtype=129, address=3735552, size=264448, label=vfs, encrypted=0

Memory hardly moves after each loop.
mem_alloc: 99056
mem_free: 3999184

The errors are different , but the below error came up more offten...
esp-idf/components/freertos/queue.c:705 (xQueueCreateCountingSemaphore)- assert failed!

The frist thing to be said is.... you memory is fragmented.
I have gc.collect() after every function I call and more. 62 in a 2200 line script.

The loop is a very simple scan for BLE devices,
return a buffered list of only two items,
and plug the values into pre allocated dictionaries.

Hardly any strings created.

Is there a way to monitor this "fragmentatiion".
How does one go about finding the culprit ???

My feeling is the Bluetooth scan is the problem as this was a problem in the past.
I see posts re: requests and bluetooth having issues. I use request but not during the 1800 loop.
I also send websockets to a webpage hosted from the esp32 on every loop.

Also where is the heap located in the partition table?

This is a large project that is halted and any H$LP would be appreciated.

davefes · 2022-12-02T18:31:12Z

davefes
Dec 2, 2022

Whether or not this is of much help ... I ended up going to the PSRAM variant for two dataloggers that kept crashing on ESP32-WROOM REV1 chips.

It was by using try/except clauses everywhere, a software watchdog and saving the errors to a log file that I found out where the problems were.

In the software watchdog I would record the time of the fault to that logfile, so I could identify what actually went wrong.

Good luck!

1 reply

GC-RnD Dec 3, 2022
Author

Done all that... was hoping the execpt would give me a clue, but the the unit justs reboots... no messages.
There is so little discusion on the topic of debuging memory. I have 500 units ready to be flashed,
and I for the life of me can not determin the issue.
I can run my main loop no issue... can run ble scann in loop no problem, put the two together... 1800 loops done.
As I mentioned before I am wi$$ing for help.

peterhinch · 2022-12-03T16:11:59Z

peterhinch
Dec 3, 2022
Collaborator

Unfortunately I have no experience of Bluetooth but I have some of radio communication.

Getting radio links working is easy; getting them working reliably is less so. I would start out by trying to prove whether the BT code is the problem. Run one or more devices long term with the BT code simulated. While that test is ongoing, try a unit with BT enabled and see if you can provoke the failure. Possible causes are interference from other equipment, weak signal levels or both.

Lacking specialist RF testgear, weak signals can be achieved by increasing the distance between the device under test (DUT) and the machine being scanned. In extremis put the DUT in a microwave (an excellent Faraday cage) and gradually shut the door. Electrical interference could be generated from another ESP using WiFi heavily, a radio running in the same band such as NRF24L01, or some nasty random source like electrical sparks (take care how you create these).

If you can find a way to provoke failure, identifying the cause in code should be quicker. You might also be able to reduce the problem down to a simple test case which could be discussed.

Of course you may find that the units with simulated BT do fail, in which case you have a conventional (difficult) debugging task. Only in that case would I worry about issues like RAM fragmentation. Memory failures do not normally result in a crash: I'd expect an exception to be thrown.

3 replies

GC-RnD Dec 3, 2022
Author

The problem can be reproduced... ~ 1800 loopes and shes out of gas.
When I split the BLE scan out of the code (~2300 lines) no problem.
Run the BLE on its own... again no problems.
It when they are together it will crash after 1,800 time around the block.

START WITH...
stack: 720 out of 31744
GC: total: 4098240, used: 120080, free: 3978160
No. of 1-blocks: 1897, 2-blocks: 593, max blk sz: 467, max free sz: 248627
GC memory layout; from 3f817740:

END WITH...
stack: 720 out of 31744
GC: total: 4098240, used: 92816, free: 4005424
No. of 1-blocks: 1228, 2-blocks: 316, max blk sz: 264, max free sz: 248412
GC memory layout; from 3f817740:

Does anything look wrong in this block dig ??

I use global list and dict no real string creation at all.

jimmo Dec 5, 2022
Maintainer

@GC-RnD unfortunately there isn't really enough information here to be able to help very much.

About the only thing I can suggest is that

MicroPython v1.17-805-g7b1d10d69-dirty on 2022-04-05;

It's definitely worth trying to update to v1.19 (or better, the latest nightly build) as there have been significant changes (and improvements) to Bluetooth since v1.17.

GC-RnD Dec 5, 2022
Author

This project also uses LVGL and my IDF build was so mangled with custom modification in order to get every thing to work. New Pc ~ 3 months old SSD went... no backup. It will take me a month or two to get back to where I am now.
I am out of time... I am sending my current .bin to our production house in China who have been sitting on 500 units to be flashed. I wish there was a real time resource monitor for ESP32, to see where the issue is.

@jimmo I ran my code without BLE scaning no issue, ran ble scaning by its self no problems.
Is my code flow acceptable...
A) bt.gap_scan triggered from an async function
B) bt_irq calling uData...
bt.gap_scan(None)
elif event == _IRQ_SCAN_DONE:
uData()
uData just looks at a golbal list sensor=[b'\x00\x00\x000\x000', b'\x00\x00\x000\x000']
that is populated with _IRQ_SCAN_RESULT
if sensor[0] == b'\x00\x00\x000\x000':
sensor[0] = bytearray( pack(">h", data[3]) + (bytes(data[4])[-4:]))

If I run uData from the the async function vs bt.gap_scan I have no issues.

You are correct that the IDF will need to be rebuit, unfortunately that's going to have to be after our frist run of units.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MicroPython

After 30 hours - crash where to start #10146

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

MicroPython

After 30 hours - crash where to start #10146

GC-RnD Dec 1, 2022

Replies: 2 comments · 4 replies

davefes Dec 2, 2022

GC-RnD Dec 3, 2022 Author

peterhinch Dec 3, 2022 Collaborator

GC-RnD Dec 3, 2022 Author

jimmo Dec 5, 2022 Maintainer

GC-RnD Dec 5, 2022 Author

GC-RnD
Dec 1, 2022

Replies: 2 comments 4 replies

davefes
Dec 2, 2022

GC-RnD Dec 3, 2022
Author

peterhinch
Dec 3, 2022
Collaborator

GC-RnD Dec 3, 2022
Author

jimmo Dec 5, 2022
Maintainer

GC-RnD Dec 5, 2022
Author