Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition in otbr-web Start During Boot #2388

Open
rwkeane opened this issue Jul 19, 2024 · 3 comments
Open

Race Condition in otbr-web Start During Boot #2388

rwkeane opened this issue Jul 19, 2024 · 3 comments

Comments

@rwkeane
Copy link

rwkeane commented Jul 19, 2024

Describe the bug A clear and concise description of what the bug is.

I have OTBR currently running on an RPI device (both 3 and 4 repro this issue). I am able to successfully start otbr-agent and it runs fine (including when started automatically as part of the boot process), but otbr-web fails after boot with error [ERR ]-WEB-----: OpenThread daemon is not running.

Running systemctl stop otbr-web and then systemctl stop otbr-web to restart the service fixes the issue, and it runs without any errors.

To Reproduce Information to reproduce the behavior, including:

  1. Git commit id: commit 8e4d36b
  2. IEEE 802.15.4 hardware platform: Home Assistant SkyConnect
  3. Build steps: Set up as per the code lab
  4. Network topology: No network currently configured

Expected behavior A clear and concise description of what you expected to happen.

The web service should correctly delay starting until the agent is running.

Console/log output If applicable, add console/log output to help explain your problem.

on boot:

otbr-web (via systemctl status otbr-web.service):

Jul 18 20:33:33 ha-satellite systemd[1]: Started otbr-web.service - Border Router Web.
Jul 18 20:33:33 ha-satellite otbr-web[1398]: otbr-web[1398]: [INFO]-WEB-----: Running 0.3.0-8e4d36b
Jul 18 20:33:33 ha-satellite otbr-web[1398]: otbr-web[1398]: [INFO]-WEB-----: Border router web started on wpan0
Jul 18 20:33:33 ha-satellite otbr-web[1398]: otbr-web[1398]: [ERR ]-WEB-----: OpenThread daemon is not running.
Jul 18 20:33:33 ha-satellite otbr-web[1398]: [INFO]-WEB-----: Running 0.3.0-8e4d36b
Jul 18 20:33:33 ha-satellite otbr-web[1398]: [INFO]-WEB-----: Border router web started on wpan0
Jul 18 20:33:33 ha-satellite otbr-web[1398]: [ERR ]-WEB-----: OpenThread daemon is not running.

otbr-agent (via systemctl status otbr-agent.service):

Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.151 [I] Platform------: Execute command `ipset swap otbr-ingress-deny-src-swap otbr-ingress-deny-src` = 0
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.154 [I] Platform------: Execute command `ipset swap otbr-ingress-allow-dst-swap otbr-ingress-allow-dst` = 0
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.154 [I] P-McastRtMgr--: Disable: OK
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: [INFO]-BA------: Publish meshcop service OpenThread BorderRouter #55D9._meshcop._udp.local.
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.154 [I] RouterTable---: Route table
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.157 [I] P-Netif-------: Host netif is down
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.158 [I] P-Netif-------: Succeeded to process request#1
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: [INFO]-MDNS----: Successfully registered service OpenThread BorderRouter #55D9._meshcop._udp
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: [INFO]-BA------: Result of publish meshcop service OpenThread BorderRouter #55D9._meshcop._udp.local: OK
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: [INFO]-BA------: Result of publish meshcop service OpenThread BorderRouter #55D9._meshcop._udp.local: OK

after restarting otbr-web with commands earlier (ignore the timestamps):

Jul 18 20:30:06 ha-satellite systemd[1]: Started otbr-web.service - Border Router Web.
Jul 18 20:30:06 ha-satellite otbr-web[5031]: otbr-web[5031]: [INFO]-WEB-----: Running 0.3.0-8e4d36b
Jul 18 20:30:06 ha-satellite otbr-web[5031]: otbr-web[5031]: [INFO]-WEB-----: Border router web started on wpan0
Jul 18 20:30:06 ha-satellite otbr-web[5031]: [INFO]-WEB-----: Running 0.3.0-8e4d36b
Jul 18 20:30:06 ha-satellite otbr-web[5031]: [INFO]-WEB-----: Border router web started on wpan0

Additional context Add any other context about the problem here.

I've got 3 OTBR devices (RPI3, RPI4, mini computer w/ home assistant) as well as an official Thread Border Router (Nest device) running on my wifi network (though no thread network as of yet). It's possible they are all fighting over the same mDNS name, and the probe -> retry -> repeat... name resolution process is taking a long time

@superwhd
Copy link
Contributor

I guess it's probably because otbr-web didn't wait for the otbr-agent to be ready, which it should have done.

Running systemctl stop otbr-web and then systemctl stop otbr-web to restart the service fixes the issue, and it runs without any errors.

I'm curious why you needed to stop it twice (or is it a typo? 😃 ).

It's possible they are all fighting over the same mDNS name, and the probe -> retry -> repeat... name resolution process is taking a long time

Do you have logs supporting this? As far as I know this should rarely happen because the mDNS service instance name of a device ends with last few digits of its extended MAC address.

@rwkeane
Copy link
Author

rwkeane commented Jul 19, 2024

I'm curious why you needed to stop it twice (or is it a typo? 😃 )

That's just a typo :)

Do you have logs supporting this?

It was only my guess based on Result of publish meshcop service OpenThread BorderRouter #55D9._meshcop._udp.local: OK not occurring until after the otbr-web had already launched (and failed)

@superwhd
Copy link
Contributor

The error [ERR ]-WEB-----: OpenThread daemon is not running should be transient and will be automatically resolved when otbr-agent is launched. I just tried launching otbr-agent after otbr-web and it's working fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants