Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] HTTP down #299

Open
eskey0 opened this issue Apr 8, 2024 · 185 comments
Open

[BUG] HTTP down #299

eskey0 opened this issue Apr 8, 2024 · 185 comments
Assignees

Comments

@eskey0
Copy link

eskey0 commented Apr 8, 2024

Faikin hardware
Faikin-S3-MINI-N4-R2: 91c1bc5 2024-03-31T10:59:15 S21 from Amazon

Daikin hardware
FTXP35N5V1B via s403

Describe the bug
The website goes down, I can control the unit via MQTT and ping it, but no HTTP or whatsoever

To Reproduce
No idea, happened out of the blue, I waited to see if it comes back but no dice.

Expected behavior
Have the web service working, I searched for a reboot via MQTT to see if that fix it, but found none.

Additional context
I have 3 of them, all of them configured and setted up the same day, only one of them failed

@revk
Copy link
Owner

revk commented Apr 8, 2024

Hmm, odd, we had this ages ago on older code with an app using the legacy URLs, but fixed long ago.

Try just power cycling or sending restart command over MQTT and see if it comes back.

Try web via IP not URL/domain in case an mDNS issue.

@eskey0
Copy link
Author

eskey0 commented Apr 8, 2024

Sorry I didn't specify that yes I do use direct ip address to connect to the device.
After the restart command the website is up again, I don't know if you want to dig more on this, or let it be for now.

@revk
Copy link
Owner

revk commented Apr 8, 2024

Ok not sure, as I say, only seen with some very specific (and now fixed) legacy IP polling. See if its happens again.

@eskey0
Copy link
Author

eskey0 commented Apr 8, 2024

Sure, I'll keep an eye on this, and keep you updated, thanks sir you awesome!

@revk revk closed this as completed Apr 8, 2024
@eskey0
Copy link
Author

eskey0 commented Apr 15, 2024

Hello there again, just a heads up, I got my second device to also "http fail", and I, again, fix it by mqtt restart, and now my 3rd device is in that state too.

EDIT: Just wanted to share the status, if no one else is experience this, maybe it's something in my setup

@revk
Copy link
Owner

revk commented Apr 15, 2024

Are you using the legacy URLs / polling them?

@eskey0
Copy link
Author

eskey0 commented Apr 15, 2024

I just navigate to http://ipaddress in the browser, it usually just works.

@revk
Copy link
Owner

revk commented Apr 15, 2024

OK but no tools, HA plug-ins, or something, that may be accessing the legacy URLs for data?

@eskey0
Copy link
Author

eskey0 commented Apr 15, 2024

No that Iam aware of, just HA through MQTT, nothing going for the HTTP besides my browser that I rarely use.

@revk
Copy link
Owner

revk commented Apr 15, 2024

OK, as I know some HA plug-ins use the old URLs, but if using MQTT, that should be fine. Which leaves my rather puzzled at the issue, to be honest.

@eskey0
Copy link
Author

eskey0 commented Apr 15, 2024

It also looks timed, one failed, reboot, about 3/5 days passed by, and then the other one, and repeat. Now is the 3rd one (of 3) I can just reboot it via MQTT too and see if they start from the first one that failed.

To give you more of hindsight, I do have a more-than-average network, the Faikins also are in a restricted network, with some cameras in the same segment, with only access to HA trough MQTT, the web access from my computer, and to your update server.

I do have some plug-ins in HA, but that were for the "official" modules, and they were assigned different IP addresses, and I dissconect them from the units, so I don't think that could be an issue.

@eskey0
Copy link
Author

eskey0 commented Apr 17, 2024

I have more information to share, it happened again, this time to 2 of the 3 devices I have. It happened just after I changed the wifi band on my AP, does that ring any bell? Again after sending a MQTT reboot the website goes online.

I must add I live in an appartment that is very noisy wifi wise.

@antwin
Copy link

antwin commented Aug 18, 2024

This has just happened to me. The device is online - responds to pings, nmap can see it but not analyse it, it works on mqtt, but the webserver times out. Addressed by ip address. Webserver is up again after an mqtt restart. Uptime was a few days.
Just before the web server stopped, I was looking at the page. It loaded the first time ok. Then just gave the blue screen with no buttons. On a reload it loaded it all, then timed out.
Faikin-S3-MINI-N4-R2: b16bfc4 2024-08-12T14:02:04 S21
Would any more info help - wireshark capture, status output ... ?

@revk
Copy link
Owner

revk commented Aug 18, 2024

Just to check, are you using the legacy URLs? We think, somehow, there is a memory leak, possibly in the ESP IDF.

@antwin
Copy link

antwin commented Aug 18, 2024

I'm not sure what you mean by legacy URLs. I'm using the IP address (192.168.0.150) directly.

@revk
Copy link
Owner

revk commented Aug 18, 2024

I.e. a monitoring app that talks http to Faikin to get/set data. The way the old Daikin wifi modules used to work.

@antwin
Copy link

antwin commented Aug 18, 2024

I'm using Firefox to read from http://192.168.0.150 (the Faikin) on one computer. The page appears to be refreshed at intervals. I have not disconnected the original Daikin wifi module, but that has never been used, and the Daikin app is not available here.

@revk
Copy link
Owner

revk commented Aug 18, 2024

OK sounds like you are not using the legacy HTTP API then. The web page on the Faikin is not "refreshed" it uses a web socket. It should have no problem working indefinitely. I'm puzzled if you think it is being refreshed.

When we have seen issues with web server stopping it has always been down to someone using some app (not the Daikin app, usually some home assistant plug in that is not using MQTT). That polls the HTTP legacy APIs constantly, and we think there is some memory leak issue from that, but not 100% sure.

If you are not doing that, it is the first case of a problem like this.

Can you check the settings / basic page occasionally and see if the memory figures on that page are going down over time?

@revk revk reopened this Aug 18, 2024
@antwin
Copy link

antwin commented Aug 18, 2024

First off, thanks for the prompt replies - I'm very impressed!
My terminology was off. The page is updated, which is why I assumed it was refreshed. I must get the hang of websockets some day.
I'm not using HA. I intend to be using MQTT sometime.
I'll check the memory figures on the settings page, but it's a cold wet night here (NZ) and I'm off to bed, so there will be a pause of a day or two.

@revk
Copy link
Owner

revk commented Aug 18, 2024

Have a good night. The fact this is not using legacy HTTP APIs is interesting, and so may give us clues.

@antwin
Copy link

antwin commented Aug 20, 2024

Here are some preliminary results from status/faikin - are these what you need to see?:
{"ts":"2024-08-20T05:25:26Z","id":"DC5475EF52FC","up":true,"uptime":3690,"mqtt-up":3686,"mem":119504,"spi":2090296}
{"ts":"2024-08-20T08:28:48Z","id":"DC5475EF52FC","up":true,"uptime":14692,"mqtt-up":14688,"mem":119324,"spi":2090196}
{"ts":"2024-08-20T22:40:31Z","id":"DC5475EF52FC","up":true,"uptime":65794,"mqtt-up":65790,"mem":119120,"spi":2090196}

@revk
Copy link
Owner

revk commented Aug 21, 2024

Ah prefect yes mem and SPI, over time.

@antwin
Copy link

antwin commented Aug 25, 2024

No http hangs for several days!
More results:
{"ts":"2024-08-21T09:38:24Z","id":"DC5475EF52FC","up":true,"uptime":105267,"mqtt-up":21698,"mem":119324,"spi":2090196}
{"ts":"2024-08-21T23:28:38Z","id":"DC5475EF52FC","up":true,"uptime":155080,"mqtt-up":71511,"mem":118760,"spi":2090196}
{"ts":"2024-08-23T23:37:50Z","id":"DC5475EF52FC","up":true,"uptime":328430,"mqtt-up":244861,"mem":118676,"spi":2090040}
{"ts":"2024-08-25T05:06:19Z","id":"DC5475EF52FC","up":true,"uptime":434538,"mqtt-up":350969} mem 113600+2090108 (for some reason, it's not now reporting "mem" in status.)

@antwin
Copy link

antwin commented Aug 27, 2024

MQTT is working fine. BUT although HTTP is working on one device I cannot connect on a second device. Current status:
{"ts":"2024-08-27T22:42:04Z","id":"DC5475EF52FC","up":true,"uptime":670681,"mqtt-up":587112,"mem":109792,"spi":2089848}

@revk
Copy link
Owner

revk commented Aug 28, 2024

OK, that means it is not a memory leak. I'll have to look at number of TCP sockets or something.

Does it eventually recover, or does it need a restart?

@antwin
Copy link

antwin commented Aug 28, 2024

The working one worked for some hours. But it has also just stopped. It stopped with just the blue background page and 'settings....' at the bottom left, so no updating. So now no http connection on either, but pings and mqtt work fine.

@revk
Copy link
Owner

revk commented Aug 28, 2024

This sounds a lot like a TCP related issue. I'll have to have a play with the options.

@macmpi
Copy link

macmpi commented Feb 27, 2025

We have not really got to the bottom of why a very few customers have this issue. [...]
A big issue is that I cannot reproduce the problem. So I am trying things blind. [...]
But I am open to any suggestions, obviously.

@revk how can we provide more input if you still can not reproduce as suggested?
I'm open to allow you to remote access my device to undergo live testing/diagnostic, as I can make it fail regularly.
We could do that over Rustdek for you to operate on my linux machine and access to device through Firefox, terminal, and see MQTT results (MQTT 5 Explorer).
If up for that, let's find a 1:1 communication channel to share details.

@revk
Copy link
Owner

revk commented Feb 27, 2025

I don't have any other browsers installed, it would be a nuisance if browser specific, but if browsers are opening and holding open TCP connections that would be an issue as limited number of connections.

Have you tried with safari at all?

@macmpi
Copy link

macmpi commented Feb 27, 2025

I don't have any other browsers installed, it would be a nuisance if browser specific, but if browsers are opening and holding open TCP connections that would be an issue as limited number of connections.

I'm afraid your customers may have different browser preferences/constraints among classical family Safari, Chrome, Firefox.
I have not tested on all, but at least we know how to reproduce bug on few.
I can offer you remotely access to my PC & networked device which has such configuration: can't do more to help you diagnose, if you can't diagnose locally.
I think it is now the only remaining debugging option, should you want to get to the bottom of it.
Let me know if we shall work that out.

Have you tried with safari at all?

Yes fails too on iOS/Safari as already reported.

@revk
Copy link
Owner

revk commented Feb 27, 2025

OK that is odd as Safari seems OK for me. I can try more. Remote access may not help a lot, ideally I need it broken here so I can add extra debug on serial and monitor what is going on in detail.

@macmpi
Copy link

macmpi commented Feb 27, 2025

Your call (hoped you could have a special debug version pushing meaningful logs over MQTT).

Try accessing with 2 Safari devices accessing same Faikin homepage.
To see issue start developing it's best to have a debugger/inspector browser view: I guess Mac Safari can enable developer mode (Web Inspector option seems possible on iPhone/iPad too).
You should probably see DOMException popping here & there, and eventually becoming more severe, until delayed page display becomes noticeable, up to eventually no page updates anymore (complete lock-up).

@revk
Copy link
Owner

revk commented Feb 27, 2025

I will see when I have a chance, two web sockets at a time may be a factor I guess.

@MaienM
Copy link

MaienM commented Feb 27, 2025

Just to reiterate, I am seeing this issue just using curl, no browser involved (I control the device through MQTT). This means no web sockets and no long-lived sockets in my case (unless the kernel is keeping the sockets around after the curl command exits, but that seems unlikely).

It's entirely possible that the ways a browser does these requests (a larger volume & the addition of web sockets) exacerbates the issue, but this does not appear to be necessary for the issue to occur.

@macmpi
Copy link

macmpi commented Feb 27, 2025

Thanks for confirmation.
BTW, do you have BLE device associated too?

It's entirely possible that the ways a browser does these requests (a larger volume & the addition of web sockets) exacerbates the issue, but this does not appear to be necessary for the issue to occur.

Sure, but the priority at this point is to indeed exacerbate it, so that @revk can at least reproduce & observe it.

@revk
Copy link
Owner

revk commented Feb 27, 2025

Just to reiterate, I am seeing this issue just using curl, no browser involved ...

OK how? Just on curl to main page in a loop? Or what?

And what do you "see" when it breaks, and how long does it take?

@MaienM
Copy link

MaienM commented Feb 27, 2025

BTW, do you have BLE device associated too?

I do not, and the BLE toggle is set to off.

OK how? Just on curl to main page in a loop? Or what?

Whenever the credentials are close to expiring a new set of credentials is generated for a device and a single curl call is done to update them. This call actually includes a bunch of settings but only mqttuser and mqttpass will actually have changed since the last call. I believe I currently have the TTL for these set to 15 minutes, so this call will probably happen once every 13 minutes or so.

The full curl command looks something like this:

curl http://DEVICE-IP/revk-settings \
	--data-urlencode hostname="..." \
	--data-urlencode mqtthost="..." \
	--data-urlencode mqttuser="$USERNAME" \
	--data-urlencode mqttpass="$PASSWORD" \
	--data-urlencode mqttport=8883 \
	--data-urlencode tz="CET-1CEST,M3.5.0,M10.5.0/3" \
	--data-urlencode ha=1\
	--data-urlencode ble=0 \
	--data-urlencode dark=1 \
	--data-urlencode otaauto=0 \
	--data-urlencode webcontrol=0

And what do you "see" when it breaks, and how long does it take?

This curl call times out. I believe the default timeout that curl uses is a minute, so that means there was no response from the device within that time. The command will be retried, but it will continue to fail until I reboot the device.

@revk
Copy link
Owner

revk commented Feb 27, 2025

Wait, what? This is a new one...

You are changing settings every 15 minutes, for new MQTT settings?

Why? Why do you ever need new MQTT settings?

It is interesting that it is a query string. That may be a place to look for memory leaks, etc.

@MaienM
Copy link

MaienM commented Feb 27, 2025

Why? Why do you ever need new MQTT settings?

I don't need to do this (and I'll probably end up switching to using a client certificate instead once I have time to set that up), but for now the credentials that these devices have for MQTT are set to expire after some time, so the device needs new credentials in order to remain connected to MQTT. I mentioned this in an earlier post as well, though I didn't mention the exact interval then:

I've been experimenting with using short-lived MQTT credentials for everything in my home lab, and I've been doing curl requests to the revk-settings endpoint to update these credentials. (Given that this is an HTTP request rather than HTTPS this is arguably worse from a security perspective than long-lived credentials, but I digress.)


It is interesting that it is a query string. That may be a place to look for memory leaks, etc.

The behaviour matches a memory leak, yeah. It always seems to take some time after a restart before the problem reoccurs, which also lends credence to that idea.

@revk
Copy link
Owner

revk commented Feb 28, 2025

I'll follow up the query string.

But I still have no clue why you would expire MQTT credentials for mqtts like that.

@revk
Copy link
Owner

revk commented Feb 28, 2025

No obvious leak in the code, but will test and see if free RAM goes down.

@macmpi
Copy link

macmpi commented Feb 28, 2025

I do not, and the BLE toggle is set to off.

Thinking about the BLE activation in the context of memory consumption: does having it ON significantly impact RAM usage?
If so, could be one of the worsening factors with memory leaks...

@revk
Copy link
Owner

revk commented Feb 28, 2025

Thinking about the BLE activation in the context of memory consumption: does having it ON significantly impact RAM usage? If so, could be one of the worsening factors with memory leaks...

It does use RAM as it stores all visible temperature sensors. I would not say a lot of RAM (but depends how many devices in view). Also, I expect BLE being enabled at all uses some RAM too, and I don't know how much that is. So it could be a factor.

@yalec38

This comment has been minimized.

@revk

This comment has been minimized.

@yalec38

This comment has been minimized.

@revk

This comment has been minimized.

@yalec38

This comment has been minimized.

@yalec38

This comment has been minimized.

@revk

This comment has been minimized.

@yalec38

This comment has been minimized.

@revk

This comment has been minimized.

@yalec38

This comment has been minimized.

@revk

This comment has been minimized.

Repository owner deleted a comment from macmpi Mar 3, 2025
Repository owner deleted a comment from macmpi Mar 3, 2025
Repository owner deleted a comment from macmpi Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests