Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MT7981 5GHz occasionally cannot disconnect clients that have left and causes bad performance. #922

Open
victor186 opened this issue Oct 21, 2024 · 29 comments

Comments

@victor186
Copy link

victor186 commented Oct 21, 2024

I'm testing AX3000T on a restaurant for future network upgrade, but a've noticed poor speeds on 5GHz ramdomly, solved with radio restart, but when it occours, the network goes down due to low speed/high latency.

The AP is running on 80MHz/AX mode.
Openwrt 23.05.5.
Screenshot_20241019-203131_Chrome~2

Screenshot_20241020-195607_Speedtest

@lukasz1992
Copy link

Is the second device also connected to the network?
I see really bad signal from it, communication with such device can highly decrease performance.

@victor186
Copy link
Author

Is the second device also connected to the network? I see really bad signal from it, communication with such device can highly decrease performance.

This devices on list is in 2.4GHz

@romanovj
Copy link

romanovj commented Oct 27, 2024

Can you list your wifi clients(device models)?

@victor186
Copy link
Author

Can you list your wifi clients(device models)?

I can't, due this device is running as AP on a restaurant for administrative and client's Wi-Fi

@romanovj
Copy link

romanovj commented Nov 7, 2024

Looks like Qualcomm QCA9377 + windows 10 driver + 5GHz can cause this. No problems on 2.4 band.

@lukasz1992
Copy link

Do you have driver 10.0.0.1272 for Windows installed?

@victor186
Copy link
Author

Looks like Qualcomm QCA9377 + windows 10 driver + 5GHz can cause this. No problems on 2.4 band.

I not understood, Wi-Fi 5GHz adapter with QCA9377 is causing 5GHz network bad performance? I don't have QCA9377 on network and the router is mediatek.

@romanovj
Copy link

romanovj commented Nov 7, 2024

@victor186

I don't have QCA9377 on network

How can you be sure?

device is running as AP on a restaurant for administrative and client's Wi-Fi

@victor186
Copy link
Author

@victor186

I don't have QCA9377 on network

How can you be sure?

device is running as AP on a restaurant for administrative and client's Wi-Fi

The clients only use smartphones.
The unique PC on Wi-Fi is using a realtek wi-fi adapter

@romanovj
Copy link

@victor186

I don't have QCA9377 on network

How can you be sure?

device is running as AP on a restaurant for administrative and client's Wi-Fi

The clients only use smartphones. The unique PC on Wi-Fi is using a realtek wi-fi adapter

If QCA9377 can affect 5GHz AP on mt76+mt7915(mt7981), then maybe some other clients can do the same.

I'm not an owner of QCA9377. I just helped a user to isolate the problem on openwrt 23.05.5 mt7981 device.

@nbd168 what do you think about this?

@nbd168
Copy link
Member

nbd168 commented Nov 14, 2024

One thing you could try is copy the latest MT7981 firmware from https://github.com/openwrt/mt76/tree/master/firmware to your device. If that doesn't help, trying a recent snapshot might also be a good idea.

@romanovj
Copy link

romanovj commented Nov 14, 2024

One thing you could try is copy the latest MT7981 firmware from https://github.com/openwrt/mt76/tree/master/firmware to your device.

Already done this, it didn't help.

If that doesn't help, trying a recent snapshot might also be a good idea.

That user didn't want to experiment with snapshot. Connecting QCA9377 to 2.4GHz AP solved issue with 5GHz AP for him.

@lukasz1992
Copy link

I'd say there are too little details we could help you

@IrineSistiana
Copy link

IrineSistiana commented Nov 20, 2024

Openwrt 23.05.5. H3C Magic NX30 Pro.

Same issue here. Encountered it several times

Almost zero speed (1kb/s) through 5G wifi. Enough for DHCP but anything else will be broken, even ping.

I noticed that when this happening, there are 2 dead clients (which maybe leave the wifi range at the same time) in luci wifi page. With RX Rate / TX Rate 6.0 Mbit/s, 20 MHz. If I manually click the "Disconnect" button, the wifi works again immediately.

@IrineSistiana
Copy link

IrineSistiana commented Nov 20, 2024

More info

Also, when I check the log. The log keeps showing that the two offline clients were still AP-STA-POLL-OK. Started when they were out of the wifi range, till I clicked the luci "Disconnect" button.

P.S. OFFLINE:MAC:1 OFFLINE:MAC:2 are clients that went away.

Wed Nov 20 19:33:33 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **OFFLINE:MAC:1**
Wed Nov 20 19:35:31 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **OFFLINE:MAC:2**
Wed Nov 20 19:38:44 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **OFFLINE:MAC:1**
Wed Nov 20 19:40:51 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **OFFLINE:MAC:2**
Wed Nov 20 19:44:03 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **OFFLINE:MAC:1**
...
Wed Nov 20 20:06:42 2024 daemon.notice hostapd: phy1-ap0: AP-STA-DISCONNECTED **OFFLINE:MAC:1**
Wed Nov 20 20:06:44 2024 daemon.notice hostapd: phy1-ap0: AP-STA-DISCONNECTED **OFFLINE:MAC:2**
Wed Nov 20 20:06:47 2024 daemon.info hostapd: phy1-ap0: STA **OFFLINE:MAC:1** IEEE 802.11: deauthenticated due to local deauth request
Wed Nov 20 20:06:49 2024 daemon.info hostapd: phy1-ap0: STA **OFFLINE:MAC:2** IEEE 802.11: deauthenticated due to local deauth request

When I restart the 5g wifi a few minutes later. Another sus log.

Wed Nov 20 20:13:06 2024 kern.warn kernel: [2135649.716364] Ignoring NSS change in VHT Operating Mode Notification from **OFFLINE:MAC:1** with invalid nss 2
Wed Nov 20 20:13:06 2024 kern.info kernel: [2143605.339316] device phy1-ap0 left promiscuous mode
Wed Nov 20 20:13:06 2024 kern.info kernel: [2143605.354371] br-lan: port 5(phy1-ap0) entered disabled state
Wed Nov 20 20:13:07 2024 daemon.notice wpa_supplicant[1538]: Set new config for phy phy1
Wed Nov 20 20:13:07 2024 daemon.notice hostapd: Set new config for phy phy1: /var/run/hostapd-phy1.conf
Wed Nov 20 20:13:07 2024 daemon.notice hostapd: Reload config for bss 'phy1-ap0' on phy 'phy1'
Wed Nov 20 20:13:07 2024 daemon.notice hostapd: phy1-ap0: AP-STA-DISCONNECTED **AN:ONLINE:CLIENT:MAC:1**
Wed Nov 20 20:13:08 2024 daemon.notice hostapd: Reloaded settings for phy phy1
Wed Nov 20 20:13:08 2024 daemon.notice netifd: Wireless device 'radio1' is now up
Wed Nov 20 20:13:08 2024 daemon.notice netifd: Network device 'phy1-ap0' link is up
Wed Nov 20 20:13:08 2024 kern.info kernel: [2143607.148600] br-lan: port 5(phy1-ap0) entered blocking state
Wed Nov 20 20:13:08 2024 kern.info kernel: [2143607.154384] br-lan: port 5(phy1-ap0) entered disabled state
Wed Nov 20 20:13:08 2024 kern.info kernel: [2143607.160337] device phy1-ap0 entered promiscuous mode
Wed Nov 20 20:13:08 2024 kern.info kernel: [2143607.165646] br-lan: port 5(phy1-ap0) entered blocking state
Wed Nov 20 20:13:08 2024 kern.info kernel: [2143607.171424] br-lan: port 5(phy1-ap0) entered forwarding state
Wed Nov 20 20:13:09 2024 daemon.info dnsmasq[1]: read /etc/hosts - 12 names
Wed Nov 20 20:13:09 2024 daemon.info dnsmasq[1]: read /tmp/hosts/dhcp.cfg01411c - 4 names
Wed Nov 20 20:13:09 2024 daemon.info dnsmasq-dhcp[1]: read /etc/ethers - 0 addresses
...

Wireless config

cat /etc/config/wireless

config wifi-device 'radio0'
        option type 'mac80211'
        option path 'platform/18000000.wifi'
        option channel '1'
        option band '2g'
        option htmode 'HT20'
        option country 'CN'
        option cell_density '0'

config wifi-iface 'default_radio0'
        option device 'radio0'
        option network 'lan'
        option mode 'ap'
        option ssid 'ssid1'
        option encryption 'psk2+ccmp'
        option key 'WIFIPASSWD'

config wifi-device 'radio1'
        option type 'mac80211'
        option path 'platform/18000000.wifi+1'
        option channel '149'
        option band '5g'
        option htmode 'HE80'
        option country 'CN'
        option cell_density '0'
        option txpower '27'

config wifi-iface 'default_radio1'
        option device 'radio1'
        option network 'lan'
        option mode 'ap'
        option ssid 'ssid2'
        option encryption 'sae-mixed'
        option key 'WIFIPASSWD'

May related:
openwrt/openwrt#14415

@IrineSistiana
Copy link

IrineSistiana commented Nov 21, 2024

I reproduced this bug.

If a client leaves the WiFi coverage, there is a certain probability (10%? i guess) that the above bug will occur.

It is almost the same as this issue openwrt/openwrt#14415 . But it also causes bad wifi performance. (In my case this is extremely bad, < 1kb/s, other clients can still connect but only enough for DHCP to complete and anything else will be broken, even ping.)

Log keeps showing AP-STA-POLL-OK after the client left. (p.s. I added option max_inactivity '60'. )

...
Thu Nov 21 09:25:38 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:26:46 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:27:56 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:29:04 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:30:24 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:31:33 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:32:39 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:33:44 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:34:51 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
...

iw shows the client still "associated".

iw dev phy1-ap0 station dump

Station **WENT:AWAY:CLIENT:MAC** (on phy1-ap0)
        inactive time:  46190 ms
        rx bytes:       7315589
        rx packets:     52352
        tx bytes:       66444699
        tx packets:     69473
        tx retries:     6987
        tx failed:      7033
        rx drop misc:   2
        signal:         -95 [-97, -99] dBm
        signal avg:     -91 [-93, -95] dBm
        tx bitrate:     6.0 MBit/s
        tx duration:    83677141 us
        rx bitrate:     6.0 MBit/s
        rx duration:    4720659 us
        last ack signal:-96 dBm
        avg ack signal: -95 dBm
        airtime weight: 256
        authorized:     yes
        authenticated:  yes
        associated:     yes
        preamble:       short
        WMM/WME:        yes
        MFP:            no
        TDLS peer:      no
        DTIM period:    2
        beacon interval:100
        short preamble: yes
        short slot time:yes
        connected time: 8708 seconds
        associated at [boottime]:       2183028.795s
        associated at:  1732143676976 ms
        current time:   1732152384528 ms

p.s. Above device is a smartphone with snapdragon FastConnect 6800 (However, I do believe other clients can do the same.). It left the wifi range hour ago and kilometers away from wifi.

If I manually click the "Disconnect" button in luci, the wifi works again immediately, (no restart).

I'm using the offical unmodified Openwrt 23.05.5 image. openwrt/openwrt#14415 seems using a fork openwrt with a modified driver(?) (I misunderstund, they enabled /sys/module/mt7915e/parameters/wed_enable.).

I did not set the wed_enable.

cat /sys/module/mt7915e/parameters/wed_enable
N

@victor186
Copy link
Author

I reproduced this bug. If a client leaves the WiFi coverage, there is a certain probability that the above bug will occur.

It is almost the same as this issue openwrt/openwrt#14415 .

Log keeps showing (p.s. I added option max_inactivity '60'.)

Thu Nov 21 09:25:38 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:26:46 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:27:56 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:29:04 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:30:24 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:31:33 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:32:39 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:33:44 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:34:51 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
...
iw dev phy1-ap0 station dump

Station **WENT:AWAY:CLIENT:MAC** (on phy1-ap0)
        inactive time:  46190 ms
        rx bytes:       7315589
        rx packets:     52352
        tx bytes:       66444699
        tx packets:     69473
        tx retries:     6987
        tx failed:      7033
        rx drop misc:   2
        signal:         -95 [-97, -99] dBm
        signal avg:     -91 [-93, -95] dBm
        tx bitrate:     6.0 MBit/s
        tx duration:    83677141 us
        rx bitrate:     6.0 MBit/s
        rx duration:    4720659 us
        last ack signal:-96 dBm
        avg ack signal: -95 dBm
        airtime weight: 256
        authorized:     yes
        authenticated:  yes
        associated:     yes
        preamble:       short
        WMM/WME:        yes
        MFP:            no
        TDLS peer:      no
        DTIM period:    2
        beacon interval:100
        short preamble: yes
        short slot time:yes
        connected time: 8708 seconds
        associated at [boottime]:       2183028.795s
        associated at:  1732143676976 ms
        current time:   1732152384528 ms

p.s. Above device is a smartphone with snapdragon FastConnect 6800 (However, I do believe other clients can do the same.). It left the wifi range hour ago and kilometers away from wifi.

If I manually click the "Disconnect" button in luci, the wifi works again immediately, (no restart).

I'm using the offical unmodified Openwrt 23.05.5 image. openwrt/openwrt#14415 seems using a fork openwrt with a modified driver(?) (I misunderstund, they enabled /sys/module/mt7915e/parameters/wed_enable.).

I did not set the wed_enable.

cat /sys/module/mt7915e/parameters/wed_enable
N

It's make sense, because the router as public Wi-Fi have client's entering and quiting the network at all time.
And i noticed via luci some client's with signal -9x dBm that never disconnect's, like your example, client out of range never disapears.

@rx78gp01
Copy link

You can try this patch from mtk

@victor186
Copy link
Author

You can try this patch from mtk

I don't know how to use this

@IrineSistiana
Copy link

Sorry. My router is a main device, It is hard for me to play with it. But I can provide log if needed.

@victor186 I feel this is a common bug, for all MT7981, but it happens occasionally, hard to reproduce and notice.

Maybe we could change the title to make it easier for more users to find?

"MT7981 5GHz occasionally cannot disconnect clients that have left and causes bad performance."

@victor186 victor186 changed the title MT7981 5GHz ramdomly bad performance MT7981 5GHz occasionally cannot disconnect clients that have left and causes bad performance. Nov 22, 2024
@victor186
Copy link
Author

Sorry. My router is a main device, It is hard for me to play with it. But I can provide log if needed.

@victor186 I feel this is a common bug, for all MT7981, but it happens occasionally, hard to reproduce and notice.

Maybe we could change the title to make it easier for more users to find?

"MT7981 5GHz occasionally cannot disconnect clients that have left and causes bad performance."

Done

@IrineSistiana
Copy link

A dirty temp fix. Tested, works for me. Do not know if there is any side effect.

Run this script every minute via cron.

It will "disconnect" all clients that have a very very low signal strength (should be the clients that have already left the wifi coverage but still buggy as "associated".).

#!/bin/sh

# threshold (dBm)
thr=-90
# add other interface name if any, "phy1-ap0 phy1-ap1 phy1-ap2"
wlanlist="phy1-ap0" 

disconnect() {
        mac=$1
        wlan=$2
        rssi=$3
        echo "disconnecting client at $wlan $mac with $rssi dBm (thr=$thr)" | logger -t disconnected-client-killer
        ubus call hostapd.$wlan del_client "{'addr':'$mac', 'reason':5, 'deauth':true, 'ban_time':1000}"
        # "ban_time" prohibits the client to reassociate for the given amount of milliseconds.
}

for wlan in $wlanlist; do
        iwinfo ${wlan} assoclist | grep SNR | while read line; do
                mac=$(echo "${line}" | awk '{ print $1 }')
                rssi=$(echo "${line}" | awk '{ print $2 }')
                if [ $rssi -lt $thr ]; then
                        disconnect $mac $wlan $rssi
                fi
        done
done

@lukasz1992
Copy link

@kloon15
Copy link

kloon15 commented Nov 26, 2024

You can try this patch from mtk

This patch def does some good thing, before i had intermittent packet loss indication every min or less in games, now thats completely fixed with this patch.

@lukasz1992
Copy link

You can try this patch from mtk

This patch def does some good thing, before i had intermittent packet loss indication every min or less in games, now thats completely fixed with this patch.

I tried this patch, and speed dropped 2x times with inactive WED.

@kloon15
Copy link

kloon15 commented Nov 26, 2024

You can try this patch from mtk

This patch def does some good thing, before i had intermittent packet loss indication every min or less in games, now thats completely fixed with this patch.

I tried this patch, and speed dropped 2x times with inactive WED.

I dont notice a speed difference with WED enabled.

@oxavelar
Copy link

oxavelar commented Dec 7, 2024

Below client has left the house, but the MT6000 still sees/tracks it with a -92/-92 RSSI, ugh

17335763009011246083969200368102

Using a pretty recent OpenWrt SNAPSHOT, r28242, with:

mt798x-wmac 18000000.wifi: WM Firmware Version: ____000000, Build Time: 20240823160721
mt798x-wmac 18000000.wifi: WA Firmware Version: DEV_000000, Build Time: 20240823160840

Stressing roamings with DAWN and or disconnects by walking of bounds seem to trigger that odd condition.

I might try the cron job workarounnd. Since this is affecting my mesh network as batctl ends with nodes with 0.3 crawling link-speeds.

@LearZhou
Copy link

LearZhou commented Dec 8, 2024

Observed similar AP-STA-POLL-OK logs with my Flint 2 on 2.4G WiFi.

@oxavelar
Copy link

oxavelar commented Dec 8, 2024

A dirty temp fix. Tested, works for me. Do not know if there is any side effect.

Run this script every minute via cron.

It will "disconnect" all clients that have a very very low signal strength (should be the clients that have already left the wifi coverage but still buggy as "associated".).

I have adapted your solution and started using it to workaround this for my case too.

gist:openwrt-mt76-disconnect-workaround

This version can be added under init / rc scripts since it spawns a subshell on boot that keeps checking for the condition every N seconds.

Another slight change is there is no need to set a threshold, it instead considers that if the signal is lower than the noise floor.

We understand this is just a temporary workaround while we wait for the real solution, and also wonder if that MTK ref from losing the ACK on AX chips is related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants