Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mellanox] Disable SSD NCQ on Mellanox platforms #17567

Merged
merged 1 commit into from
Jan 28, 2024

Conversation

volodymyrsamotiy
Copy link
Collaborator

@volodymyrsamotiy volodymyrsamotiy commented Dec 19, 2023

Why I did it

Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.

Syslog error message examples:

  • Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
  • Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".

Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:

Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:

Work item tracking
  • Microsoft ADO (number only):

How I did it

Add a kernel parameter to tell libata to disable NCQ

How to verify it

Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4

Test results with NCQ enabled:

 READ: bw=128MiB/s (135MB/s), 128MiB/s-128MiB/s (135MB/s-135MB/s), io=247MiB (259MB), run=1924-1924msec
WRITE: bw=131MiB/s (138MB/s), 131MiB/s-131MiB/s (138MB/s-138MB/s), io=253MiB (265MB), run=1924-1924msec
…
 READ: bw=130MiB/s (136MB/s), 130MiB/s-130MiB/s (136MB/s-136MB/s), io=247MiB (259MB), run=1902-1902msec
WRITE: bw=133MiB/s (139MB/s), 133MiB/s-133MiB/s (139MB/s-139MB/s), io=253MiB (265MB), run=1902-1902msec
…
 READ: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=247MiB (259MB), run=1919-1919msec
WRITE: bw=132MiB/s (138MB/s), 132MiB/s-132MiB/s (138MB/s-138MB/s), io=253MiB (265MB), run=1919-1919msec

Test results with NCQ disabled:

 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2354-2354msec
WRITE: bw=107MiB/s (113MB/s), 107MiB/s-107MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2354-2354msec
…
 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2349-2349msec
WRITE: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2349-2349msec
…
 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2349-2349msec
WRITE: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2349-2349msec

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305
  • 202311

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mssonicbld
Copy link
Collaborator

@StormLiangMS,@liushilongbuaa PR: #17567 is conflict with MS internal repo
Please complete the following PR by pushing fix commit to sonicbld/conflict_prefix/17567-fix
https://msazure.visualstudio.com/One/_git/Networking-acs-buildimage/pullrequest/9264755
Then comment "/azpw ms_conflict" to rerun PR checker.

@bingwang-ms
Copy link
Contributor

@saiarcot895 Could you please help review? Thanks!

@bingwang-ms
Copy link
Contributor

How about the SN2700 A1?

@prgeor prgeor added the NVIDIA label Dec 19, 2023
@liushilongbuaa
Copy link
Contributor

/azpw ms_conflict

Copy link
Collaborator

@liat-grozovik liat-grozovik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodymyrsamotiy you are missing SN5600 as well as SN2700-A1.
please go over the list of systems. BTW. this means you will have conflict with 2205 thus please update the missing systems and create a backport PR for 202205

@StormLiangMS
Copy link
Contributor

@liat-grozovik are we good to go?

@mssonicbld
Copy link
Collaborator

@StormLiangMS,@liushilongbuaa PR: #17567 is conflict with MS internal repo
Please complete the following PR by pushing fix commit to sonicbld/conflict_prefix/17567-fix
https://msazure.visualstudio.com/One/_git/Networking-acs-buildimage/pullrequest/9303135
Then comment "/azpw ms_conflict" to rerun PR checker.

@yxieca
Copy link
Contributor

yxieca commented Jan 4, 2024

@volodymyrsamotiy you are missing SN5600 as well as SN2700-A1. please go over the list of systems. BTW. this means you will have conflict with 2205 thus please update the missing systems and create a backport PR for 202205

@liat-grozovik can we move forward with this PR and handle 5600 and 2700-A1 with another PR?

@yxieca
Copy link
Contributor

yxieca commented Jan 4, 2024

/azpw ms_conflict

@mssonicbld
Copy link
Collaborator

@StormLiangMS,@liushilongbuaa PR: #17567 is conflict with MS internal repo
Please complete the following PR by pushing fix commit to sonicbld/conflict_prefix/17567-fix
https://msazure.visualstudio.com/One/_git/Networking-acs-buildimage/pullrequest/9306301
Then comment "/azpw ms_conflict" to rerun PR checker.

@liushilongbuaa
Copy link
Contributor

/azpw ms_conflict

@volodymyrsamotiy
Copy link
Collaborator Author

@volodymyrsamotiy you are missing SN5600 as well as SN2700-A1. please go over the list of systems. BTW. this means you will have conflict with 2205 thus please update the missing systems and create a backport PR for 202205

@liat-grozovik can we move forward with this PR and handle 5600 and 2700-A1 with another PR?

@yxieca, no need to handle 5600 and 2700-A1 with another PR, this PR was already updated, it has all the changes.

@yxieca
Copy link
Contributor

yxieca commented Jan 4, 2024

@volodymyrsamotiy you are missing SN5600 as well as SN2700-A1. please go over the list of systems. BTW. this means you will have conflict with 2205 thus please update the missing systems and create a backport PR for 202205

@liat-grozovik can we move forward with this PR and handle 5600 and 2700-A1 with another PR?

@yxieca, no need to handle 5600 and 2700-A1 with another PR, this PR was already updated, it has all the changes.

@volodymyrsamotiy what is keeping this PR in draft mode? @liat-grozovik any other blocking issues?

@volodymyrsamotiy volodymyrsamotiy marked this pull request as ready for review January 8, 2024 12:52
@bingwang-ms
Copy link
Contributor

@liat-grozovik Can we unblock this PR now?

@bingwang-ms
Copy link
Contributor

Removed the label for 202205 branch as another PR has been opened to 202205 #17662

@prgeor
Copy link
Contributor

prgeor commented Jan 10, 2024

@volodymyrsamotiy you are missing SN5600 as well as SN2700-A1. please go over the list of systems. BTW. this means you will have conflict with 2205 thus please update the missing systems and create a backport PR for 202205

@liat-grozovik i think this has been addressed so please check again

@liat-grozovik liat-grozovik merged commit f1d6655 into sonic-net:master Jan 28, 2024
18 checks passed
@mssonicbld
Copy link
Collaborator

@volodymyrsamotiy PR conflicts with 202311 branch

@yxieca
Copy link
Contributor

yxieca commented Jan 31, 2024

@volodymyrsamotiy can you help create 202311 PR?

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jan 31, 2024
- Why I did it
Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.

Syslog error message examples:

Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".
Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:

[Arista] Disable ATA NCQ for a few products sonic-net#13739 [Arista] Disable ATA NCQ for a few products
[Arista] Disable SSD NCQ on DCS-7050CX3-32S sonic-net#13964 [Arista] Disable SSD NCQ on DCS-7050CX3-32S
Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:

https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous

- How I did it
Add a kernel parameter to tell libata to disable NCQ

- How to verify it
Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #17960

@StormLiangMS
Copy link
Contributor

ADO: 25853968

mssonicbld pushed a commit that referenced this pull request Jan 31, 2024
- Why I did it
Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.

Syslog error message examples:

Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".
Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:

[Arista] Disable ATA NCQ for a few products #13739 [Arista] Disable ATA NCQ for a few products
[Arista] Disable SSD NCQ on DCS-7050CX3-32S #13964 [Arista] Disable SSD NCQ on DCS-7050CX3-32S
Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:

https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous

- How I did it
Add a kernel parameter to tell libata to disable NCQ

- How to verify it
Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants