Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Allow a delay or different schedules between multiple disks checks #706

Open
pabsi opened this issue Oct 31, 2024 · 5 comments · May be fixed by #710
Open

[FEAT] Allow a delay or different schedules between multiple disks checks #706

pabsi opened this issue Oct 31, 2024 · 5 comments · May be fixed by #710

Comments

@pabsi
Copy link

pabsi commented Oct 31, 2024

Is your feature request related to a problem? Please describe.
The particular issue arises when running a smart check over multiple disks which are connected USB-to-SATA. In my specific case, I have the Quad SATA Hat for the Pi 4, meaning 4 sata disks are connected via 2 USB 3.0 ports. Sometimes when running the smart checks against all 4 drives at once, the USB connection gets reset, and this, in my case, makes the mdadm RAID array fail and mark the devices as failed, and thus removing them from the array. Not a real issue, since I can --re-add them later. But it's very inconvenient. Moreover if the smart checks are run daily. See example of dmesg logs:

[Wed Oct 30 04:00:05 2024] usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
[Wed Oct 30 04:00:06 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Wed Oct 30 04:00:06 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Wed Oct 30 04:00:06 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Wed Oct 30 04:00:06 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Wed Oct 30 04:00:06 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Thu Oct 31 04:00:06 2024] usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
[Thu Oct 31 04:00:06 2024] sd 0:0:0:1: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=0s
[Thu Oct 31 04:00:06 2024] sd 0:0:0:1: [sdb] tag#0 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[Thu Oct 31 04:00:07 2024] usb 2-2: USB disconnect, device number 2
[Thu Oct 31 04:00:07 2024] md: super_written gets error=-5
[Thu Oct 31 04:00:07 2024] md/raid10:md0: Disk failure on sdb1, disabling device.
                           md/raid10:md0: Operation continuing on 3 devices.
[Thu Oct 31 04:00:07 2024] md: super_written gets error=-5
[Thu Oct 31 04:00:07 2024] md/raid10:md0: Disk failure on sda1, disabling device.
                           md/raid10:md0: Operation continuing on 2 devices.
[Thu Oct 31 04:00:07 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Thu Oct 31 04:00:12 2024] usb 2-2: new SuperSpeed USB device number 4 using xhci_hcd
[Thu Oct 31 04:00:12 2024] usb 2-2: New USB device found, idVendor=1058, idProduct=0a10, bcdDevice=81.36
[Thu Oct 31 04:00:12 2024] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=5
[Thu Oct 31 04:00:12 2024] usb 2-2: Product: JMS56x Series
[Thu Oct 31 04:00:12 2024] usb 2-2: Manufacturer: JMicron
[Thu Oct 31 04:00:12 2024] usb 2-2: SerialNumber: 1234567890123
[Thu Oct 31 04:00:12 2024] usb 2-2: UAS is ignored for this device, using usb-storage instead
[Thu Oct 31 04:00:12 2024] usb 2-2: UAS is ignored for this device, using usb-storage instead
[Thu Oct 31 04:00:12 2024] usb-storage 2-2:1.0: USB Mass Storage device detected
[Thu Oct 31 04:00:12 2024] usb-storage 2-2:1.0: Quirks match for vid 1058 pid 0a10: 800000
[Thu Oct 31 04:00:12 2024] scsi host0: usb-storage 2-2:1.0
[Thu Oct 31 04:00:13 2024] scsi 0:0:0:0: Direct-Access     Samsung  SSD 850 EVO 2TB  8136 PQ: 0 ANSI: 6
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: Attached scsi generic sg0 type 0
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] Write Protect is off
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] Mode Sense: 47 00 10 08
[Thu Oct 31 04:00:13 2024] scsi 0:0:0:1: Direct-Access     CT2000BX 500SSD1          8136 PQ: 0 ANSI: 6
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] No Caching mode page found
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] Assuming drive cache: write through
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: Attached scsi generic sg1 type 0
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] Write Protect is off
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] Mode Sense: 47 00 10 08
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] No Caching mode page found
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] Assuming drive cache: write through
[Thu Oct 31 04:00:13 2024]  sda: sda1
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] Attached SCSI disk
[Thu Oct 31 04:00:13 2024]  sdb: sdb1
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] Attached SCSI disk

I also say "sometimes" because there are times that despite running the 4 drives checks at once, it doesn't disconnect them. But I also experienced more stability when running the smart checks one by one, disk by disk, with a certain delay (just a bunch of seconds normally does the job).

Describe the solution you'd like
A possible option would be to have some environment variable (e.g. DELAY_BETWEEN_DISK_CHECKS or whatever, naming is hard). Another option would be to offer a schedule per drive, but I think this would be way more engineering for perhaps a very specific problem not everyone has.

I would do it myself, but unfortunately I am not savvy enough on Go :(

Additional context
N/A

Other notes
Thank you so much for your work. Really appreciate it 🥇

@pabsi
Copy link
Author

pabsi commented Oct 31, 2024

Could probably just a matter of adding a sleep of some sort based on that ENV var I suggested, in this for loop?
https://github.com/AnalogJ/scrutiny/blob/master/collector/pkg/collector/metrics.go#L87

@pabsi
Copy link
Author

pabsi commented Nov 4, 2024

Revisiting the code, I just realised there's a TODO in the code about this very same topic 😅 :
https://github.com/AnalogJ/scrutiny/blob/master/collector/pkg/collector/metrics.go#L93-L94

@AnalogJ
Copy link
Owner

AnalogJ commented Nov 6, 2024

Hey @pabsi I'd be happy to consider a change like this, if its optional and configurable via the collector config yaml file.

Can you open a PR?

@pabsi
Copy link
Author

pabsi commented Nov 6, 2024

I can try :)

As I said on the original post:

I would do it myself, but unfortunately I am not savvy enough on Go :(

But I'll give it a go ;)

@pabsi
Copy link
Author

pabsi commented Nov 7, 2024

@AnalogJ I can't raise a PR. GitHub threw me an error about not being a contributor.

You can see what I did here: https://github.com/AnalogJ/scrutiny/compare/AnalogJ:master...pabsi:706-add-wait-time-between-checks?expand=1

The test for the collector (go run collector/cmd/collector-metrics/collector-metrics.go run --debug worked fine).

Regards.

@AnalogJ AnalogJ linked a pull request Nov 8, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants