Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ext-iscsi is broken #490

Closed
sbogomolov opened this issue Oct 14, 2024 · 16 comments · Fixed by #500
Closed

ext-iscsi is broken #490

sbogomolov opened this issue Oct 14, 2024 · 16 comments · Fixed by #500
Assignees

Comments

@sbogomolov
Copy link

sbogomolov commented Oct 14, 2024

In versions 1.8.0 and 1.8.1 something broke with regards to ext-iscsi. All iSCSI PV's (that worked in 1.7.7) stopped working. The error I see (from the pod description):

Warning  FailedMount             7s (x10 over 4m17s)  kubelet                  MountVolume.WaitForAttach failed for volume "postgres-data-pv" : executable file not found in $PATH

I checked that the iscsiadm binary actually exists, but is not available for kubelet.

Start privileged container:

$ kubectl run -i --tty --rm --privileged -n kube-system --overrides='{"apiVersion": "v1", "spec": {"hostNetwork": true, "hostPID": true}}' --image ubuntu bash

In the container: run in ext-iscsi's namespace:

# nsenter -t $(pgrep iscsid) -a iscsiadm --help
iscsiadm -m discoverydb [-hV] [-d debug_level] [-P printlevel] [-t type -p ip:port -I ifaceN ... [-Dl]] | [[-p ip:port -t type] [-o operation] [-n name] [-v value] [-lD]] 
iscsiadm -m discovery [-hV] [-d debug_level] [-P printlevel] [-t type -p ip:port -I ifaceN ... [-l]] | [[-p ip:port] [-l | -D]] [-W]
...

In the container: run in kubelet's namespace:

# nsenter -t 2272 -a iscsiadm --help
nsenter: failed to execute iscsiadm: No such file or directory
@frezbo
Copy link
Member

frezbo commented Oct 15, 2024

This has nothing to do with extensions, you might need to use the fat kubelet image: https://www.talos.dev/v1.8/introduction/what-is-new/#slim-kubelet-image

@sbogomolov
Copy link
Author

Thanks @frezbo. Using fat kubectl image for worker nodes seems to have changed some things. For some pods I now see multi-attach errors (iscsi target was attached on one node and then not detached when pod has moved). Other pods seem to just hang on ContainerCreating without errors. Only containers with iscsi PVs have issues though. Is there something else I could try looking at?

@frezbo
Copy link
Member

frezbo commented Oct 16, 2024

it could be also due to k8s removing all in-tree CSI's

@sbogomolov
Copy link
Author

it could be also due to k8s removing all in-tree CSI's

Interesting suggestion. I thought that this migration was done quite some time ago and did not affect iSCSI. In Talos v1.7.7 everything works and its k8s version is not too far behind.

@sbogomolov
Copy link
Author

This is the kind of errors I see now:

MountVolume.WaitForAttach failed for volume "postgres-data-pv" : failed to get any path for iscsi disk, last err seen: 
  iscsi: failed to attach disk: Error: /usr/local/sbin/iscsiadm: 5: [: missing ]
  nsenter: failed to execute 5714: No such file or directory 
    (exit status 127)

@smira
Copy link
Member

smira commented Oct 16, 2024

It might be that you have wrong version of the extension installed, it's hard to guess.

@sbogomolov
Copy link
Author

It might be that you have wrong version of the extension installed, it's hard to guess.

The version of the extension is determined automatically. I have this definition:

customization:
  systemExtensions:
    officialExtensions:
    - siderolabs/qemu-guest-agent
    - siderolabs/intel-ucode
    - siderolabs/i915-ucode
    - siderolabs/iscsi-tools

and get image from the image factory:

factory.talos.dev/installer/81520a5f701f32795dbb3585efa23384a94a87b26979f19ca6a7b50744a661c2:v1.8.1

@sbogomolov
Copy link
Author

sbogomolov commented Oct 16, 2024

This is the kind of errors I see now:

MountVolume.WaitForAttach failed for volume "postgres-data-pv" : failed to get any path for iscsi disk, last err seen: 
  iscsi: failed to attach disk: Error: /usr/local/sbin/iscsiadm: 5: [: missing ]
  nsenter: failed to execute 5714: No such file or directory 
    (exit status 127)

I have figured out what this error is. It was a cosmetic problem in this case, I'll push a fix to kubelet repo. After fixing it, the situation is even more weird though. Now pod hangs on ContainerCreating without any problems in the Events. kubelet log has some errors:

k8s-worker-03: {"ts":1729100569283.8694,"caller":"operationexecutor/operation_generator.go:1491","msg":"Controller attach succeeded for volume \"authentik-redis-pv\" (UniqueName: \"kubernetes.io/iscsi/192.168.70.20:3260:iqn.2000-01.com.synology:nas.authentik-redis.2edf3217d57:1\") pod \"authentik-redis-master-0\" (UID: \"2aa6b3fe-20a0-4519-ba19-9484266e7cc9\") device path: \"\"","v":0,"pod":{"name":"authentik-redis-master-0","namespace":"authentik"}}
k8s-worker-03: {"ts":1729100569373.826,"caller":"operationexecutor/operation_generator.go:538","msg":"MountVolume.WaitForAttach entering for volume \"authentik-redis-pv\" (UniqueName: \"kubernetes.io/iscsi/192.168.70.20:3260:iqn.2000-01.com.synology:nas.authentik-redis.2edf3217d57:1\") pod \"authentik-redis-master-0\" (UID: \"2aa6b3fe-20a0-4519-ba19-9484266e7cc9\") DevicePath \"\"","v":0,"pod":{"name":"authentik-redis-master-0","namespace":"authentik"}}
k8s-worker-03: {"ts":1729100683870.8123,"caller":"kubelet/pod_workers.go:1301","msg":"Error syncing pod, skipping","pod":{"name":"authentik-redis-master-0","namespace":"authentik"},"podUID":"2aa6b3fe-20a0-4519-ba19-9484266e7cc9","err":"unmounted volumes=[redis-data], unattached volumes=[], failed to process volumes=[]: context deadline exceeded"}
k8s-worker-03: {"ts":1729100818304.6995,"caller":"kubelet/pod_workers.go:1301","msg":"Error syncing pod, skipping","pod":{"name":"authentik-redis-master-0","namespace":"authentik"},"podUID":"2aa6b3fe-20a0-4519-ba19-9484266e7cc9","err":"unmounted volumes=[redis-data], unattached volumes=[], failed to process volumes=[]: context deadline exceeded"}

EDIT: fix for kubelet: siderolabs/kubelet#87

@sbogomolov
Copy link
Author

sbogomolov commented Oct 16, 2024

Extra information: iscsiadm shows that we have logged in to a target, but there is no device node created for it.

EDIT:
When I try to manually login to the target, it hangs trying to log in:

# iscsiadm --mode node --target iqn.2000-01.com.synology:nas.authentik-postgres.2edf3217d57 --portal 192.168.70.20:3260,1 --login 
Logging in to [iface: default, target: iqn.2000-01.com.synology:nas.authentik-postgres.2edf3217d57, portal: 192.168.70.20,3260]

No errors. If I Ctrl+C it - it appears to be logged in, but no disk device is created.

EDIT2:
If I run discovery manually, logging in succeeds.

EDIT3:
This breaks it:

iscsiadm -m node -p 192.168.70.20:3260 -T iqn.2000-01.com.synology:nas.authentik-postgres.2edf3217d57 -I default -o update -n node.session.scan -v manual

If I set node.session.scan to manual for any target - login hangs. If I leave it on auto - login works.

EDIT4:
I was able to reproduce it on my desktop with the same version of open-iscsi. I'll try an older version later today.

@sbogomolov
Copy link
Author

The new ext-iscsi uses open-iscsi 2.1.10. I was able to reproduce the problem on my desktop using the same version of open-iscsi. Downgrading to open-iscsi 2.1.9 fixed it for me. Is there a way to specify custom image for extensions?

@frezbo
Copy link
Member

frezbo commented Oct 17, 2024

The new ext-iscsi uses open-iscsi 2.1.10. I was able to reproduce the problem on my desktop using the same version of open-iscsi. Downgrading to open-iscsi 2.1.9 fixed it for me. Is there a way to specify custom image for extensions?

you can use custom extensions using imager to generate custom installer image, is there a known fix for open-iscsi, we could then update it

@sbogomolov
Copy link
Author

I'll check the diff between the versions and open the issue there. I'll update this issue if I find out anything else.

@sbogomolov
Copy link
Author

The issue was identified and PR with the fix is up. I've verified that it does indeed fix the issue. Will we have to wait for a new release, or can we pick this up as soon as it is merged?

open-iscsi/open-iscsi#485

@frezbo
Copy link
Member

frezbo commented Oct 20, 2024

i guess we can pick for 1.8.1

@sbogomolov
Copy link
Author

i guess we can pick for 1.8.1

That PR was just merged.

@smira
Copy link
Member

smira commented Oct 21, 2024

@frezbo do you mind to apply the patch, and we can backport (to 1.8.2) this week

frezbo added a commit to frezbo/extensions that referenced this issue Oct 21, 2024
Apply the upstream patch (no release yet).

Fixes: siderolabs#490

Signed-off-by: Noel Georgi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants