Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cephadm-rookify-setup: Waiting for disabled original ceph-rgw host worker1 to disconnect #106

Closed
boekhorstb1 opened this issue Nov 18, 2024 · 3 comments · Fixed by #109
Closed

Comments

@boekhorstb1
Copy link
Contributor

I added rgw objects, buckets with files, for an own zone, zonegroup and so on. Following setup (see at the end of the yaml file):

---
service_type: host
hostname: master
addr: {{ master_ip }}
labels:
  - _admin
  - grafana
---
service_type: host
hostname: worker1
addr: {{ worker1_ip }}
labels:
  - mon
  - mgr
  - mds
  - osd
  - alertmanager
  - rgw
---
service_type: host
hostname: worker2
addr: {{ worker2_ip }}
labels:
  - mon
  - mgr
  - mds
  - osd
  - prometheus
  - rgw
---
service_type: host
hostname: worker3
addr: {{ worker3_ip }}
labels:
  - mon
  - mgr
  - mds
  - osd
  - rgw
---
service_type: alertmanager
service_name: alertmanager
placement:
  count: 1
  hosts: 
    - 'worker1'
---
service_type: prometheus
service_name: prometheus
placement:
  count: 1
  hosts: 
    - 'worker2'
---
service_type: mds
service_id: mds
service_name: mds.mds
placement:
  count: 2
  label: mds
---
service_type: mgr
service_name: mgr
placement:
  count: 2
  hosts:
    - worker1
    - worker2
    - worker3
---
service_type: mon
service_name: mon
placement:
  count: 3
  hosts:
    - worker1
    - worker2
    - worker3
---
service_type: osd
service_name: osd
placement:
  count: 3
  hosts:
    - worker1
    - worker2
    - worker3
crush_device_class: hdd
spec:
  data_devices:
    rotational: true
    size: 10GB
  filter_logic: AND
  objectstore: bluestore
  db_devices:
    rotational: true
    size: 9GB
---
service_type: node-exporter
service_name: node-exporter
placement:
  host_pattern: 'worker*'
---
service_type: crash
service_name: crash
placement:
  host_pattern: 'worker*'
---
service_type: rgw
service_id: testing-eu-east
service_name: rgw.testing-eu-east
placement:
  count: 3
  host_pattern: 'worker*'
spec:
  rgw_zonegroup: eu
  rgw_zone: eu-east
  rgw_realm: eu-east
  rgw_frontend_type: "beast"
  rgw_frontend_port: 8080

# NOTE: rgw object might be created, good, but the config is set elsewhere, which means: the default will not be changed for the zones.
# So current solution: change default afterwards, like this, manually:
#  sudo radosgw-admin zonegroup create --rgw-zonegroup=eu --rgw-realm=eu-east --master --default
#  sudo radosgw-admin zone create --rgw-zonegroup=eu --rgw-zone=eu-east --master --default
#  sudo radosgw-admin zonegroup remove --rgw-zonegroup=default --rgw-zone=default
#  sudo radosgw-admin period update --commit

when running migration with rookify -m setup hangs at disabling origin ceph-rgw:

2024-11-18 14:04.38 [info     ] Execution started with machine pickle file
2024-11-18 14:04.38 [info     ] Validated Ceph to expect cephx auth
2024-11-18 14:04.38 [debug    ] K8sPrerequisitesCheck started validation
2024-11-18 14:04.38 [info     ] Migrating ceph-osd host 'worker1'
2024-11-18 14:04.38 [info     ] Migrating ceph-osd host 'worker2'
2024-11-18 14:04.38 [info     ] Migrating ceph-osd host 'worker3'
2024-11-18 14:04.38 [info     ] Migrating ceph-rgw daemon at host 'worker1'
2024-11-18 14:04.38 [debug    ] Waiting for disabled original ceph-rgw host 'worker1' to disconnect

Checking the pods in k3s give following insights:

rook-ceph     rook-ceph-mgr-a-7bf4754786-mgdvp                    0/3     Pending     0              173m
rook-ceph     rook-ceph-mgr-b-6b8786d97f-b2bjv                    0/3     Pending     0              173m

Description of one of those pods shows this:

QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/control-plane=true:NoSchedule
                             node-role.kubernetes.io/master=true:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  3m26s (x14 over 68m)  default-scheduler  0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.

Will check if this is because of some configurational mistakes of my own, maybe not enough resoruces (RAM?), or otherwise this might be a bug.

@boekhorstb1
Copy link
Contributor Author

Do you have any idea maybe @NotTheEvilOne ?

@NotTheEvilOne
Copy link
Contributor

You are basically hitting line https://github.com/SovereignCloudStack/rookify/blob/main/src/rookify/modules/migrate_rgws/main.py#L103. That means that as long as ceph status reports the host as one service daemon running it will wait indefinitely.

Please check the output of ceph -f json status manually, additionally ensure the on worker1 no RGW process is running any more.

Please note that you might check an issue discovered lately and fixed in #101 (updated in #103). Ensure you are running the latest code :)

@boekhorstb1
Copy link
Contributor Author

boekhorstb1 commented Nov 19, 2024

ok I tested some more, running latest code of main branch and yes it runs quite indefinetly I fear.

all modules from this list are working for me except migrate_rgws (or migrate_mgr to be more precies, or am I missing something here maybe?) :

migration_modules:
- create_rook_cluster
- migrate_osds
- migrate_osd_pools
- migrate_mds
- migrate_mds_pools
#- migrate_rgws
- migrate_rgw_pools

migrate_rgws generates the error I mentioned above. So becuase of the error you noted, i guess it is actually the migrate_mgr which is the real issue, at least for my setup.

Following output in ceph status:

root@master:~# ceph -s
  cluster:
    id:     fc4bd1c0-a674-11ef-a6ea-6df3313396ee
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum worker3,worker2,worker1 (age 9m)
    mgr: worker3.jotbob(active, since 8m), standbys: worker1.uzvxrp
    osd: 3 osds: 3 up (since 7m), 3 in (since 89m)
    rgw: 3 daemons active (3 hosts, 1 zones)
 
  data:
    pools:   12 pools, 185 pgs
    objects: 421 objects, 465 KiB
    usage:   900 MiB used, 29 GiB / 30 GiB avail
    pgs:     185 active+clean

following output of json format:

eph -s -f json:

{"fsid":"fc4bd1c0-a674-11ef-a6ea-6df3313396ee","health":{"status":"HEALTH_OK","checks":{},"mutes":[]},"election_epoch":52,"quorum":[0,1,2],"quorum_names":["worker3","worker2","worker1"],"quorum_age":677,"monmap":{"epoch":11,"min_mon_release_name":"quincy","num_mons":3},"osdmap":{"epoch":478,"num_osds":3,"num_up_osds":3,"osd_up_since":1732025891,"num_in_osds":3,"osd_in_since":1732020978,"num_remapped_pgs":0},"pgmap":{"pgs_by_state":[{"state_name":"active+clean","count":185}],"num_pgs":185,"num_pools":12,"num_objects":421,"data_bytes":476604,"bytes_used":944103424,"bytes_avail":31255568384,"bytes_total":32199671808},"fsmap":{"epoch":3,"by_rank":[],"up:standby":2},"mgrmap":{"available":true,"num_standbys":1,"modules":["cephadm","dashboard","iostat","nfs","prometheus","restful"],"services":{"dashboard":"https://192.168.121.36:8443/","prometheus":"http://192.168.121.36:9283/"}},"servicemap":{"epoch":26,"modified":"2024-11-19T14:19:11.227514+0000","services":{"rgw":{"daemons":{"summary":"","24475":{"start_epoch":19,"start_stamp":"2024-11-19T14:04:13.667023+0000","gid":24475,"addr":"192.168.121.231:0/2595263976","metadata":{"arch":"x86_64","ceph_release":"quincy","ceph_version":"ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)","ceph_version_short":"17.2.7","container_hostname":"worker2","container_image":"quay.io/ceph/ceph@sha256:d26c11e20773704382946e34f0d3d2c0b8bb0b7b37d9017faa9dc11a0196c7d9","cpu":"AMD EPYC-Genoa Processor","distro":"centos","distro_description":"CentOS Stream 8","distro_version":"8","frontend_config#0":"beast port=8080","frontend_type#0":"beast","hostname":"worker2","id":"testing-eu-east.worker2.mtohyq","kernel_description":"#129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024","kernel_version":"5.15.0-119-generic","mem_swap_kb":"0","mem_total_kb":"6063804","num_handles":"1","os":"Linux","pid":"7","realm_id":"903dc6e5-cedb-4230-b6d6-b3435e8b759a","realm_name":"eu-east","zone_id":"8ad363a0-d24a-45b3-a8a4-18ce4bb2985e","zone_name":"eu-east","zonegroup_id":"d9e1ac32-7746-4e90-b74d-a9ed8545c8c4","zonegroup_name":"eu"},"task_status":{}},"24513":{"start_epoch":20,"start_stamp":"2024-11-19T14:04:15.610782+0000","gid":24513,"addr":"192.168.121.36:0/2160107591","metadata":{"arch":"x86_64","ceph_release":"quincy","ceph_version":"ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)","ceph_version_short":"17.2.7","container_hostname":"worker3","container_image":"quay.io/ceph/ceph@sha256:d26c11e20773704382946e34f0d3d2c0b8bb0b7b37d9017faa9dc11a0196c7d9","cpu":"AMD EPYC-Genoa Processor","distro":"centos","distro_description":"CentOS Stream 8","distro_version":"8","frontend_config#0":"beast port=8080","frontend_type#0":"beast","hostname":"worker3","id":"testing-eu-east.worker3.otjrcg","kernel_description":"#129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024","kernel_version":"5.15.0-119-generic","mem_swap_kb":"0","mem_total_kb":"6063804","num_handles":"1","os":"Linux","pid":"7","realm_id":"903dc6e5-cedb-4230-b6d6-b3435e8b759a","realm_name":"eu-east","zone_id":"8ad363a0-d24a-45b3-a8a4-18ce4bb2985e","zone_name":"eu-east","zonegroup_id":"d9e1ac32-7746-4e90-b74d-a9ed8545c8c4","zonegroup_name":"eu"},"task_status":{}},"34454":{"start_epoch":19,"start_stamp":"2024-11-19T14:04:12.023143+0000","gid":34454,"addr":"192.168.121.138:0/1091293753","metadata":{"arch":"x86_64","ceph_release":"quincy","ceph_version":"ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)","ceph_version_short":"17.2.7","container_hostname":"worker1","container_image":"quay.io/ceph/ceph@sha256:d26c11e20773704382946e34f0d3d2c0b8bb0b7b37d9017faa9dc11a0196c7d9","cpu":"AMD EPYC-Genoa Processor","distro":"centos","distro_description":"CentOS Stream 8","distro_version":"8","frontend_config#0":"beast port=8080","frontend_type#0":"beast","hostname":"worker1","id":"testing-eu-east.worker1.yeqfay","kernel_description":"#129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024","kernel_version":"5.15.0-119-generic","mem_swap_kb":"0","mem_total_kb":"6063804","num_handles":"1","os":"Linux","pid":"7","realm_id":"903dc6e5-cedb-4230-b6d6-b3435e8b759a","realm_name":"eu-east","zone_id":"8ad363a0-d24a-45b3-a8a4-18ce4bb2985e","zone_name":"eu-east","zonegroup_id":"d9e1ac32-7746-4e90-b74d-a9ed8545c8c4","zonegroup_name":"eu"},"task_status":{}}}}}},"progress_events":{}}

NotTheEvilOne added a commit that referenced this issue Nov 28, 2024
Fixes an issue found related to #106 with systemd RGW systemd file name template not being usable at all.

Fixes: #106
Signed-off-by: Tobias Wolf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants