Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health HEALTH_WARN 64 pgs incomplete; 64 pgs stuck inactive; 64 pgs stuck unclean #187

Open
abhishek6590 opened this issue Feb 3, 2015 · 20 comments

Comments

@abhishek6590
Copy link

Hi,

I am having an issue of ceph health -
health HEALTH_WARN 64 pgs incomplete; 64 pgs stuck inactive; 64 pgs stuck unclean
Please suggest me what should I check.

Thanks,
Abhishek

@hufman
Copy link
Contributor

hufman commented Feb 3, 2015

That sounds like there aren't any OSD processes running and connected to the cluster. If you check the output of ceph osd tree, does it show that the cluster expects to have an OSD? If not, this means that the ceph-disk-prepare script didn't run, which comes from the ceph::osd recipe. If so, this means that the ceph::osd script ran and initialized an OSD, but for some reason that OSD didn't connect to the cluster. Check the OSD server to make sure the process is running, and then look at the logs in /var/log/ceph/ceph-osd* to see why the OSD isn't connecting.

@abhishek6590
Copy link
Author

Hi ceph osd tree is showing output -
#ceph osd tree

id weight type name up/down reweight

-1 0.09 root default
-2 0.09 host server3
0 0.09 osd.0 up 1

and logs are showing
tail -f ceph-osd.0.log
2015-02-03 12:50:44.115354 7f0d0d1b7900 0 cls/hello/cls_hello.cc:271: loading cls_hello
2015-02-03 12:50:44.157671 7f0d0d1b7900 0 osd.0 4 crush map has features 1107558400, adjusting msgr requires for clients
2015-02-03 12:50:44.157682 7f0d0d1b7900 0 osd.0 4 crush map has features 1107558400 was 8705, adjusting msgr requires for mons
2015-02-03 12:50:44.157687 7f0d0d1b7900 0 osd.0 4 crush map has features 1107558400, adjusting msgr requires for osds
2015-02-03 12:50:44.157703 7f0d0d1b7900 0 osd.0 4 load_pgs
2015-02-03 12:50:44.201885 7f0d0d1b7900 0 osd.0 4 load_pgs opened 64 pgs
2015-02-03 12:50:44.212991 7f0d0d1b7900 -1 osd.0 4 set_disk_tp_priority(22) Invalid argument: osd_disk_thread_ioprio_class is but only the following values are allowed: idle, be or rt
2015-02-03 12:50:44.290354 7f0cfb587700 0 osd.0 4 ignoring osdmap until we have initialized
2015-02-03 12:50:44.290416 7f0cfb587700 0 osd.0 4 ignoring osdmap until we have initialized
2015-02-03 12:50:44.371616 7f0d0d1b7900 0 osd.0 4 done with init, starting boot process

Please suggest me.

Thanks,

@hufman
Copy link
Contributor

hufman commented Feb 4, 2015

Ah yes, you'll need at least 3 OSDs for Ceph to be happy and healthy. Depending on how your Crush map is configured, I forget the defaults, these OSDs will have to be on separate hosts.

@zdubery
Copy link

zdubery commented Nov 9, 2016

Hi

I am a bit confused by this statement. "you'll need at least 3 OSD's to be happy and healthy". I followed the instructions (here: http://docs.ceph.com/docs/hammer/start/quick-ceph-deploy/) and once I get to the command "ceph health", the response is: "health HEALTH_ERR 64 pgs incomplete; 64 pgs stuck inactive; 64 pgs stuck unclean". That is when I install it...

Ceph documentation clearly stated:
"Change the default number of replicas in the Ceph configuration file from 3 to 2 so that Ceph can achieve an active + clean state with just two Ceph OSDs. Add the following line under the [global] section:
osd pool default size = 2"

I have attempted this install at least 3 times now and the response is the same every time. I am running 1 admin node, 1 monitor and 2 osd's on 4 VirtualBox Ubuntu 14.04 LTS VM's within Ubuntu 16 (previous attempt was within Ubuntu 14).

The debug information is not very helpful at all. Ceph is also not writing to the /var/log/ceph/ location at all even after I set permissions
sudo chmod ceph:root /var/log/ceph

ceph-deploy osd activate tells me that the osd's are active but ceph osd tree shows otherwise. (down)

The config is read from /etc/ceph/cep.conf all the time (even though I install everything from my-cluster directory) which is incorrect. When I ran the install, the config was created in /home/user/my-cluster/ceph.conf yet it reads it from /etc/ceph/cep.conf.

So I will attempt 3 OSD's now even though the site states otherwise...

Any suggestions would be very helpful.

Thanks,

zd

@sweetie233
Copy link

Hi, I just have the same problem as yours, and I have reinstalled Ceph for more than 3 times. I'm really upset. Have you figured it out? Expect your suggestions.

@zdubery
Copy link

zdubery commented Dec 3, 2016 via email

@sweetie233
Copy link

Hi

First, thank you so much for your suggestion!!!
My file system is ext4, and I just did the thing you suggested, but it seems to make no difference.

I reviewed the osd's log throughly and found the following words:
osd.0 0 backend (filestore) is unable to support max object name[space] len
osd.0 0 osd max object name len = 2048
osd.0 0 osd max object namespace len = 256
osd.0 0 (36) File name too long
journal close /var/lib/ceph/osd/ceph-0/journal
** ERROR: osd init failed: (36) File name too long

Then I found this page:
http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/

I just reinstalled Ceph again, and place the following words in config global
section:
osd_max_object_name_len = 256
osd_max_object_namespace_len = 64

It works!!! I'm so happy and I appreciate you reply very much!!!

Thanks again!
Best wishes~

@zdubery
Copy link

zdubery commented Dec 4, 2016 via email

@subhashchand
Copy link

If you are using ext4 file system, you need to place this in config global section:

vim /etc/ceph/ceph.conf

osd_max_object_name_len = 256
osd_max_object_namespace_len = 64
http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/
#ceph status

@getarz4u15ster
Copy link

getarz4u15ster commented Jan 27, 2017

I'm having the same problem, however I am using the preferred xfs filesystem.. Any suggestions?

[From monitor node i get the following]
HEALTH_ERR 64 pgs are stuck inactive for more than 300 seconds; 64 pgs stuck inactive; no osds

[From OSD node]
2017-01-27 07:55:28.000882 7fde7846d700 0 -- :/429908835 >> ipaddress:6789/0 pipe(0x7fde74063f30 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fde7405c5a0).fault

[From Monitor node out of /var/log/ceph/ceph.log]
2017-01-27 06:47:11.121804 mon.0 ipaddress:6789/0 1 : cluster [INF] mon.oso-node1@0 won leader election with quorum 0
2017-01-27 06:47:11.121931 mon.0ipaddress:6789/0 2 : cluster [INF] monmap e1: 1 mons at {oso-node1=ipaddress:6789/0}
2017-01-27 06:47:11.122008 mon.0 ipaddress:6789/0 3 : cluster [INF] pgmap v2: 64 pgs: 64 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
2017-01-27 06:47:11.122090 mon.0 ipaddress:6789/0 4 : cluster [INF] fsmap e1:
2017-01-27 06:47:11.122203 mon.0 ipaddress:6789/0 5 : cluster [INF] osdmap e1: 0 osds: 0 up, 0 in
2017-01-27 06:54:50.687322 mon.0 ipaddress:6789/0 1 : cluster [INF] mon.oso-node1@0 won leader election with quorum 0
2017-01-27 06:54:50.687415 mon.0 ipaddress:6789/0 2 : cluster [INF] monmap e1: 1 mons at {oso-node1=ipaddress:6789/0}
2017-01-27 06:54:50.687497 mon.0 ipaddress:6789/0 3 : cluster [INF] pgmap v2: 64 pgs: 64 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
2017-01-27 06:54:50.687577 mon.0 ipaddress:6789/0 4 : cluster [INF] fsmap e1:
2017-01-27 06:54:50.687716 mon.0 ipaddress:6789/0 5 : cluster [INF] osdmap e1: 0 osds: 0 up, 0 in

@swq499809608
Copy link

f_redirected e754) currently waiting for peered
2017-03-02 10:58:39.952422 osd.25 [WRN] 100 slow requests, 1 included below; oldest blocked for > 324.251003 secs
2017-03-02 10:58:39.952444 osd.25 [WRN] slow request 240.250943 seconds old, received at 2017-03-02 10:54:39.701431: osd_op(client.512724.0:135407 97.84ada7c9 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered
2017-03-02 10:58:40.091373 osd.27 [WRN] 100 slow requests, 1 included below; oldest blocked for > 324.389960 secs
2017-03-02 10:58:40.091378 osd.27 [WRN] slow request 240.389941 seconds old, received at 2017-03-02 10:54:39.701397: osd_op(client.512724.0:135408 97.31099063 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered
2017-03-02 10:58:40.952740 osd.25 [WRN] 100 slow requests, 1 included below; oldest blocked for > 325.251301 secs
2017-03-02 10:58:40.952791 osd.25 [WRN] slow request 240.243998 seconds old, received at 2017-03-02 10:54:40.708674: osd_op(client.36294.0:8895939 97.84ada7c9 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered
2017-03-02 10:58:41.091613 osd.27 [WRN] 100 slow requests, 1 included below; oldest blocked for > 325.390198 secs
2017-03-02 10:58:41.091619 osd.27 [WRN] slow request 240.382847 seconds old, received at 2017-03-02 10:54:40.708729: osd_op(client.36294.0:8895940 97.31099063 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered
2017-03-02 10:58:43.953496 osd.25 [WRN] 100 slow requests, 1 included below; oldest blocked for > 328.252086 secs
2017-03-02 10:58:43.953517 osd.25 [WRN] slow request 240.022847 seconds old, received at 2017-03-02 10:54:43.930609: osd_op(client.36291.0:8893352 97.84ada7c9 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered
2017-03-02 10:58:44.092310 osd.27 [WRN] 100 slow requests, 1 included below; oldest blocked for > 328.390885 secs
2017-03-02 10:58:44.092315 osd.27 [WRN] slow request 240.161657 seconds old, received at 2017-03-02 10:54:43.930605: osd_op(client.36291.0:8893353 97.31099063 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered
2017-03-02 10:58:44.953818 osd.25 [WRN] 100 slow requests, 1 included below; oldest blocked for > 329.252386 secs
2017-03-02 10:58:44.953827 osd.25 [WRN] slow request 240.251734 seconds old, received at 2017-03-02 10:54:44.702023: osd_op(client.512724.0:135415 97.84ada7c9 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered
2017-03-02 10:58:45.092587 osd.27 [WRN] 100 slow requests, 1 included below; oldest blocked for > 329.391155 secs
2017-03-02 10:58:45.092597 osd.27 [WRN] slow request 240.390484 seconds old, received at 2017-03-02 10:54:44.702049: osd_op(client.512724.0:135416 97.31099063 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered
2017-03-02 10:58:45.954085 osd.25 [WRN] 100 slow requests, 1 included below; oldest blocked for > 330.252673 secs
2017-03-02 10:58:45.954103 osd.25 [WRN] slow request 240.244915 seconds old, received at 2017-03-02 10:54:45.709129: osd_op(client.36294.0:8895947 97.84ada7c9 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered
2017-03-02 10:58:46.092838 osd.27 [WRN] 100 slow requests, 1 included below; oldest blocked for > 330.391422 secs
2017-03-02 10:58:46.092850 osd.27 [WRN] slow request 240.383640 seconds old, received at 2017-03-02 10:54:45.709160: osd_op(client.36294.0:8895948 97.31099063 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered

@ghost
Copy link

ghost commented Jul 26, 2017

After adding the following lines to /etc/ceph/ceph.conf file and reboot the system. Somehow, the issue still exists.

osd_max_object_name_len = 256
osd_max_object_namespace_len = 64

ceph status

cluster b3609cba-0b6d-4311-8aa3-6968c0e66f5e
 health HEALTH_WARN
        64 pgs degraded
        64 pgs stuck degraded
        64 pgs stuck unclean
        64 pgs stuck undersized
        64 pgs undersized
 monmap e1: 1 mons at {0=10.11.108.188:6789/0}
        election epoch 3, quorum 0 0
 osdmap e15: 2 osds: 2 up, 2 in
        flags sortbitwise,require_jewel_osds
  pgmap v36: 64 pgs, 1 pools, 0 bytes data, 0 objects
        69172 kB used, 3338 GB / 3338 GB avail
              _64 active+undersized+degraded

@mosyang
Copy link

mosyang commented Jul 28, 2017

I met those ext4 file system issue before. I tried below settings in ceph.conf but finally gave up.

osd_max_object_name_len = 256
osd_max_object_namespace_len = 64
osd check max object name len on startup = false

However, I follow this helpful document to deploy Ceph Jewel 10.2.9 on Ubuntu 16.04. Login to all OSD nodes and format the /dev/sdb partition with XFS file system. After that, I follow official document to deploy ceph on my ubuntu 16.04 servers. Everything works fine now.

@Runomu
Copy link

Runomu commented Aug 1, 2017

i have exactly same Problem with 14.04 LTS ext4. I tried almost everything and all suggestions above. But i'm still getting following on celp -s and next one on celp osd tree

health HEALTH_ERR
64 pgs are stuck inactive for more than 300 seconds
64 pgs stuck inactive
64 pgs stuck unclean

ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0 root default
0 0 osd.0 down 0 1.00000

@mattshma
Copy link

mattshma commented Sep 8, 2017

After appended those lines into admin_node's ceph.conf:

osd max object name len = 256
osd max object namespace len = 64

then I think you should run ceph-deploy --overwrite-conf admin osd1 osd2 to deploy the changes to osd nodes. And you should make sure the user ceph has r permission of /etc/ceph/ceph.client.admin.keyring in the osd nodes.

@alamintech
Copy link

When my server reboot and then see error osds down and pgs inactive.
Please help me. How can I solve this. This storage using for cloudstack primary storage.

image

Thanks.

@alamintech
Copy link

Please help me anyone.
image

@zdover23
Copy link

zdover23 commented Sep 7, 2020 via email

@alamintech
Copy link

See but can't find solution for this

image

@alamintech
Copy link

After server reboot can't start osd service. Please help me any one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests