Be sure to checkout the Rook Ceph Common Issues page and that all prerequisites for the storage backend of your choice are met!
rook-discover-* Pods go after a recent Rook Ceph update?A recent change in Rook Ceph has disabled the rook-discover DaemonSet by default.
This behavior is controlled by the ROOK_ENABLE_DISCOVERY_DAEMON located in the operator.yaml or for Helm users enableDiscoveryDaemon: (false|true in your values file. It is a boolean, so false or true.
rook-discover-* Pods/ ROOK_ENABLE_DISCOVERY_DAEMON: true?volumeMode: Block. Ceph requires block devices (Ceph's filestore is not available, through Rook, since a bunch of versions as bluestore is superior in certain ways).Pending/ ContainerCreatingkubectl describe pod POD_NAME.rook-ceph-mon-* Pods are runningrp_filter, see here: rp_filter (default) strict mode breaks certain load balancing cases in kube-proxy-free mode · Issue #13130 · cilium/ciliumrook-ceph-operator Logs for any warnings, errors, etc.rook-discover-* Pods/ ROOK_ENABLE_DISCOVERY_DAEMON: true? apply to you? If so, make sure the operator has the discovery daemon enabled in its (Pod) config!shred, dd or similar,
DISK="/dev/sdXYZ"
sgdisk --zap-all "$DISK"
dd if=/dev/zero of="$DISK" bs=1M count=100 oflag=direct,dsync
blkdiscard "$DISK"
kubectl describe pod POD_NAME.dmesg logs.Make sure you have checked out the CSI Common Issues - Rook Docs.
If you have some weird kernel and/ or kubelet configuration, make sure Ceph CSI's config options in the Rook Ceph Operator config is correctly setup (e.g., LIB_MODULES_DIR_PATH, ROOK_CSI_KUBELET_DIR_PATH, AGENT_MOUNTS).
rook-ceph-mon-* Pods all in Running state?ceph -s work?rook-ceph-mgr-* Pod(s) running as well?rook-ceph-mon-* and rook-ceph-mgr-* logs for errorsrook-ceph-mon-* talk about probing for other mons only, you might need to follow the disaster recovery guide for your Rook Ceph version here: Rook v1.8 Docs - Ceph Disaster Recovery - Restoring Mon Quorum.ROOK_MON_OUT_TIMEOUT, by default 600s (10 minutes)Checkout the official Ceph OSD Management guide from Rook here: Rook v1.8 Docs - Ceph OSD Management.
release- branch of your Rook release): https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/toolbox.yamlkubectl describe pod POD_NAME.ceph osd tree output.ceph osd crush rm-device-class osd.ID.ceph osd crush set-device-class CLASS osd.IDhdd, ssd, nvmeHEALTH_WARN: clients are using insecure global_id reclaim/ HEALTH_WARN: mons are allowing insecure global_id reclaimSource: https://github.com/rook/rook/issues/7746
I can confirm this is happening in all clusters, whether a clean install or upgraded cluster, running at least versions:
v14.2.20,v15.2.11orv16.2.1.According to the CVE also previously mentioned, there is a security issue where clients need to be upgraded to the releases mentioned. Once all the clients are updated (e.g. the rook daemons and csi driver), a new setting needs to be applied to the cluster that will disable allowing the insecure mode.
If you see both these health warnings, then either one of the rook or csi daemons has not been upgraded yet, or some other client is detected on the older version:
health: HEALTH_WARN client is using insecure global_id reclaim mon is allowing insecure global_id reclaimIf you only see this one warning, then the insecure mode should be disabled:
health: HEALTH_WARN mon is allowing insecure global_id reclaimTo disable the insecure mode from the toolbox after all the clients are upgraded: Make sure all clients have been upgraded, or else those clients will be blocked after this is set:
ceph config set mon auth_allow_insecure_global_id_reclaim falseRook could set this flag automatically after the clients have all been updated.
$ ceph osd metadata 0 | grep osd_objectstore
"osd_objectstore": "bluestore",
To get a quick overview of the "object stores" (bluestore, (don't use it) filestore):
$ ceph osd count-metadata osd_objectstore
{
"bluestore": 6
}
allowVolumeExpansion: false:
$ kubectl get storageclasses.storage.k8s.io
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
rook-ceph-block rook-ceph.rbd.csi.ceph.com Retain Immediate false 3d21h
rook-ceph-fs rook-ceph.cephfs.csi.ceph.com Retain Immediate true 3d21h
allowVolumeExpansion: true in the StorageClass. Below is a StorageClass with this option set, it is at the top level of the object (not in .spec or similar):
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-ceph-block
parameters:
clusterID: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
[...]
imageFeatures: layering
imageFormat: "2"
pool: replicapool
provisioner: rook-ceph.rbd.csi.ceph.com
reclaimPolicy: Retain
volumeBindingMode: Immediate
[...] failed to retrieve servicemonitor. servicemonitors.monitoring.coreos.com "rook-ceph-mgr" is forbidden: [...]You have the Prometheus Operator installed in your Kubernetes cluster, but have not applied the RBAC necessary for the Rook Ceph Operator to be able to create the monitoring objects.
To rectify this, you can run the following command and/ or add the file to your deployment system:
kubectl apply -f https://raw.githubusercontent.com/rook/rook/master/cluster/examples/kubernetes/ceph/monitoring/rbac.yaml
(Original file located at: https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/ceph/monitoring/rbac.yaml)
[...] failed to reconcile cluster "rook-ceph": [...] failed to create servicemonitor. the server could not find the requested resource (post servicemonitors.monitoring.coreos.com)This normally means that you don't have the Prometheus Operator installed in your Kubernetes cluster. It is required for .spec.monitoring.enabled: true in the CephCluster object to work (the operator to be able to create the ServiceMonitor object to enable monitoring).
For the Rook Ceph - Prometheus Monitoring Setup Steps check the link.
Set .spec.monitoring.enabled to false in your CephCluster object/ yaml (and apply it).
If you want to use Prometheus for monitoring your applications and in this case also Rook Ceph Cluster easily in Kubernetes, make sure to install the Prometheus Operator.
Checkout the Prometheus Operator - Getting Started Guide.
unable to get monitor info from DNS SRV with service name: ceph-mon/ Can't run ceph and rbd commands in the Rook Ceph XYZ PodYou are only supposed to run ceph, rbd, radosgw-admin, etc., commands in the Rook Ceph Toolbox/ Tools Pod.
Regarding the Rook Ceph Toolbox Pod checkout the Rook documentation here: Rook Ceph Docs - Ceph Toolbox.
This requires you to have the Rook Ceph Toolbox deployed, see Rook Ceph Docs - Ceph Toolbox for more information.
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- bash
OSD id X != my id Y - OSD Crashkubectl exec -n rook-ceph -it OSD_POD_NAME -- bash (ceph-bluestore-tool command is needed), run the following commands:
lsblk to see all disks of the host.ceph-bluestore-tool show-label --dev=/dev/sdX (note down the OSD ID (whoami field in the JSON output) and which disk the OSD is on (example: OSD 11 /dev/sda).rook-ceph-osd-... deployment needs to be updated with the new/ correct device path. The ROOK_BLOCK_PATH environment variable must have the correct device path (there are two occurrences, in the containers: and in initContainers: list).up in the ceph osd tree output (the command can be run in the rook-ceph-tools Pod). If you have scaled down the OSD Deployment, make sure to scale it up to 1 again (kubectl scale -n rook-ceph deployment --replicas=1 rook-ceph-osd...)_read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (13) Permission deniedceph package(s) installed on the host and/ or a user/group named ceph?This can potentially mess with the owner/group of the ceph osd block device, as described in GitHub rook/rook Issue 7519 "OSD pod permissions broken, unable to open OSD superblock after node restart".
You can either change the user and group ID of the ceph user on the host to the one inside the ceph/ceph image that your Rook Ceph cluster is running right now (CephCluster object .spec.cephVersion.image).
$ kubectl get -n rook-ceph cephclusters.ceph.rook.io rook-ceph -o yaml
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
[...]
name: rook-ceph
namespace: rook-ceph
[...]
spec:
cephVersion:
image: quay.io/ceph/ceph:v16.2.6-20210927
[...]
Depending your hosts, you might not need to even have the ceph packages installed. If you are using Rook Ceph, you normally don't need any ceph related packages on the hosts.
Should this have not fixed your issue, you might be running into some other permission issue. If your hosts are using a Linux distribution that uses SELinux, you might need to follow these steps to re-configure the Rook Ceph operator: Rook Ceph Docs - OpenShift Special Configuration Guide.
Should this page not have yielded you a solution, checkout the Ceph Common Issues doc as well.