Common Issues
Benchmarking Ceph Storage
You want to benchmark the storage of your Ceph cluster(s)? This is a short list of tools to benchmark storage.
Recommended tools:
- General Benchmarking Testing of Storage (e.g., plain disks, and other storage software)
- Ceph specific Benchmarking:
CephFS mount issues on Hosts
Make sure you have a (active) Linux kernel of version 4.17
or higher.
5.x
or higher).HEALTH_WARN 1 large omap objects
Issue
HEALTH_WARN 1 large omap objects
# and/or
LARGE_OMAP_OBJECTS 1 large omap objects
Solution
The following command should fix the issue:
radosgw-admin reshard stale-instances rm
MDSs report oversized cache
Issue
Ceph health status reports, e.g., 1 MDSs report oversized cache
.
[root@rook-ceph-tools-86d54cbd8d-6ktjh /]# ceph -s
cluster:
id: 67e1ce27-0405-441e-ad73-724c93b7aac4
health: HEALTH_WARN
1 MDSs report oversized cache
[...]
Solution
You can try to increase the mds cache memory limit
setting1.
Find Device OSD is using
Issue
You need to find out which disk/device is used by an OSD daemon.
Scenarios: smartctl
is showing that the disk should be replaced, disk has already failed, etc.
Solution
Use the various ls*
subcommands of ceph device
.
$ ceph device --help
device check-health Check life expectancy of devices
device get-health-metrics <devid> [<sample>] Show stored device metrics for the device
device info <devid> Show information about a device
device light on|off <devid> [ident|fault] [--force] Enable or disable the device light. Default type is `ident`
'Usage: device
light (on|off) <devid> [ident|fault] [--force]'
device ls Show devices
device ls-by-daemon <who> Show devices associated with a daemon
device ls-by-host <host> Show devices on a host
device ls-lights List currently active device indicator lights
device monitoring off Disable device health monitoring
device monitoring on Enable device health monitoring
device predict-life-expectancy <devid> Predict life expectancy with local predictor
device query-daemon-health-metrics <who> Get device health metrics for a given daemon
device rm-life-expectancy <devid> Clear predicted device life expectancy
device scrape-daemon-health-metrics <who> Scrape and store device health metrics for a given daemon
device scrape-health-metrics [<devid>] Scrape and store device health metrics
device set-life-expectancy <devid> <from> [<to>] Set predicted device life expectancy
The ceph device
subcommands allow you to do even more things, e.g., turn on the disk light in server chassis.
Enabling the light for the disk can help the datacenter workers to easily locate the disk and not replacing the wrong disk.
Locate Disk of OSD by OSD daemon ID (e.g., OSD 13):
$ ceph device ls-by-daemon osd.13
DEVICE HOST:DEV EXPECTED FAILURE
SAMSUNG_MZVL2512HCJQ-00B00_S1234567890123 HOSTNAME:nvme1n1
Show all disks by host (hostname):
$ ceph device ls-by-host HOSTNAME
DEVICE HOST:DEV EXPECTED FAILURE
DEVICE DEV DAEMONS EXPECTED FAILURE
SAMSUNG_MZVL2512HCJQ-00B00_S1234567890123 nvme1n1 osd.5
SAMSUNG_MZVL2512HCJQ-00B00_S1234567890123 nvme0n1 osd.2
SAMSUNG_MZVL2512HCJQ-00B00_S1234567890123 nvme2n1 osd.8
SAMSUNG_MZVL2512HCJQ-00B00_S1234567890123 nvme3n1 osd.13
CephOSDSlowOps Alerts
Issue
TODO
Things to Try
- Ensure the disks you are using are healthy
- Check the SMART values. A bad disks can lock up an application (such as a Ceph OSD) or worse the whole server.
Should this page not have yielded you a solution, checkout the Rook Ceph Common Issues doc as well.
Footnotes
- Report/ Source for information regarding this issue has been taken from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-December/037633.html ↩
- Rook Ceph Docs v1.7 - Ceph Filesystem CRD - MDS Resources Configuration Settings ↩