Ceph deep scrubbing and I/O saturation

Ceph is a distributed storage system use in Cloud environment. Recently I’ve been facing an I/O congestion during night period.

This I/O saturation is impacting the application performance on OpenStack even if the system was really resilient to this activity level.

In this post I’ll explain why I had a such situation on the CEPH infrastructure and what are the settings to modify this I/O level.

CEPH scrubbing

CEPH is using two type of scrubbing processing to check storage health. The scrubbing process is usually execute on daily basis.

  • normal scrubbing – catch the OSD bugs or filesystem errors. This one is usually light and not impacting the I/O performance as on the graph above.
  • deep scrubbing – compare data in PG objets, bit-for-bit. It looks for bad sectors on disks. This processing is I/O intensive.

The I/O performance shown in the above metric history was caused by the deep scrubbing process running on daily basis from 0:00am to 7:00am. To understand this, let see the configuration I had.

Deep-Scrubbing parameters

osd scrub begin/end hour – these parameters define the scrubbing time windows. In my exemple they are planned between 0am and 7am. The deep scrubbing takes time and if the last deep-scrubbing starts before 7am, the end of the processing can ba about 1 hour later. This is why the I/O level decreased after 7am but is still high until 8m.

It’s important to correctly set the scrubbing time-windows regarding the server activity. Having a nightly processing make sens if the supported workload is regional and human based transactions. For batch or international workload a global time windows (from 0am to 24pm) looks more appropriate.

osd deep scrub interval – this parameter defines the period for executing a deep scrubbing on a PG. The default setting is once a week. It means, when the deep scrubbing schedules its work, it marks all the PG not deep scrubbed since this interval to be in this batch. Basically the level of parallel deep-scrubbing is related to this parameter and the time windows. One week period seems a really conservative setting.

After some investigation, it appears than even if the interval was 1 week, we had no PG deep-scrubbed before the 4 previous days. As a consequence, if we were looking to extend the period or the time windows we would not have solution here if the PG are process more often than the expected period. This assumption is partially true but also wrong. The explanation of this over processing has been found in the CEPH code:

scrubber.time_for_deep = ceph_clock_now() >=
	          info.history.last_deep_scrub_stamp + deep_scrub_interval;

          bool deep_coin_flip = false;
	  // If we randomize when !allow_scrub && allow_deep_scrub, then it guarantees
	  // we will deep scrub because this function is called often.
	  if (!scrubber.time_for_deep && allow_scrub)
	    deep_coin_flip = (rand() % 100) < cct->_conf->osd_deep_scrub_randomize_ratio * 100;
          
          scrubber.time_for_deep = (scrubber.time_for_deep || deep_coin_flip);

Basically deep-scrubbing is also dependent of a fourth parameter even if the documentation is not really clear on it:

osd_deep_scrub_randomize_ratio – this is a random ratio to select a PG that do not need to be selected for deep-scrubbing to be part of the deep-scrubbing batch. The value is between 0 and 1.

In this case, the initial setup was 15%. Assuming the deep-scrubbing allocation is made once a day, it means any of the PG will be selected by the random procedure on every week. This means in other words, statistically that >50% of PG will be process twice a week. This is why all the PG were processed within a 4 days period even with a 1 week interval.

If you change the osd deep scrub interval without changing the osd_deep_scrub_randomize_ratio your workload will not be reduced. So you need to setup these two together.

The objective of osd_deep_scrub_randomize_ratio is to linearize the deep scrubs operations otherwise we could have big I/O peaks on PG creation anniversary even if interval after interval this could be more and more linear. When the random factor correspond to the interval period (basically 15% for a week) this is creating a linearity in the PG deep-scrubbing distribution over days. But it also create an over processing about 150%. So once the PG deep scrubbing workload has been dispatch all over the period it could be interesting to reduce this ratio to let the expiration be correctly handled.

An other way to work could be to have a ratio defining the period (in a statistical approach) and using the period as a garbage collector for the PG not processed (the unlucky one). This seem to me a good setting with as an exemple a random ratio 5% for processing PG statistically around 3 weeks each and an interval of 1 month fo the garbaging.

Analyse your scrubbing rate

Understanding you scrubbing and deep scrubbing rate is important to understand the system saturation and compute the normal time to process your scrubbing activities. Looking at the history helps.

This command gives the PG state and last scrubbing / deep scrubbing date.

[~] ceph pg dump

You can take a look on the oldest deep scrubbing date for a PG:

[~] ceph pg dump | awk '$1 ~/[0-9a-f]+\.[0-9a-f]+/ {print $25, $26, $1}' | sort -rn

For the light scrubbing

[~] ceph pg dump | awk '$1 ~/[0-9a-f]+\.[0-9a-f]+/ {print $22, $23, $1}' | sort -rn

You can filter by date to get the number of PG processed for a given day, so your ability to process a certain number of PG per day for deep-sccrubbing.

[~] ceph pg dump | awk '$1 ~/[0-9a-f]+\.[0-9a-f]+/ {print $25, $26, $1}' | sort -rn | grep 2020-XX-YY | wc

You can change the column id to get the same thing for the light scrubbing.

These elements are important to understand if your system is saturating because there is too much work for a shorter time window or if the Random ratio is too high and you have too much scrubbing operations on daily basis.

Going further with scrubbing parameters

Some other parameters will have an impact on the load as once we have defined the work to do, it needs to be scheduled and executed.

  • osd max scrubs – indicate the maximum number of scrub operation we can have for a single OSD (drive). It is usually 1. But you can have many scrub operation on the same host on different OSD. This can impact the global I/O performance particularly when busses are shared between drives.

Scrub processing is organized per chunk of work. The chunk size if defined by two values: osd scrub chunk min and osd scrub chunk max. It seems to be a batch of scrub operation but currently it’s not clear for me if a chunk contains a mix of deep-scrubs & light scrubs. The processing between two chunk is separated by a potential sleep period given by osd scrub sleep. This parameter has no unit described so it is a bit complex to understand how to use it. I’ve seen setting with 0.1 value ; it that 10% (of what?).

Priority

CEPH I/O are queued and process with a priority. This way we can have scrub operation not impacting too much the user I/O. The priority is higher for a higher value. Different priority parameters are involved:

  • osd requested scrub priority – is the priority given to a scrub (light and deep) operation manually started. This can be important as I’ve seen different people replacing the scheduled system by a script. The default value is 120 and this is usually higher than the user priority. So basically such request will impact the performance with this setting.
  • osd scrub priority – is the priority given to a scrub (light and deep) operation automatically scheduled. The value is by default 5. Considering 4 as best-effort, 5 is low.
  • osd client op priority – is the priority given to a client (customer) I/O The maximum value is 63. It is also the default value.

Block size

Block size read access performance can also impact the global scrubbing performance. osd deep scrub stride parameter set the block size to be read. The default value is 512KB (524288). According to different reading, in case of SSD, the block size doesn’t really matter above 16KiB. When for mechanical disk 512KiB is a good compromise but you can improve the throughput with 1MiB or 2MiB blocks.

System load

osd scrub load threshold parameter allows to stop scrubbing when the system load (loadavg / number of online cpu) is above the given limit. Potentially this helps to stop scrub during intensive user I/O. But scrub is also creating load so in my point of view it’s a bit complex to correctly tune this parameter. It makes sense for CPU undersized CEPH server.

Scheduling scrub

One other yet unclear things: When does the scheduling of the scrubbing action is executed ? is that on every batch windows (so once a day) or is it after every chuck with a scheduling ending as soon as the chuck is full ? in this case how does the system ensure to process all the PGs over the different schedules. This information is a key point for understanding the impact of the Random PG selection in scrubbing scheduling. Unfortunately, right know, I did not get the answer from the available documentation. Any link is welcome.

Useful related commands

The configuration is attached per OSD. To list the parameter value for one OSD:

ceph config show-with-defaults osd.x   # with x a OSD number

To see one of the parameters on all the OSDs

ceph tell osd.* config get osd_deep_scrub_randomize_ratio

Get some basic insight with OSD and disk performance. Some preliminary checks are interesting as your IO scrubbing issue could be, in fact, IO performance issues.

hdparm -Tt /dev/sdx
ceph tell osd.x bench -f plain
This entry was posted in Openstack and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.