part 1: Ceph Planning Summary and Architectural Overview

Ceph is an open source, distributed, scaled-out, software-defined storage system. through the use of the Controlled Replication Under Scalable Hashing (CRUSH) algorithm, Ceph eliminates the need for centralized metadata and can distribute the load across all nodes in the cluster.

Ceph provides three main types of storage: block storage via the RADOS Block Device (RBD), file storage via CephFS, and object storage via RADOS Gateway, which provides S3 and swift-compatible storage.

How Ceph works

The core storage layer in Ceph is the Reliable Autonomous Distributed Object Store (RADOS). The RADOS layer in Ceph consists of a number of Object Storage Daemons (OSDs), and Ceph Monitor (MON).

Each OSD is completely independent and forms peer-to-peer relationships to form a cluster. Each OSD is typically mapped to a single disk, into a single disk, in contrast to the traditional approach of presenting a number of disks combined into a single device via a RAID controller to the OS.

It serves the data from the hard drive or ingests it and stores it on the drive. The OSD also assures storage redundancy, by replicating data to other OSDs based on the CRUSH map.

When a drive goes down, the OSD will go down too and the monitor nodes will redistribute an update CRUSH map so the clients are aware and know where to get the data. The OSDs also respond to this update, because redundancy is lost, they may start to replicate non-redundant data to make it redundant again (across fewer nodes).

An algorithm called CRUSH is then used to place the placement groups onto the OSDs. This reduces the tasks of tracking millions of objects to a matter of tracking a much more manageable number of placement groups, normally measured in thousands.

pools are logical partitions for storing objects. When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data. Ceph clients retrieve a cluster map from a ceph monitor, and write objects to pools. The pool’s size or number of replicas, the CRUSH rule and the number of placement groups determine how Ceph will place the data

So an object lives in a pool and it is associated with one placement group. Depending on the properties of the pool, the placement group is associated with the number of OSDs as the replication count. eg. if for a replication count of three, each placement group with be associated with three OSDs. A primary OSD and two secondary OSDs. The primary OSD will serve data and peer with the secondary OSDs for data redundancy. In case the primary OSD goes down, a secondary OSD can be promoted to become the primary to serve data, allowing for high availability..

A monitor or MON node is responsible for helping reach a consensus in distributed decision making using the Paxos protocol. It’s important to keep in mind that the Ceph monitor node does not store or process any metadata. It only keeps track of the CRUSH map for both clients and individual storage nodes.

In Ceph, consistency is favored over availability. A majority of the configured monitors need to be available for the cluster to be functional. For example, if there are two monitors and one fails, only 50% of the monitors are available so the cluster would not function. But if there are three monitors, the cluster would survive one node’s failure and still be fully functional.

How to plan a successful Ceph implementation

7.2k disks = 70–80 4k IOPS

10k disks = 120–150 4k IOPS

15k disks = you should be using SSDs

As a general rule, if you are designing a cluster that will offer active workloads rather than bulk inactive/archive storage, then you should design for required IOPS and not capacity. If your cluster will largely contain spinning disks with the intention of providing storage for an active workload, then you should prefer an increased number of smaller capacity disks rather than the use of larger disks.

As mentioned, Pools are logical partitions for storing objects.

In Ceph, the objects belong to pools and pools are comprised of placement groups. Each placement group maps to a list of OSDs. This is the critical path you need to understand.

The pool is the way how Ceph divides the global storage. This division or partition is the abstraction used to define the resilience (number of replicas, etc.), the number of placement groups, the CRUSH ruleset, the ownership and so on. We can consider this abstraction as the right place to define the configuration of your policies, so each pool handles its own number of replicas, number of placement groups, etc.

The placement group is the abstraction used by Ceph to map objects to OSDs in a dynamic way. You can consider it as the placement or distribution unit in Ceph.

So how we go from objects to OSDs via pools and placements groups? It is straight.

In Ceph one object will be stored in a concrete pool so the pool identifier (a number) and the name of the object are used to uniquely identify the object in the system.

Those two values, the pool id and the name of the object, are used to get a placement group via hashing.

When a pool is created it is assigned a number of placement groups (PGs). One object is always stored in a concrete pool so the pool identifier (a number) and the name of the object is used to uniquely identify each object in the system.

With the pool identifier and the hashed name of the object, Ceph will compute the hash modulo the number of PGs to retrieve the dynamic list of OSDs.

In detail, the steps to compute for one placement group for the object named ‘tree’ in the pool ‘images’ (pool id 7) with 65536 placement groups would be…

  1. Hash the object name : hash(‘tree’) = 0xA062B8CF
  2. Calculates the hash modulo the number of PGs : 0xA062B8CF % 65536 = 0xB8CF
  3. Get the pool id : ‘images’ = 7
  4. Prepends the pool id to 0xB8CF to get the placement group: 7.B8CF

Ceph uses this new placement group (7.B8CF) together with the cluster map and the placement rules to get the dynamic list of OSDs…

The size of this list is the number of replicas configured in the pool. The first OSD in the list is the primary, the next one is the secondary and so on.

RADOS Pools and Client Access

Replicated pools:

Replicated RADOS pools are the default pool type in Ceph; data is received by the primary OSD from the client and then replicated to the remaining OSDs. The logic behind the replication is fairly simple and requires minimal processing to calculate and replicate the data between OSDs. However, as data has to be written multiple times across the OSDs. By Default, Ceph will use a replication factor of 3x, so all data will be written three time; this does not take into account any other write amplification that may be present further down in the Ceph stack. This write penalty has two main drawbacks. It obviously puts further I/O load on your Ceph cluster, as there is more data to be written, and in the case of SSDs, these extra writes will wear out the flash cells more quickly.

Erasure code pools:

Ceph’s default replication level provides excellent protection against data loss by storing three copies of your data in different OSDs. However, storing three copies of data vastly increase both the purchase cost of the hardware and the associated operational costs, such as power and cooling. Furthermore, storing copies also means that for every client, the backend storage must write three times the amount of data.

Erasure coding allows Ceph to achieve either greater usable storage capacity or increase resilience to disk failure for the same number if disks. Erasure coding achieve this by splitting up the object into a number of parts and then also calculating a type of cyclic redundancy check (CRC), the ensure code, and then storing the results in one or more extra parts. Each part is then stored on a separate OSD. these parts are referred to as K and M chunks, where K refers to the number of data shards and M refer to the number of erasure code shards. As in RAID, these can often be expressed in form K+M, or 4+2, for example. A 3+1 configuration will give you 75% usable capacity but only allows for a single OSD failure, and so would not be recommended. In comparison, a three-way replica pool only gives you 33% usable capacity. 4+2 configurations would give you 66% usable capacity and allows for two OSD failures.

These smaller shards will generate a large amount of small I/O and cause an additional load on some clusters.

Reading back from these high-chunk pools is also a problem. Unlike in a replica pool, where Ceph can read just the requested data from any offset in an object, in an erasure pool, all shards from all OSDs have to be read before read request can be satisfied. In the 18+2 example, this can massively amplify the amount of required disk read ops, and average latency will increase as a result. A 4+2 configuration in some instances will get a performance gain compared to a replica pool, from the result of splitting an object into shards. As data is effectively striped over a number of OSDs, each OSD has to write less data.

erasure code pools v.s. replicated pools

where all this information is summarised together with pointers to the: Mastering Ceph Book, Official Ceph Documentation

Links that I recommend:

DevOps Engineer