Ceph : An overview

Ceph is open source, software-defined storage maintained by RedHat. It’s capable of block, object, and file storage. The clusters of Ceph are designed in order to run on any hardware with the help of an algorithm called CRUSH (Controlled Replication Under Scalable Hashing). This algorithm ensures that all the data is properly distributed across the cluster and data quickly without any constraints. Replication, Thin provisioning, Snapshots are the key features of the Ceph storage.

There are good reasons for using Ceph as IaaS and PaaS storage :

  • Scale your operations and move to market faster.
  • Bridge the gaps between application development and data science.
  • Gain deeper insights into your data.
  • File, block and Object storage in the same wrapper.
  • Better transfer speed and lower latency
  • Easily accessible storage that can quickly scal up or down .

Site : https://ceph.io/

This technology allows a fast and accessible operational process without interruptions, while supporting the new scale of petabytes with high performance. This type of hyper-scalable storage fits very well with Cloud models.

1. Components

1.1 Cluster components

A CEPH cluster works with the following components:

  • OSD (Object Storage Daemon) — Each Ceph storage node runs one or more Ceph OSD daemons (one per disk). The OSD is a dameon that does all data storage, replication and data recovery operations. The file systems commonly used are XFS, btrfs and ext4.
  • Monitor — Ceph Monitor is the daemon responsible for maintaining a master copy of the cluster map. The Ceph cluster needs a minimum quorum of 3 or more to ensure high availability .
  • Rados Gateway –The rados gateway delivers an api service and it connect via S3 or Swift directly with Ceph.
  • Metadata Server — MDS handles all file operations and uses RADOS objects to store data and file system attributes. It can be scaled horizontally by adding more Ceph Metadata Servers to support more customers.
  • Ceph Manager — The Ceph Manager daemon (ceph-mgr) runs alongside the monitor daemons, to provide additional monitoring and interfaces to external monitoring and management systems (Only available from Luminous version upwards).

1.2 Storage Clients

Ceph provides a block interface (RBD) for each cluster connection , an object interface (RGW) and a file system interface (CephFS). The mostly common interface is RBD .

Image for post
  • S3 / SWIFT (RGW) — It consumes the services of the Rados gateway via the internet to store objects.
  • RBD. A reliable, fully distributed block device with cloud platform integration
  • CephFS — Ceph Filesystem (CephFS) is a POSIX-compatible file system that uses a Ceph cluster to store your data. To use this feature it is necessary to have a Ceph Metadata Server (MDS) structure in the Ceph cluster
  • LIBRADOS — A library that allows applications to directly access RADOS (C, C ++, Java, Python, Ruby, PHP)

2. Inside of the solution

2.1 Authentication

Ceph uses a cephx authentication system similar to Kerberos to authenticate users and daemons, where SSL or TLS is not used. Cephx shares secret keys for mutual authentication, which means that both the clients and the monitors in the cluster have a copy of the client’s secret key.

2.2 Cluster Map

Ceph maintains all cluster topology, which includes five maps called the “Cluster Map”:

  • Monitor Map: Contains the fsid of the cluster, the position, the name of the address and the port of each monitor. It also indicates the current time, when the map was created and the last time it was changed. To view a map of the monitor, run ceph mon dump.
  • OSD Map: contains the cluster fsid, when the map was last created and modified, a list of pools, replica sizes, PG numbers, a list of OSDs and their status (for example, up, in and down). To view an OSD map, run ceph osd dump .
  • PG Map: Contains the PG version, its time stamp, the last epoch of the OSD map, the complete proportions and details of each positioning group, such as PG ID, Up Set, Active, PG status (for example, active + clean) and data usage statistics for each pool.
  • CRUSH Map: Contains a list of storage devices, the fault domain hierarchy (for example, device, host, rack, line, room, etc.) and rules for going through the hierarchy when storing data.
    You can view the decompiled map in a text editor.
  • MDS Map (CEPHFS): Contains the current time of the MDS map, when the map was created and the last time it was changed. It also contains the pool to store metadata, a list of metadata servers and which metadata servers are active and available. To view an MDS map, run ceph mds dump .

The crush map process :

  1. Server Client contacts Monitors to update a copy of the cluster.
  2. The Server Client receives the Map of PGs and OSD’s.
  3. The Server Client sends the data to write to the cluster and the entire Crush process by mapping the PGs to the OSD’s.
  4. The entire recording and replication process is performed (detailed below).

Ceph supports a cluster of monitors, where in a failure scenario of 1 Monitor server make it lose the current state of the cluster , so for this reason that uses the Paxos algorithm to establish consensus among the Monitors and always update the Cluster status.

NOTE: The client’s communication with the monitor is performed only to update the Crush Map, the data recording is performed directly in communication with the Storage node independently.

2.3 Logical Data

Ceph uses 3 important components to make the logical separation of data:

  • Pools — Objects in Ceph are stored in Pools. Each pool is divided into pg_num positioning groups (PGs) where each PG contains fragments of the general set object.
  • Placement Groups — Ceph maps objects to placement groups (PGs) that are fragments of a set of objects that are mapped to various OSDs.
  • CRUSH Map — CRUSH is what allows Ceph to scale without performance bottlenecks, without scalability limitations and without a single point of failure. CRUSH maps provide the physical topology of the cluster to the CRUSH algorithm to determine where an object’s data and its replicas should be stored and how to do this in the fault domains for greater data security.

A big picture of logical data flow

2.4 Writing and Reading Process

The reading and writing process is carried out as follows in the cluster:

  1. The customer writes the object to the PG identified in the main OSD.
  2. In the main OSD with its own copy of the CRUSH map it identifies other OSDs.
  3. Replica is performed for the Second OSD.
  4. Replies are made to the Third OSD.
  5. The Client receives written confirmation successfully.

In the example below we have the reading example.

2.5 Data Replication

The replicated data inside of cluster :

When a PG is replicated you will receive identical copies of these objects in each OSD that CRUSH decides that PGs should be mapped to.

If you have a 3 replicated pool with 128 PGs, you will have 384 PGs spread across as many OSDs as you have.

We have all blocks ditributed by CRUSH algorithm and all data throughout the cluster.

2.5 Physical separation

CRUSH maps contain a list of OSDs, a list of buckets for aggregating devices in physical locations, and a list of rules that tell you how CRUSH should replicate data in Ceph cluster pools.

According to the image above to clarify the issue of this separation we can consider that:

  • Rack com storages SATA — Pool A e Pool B
  • Rack com storages SSD — Pool C e Pool D

The proposed scenario after changing the Crush Map:

Example for edit crush map :

ceph osd crush rule create-replicated <name> <root> <failure-domain> <class>

  • <name>: The name of the rule.
  • <root>: The root of the CRUSH hierarchy.
  • <failure-domain>: The failure domain. For example: host or rack.
  • <class>: The storage device class. For example: hdd or ssd. Ceph Luminous and later only.

Create the crush rules for sata and ssd hardwares :

ceph osd crush rule create-replicated ssd default host ssd
ceph osd crush rule create-replicated sata default host hdd

Command to create the pools:

ceph osd pool create  

Creating the pools to physically separate:

ceph osd pool create poola 256
ceph osd pool create poolb 256
ceph osd pool create poolc 256
ceph osd pool create poold 256

Physically separating the pools:

ceph osd pool set poola crush_ruleset 1
ceph osd pool set poolb crush_ruleset 1
ceph osd pool set poolc crush_ruleset 2
ceph osd pool set poold crush_ruleset 2

We can see that in the CrushMap file the SATA rules are with ruleset 1 and the SSD rules are with ruleset 2.

Result from ceph osd dump:

pool 1 'sata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 116 flags hashpspool stripe_width 0
pool 2 'ssd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 128 pgp_num 128 last_change 117 flags hashpspool stripe_width 0

I would to share more hands on details in my next Ceph posts.