Controlled Replication Under Scalable Hashing (#CRUSH), Ceph’s Hidden Power to face any storage’s challenge

What would you think if you can customize your storage solution to the way your company really needs? It doesn’t matter how much capacity or performance you think you might need. And we are talking here about from a few TeraBytes to thousands of PetaBytes. A storage system that is able to optimally manage data from a couple of servers to thousands of them spread among several datacenters. A storage solution that can customized data placement rules based on your underlying physical Infrastructure, at any datacenter, to bring the required resiliency against power, cabinets or devices’s failures. A solution that you can be optimally customized depending on your decommissioning hardware processes, or the way that your data moves or changes, or the way your capacity scales or even shrinks.

However, “with a great power, comes a great responsibility” (I am not quite sure, I think I’ve heard this byword from Spiderman).

Storage’s vendors helps you NOT figure out, by yourself only, how to implement the best operation practices into your storage system to meet properly your company’s data capacity, performance and management demands. Vendors have people and tons of documentation to support you. And start working with #Ceph requires a big responsibility from you and your team, You need to start understanding how it works and you can define it depending on your storage needs. I mean it, you can lose your data, or worst, your credibility to the company, if you don’t take the required time to design it and think in all possible failures scenarios.

… And if you mess it up, you don’t have a vendor to blame on.

Anyway, you can try #Ceph at your company, but again, that means you need to move out from your comfort zone and start studying and doing many tests or find somebody, like us 😉 to support you throughout this new journey.

Also, there are vendors that will try to diverge you from the chosen ones’ path by saying things like:”#Ceph is not a solution for transactional applications” or “#Ceph is not done to bring enough redundancy”. Of course #Ceph is not 100% done, you need to invest time to stable it, but when you finish it, you won’t regret. You need to turn your mind into a Myth’s killer. This statements makes me recall, once upon a time, It has been told that storage block protocols over Ethernet didn’t be suitable for databases and now we have known database’s vendors suggesting to use them through their converged database optimized infrastructure stacks, or worst, when It has been told that any RAID (redundant array of independent disks) based on hardware bring much better performance than its parallel RAID based on software, and now all best-of-bread storage technologies are fully based on software – we can say that almost all storage vendors have been turned into software companies –

Well, let’s stop, a while, bringing reasons to take you to #Ceph and let me develop the topic of this note.

#CRUSH is the Ceph’s Heart or the Hidden Power behind scenes. #CRUSH stands for “Controlled Replication Under Scalable Hashing” and It is the algorithm in charge of placing data at every disk’s device of the cluster – remember that #Ceph is a storage system compose by several nodes and their local disks arranged and managed in a networked cluster configuration with several replication rules for redundancy – #CRUSH’s settings are defined through a compiled map file to define replication rules between different failure domains (failure domains are disk pools physically separated to bring redundancy against power or/and hardware failures), bucket types (bucket are parameters arranged in an hierarchical order that helps to correlate physical infrastructure components like datacenter’s rows, racks, chassis or nodes to the location of storage devices and stored objects) and OSDs’ identifiers (object-based storage devices). CRUSH helps to speed data’s placement and retrieving via a pseudo-random data distribution function avoiding to rely the mapping of every piece of data and its respective storage device on a central directory.

CRUSH Ceph storage openstack kio networks mauricio rojas pinrojas

#CRUSH is also responsible for the evenly distribution of the storage capacity and workload among nodes. Also, CRUSH is what brings the magic to keep this evenly distribution besides the addition/removal of nodes and capacity over the time, and keep a lower resources’ usage throughout the data movement as result of these changes into the cluster.

#CRUSH allows you to customize the map settings depending on your storage cluster’s components, distribution and changes over the time. This customization helps to improve your efficiency and performance heavily in comparison with any other storage solutions available in the market. You can customized your solution through parameters like the type of bucket (uniform, list, binary or straw), weight and its hierarchy definition, and the replication and placement rules. More information about how the type of bucket, the rules and weight definition affect your performance and capacity at “CRUSH – Controlled, Scalable, Decentralized Placement of Replicated Data”.

Now a couple of cases of customization:

  • Suppose you want to offer only a block storage solution based on exclusively SSD devices with an extreme level of performance and redundancy. This environment will be extremely uniform, evenly weight and it will never suffer any sort of shrink. You will be adding nodes by blocks every quarter year – maybe adding racks full of new nodes following the row of cabinets at the datacenter –  Then probably you need to define failure domains base on the racks’ location of every node to get power failures independency. You have to bring a high performance and bandwidth network infrastructure between racks to attend user and replication data. You should need to set buckets as “binary” types to reduce the data placement time and a 3-way replication to get better redundancy and avoid lost against a double failure scenario due to the big amount of devices and nodes. all devices will be weight at the same value. You should define a fixed object size to fit block size to optimally stripe in.
  • Now we have an storage with a mix between block and object type of data. Some data could be non-transactional at all. The ecosystem of nodes are heterogeneous with different types of disks (SATA, SAS, SSD) and you don’t know how the demand of capacity will change in the future. You will decommission or reuse old hardware and there is uncertainty how it could be replaced. You will need to distribute data among different datacenters and the connection between these sites are limited in bandwidth and brings high latency values. First of all, you need to set “straw” as your type of bucket and be really careful to define their values and hierarchy. You should set more than one root bucket depending on the type of devices. Objects could be replicated between datacenters at n-way mode – “n” could be the numbers of sites-, but block data probably need to be keep it inside every site and let the client application to be in charge of the replication to optimize the bandwidth usage between sites. You should define different weight depending on capacity and device performance (i.e. SSD could have a higher weight than SAS to get more workload capacity).

As you can see, CRUSH helps you to face any storage demand challenges, from a high performance uniform block storage solution to a geo-replicated multi-purpose storage system.

See you next!

One thought on “Controlled Replication Under Scalable Hashing (#CRUSH), Ceph’s Hidden Power to face any storage’s challenge

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s