#Ceph versus Distributed Share Nothing Storage Architectures

Last #EMCWorld we’ve got cool key notes showing beautiful trends on storage technologies. There were frequent recalls on these two types of Storage architectures:

  • Distributed Share Nothing: This type of architectures works on independent controllers no sharing memory resources between nodes.  This sort of solution has beed made for Non-transactional data and brings distributed data protection features. Object Storage is a solution that fits with this description and you have several options like OpenStack Swift or Ceph Object Storage or Amazon S3.
  • Loosely Coupled Scale-Out: Similar description to the other, but it’s aimed to store transactional data. The data is distributed through all nodes in blocks or pieces and you get consistency writes and reads among the nodes. Some part of the software maps the location of the pieces of the data and help you to put it together to have a coherent read operation. The performance and the capacity scale out adding nodes and usually you can control the importance of every node into the whole cluster depending on its hardware features and its contribution to the overall performance. Some examples are: EMC ScaleIO, Ceph Block Storage, VMWare Virtual SAN, Nutanix and Pivot3.

More details about these architectures at “Understanding Storage Architectures”

Some Key notes slides at EMCWorld has shown Ceph as an only “Distributed Share Nothing” type of architecture. I think It was a horrible mistake, cause Ceph could work in both worlds simultaneously – I am using this note to reply it -.

We are using Ceph as an only Block Storage Solution through OpenStack Cinder in our case. Obviously, There is a cost with extra IO through the cluster network to synchronize writes and reads, and for maintenance, but if you bring enough network resources you can get more than enough throughput to your cloud servers.

(Below you have a picture showing both architectures)


My Last note I’ve brought some results of our test with Ceph Block Storage and OpenStack Cinder. Using local SSD Disk we’ve got numbers around 900 MB/s from a VM executing “dd if=/dev/zero bs=1M”. This case we’ve got results a bit less than local disks, but we’ve won on redundancy, data persistency, flexibility, scalability and cost – you have to sacrifice something to succeed -. It could be so much worst if we didn’t use PCI Flash cards for journal – A lesson we have learned –

Anyway, we’ve been able to lower our storage costs and bring high performance and availability levels to our customers.

(Below you can see a picture of our labs results with Ceph Block Storage and OpenStack Cinder)

Flash Storage Ceph Openstack Cinder icehouse Performance Write

See you around

4 replies »

  1. Disclosure – EMCer here
    Mauricio – thanks for the blog post, and the linkage to my blog post on storage architectures. I hear your point on Ceph being in more than one of the “Phylum” (storage “tree of life branches”) – but I stick to the way I classified it (though ALWAYS open to debate!) for this reason:
    The core Ceph innovation is their RADOS layer – which is a “Type IV” in the classification I’ve proposed. On top of this, one CAN layer block and NAS presentation models (the approach the Ceph crew have taken).
    There are ups (simple, and re-uses the underlying object store for multiple purposes) and downs (the transactional stacks “inherit” some of the underlying object store behaviors – for better and for worse) to this layering approach.
    BTW – the EMC thingamajig that competes with Ceph is the ViPR data services layer (which has ViPR Block which you blogged about, and ViPR Object/HDFS – which will eventually also add a non-transactional NAS protocol choice). The most important behaviors for most people looking at Object stores is the core object engine and geo-distribution behaviors – again, describing at moderate to large scale).
    We took a different approach than Ceph on how to do transactional on commodity hardware (our thinking was that transactional behaviors/characteristics used by things like Openstack cinder volumes used by Nova instances *tend* to diverge from object – particularly at scale), but the market will decide.
    The ViPR Controller provides a common management model not only for these, but also frankly for ANY storage model (oh, and is freely available – just google “get ViPR”)
    I’d encourage anyone interested in things like Ceph to look at the various models in the market (native SWIFT as you point out), ViPR, and others – part of the fun of software-defined storage stacks is that you can use them with commodity hardware models!

    • Chad, First of all, I am really honored to get your comment… thanks
      I’m sure I will own you a couple of 100s of additional hits to this note.
      The bone of contention is Ceph cannot be considered a non-transactional storage ONLY type. Based on our experience, this description could fully fit to the Object or NAS presentation as you mentioned.
      But, there aren’t remarkable differences to the features that you can get from your ViPR Block solution in comparison with Ceph block storage. Ceph block storage works really cool for transactional workloads – We’ve done tests to probe it -. And as I could see at your post, you’ve taken ScaleIO to Storage Type 3. If you consider that both techs has to map and monitor the pieces of data spread between the data servers, and both works on a block protocol basis (iSCSI for ViPR and RBD for Ceph) with higher efficiencies to lower latency between client and disk.
      Again, I am pleased to get you here.

  2. Mauricio – thanks, and don’t be honored, I’m just a dude who puts his pants on one leg at time in the morning, just like you 🙂

    The “Architectural Type” classification is rooted in the core architecture of the IO path. For example, if you present object storage on a VNX (which we can!) – it doesn’t make it a Type 4. I don’t claim to be a Ceph expert, so I’ll focus on the stuff I do know….

    The differences really start to manifest themselves as you scale up. For example, in the ScaleIO case, the distribution of the data is extremely widespread across nodes (with a very low degree of granularity relative to layered on an object model). This makes scaling performance and rebuilds/redistribution work well at the scales customers tend to expect.

    Conversely, object stores (at least from the customers that I talk to ) have a high degree of expectation of object geo-protection (erasure coding, but with “what do I have locally” logic), and geo-replication/namespaces.

    These don’t manifest when you are looking at a small deployment, or 4 servers – but are important architectural differences.

    For fun (and I’m curious – would love to see what you find!) find 10 or more servers (particularly some SSDs) and see if ViPR Block (ScaleIO) diverges (steady state, under failure, etc) from what you find with models that layer transactional models on top of object storage engines.

    • Chad, thanks again for your comment. As you’ve just figured it out, Ceph Block or RBD images are striped over objects. These objects are stored by RADOS (Ceph Distributed Object Cluster). Ceph allows you to set the size of these objects (i.e. 4MB) and other things with regard to the stored stripe’s size…

      But, the most important magic comes from CRUSH.This algorithm helps you to take and get data directly from OSDs (OSDs are things that are really taking care of the stored data, we can say there is an specific OSDs to every disk drive). CRUSH helps you set how you want to define disks pools, replication and locate OSDs through many servers, racks or rows, troubleshoot… avoid to get data from a broker or special server, this helps you to lower any additional latency into the transaction.

      It can also turn it out in an evenly wide-spreading distribution of the data across nodes to improve performance and scalability (you can define it as you want through CRUSH)…

      You can turn your Ceph into you need: an high performance block storage system for transactional data or an object storage cluster for non-transactional data.

      We agree that Ceph can’t be do both over te same OSDs if you want to be effective (you can’t apply object geo-protection policies with objects destined to store RBD images).

      However, our Ceph storage was built to attend exclusively transactional data over SSD. We use objects, yes, but we arranged these objects to attend Block Devices.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: