Monday, August 6, 2012

Is OpenStack Swift Reliable Enough for Enterprise Use? (Corrected)

CORRECTION: I had incorrectly interpreted the non-correctable error number as being the probability of a bit-rot. This is not the case. I've been told that the probability of a silent bit-rot error is actually quite low,  1 (bit up to sector) in 10^21 (in reality it is even lower) or lower. Even with this 1 in 10^21 number, the MTTDL improves significantly! Apologies to the Swift community for representing Swift in.

In this blog, I’d like to tackle reliability of OpenStack Swift. OpenStack Swift is a very successful open-source object storage project that is suitable for public and private cloud storage.  I believe reliability is a really important topic to discuss for enterprise adoption of Swift to progress, even though terms such as mean-time-to-data-loss may put even the most die-hards into a deep slumber J!!

DISCLAIMER: The views expressed here are my own and don't necessarily represent my employer Emulex's positions, strategies or opinions.  

Let’s take a step back. In any storage system, you have three major parameters you are trying to balance: High performance, high reliability, and low cost. You can get one or two but not all three.  Examples of where different storage types fall on this curve are shown below.

Different storage types on the cost, performance, reliability curve

If you accept this premise, then it is important to validate that object storage does indeed give us acceptable reliability. For the cheap-and-deep primary storage tier this would mean high reliability. For the cloud backup use-case, where the original copy exists, something lower might be OK. Triple mirroring for the broader use-case and double mirroring for cloud backup, intuitively, should give us what we need but let’s validate this.

There is another reason to go through this exercise. Swift is highly configurable and you have all kinds of knobs ranging from disk density per server, type of disks, type of network, processing horse-power, use (or not) of PCIe flash and SSDs (see a new blog by Zmanda on this topic), ratio of proxy servers to object servers, configuration of account & container servers etc. A user needs some tools to pick where they want to be on the reliabilityóperformanceócost curve. A lot has been written about cost (including a blog from me). Some amount has been written on performance most notably by Zmanda. However, there’s not much on reliability. Hopefully this blog will start a debate on this topic.

Armed with tools to compute cost, performance, and reliability, users will be able to configure the above knobs more intelligently.

I’m going to use MTTDL (mean-time-to-data-loss) as a measure of reliability. We are looking for high MTTDL numbers, at least 10x of the expected hardware useful life i.e. say 30-40 years. Reliability is different from availability where loss of availability may happen for reasons such as network or power failure, but that doesn’t imply permanent data loss. I am going to compute MTTDL in three ways i) disk failure,  ii) bit-rot, and iii) storage node failure; then we can take the minimum of those to be the final MTTDL. The approach used is loosely based on the approach taken by Richard Elling in his 2007 blogs (blog#1, blog#2).

This exercise ignores natural disasters, human errors, fire etc. Swift has the notion of zones, but these zones are currently there for availability rather than disaster recovery. I’m confident that Swift will eventually add geo-replication, and at that time we can consider a fourth technique i.e. MTTDL when there is a disaster [2/15/13: Note Swift is indeed going to have geo-replication soon]. 

To start, here‘s research on vendor data. Keep in mind that the desktop drives are designed for 2400 power-on hours (POH) of year stress. When used in a 100% up-time environment (8760 POH), the numbers for desktop drives will have to be de-rated.

[2/15/13: P(bit-rot) = 1 in 10^21]

For servers, getting the MTBF data is much harder. You typically need an NDA to get this data from the server manufacturer. I did find some public data by Intel. This is not exactly the server we would use for Swift, but it should be indicative. It has non-redundant fans & power-supplies and the MTBF data excludes disks. This is exactly what we want.

Now let’s translate the above vendor data into the following assumptions. As mentioned above, I’m derating some of the manufacturer data. I’m also using AFRs for disk based on Google’s seminal paper in the area of disk failures.

MTTDL technique#1
This technique assumes a disk failure. Data loss occurs if there are additional failures while we are trying to repair the first failure. We are assuming that the datacenter is staffed 24x7 so if a disk fails, the ring (a configuration file used by Swift to determine data placement) can be modified immediately. The disk itself does not need to be replaced quickly since we assume adequate spares. If the datacenter is staffed only 9-5, then the MTTR (mean-time-to-repair) will obviously increase since, worst case, it may be 16 hours before someone can modify the ring. These 9-5 datacenters may have to go with 4-way mirroring and therefore pay more in capital expenditures to reduce operational expenditures. A key assumption here is the disk rebuild rate. If the servers and the network are healthy and given how Swift recovery works, there is many nodes -> one node rebuild traffic. So we should be able to approach the limit of network or disk speed. Coincidentally that number is 100MB/s for both the disk speed limit or network throughput limit assuming a GE network. For our calculations, we’ll use the following formula as per Elling:

MTTDL[1]2-way mirror = MTBF^2 / (N * (N-1) * MTTR)
MTTDL[1] 3-way mirror = MTBF^3 / (N * (N-1) * (N-2) * MTTR^2)
MTTDL[1] 4-way mirror = MTBF^4 / (N * (N-1) * (N-2) *(N-3) * MTTR^3)
MTBF = MTBF for the disk           
N = Total disks in the cluster
MTTR = Mean-time-to-repair

This gives us the below MTTDL[1]:

MTTD technique#2
This technique assumes a silent bit-rot. Data loss occurs when other copies fail while we are trying to recover from the bit-rot incident. Since Swift will catch bit-rot using the object auditor (in the case of objects), we will use a modified version of Elling’s formula as per below (math majors, please double-check it). Assume that a silent object corruption has happened; it will take some amount of time for the object auditor to find it, quarantine the object, and then replicate it. The first two are considered time-to-diagnose (MTTD) and the latter is considered MTTR.

Here the key assumption is MTTD.  Check out I am assuming we can tweak these numbers upwards to get about 5 MB/s of object auditor rate per disk i.e. 70 MB/s of aggregate object-auditor throughput for the node assuming 14 disks.

# You can override the default log routing for this app here (don’t use set!):
# log_name = object-auditor
# log_facility = LOG_LOCAL0
# log_level = INFO
# files_per_second = 20
# bytes_per_second = 10000000
# log_time = 3600
# zero_byte_files_per_second = 50

Further I am assuming the fault is right in the middle of the disk. In reality the error may be earlier or later in our scan. This assumption makes this analysis somewhat simplistic in that it is deterministic. A full Monte Carlo simulation would be better, any volunteers ;-)?

MTTDL[2] 2-way mirror = MTBF/P(Object-rot)*N*MTTR2
MTTDL[2] 3-way mirror = MTBF^2 / (P(Object-rot)*N * (N-1) * MTTR2^2)
MTTDL[2] 4-way mirror = MTBF^3 / (P(Object-rot)*N * (N-1) * (N-2) * MTTR2^3)
P(Object-rot) = probability of object rot = Non-recoverable read errors per bits read * object size
MTBF = MTBF for the disk           
N = Total disks in the cluster
MTTR2 = Total repair time = MTTD (mean-time-to-diagnose) + MTTR

This gives us the below MTTDL[2]:

MTTD technique#3
This technique analyzes a node failure. There is data loss if an additional disk or node fails (that has a copy of that object) while we are trying to recover from the first node failure. Let simplify and just take the case of additional node failure rather than a disk failure. In reality, if a node fails due to fan or power-supply failure, the data is still intact. However, in a scale-out datacenter, nobody is going to repair the server or manually move disks from one server to another. These steps are expensive and error-prone.  We can use the same formulas as technique#1.

MTTDL[3]2-way mirror = MTBF^2 / (N * (N-1) * MTTR)
MTTDL[3] 3-way mirror = MTBF^3 / (N * (N-1) * (N-2) * MTTR^2)
MTTDL[3] 4-way mirror = MTBF^4 / (N * (N-1) * (N-2) *(N-3) * MTTR^3)
MTBF = MTBF for a server           
N = Total servers in the cluster
MTTR = Mean-time-to-repair

We are going to assume a rebuild rate limited by the GE network i.e. 100MB/s. We are assuming that the node being rebuild can handle this data rate. This gives us the below MTTDL[3]:

Using the worst-case MTTDL i.e. the minimum values from the above three tables, we get the following:

This matches our intuition. A two-way mirror is OK for cloud backup where there’s already a primary copy and the data-set can be recreated upon data loss. Further a 3-way mirror seems adequate for cheap-and-deep primary storage use-case and meets or exceeds out MTTDL goal. This conclusion is very simplistic in that it is valid only for this exact set of assumptions. What’s needed is more sophisticated analysis like Monte Carlo simulation and associated calculators for users. For example, MTTDL is very sensitive to the cluster size. What works for a 230TB usable storage cluster breaks for a 2.3PB usable storage cluster. Disk types, replication factors, network bandwidth, disk density per server, compute power for performing auditor functions etc. need to be considered very carefully to get the right reliability characteristics. 

Having said that, bottom line, the OpenStack Swift architecture does indeed provide the reliability needed by enterprise class customers! 


  1. Excellent post! I'm curious why you say "What works for a 230TB usable storage cluster breaks for a 2.3PB usable storage cluster.". -- are you saying the MTTDL is too low at that scale?

    1. Yes the MTTDL reduces with desktop drives... the configuration potentially needs to be tuned to meet the MTTDL goal in terms of things like disk type, disk density, server density, network throughput, CPU horsepower etc.

  2. Between desktop and enterprise "NL" drives, there's a new option that I consider attractive: "personal NAS" optimized drives like WD's "Red" drives ( These claim to be designed for 24x7 operation, and advertise an MTBF of 1'000'000 hours. The non-recoverable bit error rates are like desktop drives (1e-14). I'd love to see your math on those...
    Economically, they are more expensive than desktop drives but not by much. They also claim to optimize power usage (and noise, but that's maybe not that helpful in a data center).

    1. Very interesting! I ran the numbers and based on my other assumptions, the results look identical to the the NL drive because now you are limited by the server MTBF rather than the disk.

  3. Amar,
    Very interesting article.
    However, the calculation of the server MTTDL did not yield the same results. With:
    MTBF = MTBF for a server = 50,000
    N = Total servers in the cluster = 25
    MTTR = Mean-time-to-repair = 83
    The formula give 1.8 years MTTDL (with 3 copies)...
    I replaced the first MTBF, N, and MTTR with disk number and got close to your numbers but not the same.

    It seems to me that the formula assume that every server that fails has a copy of the relevant object. I think that if adding that probability the numbers may be closer to reality.

  4. Can you elaborate a bit more? I'd love to know which specific formula you are talking about above and what result you go. Thanks.