Friday, October 4, 2013

Swift Durability and the Mystery of 11 9s

This blog builds on my earlier blog on Swift reliability calculated via MTTDL.

A key measure of cloud storage reliability is a metric called durability. This metric was brought into vogue by Amazon and it is interesting to note that the metric wasn't popular before S3. Durability is defined as the 1 - average annual expected loss of objects as a percentage. For example, 11 9’s of durability means that if you store 10,000 objects you can expect an average loss of a single object every 10,000,000 years. The product of the two i.e. 10^4 objects and 10^7 years gives you 10^11 which corresponds to the 11 9’s.

The question is, can OpenStack Swift match the durability advertised by major cloud storage providers which is 11 9s?

Storage purists still sneer at durability since it’s not really a classic storage reliability metric. However, I think the metric is quite useful. It is definitely a lot easier to understand than say mean-time-to-data-loss (MTTDL).  Moreover, MTTDL and durability are linked and can be derived from each other, so it’s not so bad to use one metric vs. another.

To calculate durability from MTTDL, one first needs to make an assumption on the object size. Amazon, carefully, never mentions the object size used to calculate durability. This is disingenuous; 10,000 1KB objects will clearly have very different reliability characteristics than 10,000 1TB objects. Since S3 has 2 trillion objects and is roughly at 2 Exabytes capacity, this works out to an average object being 1KB. That’s the assumption I’m going to go with. Let me know if you disagree.

The formulas work out as follows:

Total objects = Total storage/ Avg. object size
P(Object loss per year) = 1/(MTTDL * Total objects)
Durability = 1 – P(Object loss per year)

Using our results from the previous reliability blog I had done, we got the following results:

This translates to the following durability numbers assuming total storage of 230TB:

This shows that OpenStack swift can indeed be used to create a cloud that’s as durable as the commercial cloud storage offerings in the industry. Please keep in mind, this analysis DOES NOT take into account non-systematic failures such as natural disasters, human error, fires, floods etc. Those failures will indeed reduce the MTTDL and durability, but that analysis is outside the scope of this blog. 

No comments:

Post a Comment