Thursday, July 5, 2012

The Significance of Hadoop running on OpenStack Swift

The folks at BigDataCraft are working on integrating Hadoop with OpenStack Swift; see for more. This is really exciting! Most readers might ask the obvious question - Hadoop already runs very well on HDFS. Why would running it on top of Swift be of any interest at all?

There are two ways to answer this question. One is from the end-user point of view and the other is from a Swift-enthusiast point of view. Let's explore each one.

DISCLAIMER: The views expressed here are my own and don't necessarily represent my employer Emulex's positions, strategies or opinions. 

End-user Point of View
Hadoop is getting serious traction in the enterprise (I am skipping Web 2.0s in this discussion). However, storage administrators don't use HDFS or Hadoop. Data analytics folks, on the other hand, use Hadoop. The two worlds today, don't necessarily intersect.

What if a storage administrator wanted to occasionally run analytics? I call this the "storage" use-case for Hadoop as opposed to the traditional "compute" use-case. Here the primary goal is to safely store data, with  analytics being a secondary nice-to-have goal. The traditional compute use-case is exactly the opposite, where the primary use-case is analytics. To illustrate the storage use-case, take an example where a friend of mine working at a major HMO told me that they archive PBs of medical data on disk and are mulling over the idea of extracting some analytics value out of it.

There are two ways to solve the storage use-case problem. The storage administrator can start using HDFS as the cheap-and-deep or tier3 storage. In this solution, whenever someone wants to run Hadoop on that stored data, it's all good-to-go since the underlying storage is HDFS. The only problem with this is that HDFS is not getting traction as the cheap-and-deep tier. The other alternative is to port Hadoop onto the cheap-and-deep storage tier. This is where Swift comes in. I know Swift is not used widely in enterprises either!! But I think this is going to change very soon. Once Swift becomes common-place in the enterprise, having a capability to run Hadoop on it might be invaluable to enable the storage-use case of Hadoop.

Imagine heaps of medical imaging, bioinformatics, long-term log data, photos, legal records, oil & gas archives etc. just sitting on tier3 storage that can now start providing business analytics value!

Swift Point of View
Let's change gears to view this development from a Swift enthusiast point of view.

For a new technology to gain serious traction, I think it has to cut cost and simultaneously provide some new value. If we go back and look at how network attached storage (NAS) gained momentum in the enterprise, it did two things - i) cut cost over a SAN and ii) provided sharing of data across users that SAN storage simply could not provide. In other words, it cut cost
and provided new value.

All the discussion around Swift today is around cutting cost of storing big data. Swift clearly cuts capital & operational expenditures! However, to really get real traction in the enterprise (where the object storage category starts being viewed as an equal to SAN & NAS), I think Swift needs to clearly demonstrate new value. 

The good news is that is a bunch of new value that Swift will provide. All these need additional development, but I'm confident they will emerge over time.

1. Running occasional analytics on stored data aka the "storage" use-case of Hadoop. If demonstrated, this will be a home-run for Swift.

2. Adding rich meta-data: Lack of Posix compliance is now a feature ;-) Seriously object-storage allows for the ability to add rich meta-data to objects that traditional file-systems can't do. The folks at SoftLayer get this! There's a lot of development that can still be done in this area.

3. Allowing data-sharing across applications. File-systems or NAS allow data-sharing across users but not across applications. The reason for this is that each application creates it's own schema for how data is spread across directories (you can't put a million files into one directory). An application other than the data-owner will have no idea about the schema and therefore cannot use the data. That's the reason, today, a mortgage application can't use the data owned by a banking application. Swift on the other hand can have a flat name-space. This can now enable data-sharing across applications. When coupled with the meta-data capability above, this has the potential to unlock tremendous value. Again, if demonstrated correctly, I think this is a potential home-run for Swift.

Below is a summary of my view of what it will take to make Swift a resounding success in the enterprise.


  1. Hi,has anyone one implemented this case study i.e hadoop using openstack swift for its storage of inputs and outputs.

  2. The topic itself is interesting. I see a synergy between Swift and HDFS as both replicate data blocks for reliability (to avoid loss of data). They follow similar pattern w.r.t data storage. (configurable replication, data blocks on same rack, outside the rack and out building premises, etc.) If HDFS can be implemented over SWIFT, it will be good, else data blocks gets replicated on HDFS as well as in Swift because of their inherent nature.