March/5/12 Note: I've posted a follow up to this blog "End-User Feedback on OpenStack Swift: A Deeper Look at UCSD's Implementation".
The University of California’s San Diego Supercomputer Center (SDSC) introduced a Data Storage Cloud using OpenStack Swift last September, making it the largest educational private cloud storage implementation. Pretty awesome, but the question is whether this is a one-off event or is this the start of a trend? In other words, are there benefits of this implementation that will carry over to other educational institutions, government organizations, and even enterprises? Let’s first look at what SDSC implemented and why. Next we can explore the question at hand.
|Summary of Benefits to UCSD and Whether they Extend Beyond UCSD|
DISCLAIMER: The views expressed here are my own and don't necessarily represent my employer Emulex's positions, strategies or opinions.
What is SDSC doing with OpenStack Swift?
SDSC has set up the largest academic cloud storage using OpenStack Swift with 5.5PB of raw storage (1.8PB-2.2PB usable given two-way replication). This is pretty big for private clouds! The reason, as Richard Moore SDSC's Deputy Director states in the January’11 PASIG presentation, is the exponential data growth in research and that current archival solutions are simply inadequate. The cloud is offered as a service to UC researchers and affiliates.
SDSC’s Swift cluster has 16 proxy/auth nodes and 49 storage nodes. Storage nodes are not identical; in fact they are of two types from two different companies. Networking is all 10 Gigabit Ethernet. The service promises 8-10GB/sec throughput, which is very high compared to traditional storage systems. The ratio of proxy nodes to storage nodes, the amount of memory used, the choice of 10Gb Ethernet etc. all suggests that a lot of thought has gone into performance tuning. Interestingly, the storage nodes have RAID controllers which is not required and in fact discouraged by the Swift documentation. This configuration uses Arista’s multi-chassis link aggregation (MLAG) to provide a unified network address. SDSC uses Rocks clustering tool kit for management. Rocks is an open source UC cluster management tool with a history in HPC. I’d never heard of it, but seems like it merits further digging. The Swift cluster offers both Swift & S3 APIs. Moreover, at rest encryption is optionally also offered. Finally, the ability to create a third replica off-site will be offered shortly.
SDSC suggests several client software packages - CyberDuck, SDSC’s home-grown cloud explorer, the swift CLI, Commvault’s cloud backup, Amanda cloud backup, and Crashplan. This is only their suggested list; a user could presumably use anything that works with Swift or S3 APIs.
This cloud storage has pretty sophisticated authentication. A user can keep an object private, share it with other users, or make it public. This is very interesting, because once folder synchronization becomes available in the future from one of the clients, SDSC would have effectively created a private DropBox for their internal users! Authentication is also tied to the billing system so delinquent accounts lose different abilities after 30, 60, or 90 days of non-payment. I could not figure out whether SDSC uses Keystone or some other home-grown authentication system.
Finally, we come to the topic of price. There are two pricing models – on-demand or micro-condo. You can check out the specifics on their site, but on-demand is basically $0.0325/GB/month for UC folks, $0.0472/GB/month for affiliates, and $0.065/GB/month for others. In my blog on costs, I projected a cost of $0.030-$0.041/GB/month for 1-5PB usable storage depending on the level of replication and type of support. Assuming SDSC is offering the storage at-cost to internal folks, these numbers match pretty well. The micro-condo option is pretty radical! The customer pays for the upfront hardware and annual maintenance cost. The upfront expense is higher, but the total savings over time are worthwhile. Finally there are no transfer costs in either model. All of this combined make this private cloud dramatically less expensive than public clouds for internal UC users. Plus since it’s offered by SDSC, researchers can be confident that it meets all their specific compliance needs.
Why is SDSC doing this?
SDSC’s use-case matches the 3rd use-case from my 3 use-cases of OpenStack Swift blog. Between the press release, the PASIG presentation and the site, you can read about all the different benefits; below is my attempt to summarize them:
- Low-cost, high-performance, highly reliable online archival: This is required as researchers need to store more and more data for long periods of time. Researchers can use different types of HPC-centric storage (other than this Swift cluster) when the data is in actual data-crunching mode. But before or after that phase, this storage is ideal.
- A UCSD-wide private “DropBox” style service for collaboration: Researchers need to collaborate routinely. With the Federal Data Sharing requirements, this becomes an even more acute need. Without this, the only options presumably would be using thumb drives, FTP sites, and emailing files around!
- Cloud backup
- An S3/ Swift API compliant storage for application writers/ web-site creators: A lot of researchers use web-style programming paradigms. For those programmers, UCSD would presumably find it a lot better if they were to use a private cloud controlled by UCSD's IT rather than a public cloud.
Will Others Follow, especially Enterprises?
Will others follow? Especially enterprises, since corporate dollars are critical for a technology to make rapid progress? Let’s review whether the four benefits transcend UCSD:
- Archival – with rapid growth of data across a variety of segments e.g. medical imaging, bioinformatics, photo sites, oil & gas, financials, pharmaceuticals etc. archival has become a big problem. Furthermore users are not happy with offline archival (on tape) and are increasingly demanding online archival for easy access. The reality is once something is archived on tape, it is a one-way-street and nobody is going to access it again. I would say this use-case definitely transcends UCSD.
- Private “DropBox” style service – Collaboration is a huge problem that has been partially solved with Microsoft SharePoint. However SharePoint doesn’t work for certain situations like ad-hoc folders or when the user wants a lot more control. Enterprises are faced with corporate data being shared on public folder sharing services which they obviously don’t like. A private cloud storage implementation with folder sharing is an elegant solution. So I would say a definite yes for this benefit also, in terms of extending beyond UCSD.
- Cloud backup – Numerous users are supplementing tape-backup with cloud backup since the backup image is online with cloud backup thus providing rapid restore. Again, rather than having employees use a higher-cost public cloud with little oversight and control from IT, a private cloud backup solution seems to have a benefit across the board.
- Private S3 for application developers – The new crop of programmers graduating today work on Python, PHP, Ruby with HTTP services for storage (e.g. AWS) rather than C/ C++ using block or file storage. These programmers are certainly going to be a lot more comfortable with a private cloud storage that’s available on-demand with HTTP APIs. I agree that this is a build-it-and-they-will-come thought process; but I think this benefit extends well beyond USCD.
I guess you can tell, I am sold on this OpenStack Swift implementation being the start of a mega-trend. If an IT department can offer a private cloud that meets all internal compliance and SLA requirements at a cost that is lower than a public cloud (with no transfer costs), then why wouldn't users use it?
SDSC’s implementation of OpenStack Swift to create a private cloud storage implementation is revolutionary. They are providing a new storage service to internal users that is better than public cloud storage in cost, performance, and compliance. I believe it is only a matter of time before other educational institutions, government organizations, and enterprises follow suit; thus making the SDSC implementation a start of a mega-trend.