Cloud vendors, like Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, all offer a variety of cloud services, ranging from high-performance, SSD-based capacity, to long-latency archive storage, at prices ranging from high to relatively low. But most applications have a variety of I/O needs, from latency-sensitive metadata updates, to bandwidth sucking backups. No single cloud storage service is ideal.
Application developers know this, and often perform unnatural acts in their code to overcome cloud storage deficits. Two major issues are cost/performance tradeoffs, and inelastic deployment boundaries.
The storage hierarchy — in simpler times DRAM, disk, and tape — reflect the tradeoffs. Fast storage is expensive, and cheap storage is slow.
To accommodate varying workloads, enteprise storage arrays move data adaptively, transferring hot data to fast caches, and moving cool data off to disk, or in some cases, all the way to a cloud archive. But this is hard to do with cloud storage, as the different services require explicit deployment, and offer different consistency guarantees.
Cloud storage services also tend to offer only single metric elasticity. The AWS S3 service, for example, scales with capacity, but not with I/O demand. DynamoDB scales with I/O demand, but is prohibitively expensive in low-latency configurations.
Anna to the rescue
In a recent paper, researchers at Cal Berkeley, explore an advanced key-value storage system, Anna, designed to overcome current cloud storage limitations. Key-value stores are essentially two column spreadsheets, where the first column contains an access key and the second contains the data you wish to store.
Key-value stores are already in wide use in cloud services, but Anna implements three important optimizations.
- Horizontal elasticity for scaling
- Vertical data movement to accomodate changing access patterns
- Selective replication of hot-data keys across multiple cores and nodes to scale access performance.
These optimizations are intended to address the need for growth in aggregate throughput, the reality of hot keys, and the shifting of workload hotspots.
There’s a lot of detail in how Anna accomplishes these goals. But the bottom line is: how well does it work compared to, say, DynamoDB?
Here’s one table, comparing the two:
Adapting to hotspots is another test:
That’s quite respectable.
The Storage Bits take
If I were Dell/EMC or NetApp, I’d be worried. Large scale public cloud storage is less than a decade old, and is rapidly maturing, as the lack of growth in enterprise storage attests.
Anna is important not only for performance gains, but for its focus on cost. Cloud storage headline rates seem reasonable, but when you add in all the overhead costs for directory lookups and data networking, enterprise storage is a lot more competitive.
The cloud vendors have as many PhDs as Berkeley does — and the paper’s authors have probably received job offers already — so expect to see something like Anna productized in the near future.
Anything that makes storage more efficient at a lower cost is a win for our developing digital civilization. But perhaps not so much for enterprise storage vendors.
Courteous comments welcome, of course.