Home - Aspen Distributed Data Platform

What It Is

Welcome to Aspen, a scalable and general-purpose distributed data platform for building higher-level distributed systems. To date, almost all of the major and successful distributed systems like Elasticsearch, ZooKeeper, CockroachDB, and etcd use a unique, custom-designed approach to distributed data management. They all start at the lowest level and must first solve all of the complex problems with data replication, repair, redundancy, scalability, etc before they can even begin to address their intended application domain. Aspen’s goal is to solve these challenges in a reusable way so that applications built on top of it can focus more on the problems they’re intended to address and less on re-inventing solutions to low-level data management issues.

To this end, Aspen employs a unique set of trade-offs that are intended to maximize flexibility and run-time adaptation rather than provide optimal performance for a specific workload and runtime environment. It won’t be a perfect fit for all application domains but the goal is to provide a solid basis for a wide variety of systems. Distributed file systems, object storage solutions, databases, and distributed indices are a few example areas which are a good fit for Aspen though ultra-high performance streaming systems like Kafka probably wouldn’t be an ideal use case.

How It Works

All data in Aspen is contained in “objects” that are relatively small, bytes to tens of megabytes in size. There is no strict size requirement but they must fit comfortably in memory. Objects are updated transactionally with the Atomic, Consistent, and Durable guarantees out of the standard ACID transaction model. The missing attribute from that list, Isolated, is also achievable through Read-Atomic RAMP transactions though this has yet to be implemented. Transactions can update multiple objects simultaneously with all-or-nothing guarantees and they do so with a single round-trip message delay in the absence of errors or contention.

Unlike most other storage solutions which tend to use one of consistent hashing, sharding, or replicated state machines to achieve robustness and scale, Aspen instead uses a model based on explicit, opaque “pointers” for storing and locating objects. These pointers are relatively small, typically 50 to 100 bytes, and contain just enough information to locate the objects within the data stores that host them. The stores themselves are logical units of storage that may be freely migrated between physical hosts and backing media to suit changing run-time needs. For example, all of the stores for an application could be located on a developer’s laptop during initial bootstrapping and testing phases then migrated out to a cluster of servers in the local data center for deployment. Should additional sites become available, the stores could later be migrated on-the-fly to multiple locations for protection against site outages. Similarly, stores may be placed on backing media to optimize their use. For instance, the inodes and directory content for a distributed file system could be placed on low-latency NVMe media to allow for fast file and directory lookups whereas cheaper spinning disks could be used to back file content where higher latency reads are less of an issue.

Another point of flexibility is that the choice between replication or erasure coding is made at the time of allocation for each object. Replication may be used for very small objects or those that need to be processed locally on its storage host, at the cost of increased storage utilization, whereas erasure coding may be used for bulk data where storage efficiency is more desirable.

A full writeup on Aspen’s architecture, its components, transaction model, and how they all fit together can be found on the Architecture page.

Current Status

Aspen has been in development for some time now and is currently on its 6th ground-up rewrite. Various programming languages and design approaches were experimented with before landing on the current implementation which seems to hit a sweet spot that compartmentalizes most of the complex subsystems and provides a well-structured and maintainable codebase. The current implementation is written in Scala and makes use of its strong support for futures and asynchronous programming. Scala’s concise syntax and support for implicit arguments also makes the code particularly readable and minimizes the cognitive load required to understand and maintain an asynchronous codebase.

At present, Aspen is in the alpha stage where most of the critical subsystems and basic error-handling code have been implemented. As a proof-of-concept application, it also includes the implementation of a distributed file system called AmoebaFS. This filesystem is implemented in terms of Aspen objects and is so named due to its ability to adapt to, and optimize itself on, heterogeneous storage hardware and changing runtime environments. The filesystem also demonstrates the use of Aspen’s distributed tasking model that handles multi-step operations where each step must happen exactly once, such as creating a file inode and inserting it into a directory. These multi-step operations are fully durable and will eventually succeed despite partial or full system crashes at any point in the process.

Given the relative youth of the current codebase, many of the error-handling and recovery implementations are, at present, too simplistic for real-world use. For example, read operations presently read the full object content from each replica rather than the more intelligent approach of reading the full content from just one replica and only the metadata from the rest. Fortunately however, Scala offers strong support for compartmentalizing simplistic algorithms and all such instances are abstracted away behind interfaces that are easily pluggable. As more advanced implementations are created they can be treated as drop in replacements for the overly simplistic solutions.

Aspen’s source code may be found on GitHub in a project repository and is distributed under the LGPL license. The intent behind this license is to encourage its use in both open-source and proprietary projects with the main ask being that any enhancements made to Aspen itself in support of an application be contributed back to the community so that other projects may benefit from the improvements.