In nature, Aspens form large, winding groves of trees. What appears to be a grove of many individual trees is really a single large organism with roots interconnected in a graph-like structure. Essentially, Aspens are nature’s version of a distributed tree and they can span over hundreds of hectares while surviving for thousands of years. Similarly, this system is based on robust distributed tree structures and may be extended to support arbitrary graphs of distributed objects.
Aspen is similar in concept to the Ceph filesystem’s Reliable Autonomic Distributed Object Store (RADOS) which provides a distributed object system upon which the Ceph filesystem was then constructed. Aspen expands upon this basic concept by providing a system with much more flexible transaction and data placement models. This allows for the creation of arbitrary distributed data structures that can be tailored to the specific needs of the application and its operating environment.
For example, a distributed file system could place the objects holding inodes and directory structures in data stores backed by SSD media to enable fast directory traversal while the file data could be placed in stores backed by spinning disks to keep costs down. Similarly, a system storing Internet of Things (IoT) sensor data could use a dispersed B-tree with the upper tiers of the tree replicated to each site to allow for fast, local navigation whereas the data itself could be erasure coded and geo-dispersed for availability and space savings.
System Components
From an architectural perspective, there are only four main components: Storage Pools, Data Stores, Objects, and Object Pointers. The logical relationship between these components is as follows: Storage Pools contain Data Stores, Data Stores contain Objects, and Objects contain Object Pointers.
Storage Pools
Storage pools aggregate Data Stores into logical units for object allocation, data dependency constraint, error recovery, and physical placement purposes.
Object Allocation
Storage pools must contain a number of Data Stores matching at least the width of the widest Information Dispersal Algorithm (IDA) intended for use with the pool. Where the IDA specifies replication or erasure coding and the width refers to the number of replicas or erasure coded object slices. Ideally pools would also contain a few Data Stores beyond this minimum requirement in order to allow for allocations in the pool to succeed at full width even if one or two Data Stores are offline at the time of the allocation. The number of Data Stores should generally be kept small though as adding additional stores beyond the maximum supported width decreases the efficiency of the rebuild procedure for failed Data Stores.
Data Dependency Constraints
Storage pools can operate in almost total isolation from other storage pools. The Data Stores within the pool share object data with each other but never with stores outside the pool. This is advantageous for a number of reasons but is particularly important for the ability to accurately determine the health of the pool based on the health of its member stores. It is also helpful for determining whether or not it is safe to take a Data Store temporarily offline for maintenance.
Error Recovery
Aspen uses a self-hosting model for error recovery in which all data needed for recovering from physical media failures and missed object update transactions is stored in distributed tree structures constructed out of Aspen objects. The Error Recovery section goes into much greater detail but, essentially, there are two types of trees that are maintained for each pool. An Allocation Tree for maintaining Object Pointers to all objects in the pool and Missed Transaction trees that contain an entry for each committed transaction where one or more peers are suspected of having failed to properly process it. These trees may be hosted entirely within the pool itself to maximize isolation or may be placed elsewhere for efficiency and/or durability reasons.
Physical Placement
For a geographically dispersed Storage Pool, the members of the pool may be spread out amongst the available physical locations. Alternatively, all members of a pool could be co-located at the same site in order to minimize access and update latency to the objects contained by those stores. Similarly, latency can be tuned by ensuring that all stores within the pool are backed by media with the appropriate access characteristics. Navigation of distributed trees, for example, benefits tremendously from ensuring that all of the upper branches of the tree are backed by solid state media.
For some workloads, data ages over time and becomes needed less frequently. As an option for effectively managing this, pools may be backed by fast media while the data is fresh and then migrated down to less performant but correspondingly less expensive hardware as it ages.
Data Stores
Data Stores are logical entities that store object slices and replicas. They do not have a fixed network endpoint or required type of storage media and they are allowed to freely migrate between physical machines and backing devices to best suit the needs of the application and operating environment. The architecture does not place any limitations on the allowable size for data stores but relatively small stores will generally be preferable since smaller stores are faster to migrate and they are more likely to fit within the size constraints commonly imposed by faster physical media.
When allocating space for an object slice or replica, the store may return a token that identifies the location of the object within the store. This value is embedded within the Object Pointer and is used every time access to the slice/replica on that store is needed. The token itself is treated as an opaque value by the rest of the system and there are no hard restrictions on its allowable size. Consequently, the data store may return whatever value that best suits the underlying implementation. A store backed by a flat-file, for example, could simply return the offset of the object within the file.
Though seemingly simplistic, the ability to place offsets in object pointers makes flat-files surprisingly effective data stores. Assuming the file is stored contiguously on disk, locating and accessing the object data requires only a single head swing when the store is backed by rotating media. Also, there is virtually no memory or cpu overhead which makes them suitable for resource-constrained hosting machines.
Another option for data store implementation is a key-value database such as LevelDB where an opaque key, rather than an offset, is stored in the object pointer. Such a store would avoid the wasted space that is inherent to contiguous, pre-allocated files when utilization is low but comes at the cost of additional lookup latency and higher resource requirements. Fortunately though, data stores are not necessarily locked in to any particular implementation. Just as stores may be migrated between machines and physical media, so too may their data be migrated between differing implementations so long as the tokens stored within the object pointers remain valid. A store could, for example, start off with a flat-file implementation and later be migrated to a LevelDB database after dropping below a certain utilization threshold. The offsets of the objects in the flat-file implementation would then become the keys used to store the objects in LevelDB.
Data stores are permanently assigned to and tightly coupled with their hosting Storage Pool. Stores are uniquely identified by their hosting pool’s UUID plus the index of the store within the pool’s array of data stores.
Objects
Objects in Aspen come in two types. First are the “normal” Data Objects which contain opaque data that the system doesn’t know anything about. Second are Key-Value Objects where the content of the object is broken out into a series of key-value pairs. Key-Value objects are useful primarily due to the fact that concurrent transactions referencing disjoint sets of keys will not conflict. This makes them particularly useful for avoiding update contention when designing distributed data structures.
Immutable Attributes
UUID
All objects in Aspen are assigned a random UUID to uniquely identify the object within the system.
ObjectType
Data or Key-Value
ObjectSize
Aspen object allocation is optionally similar to malloc() in that the size of the object may be fixed at the time of allocation. Objects may use less than the total allocated space but they cannot grow beyond it if the Object Size is defined.
IDA
The Information Dispersal Algorithm (IDA) describes whether the object is replicated or erasure coded and the associated parameters such as replication factor and read/write threshold requirements.
Mutable Attributes
ObjectContent
The data within an object may be modified through the use of immediately-consistent, atomic transactions that either append to the current data stored within the object or overwrite it entirely. Or, in the case of key value objects, insert, update, or delete KV pairs.
ObjectRevision
Object revisions are the UUID of the last transaction to successfully modify the object. The primary use for the object revision is to ensure that transactional updates are modifying the expected version of the object though it is also needed by the transaction model to resolve certain error conditions.
ReferenceCount
Reference counts on Aspen objects are employed in the traditional manner where objects are deallocated when their reference count reaches zero. Updates to the reference counts are done in an atomic manner and are independent of the object content. Simultaneous updates to both the reference count and object content by different transactions will not conflict.
Supported Operations
Reference Count Update
Changes the reference count to the new value specified in the transaction. These will typically involve a single increment/decrement from the current value but larger changes are possible with multi-object transactions that add or remove multiple references to the object within the scope of a single transaction.
Append Data
Appends the data specified in the transaction to the content currently stored by each host.
Overwrite Data
Overwrites completely replace the existing content of the object with the new data specified in the transaction
Key-Value Update
Insert, update, or delete Key-Value contents in an object. Each KV pair has a HLC timestamp associated with the key pair. Transactions may require “Key does not exist”, “Key exists”, “Timestamp is less than”, or “Timestamp equals” in the transaction commit requirements.
Version Bump
Version bumps update the Object Revision but leave the object otherwise unchanged. As the Error Recovery section explains, this can be used to ensure the failure of any potentially outstanding transactions that are attempting to modify an object.
Revision Lock
Locks the revision of an object for the duration of a transaction. This may be used to ensure that an object is not modified while the transaction is being processed.
Object Pointers
Unlike most scalable storage systems which typically use either some form of consistent hashing or sharding to scale, Aspen instead uses explicit, immutable pointers with embedded metadata to locate objects in the storage network.
The pointers do not have a fixed size due to the variable number of replicas/slices that may be used for an object. Also, each store may provide a unique token during object allocation that identifies the location of the replica/slice within the store. Use of this token is not mandatory and some data stores may choose not to use it. A RocksDB backed store may choose to use the object UUID as the key rather than attempting to generate another unique key for the object. If provided, the token is treated as an opaque, binary string and is also allowed to vary in size. Generally, the serialized size of Object Pointers is relatively small and will usually fall in the range of 50 – 100 bytes.
Object Pointer Contents
ObjectUUID
Identifies the object being pointed to and is used during read/write operations to ensure that the correct object is being accessed.
StoragePoolUUID
Storage Pool the object is contained in.
ObjectType
Data or Key-Value.
IDAConfiguration
Replication/erasure-coding algorithm, width, read-threshold, and write-threshold.
ObjectSize
Optional maximum size of the object in bytes.
SliceMap
Associative array of (Data Store ID -> sliceToken). The purpose of the slice map is to specify which stores host slices/replicas of this object as well as the store-specific sliceToken that must be used to retrieve the slice/replica from each store.
Transaction Model
Aspen’s transaction model provides immediately consistent, multi-object, atomic updates through the use of a single instance of the Paxos consensus algorithm for each transaction. In terms of concurrency control, this model could generally be described as Optimistic Concurrency Control (OCC) but there are some concepts blended in from a lock-based, 2-phase commit approach as well. Unlike many other transaction-based systems, there is no concept of “Rolling-back” aborted transactions in Aspen. Transactions are all-or-nothing operations with no intermediary results. Aborted transactions simply fail to make any visible changes.
At the basic level, transactions define a list of requirements that must be met in order for the transaction driver to attempt to resolve a Commit decision and a list of cleanup activities that must occur before the participating nodes are allowed to forget a successful transaction’s existence (delete all associated state). The resolution process uses the standard Paxos algorithm but the Prepare and Promise messages are overloaded to carry some additional data beyond what Paxos itself requires. A Transaction Description detailing the transaction requirements and slice/replica data is sent along with the Prepare message. Each participant’s vote for whether or not the transaction should be allowed to commit is sent along with the corresponding Promise message.
In order for the stores hosting slices/replicas of the target object to vote for transaction committal, they must first check their local state for all Transaction Requirements on all objects they host. Only if all of the transaction requirements are met may the store vote for committal. As part of the transaction requirements the Transaction Description may specify Object Revision and/or Object Refcount “locks”. If the object matches the lock requirements and the store votes for committal of the transaction it must deny all newer transactions seeking to modify those attributes of the objects until the locking transaction has completed. This is similar in concept to 2-phase locking but the locks only exist for as long as it takes to resolve the single Paxos instance. Once the transaction resolves to either a commit or an abort the objects are immediately unlocked.
The commit quorum may, and often will, exceed the minimum quorum requirement needed by Paxos. For an erasure-coded object, for example, the total number of slices may be twelve with eight required in order to reconstruct the object. At a minimum the commit requirement in this case would need to be set to eight to ensure that the updated object could be reconstructed. However, for reliability purposes, that number should probably be increased to ten or so to ensure that failures of one or more nodes will not bring the object below the reconstruction threshold. The IDA contained in the object pointer specifies a “write-threshold” which defines the minimum commit quorum requirement. Additionally, the commit quorum for each Transaction Requirement must be achieved independently. If a commit quorum for any of the requirements cannot be achieved, the transaction driver must attempt to resolve an Abort decision. Once the Commit/Abort decision is made, the remaining message exchange consists of standard Paxos Accept & Accepted messages.
For multi-object transactions, rather than attempt to use all of the hosting stores for each object in the Paxos protocol, a single object from the pool of objects is chosen to execute the transaction. To select which object to use the IDAs of the objects are compared and the one with the most reliable IDA is used. Once chosen, a single online store hosting the object’s data is randomly selected to serve as the designated initial leader for the transaction. This allows the client application to send the Prepare messages directly to all of the stores which will then respond back with Promise messages to the client and to the designated initial leader. When the designated initial leader sees the Promise messages for the transaction it will take over responsibility for driving the transaction to closure. If a write-threshold number of stores voted to commit alongside the Promise message, the designated initial leader will attempt to resolve a Commit. Otherwise it will attempt to resolve an Abort.
Contention Mitigation
As with most Optimistic Concurrency Control (OCC) systems, transactions simultaneously attempting to update an object can conflict with one another and prevent any transaction from gaining a quorum of votes for committal. The naive solution is to abort any transaction that fails to achieve the write threshold and leave it to application-level code to retry the transaction at a later time. This works but it defers contention mitigation to the application layer and does nothing to ensure forward progress. A better solution is to abort all but one of the contending transactions. The remaining transaction may continually attempt to gain commit permission from stores that were previously locked to one of the contenders. As the contending transactions abort, the continuing transaction can make progress toward gaining commit approval.
A transaction driver can decide whether to continue or abort by examining the replies it received from the stores and applying the following rules:
- If success is impossible due to Abort votes by width – write-threshold stores, abort.
- If a write-threshold number of stores granted permission, commit.
- If no store granted commit permission, abort.
- Examine the timestamps of the other transactions for which stores granted commit permission. Continue if this transaction has the lowest timestamp. Otherwise abort.
To prevent newly arriving transactions from interfering with the transaction that is continuing, each store should track the outstanding transactions on each object. When a transaction for which it has given permission aborts, it may immediately grant permission to the next outstanding transaction that has the lowest timestamp. Having the stores take an active role here also reduces the resolution latency that would otherwise be induced by having the continuing transaction repeatedly poll for approval. A polling mechanism is still required to ensure that a sufficient number of stores become aware of the continuing transaction but stores that are already aware of the transaction need not wait for the polling cycle to elapse before sending permission.
Optimistic Commits
Paxos normally requires four message delays in order to achieve consensus. Two delays in the network round trip to establish permission to propose a value and two more message delays in the second network round trip required to gain acceptance of that value. A long-standing observation of Paxos is that correctness is not threatened by locally generating and processing some of the required messages as long as the messaging restrictions are not violated. In certain circumstances this allows for a reduction in the typical four message delays required to achieve consensus.
Optimistic commits require three message delays to resolve a Commit decision by allowing the data stores to locally generate and process the Accept:Commit message. To do this, the Promise messages sent in response to the initial Prepare message are sent to all Data Stores in addition to the initial proposer. Each store then analyzes the promises from the other stores. If a Data Store sees that a write-threshold number of stores has voted for committal, it locally generates and processes the Accept:Commit message for the first round of the algorithm, thereby avoiding the message delay that would otherwise be required for the proposer to send the message over the network. In the failure-free case this allows consensus on Commit to be resolved in three message delays but comes at the cost of requiring a total of six message delays to resolve an Abort instead.
In the case where the commit threshold is not reached and the proposer wishes to abort the transaction, it must then send a new Prepare message for round 2 of the Paxos algorithm in order to gain permission to make a proposal for that round. This step may not be avoided as it is possible that one or more of the other peers on the network may have seen that the commit threshold was indeed reached and could have processed the Accept:Commit message for the first round. If that occurs, the proposer may be forced to resolve a Commit in the second round.
The additional three message delays required to resolve Abort decisions could adversely impact performance when contention is high. Applications may consequently elect to disable optimistic commits at such times.
Returning Success to the Client in a Single Round Trip
Although full resolution requires three message delays (1.5 round-trips) in the optimistic commit model, success to the client may be returned in a single round-trip if sufficient Promise messages are received. As with Fast-Paxos, when more than 2/3rds of the received Promise messages for each object in the transaction vote for committal, it becomes impossible for the transaction driver to resolve anything other than a Commit. When this occurs, success may be immediately returned to the client application.
Finalization Actions
Aspen expands upon the Paxos algorithm’s requirement of saving a small amount of state to a stable storage medium during the resolution process. In addition to saving the required state to ensure correct consensus, Aspen also requires that the Transaction Description, commit/abort vote, and pending slice/replica updates survive system crashes and power outages. This protects the data and transaction models from corruption and, equally as important, it protects against the loss of any Finalization Actions contained within the Transaction Description. Finalization Actions are idempotent actions that must be completed before peers are allowed to delete the state associated with a successful transaction and they are executed by whichever peer that drives the consensus algorithm to resolution. They are particularly important tools for ensuring that all necessary error-correcting information is captured by the system but they are also general-purpose utilities with many potential uses.
At a minimum, each transaction carries an implicit Finalization Action to create an entry in the Missed Transaction tree if one or more peers are suspected of having failed to properly process the transaction. Entries made in Missed Transaction trees allow for recovery from temporary offline periods and transient network failures. Another essential Finalization Action applies to all allocation transactions and ensures that a pointer to the newly allocated object is recorded in its Storage Pool’s Allocation Tree. When a Data Store fails and must be reconstructed, every object hosted by that store may be discovered by walking the allocation tree.
Finalization Actions are also useful for maintenance activities that, while required, need not be atomic and part of the transaction. By moving non-essential object updates out of the transaction and into finalization actions, the potential for update contention is reduced so the transactions are correspondingly more likely to succeed.
The requirement of writing slice/replica content to stable media twice for each transaction is an obvious performance bottleneck. There are some mitigation options available though. NVRAM is the obvious first-choice for a backing media of transient transaction state but even with its rapidly falling price point it is unlikely that pure-NVRAM solutions will be cost effective in the near future. Using NVMe or SSDs to back a transaction log will likely be sufficient for most realistic workloads.
Peer Failure
If the peer driving the transaction to consensus fails before consensus is reached, one of the other peers must step in to complete the job. Similar to most other Paxos-based systems, Aspen uses a mechanism based on Failure Detectors to continually monitor the liveness of data stores. While a store is live, it is left alone to process all transactions it is the designated leader for. When failure of that store is detected, each participant chooses a random timeout from within a small window at which point it will begin driving the transaction to closure if no other peer has already started to do so.
Peer Interference
Peer interference is largely mitigated through the failure model which attempts to ensure that only one peer is ever attempting to drive the transaction to closure. However, it is possible for multiple peers responding to the failure of the initial leader to begin driving the transaction at approximately the same time. When this occurs, peers use a simple exponential back-off mechanism to ensure that one of them eventually succeeds.
In contrast to most multi-paxos systems, an exponential backoff system makes sense for Aspen due to the spread-out nature of the objects and the comparative rarity of peer interference. Rather than using a single replicated log for transactions across all objects, Aspen instead uses individual objects as the target of its transactions and single-synod Paxos instances are sufficient for updating their state..
Read-Atomic Isolation
By default, object reads do not provide inter-object consistency guarantees with respect to transactions. That is to say if objects X and Y are both modified by the same transaction, a subsequent read of both objects may see one with the transaction committed and one without. RAMP Transactions may be used to provide consistent multi-object read guarantees for Aspen objects. This works by essentially reading both the objects as well as all outstanding transactions modifying their content. The transactions are then analyzed to see if a consistent set of object states was read and, if not, subsequent reads are performed until a valid set of objects is obtained.
Distributed Data Structures
Distributed data structures are formed by embedding serialized Object Pointers in Aspen objects. Similar to most programming languages with reference-counted objects, Aspen supports both strong and weak references to objects where strong references increment/decrement the reference count of their pointed-to objects when they are created/destroyed and weak references do not. There is no physical distinction between the two so it is up to the application logic to determine which type of reference each pointer is. Objects also do not have a special interface for dealing with embedded object pointers so it is up to either the Aspen system itself or the immediately superior application layer to ensure that the embedded pointers and reference counts are handled appropriately. In particular, the agent attempting to initiate a transaction that will set the reference count of an object to zero, thereby deleting it, must ensure that all of the references within that object have been properly dealt with or that provisions have been made to ensure that they eventually will be.
Singly Linked List
Singly Linked Lists in Aspen work in the expected manner. Insertion and deletion transactions simultaneously update the inserted/deleted node and the node referencing it. Of note is that Key-Value objects have built-in support for embedded Minimum and Maximum values along with explicit support for left and right pointers. They’re intended for inclusion into linked lists in order to expand the amount of sorted data that can be stored beyond the limitations of a single object.
Trees
There are many potential ways to implement trees on top of Aspen objects. A simple method with surprisingly good contention and consistency characteristics is a B-tree like structure using tiers of singly-linked lists of Key-Value objects where the bottom tier stores data and the upper tiers store only weak-references to the nodes in the tier below it. As the tree grows in size, new tiers are added above what was previously the top tier. The notation used here is that tier-0 is the bottom tier and tiers-1+ refer to the tiers above it.
Each node in the tree has an immutable minimum attribute which specifies the minimum allowable key that may be stored in or below this node in the tree. This value is used as the key in the next tier up with the value being an object pointer that points to the node and allows for trivially simple navigation. When searching for a key from the top of the tree, simply follow the pointer with the largest minimum value that is less than the target key.
Each object also has a maximum attribute that corresponds to the minimum of the node to its immediate right. Each time a node is inserted into the linked list the maximum value of the node to the left of the insertion is updated along with the pointer to the newly-inserted object.
Because singly-linked lists are consistent in terms of navigation, navigating the data stored in a tier of the tree from left to right is guaranteed to be consistent. Navigation between the tiers, however, is not guaranteed to be consistent if multi-object transactions are not used to simultaneously update the nodes left of and above the node being inserted. While using multi-object transactions for this purpose is certainly possible, the stronger consistency guarantees will usually not be worth the additional potential for update contention. If the references to inserted nodes are placed in the upper tiers through Finalization Actions, which happen out-of-band with and shortly after completion of the transaction that does the insertion, there is a small window of time in which the newly inserted node will be present in the lower list but not in the upper. When navigating the tree, however, this is generally not a significant problem as the error is easily detected and may be corrected by simply scanning to the right until the target node is found. This mode of navigation is similar to the approach publicized in ‘Efficient Locking for Concurrent Operations on B-Trees‘ by Lehman and Yao. An additional benefit of being able to detect and easily work around such inter-tier inconsistencies is that stale, cached copies of the upper tiers of the tree will not cause lookup or insertion failures. At worst, out-of-date copies will induce a small performance penalty.
The use of Key-Value objects in this tree structure minimizes the chances for conflict during updates to all tiers of the tree. Insertion transactions will simply require that the key being inserted does not exist for the transaction to succeed. Assuming clients are attempting to insert content with differing keys, insertion failure will most often occur only when the target node is out of space having reached its configured Object Size. When this happens the client application will simply split the full node and insert a new node into the tier.
Deleting large trees requires special attention. Although it would be possible for a client application to attempt to recursively delete the contents of a large tree, this wouldn’t work well in practice since the client would become a single point of failure. Instead the responsibility for the deletion process should be turned over to a Durable Task. Durable Tasks are described in greater detail later but for the moment they may be thought of as a reliable method of executing long-duration activities on Aspen objects. From a client’s perspective the deletion process would then be the simple creation of a Durable Task to delete the contents of a tree.
Leveraging Storage Pools
Storage pools are one of the primary knobs for tuning performance, reliability, and availability of distributed data structures. The following sections describe some of the primary use cases.
Geographic Placement
Storage pools can be used to ensure that slices and/or replicas are distributed between multiple sites to achieve the desired reliability, availability, and latency characteristics. Beyond simply spreading the data around for disaster recovery purposes, intelligent use of replication of objects at specific sites can also be a performance enhancer for certain workloads. In particular, reads of replicated objects can avoid expensive over-the-wan messages when reads of local but potentially out-of-date objects are acceptable.
Workload Separation
Pools may be used to ensure that their constituent stores are isolated to distinct sets of physical machines. A primary use case for this is workload separation. This could be used, for example, to restrict all of Aspen’s internal support structures and data processing activities to a small set of high-performance systems on an isolated network to ensure that those activities do not adversely affect end-user applications. It could also be leveraged to isolate end-user workloads from one another perhaps for quality of service or multi-tenancy reasons.
Media Placement
Storage pools may be used to isolate clusters of objects and ensure that the data stores hosting them are backed by media with the appropriate latency characteristics. Frequently accessed and/or frequently updated objects would benefit from being backed by low-latency, non-rotating media whereas objects that do not have stringent latency requirements would be better placed on less performant but correspondingly less expensive media.
Use cases that are subject to data aging may take advantage of the ability to migrate data stores between backing media and network endpoints. For example, a warehouse for sensor data may require fast access to data collected in the last few weeks, medium access to data over the last 90 days, and very infrequent and slow access to data older than that. If the data is stored in tree structures, new storage pools may be created each morning to host the day’s sensor data and the corresponding pool at the migration threshold may be migrated down to slower storage: ssd -> hdd -> tape.
Some data structures can also leverage physical media to their advantage. Trees, in particular, benefit tremendously from tier-1+ objects backed by non-rotating media. A five tier tree backed entirely by disk drives with 7ms latency will entail a media-induced overhead of 35ms per lookup whereas using SSDs with 0.3ms latency for the same tree drops that overhead to around 1.5ms. Combining the two approaches with SSDs for tiers-1+ and disk drives for tier-0 comes out to about 8.2ms, a value only marginally above that of the hard drives alone.
Migration
As the stores within pools have no fixed location, they may be migrated to suit the changing needs of the application environment. Migrating data between geographic regions may make sense for a use case where there are more sites than replicas and it is desirable to shift an application from one location to another. In such a case the application would likely benefit from migrating its supporting data to the same location to keep latencies down.
Erasure Coding vs Replication
The primary advantages of erasure coding over replication is storage efficiency and the network bandwidth needed for writes. Whereas five-way replication requires an additional 400% of both, erasure coding for the same level of reliability requires only an additional 60%. The primary drawbacks are expensive rebuild operations and the inability to do local processing of object content due to that content being sliced up and spread across multiple hosts. The pros and cons of replication vice erasure coding are essentially the inverse of this. The storage cost and write requirements are significantly higher but reconstruction operations are cheap and processing of object content may be done by the stores that host the copies.
Distributed data structures may mix and match replication with erasure coding to balance the tradeoffs. For example, tree structures will almost always benefit from using replication for tiers-1+ instead of erasure coding. The ability to locally process object content at each store can greatly cut down on the latency and network bandwidth required for navigating the tree structure.
Consider the steps required for a fully erasure-coded tree that is five levels deep and composed of 14Kb objects in tier-0 and 80Kb objects in tiers-1+. Reads require five round-trips and 334Kb of data to be sent over the network. For a similarly configured tree that uses replicated objects in tier-1+, that can be reduced to an effective three round-trips and 14Kb of data sent over the network by taking advantage of the local processing capabilities. Rather than navigating the tree directly, a searcher may instead request that the hosting stores do so in its stead. The request is sent to a store that hosts the object at the top of the tree. That store reads its local copy of the object, determines the next hop in the link and forwards the navigation request to that node, and so on until the node in tier-0 is found. The store hosting that object may then respond directly to the originator of the request. The message pattern in this case looks like: Originator -> StoreA -> StoreB -> StoreC -> StoreD -> StoreE -> Originator
Using replication in this case reduces the network bandwidth needed for tree navigation by 96% and two round-trips over the network, which can be particularly beneficial for geo-dispersed systems. An additional benefit of upper-tier replication for geo-dispersed tree structures is that the replication mechanism can be configured to ensure that replicas of all tier-1+ nodes are available at each site, thereby avoiding the need to pay a 90ms+ navigation penalty for each tier in the tree. The total storage requirement for the upper tiers is approximately 1% of the total tree size so LAN-speed navigation of storage-efficient, erasure-coded, geo-dispersed tree structures can be achieved at the cost of an additional storage overhead of 1% per site.
Internal Support Structures
Aspen is a self-hosting system in that almost all system management and error recovery information is stored within the system itself. The following sections describe the internal data structures and how they’re used.
Master Tree
The internal support structures are built on top of distributed trees, the root of which is the MasterTree. This tree contains pointers to the left-most node of every tier of every system-level tree, including those of the master tree itself. The resulting structure looks like the following:
All trees are identified by a random UUID with the single exception of the master tree which instead uses a zeroed UUID. The keys in the master tree are formed by appending a single byte to the UUID of the tree which specifies the numeric tier of the tree it points to and the value is the object pointer to the left-most node of the tier. The metadata associated with the tree is also inserted into the master tree with key UUID + 255. The natural batching of the keys means that tree lookups will need to navigate at most one additional node to the right in order to find the top-most node of the target tree and its associated metadata. Using a zeroed UUID for the master tree ensures that pointers to the master tree’s upper tiers are sorted into the left-most node of its tier-0 linked list. This node, known as the Radicle, is the ultimate root of the storage system; every object stored in an Aspen system can be found by following pointers from this point of origin. Consequently, bootstrapping an Aspen system requires only an offline copy of the object pointer to the Radicle and the configuration of the initial storage pool. All other system support and management information is stored within the system itself.
In addition to object pointers to tiers of trees, the Master Tree may also contain pointers to arbitrary objects that define additional data structures or simply to plain objects. The use of UUIDs as keys ensures that no accidental collisions will occur. Any data that the system needs to store in the global namespace may be inserted into the Master Tree.
Storage Pool Configuration Tree
This tree maps storage pool UUIDs to pointers to the objects that contain their current configuration. The configuration contains at least the following information:
- Network endpoints for each member store
- Maximum supported IDA width
- IDA to use for the Allocation Tree and all Missed Transaction trees
- The UUID of the Allocation Tree for the pool
- The UUIDs of the Missed Transaction trees for each member store
- Maximum slice/object size
Storage Pool Support Trees
Allocation Tree
The purpose of the Allocation Tree is to facilitate the reconstruction of lost data stores by maintaining a weak pointer to every object in the pool. Entries in the Allocation Tree are maintained by Finalization Actions and therefore the content at any given point in time may be slightly out of date. A fully up-to-date view of the allocation tree can be obtained by querying all outstanding transactions and overlaying that information on top of the current state of the tree.
Contention mitigation and buffering techniques are used to support efficient updates to the Allocation Tree when members are concurrently allocating objects at a high rate. Both issues are addressed by ensuring that the keys used for the allocation tree use a peer-specific prefix and sort in sequential order for objects allocated by that peer. The key and object UUID are derived via the following process:
- Allocate a random UUID
- Replace the top-most bits with the pool member index
- Replace the bits below that with the HLC timestamp of the allocation (each allocation is considered as an event from the HLC perspective and will cause the c value to increment, thereby ensuring there are no duplicates)
- The resulting value is used as the key into the Allocation Tree
- The UUID of the object is set to the xor of the StoragePoolUUID and this key
When attempting to look up the entry for an object in the Allocation Tree, the key may be derived by xoring the object UUID with the storage pool UUID. With this approach, there will be a short, initial window of time in which stores will experience contention in updating the Allocation Tree but this will vanish once each store has induced a branch split. All subsequent allocation updates will be contention free.
Missed Transaction Trees
Each Data Store maintains a separate Missed Transaction tree for each of the other stores in the pool. When a pool member drives a transaction to completion and suspects that one or more stores did not successfully process the transaction, which will generally be those peers for which it failed to receive a Promise message, a notation is made in the Missed Transaction tree for each suspected store.
Allocations always require 100% success so the only entries that will be placed in this tree are missed attempts to delete objects or update their content. Missed updates to object reference counts do not need to be logged since they are replicated along with each object slice/replica and include a monotonically increasing update number along with the reference count. Simply querying a majority of slices/replicas is sufficient as the reference count with the highest update number is guaranteed to be the correct value.
The key inserted into the Missed Transaction tree is the UUID of the missed object and the value is the HLC timestamp of the transaction. For missed object deletions, the object pointer for the object is also placed in the value alongside the transaction’s HLC Timestamp. Each peer periodically scans the Missed Transaction tree maintained for it by its fellow pool members and attempts to repair any objects found within them.
The purpose of using separate Missed Transaction trees for each peer is to enable contention-free updates. This is effective for single-object transactions where the transaction driver is guaranteed to be a peer on the pool owning the object. For multi-object transactions, however, the driving peer will often not be a member of the pool. In that case it may ask one of the peers hosting the object in question to add an entry to their missed transaction log. Once the driving peer has received a positive acknowledgment of all such entries, it may consider the finalization action complete.
Missed Transaction trees are normal trees in most respects but they do have a unique property in that the transactions used to update them do not themselves trigger updates to the Missed Transaction tree as doing so would result in an infinite cycle. Instead, reliability of the Missed Transaction tree is maintained by the constant scanning being done over the tree’s contents. When a scan is performed and an object is seen to have a store missing the appropriate data, an in-place rebuild operation will be done if the hosting store is online.
Durable Task Trees
Finalization Actions provide a limited form of distributed task management in that they ensure that the follow-on actions that need to be performed before a successful transaction may be forgotten are replicated across a write-threshold number of nodes. They also ensure that those nodes will detect and recover from processing failures. This level of distributed tasking is appropriate for relatively simple and short-duration activities but a more robust mechanism is needed for complex, long-running tasks such as reconstructing failed stores and deleting large data structures.
An additional requirement on the tasking model for some necessary maintenance activities is that it must support “exactly once” operations. “At most once” and “At least once” guarantees (which are provided in this system by transactions and finalization actions, respectively) are relatively easy to achieve in distributed systems but “exactly once” guarantees are much harder to come by. A third party utility could be used to provide this capability but a major advantage of implementing the tasking model within Aspen is that by storing the state for the distributed task within an Aspen object, updates to the task state may be tied directly into the transactions used to carry out the task. For example, if a task needed to decrement the reference count on an object exactly once, the task could use a multi-object transaction to simultaneously decrement the reference and update the task state to reflect that the reference has been decremented. Even if multiple processing agents were attempting to do so simultaneously, all transactions after the first one would fail to commit since the revision of the object holding the task state would not match the transaction requirements.
Aspen includes the basic concept for storing and executing durable tasks in a reliable manner but currently does not define a full solution to the problem. A robust task management system designed for real-world use is a complex issue in its own right and merits a separate design document.
Error Recovery
Recovering from Offline Periods and Transient Network Failures
As mentioned in the description of the Missed Transaction tree, each data store continually monitors the logs of the other stores in its pool to discover objects modified by transactions that it missed. When an entry is discovered in a peer’s tree it looks up the Object Pointer for the object in the pool’s Allocation Tree and uses it to find the local state of the object. If the HLC Timestamp on the object is greater than that of the value stored in the Missed Transaction tree, the object does not need repair and the entry may be removed from the Missed Transaction tree.
If, on the other hand, the HLC Timestamp is less than that of the value contained in the tree, the object needs to be repaired. This is done by simply reading the current state of the object. If erasure coding is being used the local state is calculated via the erasure-coding algorithm. For replication the object data is simply used as-is. Once the updated value is written to the local data store the entry may be removed from the Missed Transaction tree.
Care must be taken with the removal transaction to ensure there is no race condition between the time the entry for the object UUID is read and the repair process completed. This may be done by placing a “HLC Timestamp Equals” transaction requirement in the removal transaction. This will cause the removal transaction to fail if the entry has been modified in the intervening time window. When this occurs the repair process for the object simply needs to be restarted.
Recovering Lost Data Stores
When the physical media backing a store is lost, the basic recovery process involves recreating the failed store from the content of the pool’s allocation tree. Tier-0 of the tree is walked left-to-right and each object pointer is inspected to see if the failed store hosted any slices/replicas of that object. If so, the required local state for the failed store is reconstructed from the state held by the other slice/replica hosts and written to the recovering store. As this will generally be a long-running operation that must be completed correctly for long-term system reliability, it should be implemented in terms of the distributed task model.
Although the basic process is almost trivially simple, there are a great many tradeoffs and optimizations that practical implementations need to consider. A few of which are:
- Should the store be reconstructed on the same physical machine or somewhere else on the network?
- Optimizations for replicated as opposed to erasure-coded objects
- Optimizations based on local vs geo-dispersed objects
- Optimizations for fixed-size flat-file backing store implementations where reconstructing objects in the proper order eliminates seeks and maximizes write throughput.
- Potentially using physical media shipped between geo-dispersed sites rather than relying on WAN bandwidth.
Transaction Crash Recovery
There are two goals for transaction recovery:
- Ensure that the transaction resolves a Commit/Abort decision
- Determine whether or not to apply the pending object updates to local state (if any)
The recovery process for a transaction begins by asking a threshold of stores from each transaction requirement for their view of the current state of the transaction as well as the current revision, reference count, and last committed transaction UUID for the object targeted by the transaction requirement. The following rules are then used:
- If the transaction is still in-progress, resume processing.
- If the transaction is committed/aborted and is still awaiting completion of the finalization actions, locally apply/discard the object modifications (if any) and resume processing.
- If the peers returned a cached commit/abort result, act accordingly and discard the transaction state
- If the current object revision does not match the required value for one or more of the transaction requirements, compare the transaction UUID with the UUID of the transaction that last modified the object. If they match, commit the local modifications. Otherwise, discard the local slice/replica and repair the object.
- Bump the object revision and then discard the transaction state.
Various patterns of network partitions and dropped messages conspire to make it very difficult to distinguish between an aborted transaction and a transaction that is in-progress but has not yet reached a threshold number of peers. This recovery mechanism bypasses the problem with the last step of bumping the revision of the object which makes it impossible for the potentially outstanding transaction to commit. Once the object revision changes and the last committed transaction UUID does not match the transaction in question, the recovering transaction may be silently discarded.
Object Allocation Crash Recovery
Object allocations are tied to a transaction but the stores that host slices/replicas for the newly allocated object are not, themselves, part of that transaction. Because each store must provide a local reference to the slices/replicas hosted by that store, the Object Pointer for the new object and, therefore, the Transaction Description for the transaction used to insert the new pointer into an existing object cannot be formed until after the initial allocation step. The normal flow to resolve allocations is for the initiating store to immediately inform the allocating stores if they failed to achieve the required 100% allocation threshold, thereby abandoning the transaction before it even began, or to include a Finalization Action in the Transaction Description that will attempt to notify all allocating stores of the success/failure of the transaction. If an allocating store goes offline after the allocation step and comes back online after the transaction has been forgotten, it must determine what to do with the local slice/replica.
Although the full Transaction Description cannot be formed until after the initial step, the transaction UUID can be determined in advance as can the object that will be updated in the allocation transaction. This information is sent to the allocating stores along with the object data to assist in the recovery process. The process begins by querying a threshold number of the stores hosting the object-to-be-updated for the current state of the transaction and the current state of the object. The recovery steps are then:
- If the transaction is still outstanding, await notification from the Finalization Action
- If a cached commit/abort decision is available, act accordingly.
- If the last committed transaction UUID on the object matches the one for the allocation transaction, commit.
- If the revision of the object does not match the required value, check the StoragePool’s AllocationTree. If a pointer to the object exists, commit. If it does not exist, abort. (The Finalization Action that updates the AllocationTree ensures that if the transaction committed a recovering store will either see the outstanding transaction state or an entry in the Tree unless the object is deleted before the store recovers, in which case throwing away the allocated state is also legitimate)
- Otherwise the object revision matches the expected value. Similar to transaction recovery, attempt to bump the object revision. If the revision changes for any reason and the last committed transaction UUID does not match the value required by the allocation transaction the local state may be discarded as it is impossible for the potentially-outstanding allocation transaction to commit.