This note describes how the disk formatting logic works in the CapROS system, and how CapROS volumes are structured. Both disk partitioning and range creation/deletion are covered.
Note that the current implementation does not provide for live creation of new volumes, so the discussion of partition creation should be taken with suitable skepticism; it snapshots what we intend to do when the online disk partitioning/formatting utility is written.
Disks can be divided into multiple, independent chunks known as partitions. Each partition constitutes a logical disk volume. CapROS, for the most part, deals in terms of disk volumes. CapROS volumes, in turn, are divided into ranges, described below.
Over the life of a disk, it may prove convenient to add or remove partitions/volumes, and to adjust the content of existing volumes.
Adding new volumes (paritions) must be done carefully for the sake of consistency, but it isn't especially difficult to do. The principle challenge is to defend against disk failure during format. The creation of new CapROS partitions proceeds in steps:
There are two cases to consider when deleting CapROS partitions. CapROS treats the case of an unplanned deletion the same as if the disk became unreachable due to removal or failure, which of course means that any data that existed solely on that disk is lost. This can cause critical software components to fail. Another problem is that data in the checkpoint log destined for the removed partition will be unmigrateable.
Planned deletion of a partition involves migrating all the data on it elsewhere. The Object IDentifiers of the data do not change. Most likely this will use mirroring/duplexing as follows:
The question of how to actually reduce the amount of data in the system, so disk space can be permanently freed, is beyond the scope of the current design of the system.
CapROS views each logical volume as further divided into ranges. This section describes the range types, and the formatting of each range type as seen by the kernel. All ranges are an exact multiple of the architecture page size; extra trailing sectors, if any, are unused.
In the kernel code, ranges are called Divisions; should we change this documentation to match?
Ranges come in a variety of types.
Range Table | The range table contains the list of all ranges present on the
volume, including itself. At least one
range table is required on every volume. There may be multiple
range tables on a given volume (for redundancy).
How to find the range table? Each range table entry contains the following information:
|
||||||||||||
Object | An object range holds pages and nodes (and other types of objects, should we implement them). There can be multiple object ranges on a volume, and all mounted object ranges act together as the data space for CapROS. Every object in CapROS has a unique home location in some Object Range. Object ranges are further described below. | ||||||||||||
Log | Log ranges hold the log areas used for checkpointing. Modified
objects are periodically written to the log, and are later
migrated to their home locations. There can be many log
ranges in a partition, and all mounted checkpoint log areas
act as a single log used for checkpointing.
Every CapROS system must have a log range containing log locations zero and one, which contain the most recent two checkpoint headers. Is this correct? |
||||||||||||
Unused | The unused range type identifies an unused entry in the range table. |
No provision is made for allocating spare space for use in the event of disk failure, because both IDE and SCSI provide transparent sparing.
There is no boot range and no kernel range. It is presumed that the disk contains another partition which contains the GRUB boot loader, the kernel, and the preloaded range(s). This partition is formatted with a boot sector and a traditional file system understood by GRUB. We also need to document somewhere how systems will be booted from ROM.
Object ranges are made up of a sequence of clusters. Every cluster begins with a disk frame known as a tag pot, followed by a fixed number of object frames. The size of a cluster is determined by the number of Object Frames a Tag Pot can keep track of. (The last cluster in a range may have as few as zero object frames.) All frames are the size of the architecture's page. So the Object Range looks like:
---------------------------------------------------------------------- | TagPot | ObjFrm |...Object Frames...| ObjFrm | TagPot | ObjFrm | ... ---------------------------------------------------------------------- \_____________________ _____________________/ \ / cluster
Each Object Frame contains one or more objects of a single type (node or page). The type is specified in the tag pot. The type of the frame determines how the Frame is laid out, as follows:
------------------ |//////////////////| |//////////////////| |//////////////////| |///////Page///////| |//////////////////| |//////////////////| |//////////////////| ------------------
------------------ |///////Node///////| |\\\\\\\Node\\\\\\\| |///////Node///////| |\\\\\\\Node\\\\\\\| |///////Node///////| |\\\\\\\Node\\\\\\\| |///////Node///////| | | ------------------
If the allocation count of an object is N, that means that there are no applicable capabilities containing an allocation count M > N, and there are no rescinded applicable capabilities containing an allocation count M = N. (There may exist rescinded applicable capabilities containing an allocation count M < N.) For nodes, applicable capabilities are all capabilities containing the node's OID. other than resume capabilities. For all other objects, applicable capabilities are all capabilities containing the object's OID.
In the simple case, rescinding capabilities to an object requires incrementing the allocation count of the object.
When the allocation count reaches its maximum possible value, the object is "worn out" and cannot be reused, because we can no longer rescind capabilities to it.
To reduce the "wear" on the allocation count, we keep track of some in-memory capabilities (they are linked to the object) and we can invalidate these capabilities without incrementing the allocation count. A bit called allocCountUsed indicates whether there may be unrescinded capabilities (other than resume capabilities) that contain the current value of the allocation count. If the bit indicates that there are no such capabilities, rescinding can be done without incrementing the allocation count. With this optimization, it would take an extremely long time for a 32-bit allocation count to wear out.
The allocCountUsed bit is not stored on the disk. When an object is fetched from disk, its allocCountUsed bit is set to one. (If we did store the bit on the disk, then we would have to consider the object dirty whenever the bit is set (or cleared). The bit is set when an in-memory capability is converted to the out-of-memory form, and for technical reasons it's not possible to dirty the target object at that time.)
Nodes contain a call count that is similar to the allocation count, but for resume capabilities. If the call count of a node is N, that means that there are no resume capabilities containing the node's OID and a call count M > N, and there are no rescinded resume capabilities containing the node's OID and a call count M = N. (There may exist rescinded resume capabilities containing the node's OID and a call count M < N.) Similarly, there is a callCountUsed bit, and it's not stored on disk either.
A tag pot has an entry for each frame in its cluster. Each entry has the following fields:
A virgin frame should be typed as a page, since it is easier to retype a page to a different type than vice versa.
To avoid wasting space on alignment, a tag pot is arranged as two arrays:
struct TagPot { ObCount allocationCount[]; uint8_t type[]; };
Using a 32-bit count, a tag pot can describe 819 object frames.
It is possible to change the base type of a frame, for example from page to node pot or vice versa. By "base type" we mean the type of the container of an object. For example, forwarders and GPTs are different types, but, as an implementation detail, they both occupy a node frame; they both are of base type node.
The space bank internally allocates a whole frame at a time to a given base type. The space bank does not always know the base type of free frames, and allocating the first object in a frame may require changing its type. (As an optimization, the space bank preferentially allocates free frames that were previously of the same base type, to reduce the amount of retyping needed.)
Retyping a frame must not cause any rescinded capabilities to become unrescinded. (When a frame is retyped, there are no allocated objects in the frame, and therefore no unrescinded capabilities to objects in the frame, so we don't need to worry about unrescinded capabilities becoming rescinded.) This requires that allocation counts and call counts never decrease. Here's how we ensure that:
Retyping a frame involves the following steps. These steps ensure that the retyping operation does not change the state of the last committed checkpoint. The steps are described in the context of a persistent (checkpointed) system, but they apply equally well to a non-persistent (embedded) system.
Note that the kernel needs to know the current type of the frame.
As with obtaining the type, the counts are obtained without requiring the contents of the object. Specifically, the kernel gets the counts (1) if the object is in memory, from the object header; (2) if the object is in the in-core log directory, from the directory entry; or (3) otherwise, from the home range tag pot (for a page) or home range node pot (for nodes).
There is a separate operation for each object, because each object might be in a different node pot in the log, and we don't want to require all the pots to be in memory at once. (This function could be optimized later to return counts for multiple objects in one operation, as long as that can be done without doing multiple fetches of pots on any one operation.)
In the case of nodes, the EROS design called for keeping in the tag pot the maximum of the allocation counts of the nodes in that home node pot. Note that this maximum only applies to the nodes whose current version is in the home node pot; some nodes from that pot might have their current version in log pots, and we still need to read those. This design could save reading the home node pot when the frame is retyped, but may require updating and writing the tag pot more frequently. Since retyping should be rare, we choose to not keep node counts in the tag pot.
Copyright 1998 by Jonathan Shapiro. Copyright 2008 by Strawberry Development Group. All rights reserved. For terms of redistribution, see the GNU General Public License This material is based upon work supported by the US Defense Advanced Research Projects Agency under Contract No. W31P4Q-07-C-0070. Approved for public release, distribution unlimited. |