This note documents the design for the checkpoint data structures and algorithms.
The CapROS checkpoint design is derived from the EROS checkpoint design. The EROS design differs from the KeyKOS, but many of the ideas of that design still pertain. The KeyKOS design is described in The Checkpoint Mechanism in KeyKOS. The EROS design is described in The Design Evolution of the EROS Single-Level Store. The differences between CapROS and the other systems are summarized below. You can skip this section if you're not interested in the history.
KeyKOS uses two checkpoint areas of equal size (what Landau calls the checkpoint area and the working area). It fills the working area up, declares a checkpoint, swaps the checkpoint and the working areas, and then drains (migrates) from what is now the checkpoint area.
EROS and CapROS use a single checkpoint area in a circular fashion, and divides this area dynamically into several generations. A generation is permitted to occupy up to a fixed percentage of the checkpoint area (called the limit-percent, typically 65%) before a checkpoint must be declared. A new generation is then started and a previous checkpoint generation is migrated. This space utilization strategy provides greater flexibility in the face of bursts of object modifications. It also allows the checkpoint area to grow more easily.
KeyKOS was designed for the checkpoint areas to be somewhat larger than main storage. It calculated its "Checkpoint Limit" based on the number of page frames, node frames, and directories, on the assumption that all of them could need to be written into the checkpoint. EROS and CapROS use a "reserve space before changing" technique which allows the checkpoint areas to be smaller than main storage.
CapROS can have several generations of checkpoints which have not yet been migrated. While EROS and KeyKOS had logic which skipped migration of a page (or node) which had been changed since the checkpoint was taken (because it would be included in the next checkpoint), some of the frequently changed objects would be migrated before they were changed in the current generation. By not starting migration until more than one checkpoint has been taken, active pages and nodes have a greater chance to be modified before they are even considered for migration.
Although there is more than one checkpoint outstanding, this version of the checkpoint logic does not permit restart from other than the most recently stabilized checkpoint.
KeyKOS never reuses space in the checkpoint area; if an object is modified a second time it gets a new location. The reasons for this choice are to avoid seeks and head settling delays where possible, and to permit chained writes on the IBM S/370 disks which allowed actual data rates to approach the theoretical maximum.
EROS adopted a strategy of re-writing a page in the original working area location if it was paged out twice during one checkpoint interval. This strategy had the advantage of saving space and reducing seek distances.
Because of the need to support logs in flash memory, CapROS will follow the KeyKOS strategy. This strategy has the advantage of eliminating a source of "hot spots" on the disk.
Once KeyKOS begins to reuse a checkpoint area, all memory of that area's prior content is discarded.
EROS preserved a record of object locations in the checkpoint area until their storage was actually reused. This strategy reduced the number of seeks to areas outside the checkpoint area, improving disk performance. It also increased the number of locations for a page or node, allowing fetches to be scheduled on less-active devices.
Which strategy CapROS adopts will be up to the person writing the code.
EROS kept 28 bit log location IDs (lid, see below) in the Object Directory part of the checkpoint. Once you account for the 8 bits of object index, this size results in a maximum of 220 frames in the log, which was reasonable for the systems of the time, but now is barely enough to hold the RAM of a modern desktop processor. The CapROS lid is the same size as Object IDs (oid), (see below).
EROS kept a 2 level tree for the on-disk Object Directory. This limited the size of a checkpoint to 259,080 objects. CapROS writes Object Directory and Process Directory frames contigously in the log, so there is no artificial limit on the number of objects in a checkpoint.
EROS required sequential numbering of the log ranges. CapROS may relax that requirement depending on how easy it is in practice, but it should be noted that relaxing this requirement may be in conflict with contigous Object and Process Directories.
The CapROS log area is made up of ranges of disk page frames that are sequentially numbered beginning at zero. These are referred to as log ranges. Just as objects in object ranges are referred to with object identifiers (oids), objects in the log are referred to with log identifiers (lids). A lid is constructed by taking the 56 bit frame number and concatenating an 8-bit object index. (8 is equal to log2(EROS_OBJECTS_PER_FRAME).)
Frame 0 1 2 SIZE +-------+------+----------------------------------------+ | header area | main log | +-------+------+----------------------------------------+ LID 0x0 2<<8 ... SIZE<<8
Log ranges may reside on multiple disks, but the implementation may require that sequential numbering be preserved. This requirement presents some difficulties for log area rearrangement which may, in due course, be solved with suitable user-level applications.
A generation that has been completely written to the disk is said to be stabilized. A stabilized generation has an associated on-disk directory that describes where all of the objects in that generation reside. The last step in stabilizing a generation is writing the associated checkpoint header.
Frames zero and one of the log are the header area. They hold the two most recent checkpoint headers. There are two headers so that if there is a disk error writing a header, there will still be another valid header that can be used for restart.
The checkpoint header contains an identifying header, containing the generation number of the restart generation. (This structure is called CkptRoot in the code.) The header with the highest generation number is the most recent header. Only the most recent valid header is used for restart.
The checkpoint header also contains the number of unmigrated generations, and an array of LIDs of the generation headers of all the unmigrated generations. There is a limit, currently 20, on the number of unmigrated generations.
The last item in the generation header is a field used to validate the correctness of the previous data. This field is included to detect frames which were only partially written due to hardware failure. We needed this field on the S/370 because certain write failures (e.g. channel failure) caused the disk controller to fill out the block with the last byte transferred from the channel, and then write a correct hardware checking block.
The root of a generation directory is the generation header. The generation header frame begins with a header structure describing the generation.
The header structure is:
struct DiskGenerationHdr { uint32_t versionNumber; GenNum sequenceNumber; LID firstLogLoc; LID lastLogLoc; LID firstProcessDirFrame; LID firstObjectDirFrame; uint32_t nProcessDirFrames; uint32_t nObjectDirFrames; uint32_t nProcessDescriptors; uint32_t nObjectDescriptors; };
The object directory needs to be kept entirely in memory even past when a checkpoint is stabilized. It makes sense to write the entire object directory into sequential log locations at the end of the checkpoint.
The process directory must be saved as of the time of the demarcation event. It makes sense to write it into contigous log locations immediately after the pages and log pots that were written before the demarcation event.
Contiguous log locations allow easy scheduling of writes and reads during checkpoint and restart, as all the locations of the directories are known initially.
Implementor note: The old 32 bit lid was referred to as a LogLoc in the code.
Each of the fields is explained below.
The version number of the generation header.
The generation sequence number allows CapROS to determine which generation is most recent.
The firstLogLoc and lastLogLoc fields allow the restart code to determine what part of the main log contains this generation. Note that firstLogLoc will be higher than lastLogLoc when this generation wraps from the end of the main log to the beginning.
The restart code checks each lid when rebuilding the in-core object and process directories. If the most recent version of all the unmigrated objects, the restart generation process directory frames, and the object directory frames for all the unmigrated generations are not all mounted, the system will not restart until they are mounted.
nProcessDirFrames is the number of process directory frames. If the entire process directory is in the checkpoint header, it will be zero. The firstProcessDirFrame is the LID of the first frame of the process directory.
nObjectDirFrames is the number of object directory frames. If the entire object directory is in the checkpoint header, it will be zero. The firstObjectDirFrame is the LID of the first frame of the object directory.
Part of the process directory may be held in the generation header frame. nProcessDescriptors is the number of process descriptors in that list. They immediately follow the header.
Similarly, part of the object directory may be held in the generation header frame. nObjectDescriptors is the number of object descriptors in the header. They immediately follow any process descriptors that may be in that header.
A process directory frame consists of a single word describing the number of following entries, and then some number of process descriptors. With 8 byte OIDs and 4 byte ObCounts, a 4K process directory frame can describe 314 processes.
struct DiskProcessDescriptor { OID oid; ObCount callCount; uint8_t actHazard; };
An entry in the process directory corresponds to a process that was running at the time of the demarcation event. The process entries from the restart generation are reloaded as part of the system startup procedure.
The oid field identifies the root node of the process. actHazard identifies some situations requiring special handling:
Note that on restart CapROS loads the process list from the most recent generation header only.
The object directory frames hold ObjectDescriptors. Each object descriptor holds an object identifier, the object's allocation and call counts, the type of the object, and the lid at which the object can be found. With 8 byte OIDs, 4 byte ObCounts, 8 byte LID, and 1 byte types, each 4kB frame can hold a description of 163 objects (assuming unaligned/packed storage and allowing for a count of entries at the start of the frame).
struct ObjectDescriptor { OID oid; ObCount allocCount; ObCount callCount; LID logLoc; uint8_t type; };
For nodes, the callCount field contains the call count. For other objects, the callCount field is unused.
The allocCountUsed and callCountUsed bits are not stored on disk. They are assumed to be one for objects on disk.
If the lid field of the directory is zero, then the corresponding object is null: either a zero-filled page or a node filled with void keys and zero nodeData. Which one can be determined by the value of the type field.
If the lid field is nonzero, the corresponding log frame contains either page data or a "log pot." A log pot is simply a page-sized cluster of nodes in the main log. This is why the lid value is concatenated with an object index: the index indicates which entry in the log pot contains the relevant object. (The object index of a LID, like that of an OID, is 8 bits to allow for future types that may be small enough to fit 256 in one frame.)
CapROS will support solid-state disks that use flash memory. These devices have a finite number of erase-write cycles (up to 1,000,000 for NAND flash). When the device is written, a block of memory (as large as 256KB) is first erased, then written.
The header area is a "hot spot" on the disk because it is written once on every checkpoint. If we assume (pessimistically) one checkpoint every ten seconds, and (optimistically) 1,000,000 write cycles, the header will wear out in only 115 days. An earlier design had a large header area in an attempt to spread out writes, but the large erase block size makes that approach impractical.
Flash memory devices intended to replace disks generally have some mechanism for wear leveling. We will rely on that mechanism to handle the header area "hot spot". The migrator will be written to avoid making node pots and tag pots "hot". The main log should be sized to hold several generations to avoid a hot spot there.
In overview, checkpointing operates as follows:
The diagram above shows the layout of the main log. The main log cursor (the next location to be written) is at the end of the bar. Note that the log is circular, so the left end and the right end are the same location. The retired generations remain in the main log until their storage is reused. There may be free space if the directory entries that described that space have been freed. The reserved space may be bigger or smaller than the free space.
Before a persistent object in RAM is dirtied, space is reserved for it in the main log. See below for details of the reservation algorithm.
If the checkpoint interval (typically 5 minutes) has passed since the last demarcation event, a new demarcation event is declared. This ensures that the system can be restarted with data that is no more than 5 minutes old. See below for other reasons to declare a demarcation event.
At the demarcation event, dirty objects in RAM are marked "copy on write", the Activity structure is saved as a process directory, the next generation is started, and execution resumes. At this time we say a checkpoint is active. Then, the process directory is written to the log, and then all of the dirty objects in the working generation are written to the log.
After the last dirty object in the working generation has been written to disk, the object directory is written. Then a new generation header is written; that generation will become the restart generation.
Then a checkpoint header is written to the header area. Alternate checkpoint headers are written to alternate locations in the header area, so the older header is overwritten, and there is always a good header.
Migration is the process of copying objects from the log to their home locations. Migration runs whenever there are any unmigrated stabilized generations, migrating the oldest such generation first. The priority of migration is adjusted to balance the desire to avoid contending with needed disk I/O, with the need to progress fast enough to avoid having to stall processes (as described below).
Conceptually, during the migration phase, all the objects in a stable generation are copied from the main log to their home locations on the disk. As an important optimization, objects that are not current need not be migrated, as described in The Checkpoint Mechanism in KeyKOS.
Migration is optimized by a non-persistent process external to the kernel. This process may chose to optimize disk arm motion in the home ranges if the disk is on rotating media; or it may minimize rewrites if the disk is on flash memory. Because it is non-persistent, the external migrator will never be stalled attempting to dirty an object.
As the log cursor and the reserved space advance around the circular main log, it may happen that all the space in retired generations is reserved. In that case, if an attempt is made to dirty another object and increase the reservation, the dirtying process must be stalled until space is freed up. The priority of migration will be adjusted to reduce the likelihood of this happening.
The main log is a circular buffer of frames, each of which is one page in size. A frame may hold a data page, a log pot of nodes, a process directory frame, an object directory frame, or a generation header.
Before any persistent object can be dirtied, space for it must be reserved in the main log. The system keeps track of the number of dirty persistent objects of each type. The object reservation is the most amount of space in the main log it could take to save those objects, taking into account:
The reservation is important because before dirtying an object we need to make sure we will be able to clean it later. When we clean an object, we can't overwrite any unretired generations, because that data will be needed if the system should restart.
We do not actually decide where to store the newly dirtied object when space for it is reserved; we decide that when the object is actually written. Indeed, the object might be null (zero) at the time it is written, and not take any space in the log at all, so the reservation is an upper bound on the space needed.
While a checkpoint is active, there are some dirty objects in the working generation, and other dirty objects in the next generation. We track the reservation for each generation separately. We shall see below how these are used.
At some later time, we actually write the dirty object to the disk. At this point we actually allocate the log frame that the object or log pot will go to. At the same time, the object becomes clean, so we reduce the count of dirty objects, which may reduce the reservation. This is how reserved space is consumed.
An object can be dirtied (reserving space), written out (consuming space), and subsequently redirtied. When this occurs, there are two possible design options:
The first option has several disadvantages:
Therefore we adopt the second policy. Even if a frame in the log does not contain an active entry, we do not reuse it. As a result, we can use a simple cursor (the main log cursor) to allocate space in the log.
A great many of the objects written to the log will prove to be null objects. For our purposes, a null object is any of the following:
The reason these occur with such frequency is that returning storage to the space bank causes that storage to be zeroed. The reason to do this is that objects returned to the space bank are generally dirty, and handling them this way obviates the need to rewrite now-dead data back to the log. By storing them as null objects, the write bandwidth requirements can be reduced significantly.
The advantage to handling null objects specially is that they do not occupy any object storage in the log. The directory entry for a null object suffices to indicate that the object is null.
When CapROS is booted, it normally looks for a checkpoint on disk. (See Booting for other boot scenarios.) CapROS disk volumes (partitions) are identified with an IPLSysId that is unique for each system. This allows one CapROS system to format disks for another CapROS system without confusing them.
The restart sequence finds the configured disks, using a combination of configuration data and probing. It mounts all divisions of all volumes with the correct IPLSysId. Then it reads the checkpoint headers at LIDs 0 and 1 and determines which one is the most recent valid checkpoint header (the one with the highest generation number). This header identifies the restart generation and all the unmigrated generations.
If a drive failure occurred while writing a header (detected by bad CRC while reading the corresponding frame or by the checksum at the end of the header failing to verify), that header is discarded.
Then all the unmigrated generation headers are read. The number of unmigrated generations is given in the checkpoint header. Those generation headers are used to reload the object directory. Unmigrated generations now need to be migrated (perhaps for the second time). All the older generations are both migrated and retired and are available for allocation and reservation. They must not be migrated or used, because they may have been overwritten with newer, unstabilized data.
The process directory is reloaded from the restart generation. The main log cursor is initialized to the frame following the last frame of the restart generation.
For simplicity, we consider that no part of an unmigrated generation is available for allocation, even though some frames might be migrated or might not contain current data.
To avoid deadlock, we must make sure that the disk handler and migrator can always make progress. The design of the disk handler and perhaps the migrator are such that they allocate some storage dynamically. When the system is booted, we must wait until these components have signaled that they have finished allocating dynamic storage, before allowing any persistent objects to be dirtied.
Here are some algorithms that might not be obvious.
Just before a persistent object is dirtied, the storage reservation algorithm is executed. The algorithm goes like this:
If migrator and disk handler are not initialized: Process stalls waiting for those to be initialized. If a checkpoint is not active: Calculate tentative working reservation assuming object is dirtied. If size of working area + tentative reservation > size of main log * Limit Percent OR tentative reservation > size of retired generations: Declare a demarcation event (and a checkpoint is now active). If a checkpoint is active: Calculate tentative reservation for next generation assuming object is dirtied. If size of working area + working reservation + tentative next reservation > size of migrated generations(retired or not) OR the number of unmigrated generations (including the working generation) >= the maximum number of generations permitted in the directory Process cannot dirty now. Process stalls waiting for migration to progress. Else dirty the object and adjust next reservation. Else dirty the object and adjust working reservation.
When we write an object to the disk, we do the following:
If object is null: update in-core directory entry accordingly Else write object to disk and update in-core directory accordingly Mark object clean and adjust the corresponding reservation downwards.
During checkpoint stabilization, we must write the process directory and the object directory. The space allocated to each in the header is variable. Since the process directory is written before the dirty objects, we treat it first.
Let S be the amount of space in the generation header frame after the header.
First, if the total number of descriptors in the process directory fits in S, then just put them directly in the header frame and reduce S by the space used.
Otherwise, put the all the descriptors that will fit in the header frame and set S to zero.
After all the objects have been written, put as many object descriptors as will fit in the remaining space in the header frame, and write the rest into object directory frames at the end of the checkpoint.
The in-core directory is the data structure that records the location of objects in the log. The in-core directory is simply called the directory when context makes it clear. It is used for three purposes:
The migrator only needs to migrate the current version of an object, so the directory only needs to keep information about the current version of objects and (for pages) the restart version.
Note that the migrator will have to be able to find entries for the generation being migrated, so a compressed generation number will need to be saved for each directory entry or they can be linked in a list headed in the core generation structure.
A number of data structures are possible for the directory. Here is one. Each directory entry is a node in a red-black tree with the oid as the key. Each entry has information about the object: the allocation and call counts, the alloc and call count used bits, and the type. Each entry also has the LID and generation number of the current version. To find entries by generation, the entry is linked on a doubly-linked chain for its generation. If the object is a page, the entry may also have the LID and generation number of the restart version. (We do not need to link into the chain for the generation of the restart version.)
When a checkpoint is stabilized, the current version becomes also the restart version; the old restart version is no longer the restart version. Rather than update affected directory entries at this time, we just recognize this situation from the generation numbers, and update the entry whenever we happen to visit it.
The following is another, older design:
struct CoreDirectory { GenNum migratorSequenceNumber; CoreDirent *oidTree; };
The last completely migrated generation sequence number.
The top of the red-black tree that makes up the in-core directory.
CapROS allocates a fixed, tunable amount of space for directory entries at boot time. The restart code loads the directory with information for the restart generation and all the other unmigrated generations. (It must not load directory information from migrated generations, because those generations may have been freed and overwritten after the restart checkpoint was stabilized.)
A directory entry is allocated when a dirty persistent object is cleaned, that is, written to the log. (If that object was already in the log in the same generation, we will get its old entry back, but we do not count on this in managing directory entry allocation.) Thus the number of dirty objects is an upper bound on the number of directory entries we will need to take a checkpoint.
Before we dirty an object, we check that the number of free directory entries, plus the number of directory entries used by retired generations (these entries could easily be freed), is greater than or equal to the soft limit. The soft limit is a tunable parameter that may be the number of objects that can fit in memory, or may be larger. If we aren't going to exceed the soft limit, we go ahead and dirty the object.
If we would exceed the soft limit, and a checkpoint isn't in progress, we declare a demarcation event. If the number of free directory entries plus the number that could be freed in retired generations is greater than or equal to the number of dirty objects (including the object about to be dirtied), then we go ahead and dirty the object. Otherwise, the dirtying of the object must wait until this comparison can succeed.
At the time an object is cleaned, if there are no free directory entries, we free the oldest retired generation, freeing all its directory entries if any, until there is a free directory entry. (The above algorithm ensures that we can always get a free entry.)
I (wsf) think this structure is overkill, but it is useful documentation of what the coder will possibily need. This whole section is old and should be updated or deleted when the code is written.
Each generation has an associated in-core structure. This structure maintains book-keeping information concerning the generation and contains the root pointer for the in-core generation directory.
struct CoreGeneration { bool canReclaim; CoreDirent *oidTree; uint32_t nCoreDirent; uint32_t nDirent; uint32_t nReservedFrames; uint32_t nAllocatedFrames; uint32_t nDirPage; uint32_t nLogPot; uint32_t nZeroPage; uint32_t nZeroNode; uint32_t nPage; uint32_t nNode; uint32_t nProcess; // Record of storage we have released: uint32_t nReleasedNode; #ifdef PARANOID_CKPT uint32_t nAllocDirPage; uint32_t nAllocPage; uint32_t nAllocLogPot; #endif curLogPot; uint32_t nNodesInLogPot; };
For each type of object (page, zero page, node, zero node, process), the core generation structure records how many of these objects have been allocated in this generation. An object is allocated the first time it is written to the checkpoint area. Nodes are
The various fields are:
Indicates whether the objects in this checkpoint generation have been migrated, and can therefore be reclaimed.
Root pointer to a red-black tree of directory entries for each object in this generation.
Number of entries in the core directory tree for this generation.
Number of on-disk directory entries that have been reserved for this generation. Should always be equal to nCoreDirent.
Number of disk frames that have been reserved for this generation.
Number of disk frames that have been allocated for this generation. Should always be less than or equal to nReservedFrames.
Number of directory pages, pages, and log pots (respectively) that have been reserved in this generation. These should sum to nReservedFrames.
struct CoreDirent { CoreDirent *left; CoreDirent *right; CoreDirent *parent; OID oid; ObCount count; LogLoc logLoc : 24; uint8_t type; enum { red = 1, black = 0 }; uint8_t color; };
The design selected for CapROS above tries to avoid "hot spots" on disk, but still has "warm spots" in the checkpoint headers and possibly in the home locations. By having a large number of checkpoint headers, the heat in that part of the flash memory is reduced to a level which should meet the device life requirements.
The following design is based on the goal of never re-writing a location in the log until all other locations have been written (exactly once). It was rejected because:
The log range page with LogDI (lid) 0 is the first checkpoint header. When the disk is formated, it is initialized to all zeroes. LogID 0 is always a checkpoint header, and acts as the root of all the checkpoint headers on the disk. All the log ranges on the system make up a single logical checkpoint area which is written sequentially in order of increasing LogIDs. When a checkpoint reaches the end of the available log ranges, it wraps around to the beginning, starting with lid=1.
Each checkpoint header holds pointers to information about the checkpoint, more-or-less following those described in in the chosen design. In addition they hold forward and back pointers to checkpoint headers for previous and subsequent checkpoints as follows:
HMACValue checksum; LID next; LID previous; LID migratedSeqNum; HMACKey nextMac; HMACKey previousMac; 64bitint serialNumber;
The last operation in a checkpoint is writing the checkpoint header which is usually the first block of the checkpoint. (If the checkpoint wraps the log ranges, lid=0 will be the header for that checkpoint, and the first block of the checkpoint will either contain checkpoint data or remain unused.) Note that while we can calculate the lid of the next checkpoint, and generate a HMACKey for it, the header for that checkpoint will probably not contain a checkpoint header. It will probably be the contents of some old page, a log pot of nodes or checkpoint directory data. We use the HMAC to separate a checkpoint header from the other cases.
The restart strategy is:
Read lid=0 If that frame is all zeroes, there are no checkpoints. Stop. Do Read the next checkpoint header. If the HMAC does not check or the serialNumber is lower than the previous serialNumber break; Enddo
We how have the most recent checkpoint, and can restart from it. We can use the back pointers to find the checkpoint(s) that need migration.
We don't actually need "home ranges" at all. KeyKOS had the concept of "limbo", where, if when it came time to migrate a page or node, its home range wasn't mounted, the kernel simply marked it "dirty" and let it be written into the next checkpoint. This kind of logic could be used for all pages and nodes in the system, eliminating the home locations as hot spots.
Copyright 1998 by Jonathan Shapiro. Copyright 2008, 2009 by Strawberry Development Group. All rights reserved. For terms of redistribution, see the GNU General Public License This material is based upon work supported by the US Defense Advanced Research Projects Agency under Contract No. W31P4Q-07-C-0070. Approved for public release, distribution unlimited. |