The Checkpoint Mechanism

This note documents the design for the checkpoint data structures and algorithms.

1. Background

The CapROS checkpoint design is derived from the EROS checkpoint design. The EROS design differs from the KeyKOS, but many of the ideas of that design still pertain. The KeyKOS design is described in The Checkpoint Mechanism in KeyKOS. The EROS design is described in The Design Evolution of the EROS Single-Level Store. The differences between CapROS and the other systems are summarized below. You can skip this section if you're not interested in the history.

KeyKOS uses two checkpoint areas of equal size (what Landau calls the checkpoint area and the working area). It fills the working area up, declares a checkpoint, swaps the checkpoint and the working areas, and then drains (migrates) from what is now the checkpoint area.

EROS and CapROS use a single checkpoint area in a circular fashion, and divides this area dynamically into several generations. A generation is permitted to occupy up to a fixed percentage of the checkpoint area (called the limit-percent, typically 65%) before a checkpoint must be declared. A new generation is then started and a previous checkpoint generation is migrated. This space utilization strategy provides greater flexibility in the face of bursts of object modifications. It also allows the checkpoint area to grow more easily.
KeyKOS was designed for the checkpoint areas to be somewhat larger than main storage. It calculated its "Checkpoint Limit" based on the number of page frames, node frames, and directories, on the assumption that all of them could need to be written into the checkpoint. EROS and CapROS use a "reserve space before changing" technique which allows the checkpoint areas to be smaller than main storage.
CapROS can have several generations of checkpoints which have not yet been migrated. While EROS and KeyKOS had logic which skipped migration of a page (or node) which had been changed since the checkpoint was taken (because it would be included in the next checkpoint), some of the frequently changed objects would be migrated before they were changed in the current generation. By not starting migration until more than one checkpoint has been taken, active pages and nodes have a greater chance to be modified before they are even considered for migration.

Although there is more than one checkpoint outstanding, this version of the checkpoint logic does not permit restart from other than the most recently stabilized checkpoint.
KeyKOS never reuses space in the checkpoint area; if an object is modified a second time it gets a new location. The reasons for this choice are to avoid seeks and head settling delays where possible, and to permit chained writes on the IBM S/370 disks which allowed actual data rates to approach the theoretical maximum.

EROS adopted a strategy of re-writing a page in the original working area location if it was paged out twice during one checkpoint interval. This strategy had the advantage of saving space and reducing seek distances.
Because of the need to support logs in flash memory, CapROS will follow the KeyKOS strategy. This strategy has the advantage of eliminating a source of "hot spots" on the disk.
Once KeyKOS begins to reuse a checkpoint area, all memory of that area's prior content is discarded.

EROS preserved a record of object locations in the checkpoint area until their storage was actually reused. This strategy reduced the number of seeks to areas outside the checkpoint area, improving disk performance. It also increased the number of locations for a page or node, allowing fetches to be scheduled on less-active devices.

Which strategy CapROS adopts will be up to the person writing the code.
EROS kept 28 bit log location IDs (lid, see below) in the Object Directory part of the checkpoint. Once you account for the 8 bits of object index, this size results in a maximum of 2²⁰ frames in the log, which was reasonable for the systems of the time, but now is barely enough to hold the RAM of a modern desktop processor. The CapROS lid is the same size as Object IDs (oid), (see below).
EROS kept a 2 level tree for the on-disk Object Directory. This limited the size of a checkpoint to 259,080 objects. CapROS writes Object Directory and Process Directory frames contigously in the log, so there is no artificial limit on the number of objects in a checkpoint.
EROS required sequential numbering of the log ranges. CapROS may relax that requirement depending on how easy it is in practice, but it should be noted that relaxing this requirement may be in conflict with contigous Object and Process Directories.

2. Definitions

Demarcation event: A point in time, as of which the state of all the objects in the system will be preserved. When the system restarts, it is initialized to the state at a demarcation event.
Generation: The set of objects that have been changed in the interval from one demarcation event to the next.
Stabilization event: After a demarcation event is declared, stabilization occurs when all the state of that generation (objects, processes, directories, generation header, and checkpoint header) has been saved to the log. Writing the checkpoint header is the last act before stabilization.
Restart generation: The most recent stabilized generation. This generation will be selected for a restart.
Working generation: The generation after the restart generation. It is not yet stabilized.
Active checkpoint: From the time of a demarcation event until the next stabilization event has occurred, we say that a checkpoint is active. During this time the system is working to stabilize the working generation.
Next generation: While a checkpoint is active, any objects that are changed belong to the next generation (the one after the working generation). The next generation exists only in RAM, not on disk.
Unmigrated generation: A generation that has not been fully migrated. It may be partially migrated.
Retired generation: A generation that is identified as migrated in the most recent checkpoint header. (Additional generations may have been migrated since that header was written; they are migrated but not retired.) The system keeps a directory of objects in the retired generation(s), so that it can fetch objects from there, taking advantage of the disk locality of the log. On a restart, retired generations must not be used because they may have been overwritten with unstabilized data.
Log: The area of disk reserved for saving generations. The log is divided into the header area and the main log.
Header Area: The first two frames of the log, used to hold checkpoint headers.
Main log: The log, excluding the header area. Generation headers and other generation data are written to the main log.
Main log cursor: Generation data are written sequentially in a circle. The main log cursor is the point in the main log where data will next be written.
Limit percent: The maximum amount of the main log which one generation may use.

3. On-Disk Log Area

The CapROS log area is made up of ranges of disk page frames that are sequentially numbered beginning at zero. These are referred to as log ranges. Just as objects in object ranges are referred to with object identifiers (oids), objects in the log are referred to with log identifiers (lids). A lid is constructed by taking the 56 bit frame number and concatenating an 8-bit object index. (8 is equal to log₂(EROS_OBJECTS_PER_FRAME).)

 Frame 0  1      2                                       SIZE
+-------+------+----------------------------------------+
| header area  |               main log                 |
+-------+------+----------------------------------------+
 LID 0x0        2<<8  ...                                SIZE<<8

Log ranges may reside on multiple disks, but the implementation may require that sequential numbering be preserved. This requirement presents some difficulties for log area rearrangement which may, in due course, be solved with suitable user-level applications.

A generation that has been completely written to the disk is said to be stabilized. A stabilized generation has an associated on-disk directory that describes where all of the objects in that generation reside. The last step in stabilizing a generation is writing the associated checkpoint header.

3.1 The Checkpoint Header

Frames zero and one of the log are the header area. They hold the two most recent checkpoint headers. There are two headers so that if there is a disk error writing a header, there will still be another valid header that can be used for restart.

The checkpoint header contains an identifying header, containing the generation number of the restart generation. (This structure is called CkptRoot in the code.) The header with the highest generation number is the most recent header. Only the most recent valid header is used for restart.

The checkpoint header also contains the number of unmigrated generations, and an array of LIDs of the generation headers of all the unmigrated generations. There is a limit, currently 20, on the number of unmigrated generations.

The last item in the generation header is a field used to validate the correctness of the previous data. This field is included to detect frames which were only partially written due to hardware failure. We needed this field on the S/370 because certain write failures (e.g. channel failure) caused the disk controller to fill out the block with the last byte transferred from the channel, and then write a correct hardware checking block.

3.2 The Generation Header

The root of a generation directory is the generation header. The generation header frame begins with a header structure describing the generation.

The header structure is:

struct DiskGenerationHdr {
  uint32_t   versionNumber;
  GenNum     sequenceNumber;
  LID        firstLogLoc;
  LID        lastLogLoc;
  LID        firstProcessDirFrame;
  LID        firstObjectDirFrame;
  uint32_t   nProcessDirFrames;
  uint32_t   nObjectDirFrames;
  uint32_t   nProcessDescriptors;
  uint32_t   nObjectDescriptors;
};

The object directory needs to be kept entirely in memory even past when a checkpoint is stabilized. It makes sense to write the entire object directory into sequential log locations at the end of the checkpoint.

The process directory must be saved as of the time of the demarcation event. It makes sense to write it into contigous log locations immediately after the pages and log pots that were written before the demarcation event.

Contiguous log locations allow easy scheduling of writes and reads during checkpoint and restart, as all the locations of the directories are known initially.

Implementor note: The old 32 bit lid was referred to as a LogLoc in the code.

Each of the fields is explained below.

versionNumber

The version number of the generation header.

sequenceNumber

The generation sequence number allows CapROS to determine which generation is most recent.

firstLogLoc, lastLogLoc

The firstLogLoc and lastLogLoc fields allow the restart code to determine what part of the main log contains this generation. Note that firstLogLoc will be higher than lastLogLoc when this generation wraps from the end of the main log to the beginning.

The restart code checks each lid when rebuilding the in-core object and process directories. If the most recent version of all the unmigrated objects, the restart generation process directory frames, and the object directory frames for all the unmigrated generations are not all mounted, the system will not restart until they are mounted.

firstProcessDirFrame, nProcessDirFrames

nProcessDirFrames is the number of process directory frames. If the entire process directory is in the checkpoint header, it will be zero. The firstProcessDirFrame is the LID of the first frame of the process directory.

firstObjectDirFrame, nObjectDirFrames

nObjectDirFrames is the number of object directory frames. If the entire object directory is in the checkpoint header, it will be zero. The firstObjectDirFrame is the LID of the first frame of the object directory.

nProcessDescriptors

Part of the process directory may be held in the generation header frame. nProcessDescriptors is the number of process descriptors in that list. They immediately follow the header.

nObjectDescriptors

Similarly, part of the object directory may be held in the generation header frame. nObjectDescriptors is the number of object descriptors in the header. They immediately follow any process descriptors that may be in that header.

3.3 The Process Directory

A process directory frame consists of a single word describing the number of following entries, and then some number of process descriptors. With 8 byte OIDs and 4 byte ObCounts, a 4K process directory frame can describe 314 processes.

struct DiskProcessDescriptor {
  OID     oid;
  ObCount callCount;
  uint8_t actHazard;
};

An entry in the process directory corresponds to a process that was running at the time of the demarcation event. The process entries from the restart generation are reloaded as part of the system startup procedure.

The oid field identifies the root node of the process. actHazard identifies some situations requiring special handling:

When a process that invoked the Sleep capability is awakened due to expiration of the time, but the process is not in memory at that time, the value of actHaz_WakeOK in the actHazard field indicates that we need to give it the return code RC_OK.

Any processes that were sleeping on the Sleep capability at the time of checkpoint will be awakened on restart with the return code RC_capros_key_Restart, because we do not store the wakeup time in the process directory. The value of actHaz_WakeRestart in the actHazard field indicates this situation.

On restart, we will restart any persistent processes for which there was a Resume capability in a non-persistent node. (See details.) The process directory contains entries for these processes too. We store in callCount the call count from the Resume capability, because it might be stale, and we are not in a position to check that when we take a checkpoint. The value of actHaz_WakeResume in the actHazard field indicates this situation.

Note that on restart CapROS loads the process list from the most recent generation header only.

3.4 The Object Directory

The object directory frames hold ObjectDescriptors. Each object descriptor holds an object identifier, the object's allocation and call counts, the type of the object, and the lid at which the object can be found. With 8 byte OIDs, 4 byte ObCounts, 8 byte LID, and 1 byte types, each 4kB frame can hold a description of 163 objects (assuming unaligned/packed storage and allowing for a count of entries at the start of the frame).

struct ObjectDescriptor {
  OID       oid;
  ObCount   allocCount;
  ObCount   callCount;
  LID       logLoc;
  uint8_t   type;
};

For nodes, the callCount field contains the call count. For other objects, the callCount field is unused.

The allocCountUsed and callCountUsed bits are not stored on disk. They are assumed to be one for objects on disk.

If the lid field of the directory is zero, then the corresponding object is null: either a zero-filled page or a node filled with void keys and zero nodeData. Which one can be determined by the value of the type field.

If the lid field is nonzero, the corresponding log frame contains either page data or a "log pot." A log pot is simply a page-sized cluster of nodes in the main log. This is why the lid value is concatenated with an object index: the index indicates which entry in the log pot contains the relevant object. (The object index of a LID, like that of an OID, is 8 bits to allow for future types that may be small enough to fit 256 in one frame.)

3.5 Flash Memory

CapROS will support solid-state disks that use flash memory. These devices have a finite number of erase-write cycles (up to 1,000,000 for NAND flash). When the device is written, a block of memory (as large as 256KB) is first erased, then written.

The header area is a "hot spot" on the disk because it is written once on every checkpoint. If we assume (pessimistically) one checkpoint every ten seconds, and (optimistically) 1,000,000 write cycles, the header will wear out in only 115 days. An earlier design had a large header area in an attempt to spread out writes, but the large erase block size makes that approach impractical.

Flash memory devices intended to replace disks generally have some mechanism for wear leveling. We will rely on that mechanism to handle the header area "hot spot". The migrator will be written to avoid making node pots and tag pots "hot". The main log should be sized to hold several generations to avoid a hot spot there.

4. Theory of Operation

In overview, checkpointing operates as follows:

Following the restart generation in the log is the working generation. Objects that are cleaned from RAM are written sequentially to the working generation area.
After a while, a demarcation event occurs. Reasons for declaring a demarcation event are discussed below. The system will preserve the state of all dirty objects as of this time.
Those dirty objects are cleaned into the working area, and the directories are written.
The generation header of the working generation is written into the working area.
A checkpoint header is written to the header area. The successful completion of this write is a stabilization event, and the generation is said to be stabilized.
The working generation becomes the restart generation, and a new working generation comes into existence. The cycle then repeats.

The Checkpoint Mechanism

1. Background

2. Definitions

3. On-Disk Log Area

3.1 The Checkpoint Header

3.2 The Generation Header

3.3 The Process Directory

3.4 The Object Directory

3.5 Flash Memory

4. Theory of Operation

5. Migration

6. Log Management

6.1 Frame Reservation

6.2 Frame Allocation

6.3 Restart Sequence

6.4 Algorithms

6.4.1 Space Reservation

6.4.2 Object Write

6.4.3 Directory Write

7. The In-Core Directory

7.1 Directory Management

8. The Core Generation Structure

Appendix A - Another approach to flash memory support

The design

Additional Thought