For critical services downtime is not an option.
The downtime of the service can be addressed by replicating the
units that provide the service.
However, if the session state is important, it is not enough to
simply replicate units: sharing the continuously updated internal
state of the units must also be made possible.
If execution can be continued on another unit after the point-of-failure
without any significant loss of state, the unit is said to have
a Hot Spare.
Saving the state of a unit so that it can be restored at a later
point in time and space is known as checkpointing.
For the checkpointing approach to be a viable option in
interactive services, it must not disrupt the normal program operation
in any way noticeable to the user.
The goal of this work is to present a checkpointing facility which
can be used in applications where checkpointing should and can not
disrupt normal program operation.
To accomplish this, the responsibility of taking a checkpoint is
left up the application.
The implications are twofold: checkpointing will done at exactly the right
time and for exactly the right set of data, but each application
must be individually modified to support checkpointing.
A framework is provided for the application programmer so that it is
possible to concentrate on the important issues when adding Hot Spare
capabilities: what to checkpoint and when to checkpoint.
Checkpointing efficiency is then further increased by introducing
kernel functionality to support incremental checkpoints.