From TheBestLinks.com
In computer science, Fault-tolerance is the property of a computer system to continue operation at an acceptable quality, despite the unexpected occurrence of hardware or software failures.
Fundamental mechanisms for making a computer system fault-tolerant are, for example:
- Replication: Providing multiple instances of the same system, directing tasks or requests to all of them in parallel, and choosing the correct result on the basis of a quorum;
- Redundancy: Providing multiple instances of the same system and switching to one of the remaining instances in case of a failure (fall-back or backup);
- Self-stabilization: Building systems to converge towards an error-free state automatically;
- Diversity: Providing multiple implementations of the same system and using them like replicated systems to avoid errors in a specific implementation.
A Redundant array of independent disks (RAID) is an example of a fault-tolerant storage device that uses redundancy.
Recovery from errors in fault-tolerant systems can be characterised as either roll-forward or roll-back. When the system detects that it has made an error, roll-forward recovery takes the system state at that time and corrects it, to be able to move forward. Roll-back recovery reverts the system state back to some earlier, correct version, for example using checkpointing, and moves forward from there. Roll-back recovery requires that the operations between the checkpoint and the detected erroneous state can be made idempotent. Some systems make use of both roll-forward and roll-back recovery for different errors or different parts of one error.
A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed dual modular redundant (DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed triple modular redundant (TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.
Lockstep fault tolerant machines are most easily made fully synchronous, with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement.
Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replicant can be copied to another replicant.
One variant of DMR is pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly similarly. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicants rather than the three of TMR, but has been used commercially.
See Also
Related links
Top visited
0 of
0 links
[no links posted yet]
>> place link >>
Discussion
Last posted
0 of
0 messages
[no messages posted yet]
>> post message >>
Watch
You can
add this article to your own "watchlist" and receive e-mail notification about all changes in this page.