Implementing highly available, highly reliable virtual processors

Date

1994

Authors

Macdonald, Robert Noël

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

A fault-tolerant distributed facility called a Halt on Failure Processor (HFP) and its performance in a network of workstations are described. Process replication and n-modular redundancy are used to achieve fault tolerance in a general purpose workstation environment. A blacklisting mechanism is used to differentiate between slow and crashed workstations. The system achieves high availability by keeping a list of healthy work­stations. The HFP will halt rather than deliver the results from an erroneous cal­culation to its users. The design of the HFP is presented along with the type and number of errors it is capable of hand ling. The implementation using the existing Remote Execution Manager is discussed. Extensive performance studies were carried out within a network of Sun SPARC workstations running UNIX. Performance results are presented and the costs of performing fault management at various levels are exposed. Flaws in the way UNIX reports load information and their implication on load-balancing are pointed out. It is shown that IIFPs can achieve high availability and fault-tolerance using the idle cycles of workstations in a local area network with little performance degrada­tion.

Description

Keywords

Citation