Group communication is an essential building block for the development of
fault-tolerant distributed applications. The AspectIX group communication
system (AGC) is a reconfigurable totally-ordered group communication
system based on distributed consensus algorithms.
Our novel design uses a policy-based mechanism for dynamical
reconfiguration of the system at runtime without service
interruption. Reconfigurations may optimize for most efficient
``best-case'' operation or for minimal delays in failure
situations, may select different failure models like crash-stop,
crash-recovery, or Byzantine, and may adjust internal parameters
like timeout values for failure detection.
An instance of a distributed consensus algorithm is used for providing
total order of messages. Various algorithm instances, like the classic and
Byzantine variant of Paxos, can be configured for a specific group communication
instance. Mechanisms to adjust the set of participating nodes at runtime in a
safe way are provided.
The group communication system is the core element of the
fault tolerance support in AspectIX