Contingency Handling

A Contingency Manager is a constituent of the PalCom Middleware Management and is responsible for administering fault and problem conditions occurring in active PalCom Assemblies and 2nd Order Resources through the application of a variable set of contingency tools and mechanisms.

 

Every PalCom Assembly will nominally have one active Contingency Manager residing on one PalCom Node participating in the assembly. Other inactive Contingency Managers may also be present within the assembly, especially to support distributed replication and failover redundancy. Such an inactive Contingency Manager would then be activated on failure of a previously active Contingency Manager.

 

Contingency management addresses the non-availability, nominally through starvation or failure, of resources in PalCom Assemblies. Contingency implies the ability of a palpable system to automatically identify problem conditions, determine suitable means to resolve them and then apply appropriate mechanisms to prevent future error conditions. Therefore, for example, a temporarily lost network connection does not necessarily lead to an error condition, because it has to be considered as a legal operating state in system design. This not only ensures that a system becomes more resilient to failure, but also capable of adapting to ambient conditions such as resource starvation. The consideration of resource management is therefore also considered as a required step towards the establishment of a contingency paradigm for PalCom systems.

 

For the purposes of contingency management a dinstinction is drawn between
errors, faults and failures. An error is an exception condition resulting from some deviation from expected behaviour leading to a fault or failure. A fault is a non-catastrophic breakdown from which recovery is expected, and a failure is a serious condition from which recovery may not be readily possible.

 

The primary operations of a Contingency Manager are triggered by incoming events sourced from Assembly and Resource Managers. Also, in some cases Contingency Managers can offer reduced, but specialised functionality and operate in coordination with other specialised Contingency Managers to offer a complete, although distributed, service.

 

Contingency Managers are expected to at least provide a set of reactive contingency actions (i.e., compensations) which respond to errors, faults and failures when an event is received indicating their occurrence. Events are typically sourced from Assembly and Resource managers.

 

Based on the reception of events, the following primary operations characterise the reactive behaviour of a PalCom Contingency Manager:

  • Monitor the performance thresholds of 1st Order Resources on specified (by events) devices. If a threshold is passed, a contingent action can attempt to trigger the re-balancing of resource load across additional devices.
  • Compensate for an error/fault/failure with an Assembly-specific resource by attempting to locate an equivalent replacement resource.
  • Compensate for the error/fault/failure with an Assembly-specific resource by attempting to reconfigure an assembly in coordination with the Assembly Manager.
  • If replacement and reconfiguration fail then compensate for the error/fault/failure with an Assembly-specific resource by attempting to gracefully degrade the operation of an assembly in coordination with the Assembly Manager.
  • Resolve dependencies in accordance with the mode and subject of a compensation. For example, a replaced service must be configured to match any residual dependencies remaining from its predecessor.
  • Describing its own behaviour when inspected.

 

More advanced Contingency Managers may also provide proactive contingency actions which attempt to plan strategies for dealing with errors, faults and failures before they occur. Strategies can be proposed to actors allowing them the opportunity to override them if preferred. Some of the optional operations include:

  • Establishing proactive precautionary error avoidance strategies which will attempt to mitigate potential error conditions: e.g., failure of a service, a device, network failures etc. by anticipating them and putting contingent actions in place as precautionary measures. These actions may either simply raise exception warnings, or may attempt to divert from an anticipated error condition into a safe state of operation.
  • Immunising against known problem conditions by recording events and learning from past experiences. Using this knowledge structural (e.g., assembly formation) or behavioural (e.g., task formulation) adaptation can be applied to eliminate or at least reduce the chances of an error condition occuring. By these means a system optimises its own operation.
  • Preparation for the graceful degradation of the integrity of active PalCom Assemblies by containment of detected problem or failure conditions. This is achieved by attempting to isolate potential sources of errors, faults or failures and determine means  of allowing systems to continue running in states of limited functionality while safely degrading if necessary.
  • Restitution of system integrity through self-healing adaptation and external intervention (if necessary) to recover lost functionality and state.

 

The Contingency Manager is an optional middleware manager although it's use is highly recommended. Also, not all Contingency Managers need be implemented in same way, or provide the same functionality.

Last Modified: 12 March 2007, PalCom