Towards an OS for ControlGrids

Marko Fabiunke, Sergio Montenegro, Friedrich Schön {marko.fabinke,sergio.montenegro,friedrich.schoen}@first.fraunhofer.de)
Fraunhofer FIRST,12489 Berlin, Germany

Background & Expertise

FhG FIRST has a wide experience in the development of reliable real time control systems that provide very long survivability without maintenance and guaranty continuity of service in real time despite of failures. Such control systems can be applied to devices in inaccessible regions (e.g. aerospace) or application domains where failure of service is a major risk or cost factor (e.g. air traffic control, mining industry). We employ redundancy with smart redundancy management to support service and task replication, distribution, activation/deactivation, reallocation and degradation in real time without service interruption.

Expectation & Interest

For our applications and customers we are interested in extending our technology to a network-centric operating system of dependable controllers (Controller Grid). A Controller Grid is (internally) a network of controllers that appear as one controller to the outside world. An integrated middleware shares the resources of the involved nodes to support fault tolerance of the Controller Grid and its services as a whole, like in the description of the system in the next chapter. The individual services of a Controller Grid shall be executed by dependable real time operating systems, optimized to almost irreducible code complexity and ultra fast recovery .

The controller grid as opposite to ambient intelligence networks is a closed system, which should be commissioned by an integrator and not dynamically like in the ad-hock networks.

Approach

A controller array arranged in a grid configuration can execute very sophisticated control tasks with a high degree of dynamic and fault tolerance. A controller grid can provide a very smooth graceful degradation of services to provide a long survivability of the system despite of multiple failures in its components.

Such a System could provide a full functionality as long as enough computing resources are available. When resources become scare, for example due to failures or due to power constraints, the system can adapt its configuration to provide a degraded but still useful functionality. This allows the system to operate for a long period of time without requiring maintenance and being able to adapt itself to changing environments.

Important properties of control systems are their real time capabilities which should not be interrupted, even in case of failures. The system has to guaranty continuity of services in real time. The transition from normal to degraded functionality and back should be smooth and invisible from outside of the system.

To be able to effectible create a controller grid we need a simple and compact middleware which supports fault tolerant communication protocols and structures.

The services in a controller grid should be provided by tasks distributed in the nodes. In the simplest configuration there is one task for each service, but for fault tolerance reasons there can be for each service several identical tasks running on different nodes.

To provide continuity of services despite of failures the controller grid shall have the following features:

1. Hot redundancy at service and at node level

Each service should be implemented by at least two tasks running in two ore more different nodes. If the computing power is too limited, then a normal service should run in one node and a degraded version in another one. If computing power is sufficient we can run the same “normal” task in several nodes.

2. Communication protocols to support fault tolerance

The controller grid shall implement a producer/consumer protocol, where a service produces results and makes them available under a certain name. Producer/consumer protocols are the best (known) way to support fault tolerance, task migration, task reallocation, redundancy of services, dynamic replication of message transmitter/receiver, and to cope with abrupt disappearing of communication partners (crashes)

Using named services, a task can subscribe to several services (like usually with newspapers) in order to get a copy of service results. A task should not send to another specific task or mailbox, it should just provide a service and make it public. The middleware will propagate and distribute these results to all possible interested (subscribed) tasks.

Neither sender nor receiver has to take care of the location where the communication partner is running, or how many they are. A sender need not even care about whether some receiver at all uses its results. The configuration manager service may shut down a sender tasks and services whose results will not be needed in the next future. This communication architecture simplifies redistribution, replication, elimination, etc. of services.

3. Monitoring without intruding

Monitors help the early detection of errors and faults, and are an instrument for the system analysis and visualisation. It is important to be able to insert monitors in the service network without modifying any application. The monitor is just another task, in this case it just hears messages of a determined type and no other change has to be done to any application in the system. The Monitor listens to a message type and if the message is missing for a period then it reports this anomaly to the configuration manager, which, in turn, may take reconfiguration measures.

4. Ultra fast recovery and reintegration of services

To minimise the probability of two failures active at the same time, and the probability that a backup function crashes before the normal function has recovered, it is important to minimise the time to recover. This time is the boot time of the operating system plus the times for selection and initialisation phase of tasks and reintegration into the service network.

Towards an OS for ControlGrids

Background & Expertise

Expectation & Interest

Approach

1. Hot redundancy at service and at node level

2. Communication protocols to support fault tolerance

3. Monitoring without intruding

4. Ultra fast recovery and reintegration of services