Reliability Modelling of Complex Systems

Timm Grams
University of Applied Sciences, Fulda, Germany

This page reports on the ideas underlying the German activities aiming at a common framework for reliability and safety, including hardware and software. The efforts of unifying the terminology are based on a generic reliability model (Grams, 1997). On this page you will find an outline of the generic model as well as some applications thereof. Some of the results are: 1. The malfunction rate of a constant complex systems is not constant. 2. The assumption of a constant malfunction rate is not justified for complex systems (in contradiction to the so called X-ware reliability). 3. Redundant systems can well be treated within the generic model.

The material of this page has been presented at the European Safety and Reliability Conference at Garching 1999 (ESREL '99). Publications of the German committees working at that subject are VDI/VDE 3542/4 (1995), VDI-GIS (1993), and DGQ-Band 17-01 (1998).

Introduction

Complex Systems. A complex system is understood to be a digital computer system together with the programs installed in it. A complex system therefore includes hardware and software.

Fields of reliability modelling. The term reliability has come to have two meanings in technical usage. Firstly reliability has been taken to mean quite precisely survival probability, and in this sense its scope is restricted to hardware. Secondly reliability is used in a broader sense, and in this sense it applies to complex systems and includes probabilistic safety modelling. For this the term dependability is sometimes being used, and it is described by a set of characteristic variables.

Standards and guidelines for reliability and safety have hitherto referred primarily to the hardware. There is as yet no widely-accepted system of definitions for complex systems. The debate on the reliability and safety suffers also because some of the basic terms in guidelines and standards are either contradictory or only vaguely defined. The following table shows the scope of probabilistic reliability modelling.

Fields of reliability modelling
	Reliability	Safety
Hardware	Hardware reliability is the classical field of reliability methods. The theory and practice were developed mainly in the early 1950s. The theory comprises well-tried probabilistic models. The practice is founded on a huge data base of reliability data for hardware components. (Shooman, 1990).	Hardware safety is a traditionally deterministic and qualitative approach. About 1980 first guidelines for probabilistic models on the basis of classical hardware reliability modelling were issued (in Germany: VDI/VDE- 3542/2).
Software	Software reliability modelling evolved in the early 1970s by transferring the hardware theory to software: "Because of the relatively advanced state of hardware reliability, it is natural to try to apply this theory to software reliability" (Myers, 1976, p. 329). This concept worked well for non-redundant systems.	It is currently a point of discussion whether the probabilistic reliability approach can be applied to software safety. Difficulties encountered are: 1. The reliability modelling of redundant software turned out to be error prone. 2. The question of "what is meant by acceptable risk?" remains still unanswered (Leveson, 1995, p. 181).

Some faults in current reliability modelling

A "simple" model of X-ware reliability. In Lyu's book on software reliability engineering (1996) we can find the following statements given by Laprie and Kanoun: "... the classical reliability theory can be extended in order to be interpreted from both hardware and software viewpoints. This is referred to as X-ware ... It will be shown that, even though the action mechanisms of various classes of faults may be different from a physical viewpoint according to their causes, a single formulation can be used from the reliability modelling and statistical estimation viewpoints."

The postulated "single formulation" is a special form of the reliability function, namely Z(t) = exp(-lt). The expression results from an exponentially distributed time to the first malfunction. It is thought to be valid for an initially correct system where a random hardware failure could result in a fault that is brought to the fore through a randomly occurring fault revealing input.

This theory is overly simplistic. The derivations of this special reliability function starts with a wrong expression of discrete-time reliability (Lyu, 1996, p. 37), and the subsequent derivations and conclusions are left to the reader. The presented result is not plausible. In the following it will be shown by means of a two stage Markoff model that the superposition of an exponentially distributed time to a hardware fault and a subsequent exponentially distributed time to failure of execution (malfunction) could only in special cases result in an exponentially distributed overall time. The special cases are identical with either the hardware reliability model or the software reliability model, thus excluding the general X-ware reliability model!

Fault tree techniques gone astray. The proposal "to apply this [hardware reliability] theory to software reliability" (Myers, 1976)) has been fruitful in some cases, but it showed disastrous in others. Errors in the reliability modelling of redundant systems have been committed by various authors. Scott, Gault, and McAllister (1987) suggest to apply fault/event tree techniques to N-Version Programming. Their approach - like those of various other authors - results in an overly simplistic model.

This can be demonstrated for a diverse 2-version system. The diverse system consists of two program versions developed by two separate programming teams on the basis of one common specification . The system is composed such that each of the versions compares its own results with that of the other version. It is understood that discrepancies can be detected by a correct version, such that an erroneous result from only one of the versions has no serious effects (fail-safe design).

In analogy to the safety modelling of hardware the diverse system has been modelled by an event tree, fig. 1. This model is valid as long the letter p in this event tree is interpreted to be the incorrectness probability of either version. An upper bound for the incorrectness probability is then given by p².

But mostly - as can be seen from the literature - p is interpreted in the sense of the mean malfunction probability. The malfunction probability of the diverse system will be denoted by p_div. From the event tree the formula p_div = p² can be derived. This formula is based on the independence assumption which seems to be justified by the fact that the programming teams of the two versions are working independently.

Under this interpretation the formula is wrong. The derivation is not conclusive. The independence of the programming teams does not imply the independence of malfunctions. This is because the input process is common for both versions. This will normally lead to a correlation between the malfunction processes of the two versions.

A fresh start

Terms and definitions. The terms used on these pages are based on the terminology used in structured programming. Starting point is the definition of specification. This is the approach followed by the German guideline VDI/VDE 3542/4. The translation of this guideline is in progress. Compatibility with this standard is maintained as far as possible.

The reliability function. The reliability function Z(t) is the probability that no errors (malfunctions) will occur from time 0 to time t. Z(t) is equal to the probability that the time to the first malfunction is greater than t.

The distinction between defects and malfunctions is crucial. A malfunction in this sense occurs during operation such that an existing defect comes to the fore due to a defect revealing input. The causes of defects are 1. design, production and programming faults, and 2. hardware failures. For an initially correct system the chain of events until the first malfunction can be seen from fig. 2.

The malfunction rate l(t) can well be used to illustrate the malfunction behaviour of an object. The malfunction rate has the following meaning: Under the condition that no malfunction has occurred up to time t, the probability of a malfunction within the time interval from t to t+Dt is approximately equal to Dt×l(t). The approximation is the better the smaller Dt. The system malfunction rate can be calculated from the formula

The Generic Reliability Model. The state of a complex system can primarily be classified as correct or defective depending on whether the system satisfies the specification or not. It may be necessary to make a distinction between various degrees of the not fulfilment of the specification. This is done by introducing the system states S_k (k= 0, 1, ..., n). These states are abstract descriptions of the system structure and condition - including the programs - as far as relevant to the reliability calculations. A correct system is in the state S₀. Transitions between the states is due to failures or due to repair.

By definition the states are defined such that a malfunction rate can be attributed to each of them. This is done under the condition of a certain input statistic (operational profile).

Under the assumption that the system is in state S_k the time to the first malfunction can be conceived to be exponentially distributed. The malfunction rate of this state is equal to the parameter l_k¥ of this distribution.

The reliability function takes account of the actual behaviour with respect to some given input. We are interested in the time to the first malfunction. Therefore the states S_k will be restricted to the time where no malfunction has yet occurred. If a malfunction happens the system state changes to the fictitious state S_¥ . Fig. 3 shows an abstract state transition graph of the Generic Reliability Model.

Let P_k(t) denote the probability of State S_k at time t. The reliability function is given by the equation by Z(t) = 1- P_¥(t).

Examples of Reliability Modelling

Constant systems. A constant system is a system that does not change in structure or condition. Nevertheless it can be defective from the beginning. But its degree of defectiveness does not change in time. There are no failures and no repair.

The reliability model of such a constant system takes into consideration the quality of the production process. In my paper (1997) I have given an example where the initial probabilities of the states S₀, S₁, S₂, ..., S_n could be assessed. In this special case it turned out, that the malfunction rates of all defective states were approximately equal. Thus only one defective state had to be taken into consideration. The reliability model can be conceived to be a simple Markoff process, fig. 4. Remarkably the malfunction rate of this system is not constant, fig. 5.

Parameter evaluation by Trivial Reliability Prediction (TRP). Let us have a different view of constant systems. We are assuming a fixed state S_k of some given constant object and want to know its malfunction rate l = l_k¥. The Trivial Reliability Prediction (TRP) is based upon the N recently observed times between malfunctions t₁, t₂, ..., t_N. These values are conceived to be a sample of values of the random variables T₁, T₂, ..., T_N. The TRP is meant to give estimations on the random variables T_N+1, T_N+2, ..., T_N+K, i. e. estimations of their parameter(s) as well as statements on their accuracy. The assumptions and preselections of the model are:

The times between successive failures T₁, T₂, ... are exponentially distributed with parameters l . The expectation value of the T_i is given by 1/l .
The expected time between failures can be assessed by the arithmetic mean of the recently observed n times between failures: (t₁ + t₂ + ... + t_N)/N. This results in the estimate l^* = N/(t₁+t₂+ ...+t_N) of l .

The accuracy of the estimate l^*can be derived by means of the Poisson distribution. The function L_N(l×t) is equal to the probability of the observed accumulated times between failures being less then time t, i.e. the cumulative distribution function (c.d.f.) of the time to the N-th failure: F(t) = P(T₁ + T₂ + ... + T_N < t). This function is equal to the distribution function of the N-stage Erlang-distribution with expectation value N/l (Kleinrock, 1975, p. 119-126).

Variable Systems: Reliability Growth Models (RGM). Let a program be given containing some faults. Under the assumption of a constant operational profile a constant malfunction rate l can be assumed. This is the situation considered in the above two paragraphs. Now we are assuming some effort of fault removal. From this follows a change in state and a new and unknown malfunction rate. By this the assessment of malfunction rates from past experience seems to be impossible from the beginning.

What can be done? We can try to pull us out of these difficulties by our own boot-straps. Every so called Reliability Growth Model (RGM) is based on specific assumptions concerning the change of malfunction rates through fault removal.

Such an assumption is the core of the respective model. It is meant to represent the empirical content of the RGM.

One of the typical assumptions is the one of the Jelinski-Moranda model (Lyu, 1996): Through fault removal the malfunction rate will be reduced by a certain value which is constant for all faults. The central question is, whether such assumptions can be corroborated. In Lyu's book you will find many such RGMs and methods of parameter estimation.

On my page "Reliability Growth Models Criticized" you can find a criticism of these models. My conclusions are, that even in the case of reliability growth the simple TRM should be used. It is not (much) worse than the sophisticated RGM. But it can well be understood and reliably handled by any engineer acquainted with the field of reliability.

Variable system: X-Ware Reliability Models are not simple. Now we are assuming a simple variable system. This system may fail. And the failure can only result in one defective state with a given malfunction rate. There is no repair.

The reliability model is built of three states only: S₀, S₁, and S_¥ . The reliability model looks like that of the constant system, fig. 4; there is only one additional transition from S₀ to S₁ with transition rate l₀₁(failure rate). The analysis of this simple death process yields with

lmin= min{l₀₁, l_1¥ } and l_max= max {l₀₁, l_1¥ }

the reliability function

Z(t) = (l_max × exp(-l_min× t) - l_min exp(-l_max× t))/( l_max -l_min)

The result is given in fig. 6. It can be seen that a constant malfunction rate can only be assumed in the limit, i. d. l_max/l_min = ¥. This is in contradiction to the above mentioned simple X-Ware Reliability theory.

Redundant systems. Let a constant two version diverse system be given. The model of fig. 1 can be interpreted in terms of incorrectness probabilities, but not in terms of mean malfunction probabilities - as we have seen above. But even in the latter case the event tree can be turned into a correct model. What we have to do is to replace p by the data dependent malfunction probability p(x).

The data dependent malfunction probability of the diverse system will be denoted by p_div(x), and its mean value by p_div. From the event tree the formula p_div(x) = (p(x))²can be derived.

The expectation value taken over all possible input values x yields the mean malfunction probability of each of the versions: p = E[p(x)]. (The various states S₁, S₂, ... S_n of one version of the system are now selected such that p(x) can easily be calculated by taking the expectation values over all states.)

The variance s² of the data dependent malfunction probability of each of the versions - taken over all possible values of the input - can be calculated as follows: s² = E[(p(x)-p)²] = E[(p(x))²] - p²= E[p_div(x)] - p² = p_div- p².

From this results the Eckhardt-Lee formula p_div = p² + s². In most cases the value of s² is much greater than p²(Grams, 1997). This is in contrast to the often used formula p_div = p².