You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
884 lines
44 KiB
884 lines
44 KiB
|
|
|
|
Failure propagation models are defined by safety engineers and are |
|
usually obtained through manual assessment of the safety of the |
|
system. This is a complicated task since failures can depend on more |
|
than one element of the system; be the result of the interaction |
|
between many faults; be the consequence of the missed detection of |
|
another fault (e.g. a fault inside an element tasked with detecting |
|
faults); etc. To cope with the complexity of the systems and the |
|
scenarios that need to be analysed, several model-based approaches |
|
have been proposed such as AltaRica~\cite{Altarica2.3,arnold1999altarica}, Figaro~\cite{bouissou05}, etc. each with their |
|
associated tooling. |
|
|
|
|
|
Our work is concerned with the modelling and analysis of failures |
|
propagation in the presence of time constraints. We concentrate on a |
|
particular safety property, called \emph{loss detection convergence}, |
|
meaning that the system applies an appropriate and timely response |
|
to the occurrence of a fault before the failure is |
|
propagated and produces unwanted system behaviours. |
|
% |
|
Similar problems |
|
were addressed in~\cite{thomas2013}, where the authors describe a |
|
process to model Failure Detection Isolation and Reconfiguration architecture (for use on-board satellites) that |
|
requires to take into account failure propagation, |
|
detection, and recovery times. However, these needs are not matched by |
|
an effective way to express or check the safety constraints of the system. |
|
|
|
|
|
Our contribution is as follows. We define a lightweight extension of |
|
AltaRica, meaning that timing constraints are declared separately from |
|
the behaviour of a system (Sect.~\ref{sec:sect2}). Therefore it is easy to reuse a prior |
|
safety model and to define its temporal behaviour afterwards. We |
|
illustrate our method with an example inspired by safety architectures |
|
found in avionic systems. This example illustrate the impact of time |
|
when reasoning about failure propagation, and we use it to |
|
show that taking into accounts timing constraints---in particular |
|
propagation delays---can help finding new failure modes that cannot be |
|
detected in the untimed models currently in use. In the process, we |
|
define two safety properties: \emph{loss detection}; and its temporal |
|
version, \emph{loss detection convergence}, meaning that a system |
|
applies an appropriate and timely response to the occurrence of a |
|
fault before the failure is propagated and produces unwanted system |
|
behaviours. We show that these two properties, which are of interest |
|
in a much broader context, can be reduced to effective model-checking |
|
problems. |
|
|
|
In the next chapter we extend this approach, by applying it to an industrial |
|
satellite study case, and analysing the scalability of the methodology to more |
|
complex system models. |
|
|
|
|
|
\section{Model-Based Safety Analysis with AltaRica}\label{sec:sect2} |
|
|
|
|
|
\subsection{AltaRica language and versions} |
|
|
|
|
|
AltaRica is a high level modelling language dedicated to Safety |
|
Analysis. It has been defined to ease the modelling and analysis of |
|
failure propagation in systems. The goal is to identify the |
|
possible failure modes of the system and, for each mode, the chain of |
|
events that lead to an unwanted situation. |
|
|
|
|
|
In AltaRica, a system is expressed in terms of variables constrained |
|
by formulas and transitions. Several versions of the language have |
|
been defined (see for |
|
instance~\cite{arn00,prosvirnova2013altarica}). |
|
|
|
|
|
|
|
Three main versions of AltaRica have been designed so far. These three versions differ essentially with |
|
two respects: the way variables are updated after each transition firing and the way hierarchies of |
|
components are managed. In the first version, designed at the Computer Science Laboratory of |
|
University of Bordeaux (LaBRI), variables were updated by solving constraint systems. This model, |
|
although very powerful, was too resource consuming for industrial scale applications. A radical turn |
|
was therefore taken with the design of a data-flow version of the language. In AltaRica 2.0 Dataflow, |
|
variables are updated by propagating values in a fixed order. This order is determined at compile time. |
|
At the time this document is written, most of AltaRica tools rely on this Dataflow version. Ongoing investigation aim |
|
at improving AltaRica Dataflow in several ways, like propagating variable values by determining a fixpoint at execution time, |
|
hence justifying the third version of |
|
the language currently being developped: AltaRica 3.0 or OpenAltaRica\cite{prosvirnova2013altarica}. |
|
|
|
\subsection{AltaRica modelling} |
|
In this work, we use the AltaRica 2.0 Dataflow language which is a fragment of other |
|
versions and which is sufficient to analyse the behaviour of computer |
|
based systems. (The approach discussed here could be applied to other |
|
AltaRica dialects.) The models used in this work have been edited and |
|
analysed using Cecilia OCAS, a graphical interactive simulator |
|
developed by Dassault Aviation~\cite{bieber2004safety}. |
|
|
|
An AltaRica model is made of interconnected \emph{nodes}. A node can |
|
be essentially viewed as a mode automaton~\cite{rauzy02} extended with |
|
guards and actions on data variables. A node comprises three parts: a |
|
declaration of variables and events, the definition of transitions, |
|
and the definition of assertions. We illustrate these concepts with |
|
the example of a simple node called \FUNC. We give the code (textual |
|
definition) of \FUNC in Listing~\ref{ls:altarica} and a schematic |
|
representation in Fig.~\ref{fig:funct2}. The node \FUNC has one input, |
|
\code{I}, and one output, \code{O}. |
|
|
|
\lstinputlisting[language=Altarica,float=bht,captionpos=b,caption=\protect{Example |
|
of AltaRica code for the node \FUNC.},frame=single,label=ls:altarica]{algos/function2.alt} |
|
|
|
\begin{figure}[thb] |
|
\centering |
|
\begin{minipage}{0.25\textwidth} |
|
\centering |
|
\includegraphics[width=\textwidth]{altarica-function.pdf} |
|
\end{minipage} |
|
\begin{minipage}{0.7\textwidth} |
|
\centering |
|
\vspace{1em} |
|
\includegraphics[width=0.8\textwidth]{altarica-diagrams.pdf} |
|
\end{minipage} |
|
\caption{Graphical representation of node \FUNC (left) and its |
|
associated failure mode automaton (right).\label{fig:funct2}}\vspace{1em} |
|
\end{figure} |
|
|
|
\enlargethispage{\baselineskip} |
|
% |
|
Nodes can have an internal state stored in a set of \emph{state |
|
variables}, declared in a heading called \texttt{state}. In its |
|
nominal state (when \code{S = NOMINAL}), the \FUNC node acts as a |
|
perfect relay: it copies on its output the value provided by its input |
|
(we have \code{O = I}); this is expressed in the \code{assert} |
|
block. On the opposite, its output is set to \LOSS or \ERR when |
|
\code{S} equals \code{LOST} or \code{ERROR}, respectively. |
|
% |
|
The assert directive is used to express constraints on the values of |
|
the input and output variables of a node, also called \emph{flow |
|
variables}, establishing a link between the node and its |
|
environment, i.e. the other interconnected nodes. It distinguishes |
|
between input (\texttt{in}) and output (\texttt{out}) variables. An |
|
assertion defines a rule to update the value of output flows according |
|
to the state of the component and the value of input flows. |
|
|
|
The state variables of a node can only change when an |
|
event is triggered. The code of \FUNC declares two events: \code{fail\_err}, that |
|
changes the state from \code{NOMINAL} to \code{ERROR}, and |
|
\code{fail\_loss}, that can transition from any state to |
|
\code{LOST}. This behaviour is the one displayed on the mode |
|
automaton of Fig.~\ref{fig:funct2}. Transitions are listed in the |
|
\texttt{trans} block of the node. Each transition has an (event) |
|
name and a definition of the form |
|
\lstinline[language=Altarica]{g |-evt-> e}, where the guard \code{g} |
|
is a Boolean condition that can refer to state and flow |
|
variables. The event \code{evt} can be triggered when the guard is |
|
satisfied. In this case we apply the effect, \code{e}, that is an |
|
expression that modifies the values of state variables. |
|
|
|
Events are useful to model the occurrence of failures or the reaction |
|
to conditions on the flow variables. |
|
% It is possible to indicate |
|
% whether an event is triggered internally by a node---as in the case of |
|
% a reconfiguration---or externally by its environment---as in the case |
|
% of the occurrence of a failure. In the latter case, |
|
We can assign a |
|
law of probability on the occurrence of the failure using the heading |
|
\code{extern}. For instance we could assert that event |
|
\code{fail\_loss} follows an exponential distribution with the |
|
declaration: |
|
%% |
|
{\lstinline[language=Altarica]{extern law (<event fail_loss>)="exp 1e-4";}} |
|
%% |
|
In the next section, we propose a way to enrich this syntax to express |
|
timing constraints on events instead of probability distributions. At |
|
the moment, it is not possible to use stochastic events in addition to |
|
time events. |
|
|
|
\enlargethispage{\baselineskip} |
|
|
|
In the general case, an AltaRica model is composed of several |
|
interconnected node instances, following a component-based |
|
approach. Global assertions relate the input flows of a component to |
|
the output flows of other components. For the sake of brevity, we do |
|
not describe component synchronisation here and we refer the reader to |
|
\cite{Altarica2.3} for further details. More importantly, a |
|
hierarchical AltaRica model can always be ``flattened'', i.e. |
|
represented by a single node containing all the variables, events, |
|
assertions, and transitions from the composite system. We use this |
|
property in our interpretation of AltaRica in Fiacre. |
|
|
|
%% |
|
\subsection{Time AltaRica: Adding Timing Constraints to Events}%%\label{sec:adding-timing-constr} |
|
%% |
|
There already is a limited mechanism for declaring timing constraints |
|
in AltaRica. It relies on the use of external law associated with a |
|
\emph{Dirac distribution}. An event with Dirac$(0)$ law denotes an |
|
instantaneous transition, that should be triggered with the highest |
|
priority. Likewise, an event with Dirac($d$) (where $d$ is a |
|
positive constant) models a transition that should be triggered with a |
|
delay of $d$ units of time. |
|
% |
|
In practice, Dirac laws are rather a way to encode priorities between |
|
events than an actual mean to express duration. Moreover, while Dirac |
|
laws are used during simulation, they are not taken into account by |
|
the other analysis tools. Finally, the use of Dirac laws is not |
|
expressive enough to capture non-deterministic transitions that can |
|
occur within time intervals of the form $[a, b]$, where $a \neq |
|
b$. These constraints are useful to reason about failure propagation |
|
delays with different best and worst case traversal time. For this |
|
reason, we propose to extend event properties with \emph{temporal |
|
laws} of the form: |
|
%% |
|
{\lstinline[language=Altarica]{extern law (evt) = "[a,b]";}} |
|
%% |
|
It is also possible to use open and/or unbounded time intervals, such |
|
as \code{]a,}$\infty$\code{[}.%%\scriptstyle |
|
|
|
With such a declaration, the transition |
|
\lstinline[language=Altarica]{g |-evt-> e} can be triggered only if |
|
the guard \code{g} is satisfied for a duration (or \emph{waiting |
|
time}) $\delta$, with $\delta \in [a, b]$. A main difference with |
|
the original semantics of AltaRica is that the timing constraint of an event is |
|
not reinitialised unless its guard is set to false. |
|
%% |
|
% Hence the waiting time of an event decreases when time elapse whereas, |
|
% while with a Dirac($d$) law the waiting time is reset to $d$ each time |
|
% the event fires. Finally, |
|
Moreover, our semantics naturally entails a notion of \emph{urgency}, |
|
meaning that it is not possible to miss a deadline: when $\delta$ |
|
equals $b$, then either \code{evt} is triggered or another transition |
|
should (instantaneously) change the value of the guard \code{g} to |
|
false. |
|
|
|
We can illustrate the use of temporal laws with the following example of a new |
|
node, \PRE; see Listing.~\ref{ls:pre}. This node encodes a buffer that |
|
delays the propagation of its input. When the input changes, event |
|
\code{pre\_read} has to be triggered instantaneously. Then, only after |
|
a duration $\delta \in [a, b]$, the value stored by \PRE (in the state |
|
variable \code{Stored}) is propagated to its output. |
|
% |
|
\lstinputlisting[language=Altarica,float=t,captionpos=b,caption=\protect{Example |
|
of Time AltaRica code: the basic delay.},frame=single,label=ls:pre]{algos/pre.alt} |
|
|
|
|
|
\section{A Definition of Fiacre Using Examples}\label{sec:sect3} |
|
|
|
Fiacre~\cite{berthomieu2008fiacre} is a high-level, formal |
|
specification language designed to represent both the behavioural and |
|
timing aspects of reactive systems. Fiacre programs are stratified in |
|
two main notions: \emph{processes}, which are well-suited for modelling |
|
structured activities (like for example simple state machines), and |
|
\emph{components}, which describes a system as a composition of |
|
processes. In the following, we base our presentation of Fiacre on |
|
code examples used in our interpretation of Time AltaRica. |
|
% \subsection{A Definition of Fiacre by Examples} |
|
% \label{sec:defin-fiacre-exampl} |
|
We give a simple example of Fiacre specification in |
|
Listing~\ref{lst:fiacre}. This code defines a process, |
|
\code{Function}, that simulates the behaviour of the AltaRica node |
|
given in Listing~\ref{ls:altarica}. |
|
|
|
\lstinputlisting[language=Fiacre,float=hbt,captionpos=b,caption=\protect{Example |
|
of Fiacre code: type, functions and processes},label=lst:fiacre,frame=single]{algos/function.fcr} |
|
|
|
Fiacre is a strongly typed language, meaning that type annotations are |
|
exploited in order to avoid unchecked run-time errors. Our example |
|
defines two enumeration types, \code{FState} and \code{FailureType}, |
|
that are the equivalent of the namesake AltaRica domains. We also |
|
define a record type, \code{Flows}, that models the environment of the |
|
node \code{Function}, that is an association from flow variables to |
|
values. Fiacre provides more complex data types, such as arrays, |
|
tagged union or FIFO queues. Fiacre also supports native |
|
\emph{functions} that provide a simple way to compute on values. In |
|
our example, function \code{update} is used to compute the state of |
|
the environment after an event is triggered; that is to model the |
|
effect of assertions in AltaRica. It uses two ternary (conditional) |
|
operators to mimic the \code{case}-expression found in the |
|
\code{assert} heading of Listing~\ref{ls:altarica}. |
|
|
|
% Functions are evaluated applicatively (there are no side effects) and |
|
% can be recursively defined. |
|
|
|
A Fiacre \emph{process} is defined by a set of parameters and {control |
|
states}, each associated with a set of \emph{complex transitions} |
|
(introduced by the keyword \code{from}). Our example defines a process |
|
with two {shared variables}---symbol \code{\&} denotes variables |
|
passed by reference---that can be updated concurrently by other |
|
processes. In our case, variable \code{S} models the (unique) state |
|
variable of node \FUNC. |
|
|
|
Complex transitions are expressions that declares how variables are |
|
updated and which transitions may fire. They are built from constructs |
|
available in imperative programming languages (assignments, |
|
conditionals, sequential composition, \dots); non-deterministic |
|
constructs (such as external choice, with the \code{select} operator); |
|
communication on ports; and jump to a state (with the \code{to} or |
|
\code{loop} operators). In Listing~\ref{lst:fiacre}, the \code{select} |
|
statement defines two possible transitions, separated by the symbol |
|
\code{[]}, that loop back to \code{s0}. Each transition maps exactly |
|
to one of the AltaRica events, \code{fail\_loss} and \code{fail\_err}, |
|
that we want to translate. Transitions are triggered |
|
non-deterministically and their effects are atomic (they have an ``all |
|
or nothing'' semantics). A transition can also be guarded by a Boolean |
|
condition, using the operator \code{on} or another conditional |
|
construct. |
|
|
|
It is possible to associate a time constraint to a transition using |
|
the operator \code{wait}. Actually, the ability to express directly |
|
timing constraints in programs is a distinguishing feature of |
|
Fiacre. We illustrate this mechanism in the code below, that |
|
corresponds to the interpretation of the node \PRE of |
|
Listing~\ref{ls:pre}. Basically, a transition constrained by a (time) |
|
interval $I$ can be triggered after a time $\delta$, with |
|
$\delta \in I$, only if its guard stayed continuously valid during |
|
this time. It is this behaviour that inspired our choice of semantics |
|
for the temporal law. |
|
|
|
The last notion that we need to introduce is \emph{components}. |
|
Components allow the hierarchical composition of processes, in the |
|
same way than AltaRica nodes can be composed into classes in order to |
|
build more complex behaviors. |
|
Thus, a Fiacre component defines a parallel composition of components and/or |
|
processes using statements of the form \code{par} \vars{P}$_0$ |
|
$\parallel \dots \parallel$ \vars{P}$_n$ \code{end}. It can also be |
|
used to restrict the visibility of variables and ports and to define |
|
priorities between communication events. |
|
% Priorities are explicitly declared as constraints between |
|
% communication ports, as in \code{go1 > go2}. |
|
We give an example of Fiacre component in Listing~\ref{lst:pre2}, |
|
where we define an upgraded version of the delay operator \PRE. |
|
% |
|
\lstinputlisting[language=Fiacre,float=bt,captionpos=b,caption=\protect{Wait statement, components and synchronisation on ports.},label=lst:pre2,frame=single]{algos/pre2.fcr} |
|
|
|
% % |
|
% \lstinputlisting[language=Fiacre,float=bt,captionpos=b,caption=\protect{Example |
|
% of Fiacre code: declaring timing |
|
% constraints},label=lst:pre1,frame=single]{algos/pre.fcr} |
|
% % |
|
|
|
|
|
A problem with the implementation of \PRE is that at most one failure |
|
mode can be delayed at a time. Indeed, if the input of \PRE changes |
|
while the state is \code{Full}, then the updated value is not taken |
|
into account until after event \code{pre\_wait} triggers. |
|
% |
|
It is not possible to implement a version that can delay an unbounded |
|
number of events in a bounded time; this would require an unbounded |
|
amount of memory to store the intermediate values. More fundamentally, |
|
this would give rise to undecidable verification problems (see |
|
e.g.~\cite{bouyer2004updatable}). To fix this problem, we can define a |
|
family of operators, \PRE\code{\_k}, that can delay up-to $k$ |
|
simultaneous different inputs. Our implementation relies on a |
|
component that uses three process instances: one instance of |
|
\code{front}, that reads messages from the input (variable \code{I}), |
|
and two instances of \code{delay}, that act as buffers for the values |
|
that need to be delayed. Process \code{front} uses the local ports |
|
\code{go1} and \code{go2} to dispatch values to the buffers. |
|
Component \PRE\code{\_2} is enough to model the use case defined in |
|
Sect.~\ref{sec:example-fail-detect-1}. Indeed, any element in the FDI |
|
system may propagate at most two different status, one from \OK to |
|
\LOSS and then from \LOSS to \ERR. |
|
|
|
|
|
|
|
\section{Example of a Failure Detection and Isolation System} |
|
\label{sec:example-fail-detect-1} |
|
|
|
We study the example of a safety critical function that illustrates |
|
standard failure propagation problems. We use this example to show the |
|
adverse effects of temporal failure propagation even in the presence |
|
of % system safety improvements. In our case the addition of |
|
Failure Detection and Isolation (FDI) capabilities. |
|
% |
|
This example is inspired by the avionic functions that provide |
|
parameters for Primary Flight Display (PFD), which is located in the |
|
aircraft cockpit. The system of interest is the computer that acquires |
|
sensors measurements and computes the aircraft \emph{calibrated airspeed} (CAS) parameter. |
|
Airspeed is crucial for pilots: it is taken into |
|
account to adjust aircraft engines thrust and it plays a main role in |
|
the prevention of over speed and stall. |
|
|
|
%% |
|
\begin{figure}[thb] |
|
\centering |
|
\includegraphics[width=1\textwidth]{figures/AirSpeed_Computation.jpg} |
|
\caption{Functional and physical views of the airspeed computation function.\label{fig:cockpit}} |
|
\end{figure} |
|
%% |
|
|
|
CAS is not directly measured by a dedicated sensor, but is computed as |
|
a function of two auxiliary pressure measurements, the static pressure |
|
(Ps) and total pressure (Pt); that is |
|
$\mathrm{CAS} = f(\mathrm{Pt}, \mathrm{Ps})$. |
|
% \begin{equation*} |
|
% CAS = C_{SO}\sqrt{5 \cdot \left( \left(\frac{Pt - Ps}{P_0} + 1\right)^{2/7} - 1\right)} |
|
% \end{equation*} |
|
% with $C_{SO}$ and $P_0$ two |
|
% constants. % the speed of sound under standard day sea level conditions |
|
These two measurements come from sensors located on the aircraft nose, |
|
a pressure probe and a pitot tube. |
|
% Our proposed functional view shows: |
|
% \begin{itemize} |
|
% \item Two external functions \I{1} and \I{2} that measure static and total pressure. They represent the functions allocated to the sensors. |
|
% \item Three inner functions of the system: \F{1} and \F{2} for sensor measurements acquisition by the onboard computer and \F{3} for airspeed computation. |
|
% \item The PFD functions have not been modelled in the sole purpose of simplification. |
|
% \end{itemize} |
|
% |
|
%% |
|
|
|
Our proposed functional view is given in Fig.~\ref{fig:cockpit}. It |
|
consists in two external input functions \I{1} and \I{2} that measure static |
|
and total pressure; and three inner functions of the system, \F{1} and |
|
\F{2} for sensor measurements acquisition by the on-board computer and |
|
\F{3} for airspeed computation. For simplification purposes, the |
|
PFD functions have not been modelled. |
|
|
|
Next, we propose a first failure propagation view aiming at |
|
identifying the scenarios leading to an erroneous airspeed computation |
|
and display to the pilot (denoted \ERR). Such failure can only be |
|
detected if a failure detector is implemented, for instance by |
|
comparing the outputs of different functions. Undetected, it could |
|
mislead the pilot and, consequently, lead to an inappropriate engine |
|
thrust setting. We also want to identify the scenarios leading to |
|
the loss of the measure (denoted \LOSS). In such a case, the pilot can |
|
easily assess that the measure is missing or false and consequently |
|
rely upon another measure to control the aircraft (note that such |
|
redundancy is not modelled). For example, airspeed out of |
|
bound---indicating that an airliner has crossed the sonic barrier---is |
|
considered to be of kind \LOSS. It can be understood that scenarios leading to the |
|
loss of the airspeed are less critical than the ones leading erroneous |
|
values. |
|
|
|
% We consider two possible kinds of fail status: the ``silent'' failure of |
|
%an element (denoted \ERR); or a communication problem with one of the |
|
% functions (denoted \LOSS). |
|
|
|
A failure is typically a loss of connectivity or the |
|
detection of (clearly) erroneous transmitted values, for instance a |
|
variable out of bounds. A \LOSS fail is less critical than an \ERR |
|
since it can be detected immediately and isolated. On the other hand, |
|
an \ERR fail can only be detected if a failure detector is |
|
implemented, for instance by comparing the outputs of different |
|
functions. In the following, we consider that the status |
|
are ordered by their level of severity, that is \ERR $<$ \LOSS $<$\OK. |
|
|
|
|
|
\subsubsection{Safety model of the architecture without FDI} |
|
%%\label{sec:error-prop-simple} |
|
For the sake of understandability, we choose to study a simplified version of |
|
FDI system. While simple, this system is representative of failure |
|
propagation mechanisms found in actual onboard systems. |
|
%% |
|
|
|
We provide an AltaRica model corresponding to the functional view |
|
of the CAS function in Fig.~\ref{fig:example0}. This model, tailored |
|
to study failure propagation, is comprised of: |
|
% ( All the functions are modelled according to the node function |
|
% introduced in the Sect.~\ref{sec:sect2}. |
|
% |
|
two external functions, \I{1} and \I{2}, that have no input (so, in |
|
their nominal state, the output is set to \OK); two inner functions, |
|
\F{1} and \F{2}, which are instances of the node \FUNC described in |
|
Sect.~\ref{sec:sect2}; and a function, \F{3}, that is the composition |
|
of two basic elements: a multiplexer, \MIN, representing the |
|
dependence of the output of \F{3} from its two inputs, and a computing |
|
element \F{3Processing} that represents the computation of the |
|
airspeed. \F{3Processing} is also an instance of node \FUNC. |
|
|
|
\begin{figure}[t] |
|
\centering |
|
\begin{tikzpicture} |
|
\node[anchor=south west,inner sep=0] at (0,0) { \includegraphics[width=0.75\textwidth]{figures/Modele0}}; |
|
\draw[red!50,ultra thick, dashed] (7,2.85) rectangle (2.6,0.5); |
|
\node[red] at (6.5,2.5) {\texttt{\large F3}}; |
|
\end{tikzpicture} |
|
\caption{A simple example of failure propagation.\label{fig:example0}} |
|
\end{figure} |
|
% |
|
In case of single failure scenario, \MIN propagates the failure coming |
|
either from one input or the other. In case of multiple failures, |
|
%when both inputs have identical failure, the faillure propagates; |
|
%For instance, |
|
% both inputs to \LOSS lead to an output of \LOSS. |
|
% But when it receives an erroneous fail then the multiplexer will |
|
% propagate an \ERR. The remaining failure combination drew our |
|
% attention during \F{3} modelling. |
|
%on the other hand, |
|
when different failures propagate, one being \LOSS |
|
and the other being \ERR,---and without appropriate FDI---the system |
|
outcome is uncertain. |
|
% |
|
Solving this uncertainty would require a detailed behavioural model of |
|
the on-board computer and a model for all the possible failure modes, |
|
which is rarely feasible with a sufficient level of confidence, except |
|
for time-tested technology. Given this uncertainty, it is usual to |
|
retain the state with the most critical effect, that is to say: |
|
% in this case, \MIN can be thought of as a simple logic operator that |
|
% computes the minimum of its two inputs, and |
|
the output of \F{3} is \ERR. |
|
|
|
|
|
% Since a \LOSS can be ``easily'' detected, we want to avoid a situation |
|
% where the whole system propagates \ERR while one of its |
|
%internal element propagates a \LOSS. The rationale is that we should be |
|
% able to isolate (or quarantine) the system when we know for sure that |
|
% one of its element does not reliably respond to commands. This can be |
|
% expressed by the following safety property. |
|
%% |
|
|
|
% In the next section, we study the system with the addition of FDI |
|
% capabilities. |
|
|
|
Our goal is to prevent the computation of an erroneous airspeed while |
|
one of \F{3} input signals is lost. The rationale is that the system |
|
should be able to passivate automatically the airspeed when it detects |
|
that one of its input signals is not reliable. This behaviour can be |
|
expressed with the following property: |
|
|
|
%% |
|
\begin{prop}[Loss Detection and Instantaneous Propagation]\label{prop:1} |
|
A function is \emph{loss detection safe} if, when in nominal mode, it |
|
propagates a \LOSS whenever one of its input nodes propagates a |
|
\LOSS. |
|
\end{prop} |
|
|
|
% This is a simple system used to detect non-nominal behaviours |
|
% and to trigger observables in order to isolate the failure mode. |
|
% ({The figures are actual screen capture of models edited with Cecilia |
|
% OCAS.}) |
|
% \subsubsection{Analysis} |
|
|
|
|
|
We can show that our example of Fig.~\ref{fig:example0} does not meet |
|
this property using the \emph{Sequence Generation} tool available in |
|
Cecilia OCAS. |
|
%% |
|
% This system has several sources of unreliability. Both its sources and |
|
% the computing elements (\F{1}, \F{2} and \F{3Processing}) can |
|
% experience spontaneous failures, meaning that their outputs can |
|
% transition instantaneously to \LOSS or \ERR. Errors are irrecoverable; |
|
% once a function becomes faulty, its output will always stay equal to |
|
% \LOSS or \ERR. Likewise, if both inputs are ``nominal'' then the |
|
% output of \MIN is \OK. We assume that the multiplexer cannot undergo |
|
% spontaneous failures. |
|
%% |
|
% This can be tested using a perfect detectors, \FF{3}{Loss}, placed at |
|
% the output of the system. |
|
%% |
|
To this end, we compute the minimal cuts for the target equation |
|
$((\text{\F{1.O.Loss}} \vee \text{\F{2.O.Loss}}) \wedge \neg |
|
\text{\F{3.O.Loss}})$, meaning the scenario where \F{3} does not |
|
propagates \LOSS when one of \F{1} or \F{2} does. Hence function \F{3} |
|
is {loss detection safe} if and only if the set is empty. |
|
\footnote{To check the safety of \F{3}, we need to perform the same |
|
analysis for all the elements that can propagate a loss fail, not |
|
only \F{1}.}. |
|
|
|
In our example, once we eliminate the cases where \F{3} is not nominal |
|
(that is when \F{3Processing} is in an error state), we find eight |
|
minimal cuts, all of order $2$. In the following section, we correct |
|
the behaviour of \F{3} by considering a new architecture based on |
|
detectors and a switch to isolate the output of \F{3} when faulty: |
|
|
|
%% |
|
{%\centering |
|
\begin{small} |
|
\begin{tabular}{c@{\quad\quad}c} |
|
\begin{minipage}[t]{0.42\linewidth} |
|
\begin{verbatim} |
|
{'F1.fail_err', 'F2.fail_loss'} |
|
{'F1.fail_err', 'I2.fail_loss'} |
|
{'F1.fail_loss', 'F2.fail_err'} |
|
{'F1.fail_loss', 'I2.fail_err'} |
|
\end{verbatim} |
|
\end{minipage} |
|
& |
|
\begin{minipage}[t]{0.42\linewidth} |
|
\begin{verbatim} |
|
{'F2.fail_err', 'I1.fail_loss'} |
|
{'F2.fail_loss', 'I1.fail_err'} |
|
{'I1.fail_err', 'I2.fail_loss'} |
|
{'I1.fail_loss', 'I2.fail_err'} |
|
\end{verbatim} |
|
\end{minipage}\\ |
|
\end{tabular}\\ |
|
\end{small} |
|
} |
|
% |
|
Each of these cuts lead to trivial violations of our safety |
|
property. For instance, we obtain the cut \code{\{'F1.fail\_err', |
|
'F2.fail\_loss'\}} that corresponds to the case where \F{1} |
|
propagates \ERR and \F{2} propagates \LOSS, in which case \F{3} |
|
propagates \ERR. This is exactly the situation that we want to |
|
avoid. |
|
|
|
|
|
\begin{figure}[bt] |
|
\centering |
|
\begin{tikzpicture} |
|
\node[anchor=south west,inner sep=0] at (0,0) { \includegraphics[width=\textwidth]{figures/Modele2.png}}; |
|
\draw[red!50,ultra thick, dashed] (10.8,3.4) rectangle (2,0); |
|
\node[red] at (10.4,3.1) {\texttt{\large F3}}; |
|
\end{tikzpicture} |
|
\caption{Model of a FDI function with a switch and an alarm.\label{fig:example1}} |
|
\end{figure}%\vspace*{-1cm} |
|
%\enlargethispage{\baselineskip} |
|
|
|
\subsubsection{Safety model of the architecture with FDI} |
|
%%\label{sec:simple-fail-detect} |
|
%% |
|
%In order to satisfy the Prop.~\ref{prop:1}, |
|
The updated implementation of \F{3} (see Fig.~\ref{fig:example1}) |
|
uses two perfect detectors, \F{1Loss} and \F{2Loss}, that can detect |
|
a loss failure event on the inputs of the function. The (Boolean) outputs of |
|
these detectors are linked to an OR gate ({\ttfamily AtLeastOneLoss}) |
|
which triggers an \ALARM when at least one of the detectors outputs |
|
true. The alarm commands a \SWITCH; the output of \SWITCH is the same |
|
as \MIN, unless \ALARM is activated, in which case it propagates a |
|
\LOSS failure. The alarm can fail in two modes, either |
|
continuously signaling a \LOSS or never being activated. The |
|
schema in Fig.~\ref{fig:example1} also includes two delays operators, \D{1} and \D{2}, |
|
that model delay propagation at the input of the detectors; |
|
we will not consider them in the following lines, but come back to these timing constraints at the end of the section. |
|
|
|
% \TODO{why do we need Alarm ! why does the alarm can fail ! we could |
|
% plug directly the or gate to the switch}- Permanent failure of the alarm. |
|
|
|
The FDI function---with a switch and an alarm---is a stable scheme for |
|
failure propagation: when in nominal mode, it detects all the failures |
|
of the system and it is able to disambiguate the case where its inputs |
|
contains both \ERR and \LOSS. Once again, this can be confirmed using |
|
the Sequence Generation tool. If we repeat the same analysis than |
|
before---and if we abstract away the delay nodes---we find $56$ |
|
minimal cuts, all involving a failure of either \ALARM or |
|
\F{3Processing}, i.e. a non-nominal mode. This means that, in an untimed model, our new |
|
implementation of \F{3} satisfies the loss detection property, as |
|
desired. |
|
% \begin{figure}[hbt] |
|
% \centering |
|
% \begin{tikzpicture} |
|
% \node[anchor=south west,inner sep=0] at (0,0) { \includegraphics[width=\textwidth]{figures/Modele2.png}}; |
|
% \draw[red,ultra thick, dashed] (10.6,3.6) rectangle (2.2,0); |
|
% \node[red] at (10,3) {\texttt{\huge F3}}; |
|
% \end{tikzpicture} |
|
% % \includegraphics[width=0.95\textwidth]{figures/Modele2} |
|
% \caption{Taking into consideration propagation delays in the FDI |
|
% system.\label{fig:example2}} |
|
% \end{figure} |
|
% Unfortunately, this model makes unrealistic assumptions about the |
|
% instantaneous failure propagation of detection. Indeed, it may be the |
|
% case that the outputs of \F{1} and \F{2} are delayed as they are |
|
% propagated to the observers \F{1Loss} and \F{2Loss}. Next, we study |
|
% new failure modes that may arise from this situation and how we can |
|
% detect them. |
|
%% |
|
% \subsection{Timed Safety model of the architecture with FDI} |
|
% \label{sec:timed-safety-model} |
|
%% |
|
% In this section, we consider a new AltaRica model that takes into |
|
% account propagation delays. To this end, we may insert two instances |
|
% of the delay node (the element \PRE defined in Sect.~\ref{sec:sect2}) |
|
% before the detectors \F{1Loss} and \F{2Loss}. |
|
Even so, it is easy to find a timed scenario where the safety property |
|
is violated. |
|
|
|
Assume now that \F{1} and \F{2} propagate respectively the status \LOSS and \ERR, |
|
at the same date. In such a case and considering possible latencies, |
|
while \ERR reaches \MIN instantaneously, |
|
the output of \F{1} might reach \F{1Loss} at successive date. |
|
% of the output of \F{2} reaching \F{2Loss},. |
|
This leads to a transient state where the alarm is not activated whereas |
|
the output of \MIN is set to \ERR. This brings us back to the same |
|
dreaded scenario than in our initial model. |
|
%% |
|
In particular, this scenario corresponds to the cut |
|
\code{\{'F1.fail\_loss', 'F2.fail\_err'\}}, that is not admissible in |
|
an untimed semantics. |
|
%% |
|
|
|
This example suggests that we need a more powerful method to compute |
|
the set of cuts in the presence of temporal constraints. On the other |
|
hand, we may also advocate that our safety property is too limiting in |
|
this context, where perfect synchronicity of events is rare. Actually, |
|
it can be proven that the output of \F{3} will eventually converge to a |
|
loss detection and isolation mode (assuming that \F{3} stays nominal and |
|
that its inputs stay stable). |
|
% |
|
To reflect |
|
this si\-tua\-tion, we pro\-pose an improved safety property that takes into |
|
account temporal properties of the system: |
|
%% |
|
\begin{prop}[Loss Detection Convergent]\label{prop:2} A function is |
|
\emph{loss detection convergent} if (when in nominal mode) there |
|
exists a duration $\Delta$ such that it continuously outputs a \LOSS |
|
after the date $\delta_0 + \Delta$ if at least one of its input |
|
nodes continuously propagates a \LOSS starting from $\delta_0$ |
|
onward. The smallest possible value for $\Delta$ is called the |
|
\emph{convergence latency} of the function. |
|
\end{prop} |
|
Hence, if the latency |
|
needed to detect the loss failure can be bound, and if the bound is sufficiently |
|
small safety-wise, we can still deem our system as safe. |
|
In the example in Fig.~\ref{fig:cockpit}, this property can indicate for how long an erroneous airspeed is shown on the PFD to the pilot, |
|
before the failure is isolated. |
|
|
|
In the next section, we use our approach to generate a list of ``timed |
|
cuts'' (as model-checking counterexamples) that would have exposed the |
|
aforedescribed problems. We also use model-checking to compute |
|
the {convergence latency} for the node \F{3}. In this simple example, |
|
we can show that the latency is equal to the maximal propagation delay |
|
at the input of the detectors. The value of the latency could be much |
|
harder to compute in a more sophisticated scenario, where delays can |
|
be chained and/or depends on the internal state of a component. |
|
|
|
|
|
\section{Compilation of AltaRica and Experimental evaluation}\label{sec:sect4} |
|
|
|
We have implemented the transformation outlined in |
|
Sect.~\ref{sec:sect3}; the result is a compiler that automatically |
|
generates Fiacre code from an AltaRica model. The compilation process |
|
relies on the fact that it is possible to ``flatten'' a composition of |
|
interconnected nodes into an intermediate representation, called a |
|
\emph{Guarded Transition System} (GTS)~\cite{rau08}. A GTS is very |
|
similar to a (single) AltaRica node and can therefore be encoded in a |
|
similar way. Our tool is built using the codebase of the model-checker |
|
EPOCH~\cite{epoch}, which provides the functionalities for the |
|
syntactic analysis and the linking of AltaRica code. After |
|
compilation, the Fiacre code can be checked using |
|
Tina~\cite{BRV04}. As seen in Sect.~\ref{sec:tina}, the core of the Tina toolset is an |
|
exploration engine that can be exploited by dedicated model-checking |
|
and transition analyser tools. Tina offers several abstract state |
|
space constructions that preserve specific classes of properties like |
|
absence of deadlocks, reachability of markings, or linear and |
|
branching time temporal properties. These state space abstractions are |
|
vital when dealing with timed systems that generally have an infinite |
|
state space (due to the use of a dense time model). In our |
|
experiments, most of the requirements can be reduced to reachability |
|
properties, so we can use on-the-fly model-checking techniques. |
|
% and very ``aggressive'' abstractions. |
|
|
|
\begin{figure}[thb] |
|
\centering |
|
\includegraphics[width=\textwidth]{figures/passerelle.png} |
|
\caption{Graphical representation of the AltaRica/Fiacre toolchain.\label{fig:toolchain}} |
|
\end{figure} |
|
|
|
We interpret a GTS by a Fiacre process whose parameters consist of all |
|
its state and flow variables. Each transition |
|
\lstinline[language=Altarica]{g |-evt-> e} in the GTS is (bijectively) |
|
encoded by a transition that matches the guard \code{g} and updates |
|
the variables to reflect the effect of \code{e} plus the assertions. Each transition can |
|
be labelled with time and priorities constraints to take |
|
into account the \code{extern} declarations of the node. This |
|
translation is straightforward since all the operators available in |
|
AltaRica have a direct equivalent in Fiacre. Hence every |
|
state/transition in the GTS corresponds to a unique state/transition |
|
in Fiacre. This means that the state (reachability) graph of a GTS and |
|
its associated Fiacre model are isomorphic. This is a very strong and |
|
useful property for formal verification, since we can very easily |
|
transfer verification artefacts (such as counterexamples) from one |
|
model back to the other. |
|
|
|
The close proximity between AltaRica and Fiacre is not really |
|
surprising. First of all, both languages have similar roots in process |
|
algebra theory and share very similar synchronisation mechanisms. More |
|
deeply, they share formal models that are very close: AltaRica |
|
semantics is based on the product of ``communicating automata'', |
|
whereas the semantics of Fiacre can be expressed using (a time |
|
extension of) one-safe Petri nets. The main difference is that |
|
AltaRica provide support for defining probabilities on events, whereas |
|
Fiacre is targeted towards the definition of timing aspects. This |
|
proximity in both syntax and semantics is an advantage for the |
|
validation of our tool, because it means that our translation should |
|
preserve the semantics of AltaRica on models that do not use extern |
|
laws to define probabilities and time. We have used this property to |
|
validate our translation by comparing the behaviours of the models |
|
obtained using Cecilia OCAS simulation tool and their translation. For |
|
instance, in the case of the CAS system of |
|
Sect.~\ref{sec:example-fail-detect-1}, we can compute the set of cuts |
|
corresponding to Safety Property~\ref{prop:1} (loss detection) by |
|
checking an invariant of the form |
|
$((\text{\F{1.O}} = \text{\code{Loss}}) \vee (\text{\F{2.O}} = |
|
\text{\code{Loss}}) \Rightarrow (\text{\F{3.O}} = |
|
\text{\code{Loss}}))$. |
|
% |
|
%%\TODO{Check if it is necessary to mention the tool, as opposed to counterexamples} |
|
In both cases---with and without FDI---we are able to compute the |
|
exact same set of cuts than Cecilia OCAS. This is done using the |
|
model-checker for modal mu-calculus provided with Tina, which can list |
|
all the counterexamples for a (reachability) formula as a graph. More |
|
importantly, we can use our approach to compute the timed |
|
counterexample described at the end of |
|
Sect.~\ref{sec:example-fail-detect-1}. All these computations can be |
|
done in less than a second on our test machine. |
|
|
|
\subsection{Empirical evaluation} |
|
We have used our toolchain to generate the reachable state space of |
|
several AltaRica models~\footnote{All the benchmarks tested in this paper are available at |
|
https://w3.onera.fr/ifa-esa/content/model-checking-temporal-failure-propagation-altarica}: |
|
RUDDER describes a control system for the |
|
rudder of an A340 aircraft~\cite{bernard2007experiments}; ELEC refers |
|
to three simplified electrical generation and power distribution |
|
systems for a hypothetical twin jet aircraft; the HYDRAU model |
|
describes a hydraulic system similar to the one of the A320 |
|
aircraft~\cite{bieber2002combination}. The results are reported in |
|
Table~\ref{table:2}. In each case, we indicate the time needed to |
|
generate the whole state space (in seconds) and the number of states |
|
and transitions explored. For information, we also give the number of |
|
state variables as reported by Cecilia OCAS. All tests were run on an |
|
Intel 2.50GHz CPU with 8GB of RAM running Linux. In the case of model |
|
HYDRAU we stopped the exploration after $30$ minutes and more than |
|
$9.10^{9}$ generated states. |
|
The state space is large in this benchmark because it models the physical a-causal propagation of a leak, so a leak can impact both upward and backward components and trigger a reconfiguration, multiplying the number of reachable states. |
|
Moreover, the topology of the system shall be reconfigurated after the detection of some failures and this dynamic reconfiguration combined with a-causal propagation increases again more the size of the state space. |
|
In all cases, the time needed to generate |
|
the Fiacre code is negligible, in the order of 10~ms. |
|
|
|
\begin{table}[tb] |
|
\centering |
|
\begin{tabular}{|l|c|c|c|c|c|} |
|
\hline |
|
\textbf{Model} & \textbf{time} (s) & \textbf{\# states} & |
|
\textbf{\# |
|
trans.} |
|
& \textbf{\# state vars}\\ \hline |
|
RUDDER~~~ & 0.85 & $3.3\,10^4$ & $2.5\,10^5$ & 15 \\ \hline |
|
ELEC 01 & 0.40 & 512 & $2.3\,10^3$ & 9 \\ \hline |
|
ELEC 02 & 0.40 & 512 & $2.3\,10^3$ & 9 \\ \hline |
|
ELEC 03 & 101 & $4.2\,10^6$ & $4.6\,10^7$ & 22 \\ \hline |
|
HYDRAU & 1800 & --- & --- & 59 \\ \hline |
|
CAS~ & 0.40 & 729 & $2.9\,10^3$ & 6 \\ \hline |
|
CAS with \PRE~~~ & 46 & $9.7\, 10^5$ & $4.3\, 10^6$ & 10 \\ \hline |
|
\end{tabular} |
|
\vspace*{1em} |
|
\caption{State space size and generation time for several use cases.\label{table:2}} |
|
\end{table} |
|
|
|
Our models also include two versions of the complete CAS system |
|
(including the detectors, the alarm and the switch); both with and |
|
without the delay functions \D{1} and \D{2}. The ``CAS with \PRE'' model |
|
is our only example that contains timing constraints. In this case, we |
|
give the size of the state class graph generated by Tina, that is an |
|
abstract version of the state space that preserves LTL properties. We |
|
can use Tina to check temporal properties on this |
|
example. More precisely, we can check that \F{3} has the \emph{loss |
|
detection convergence} property. To this end, a solution is to add a |
|
Time Observer to check the maximal duration between two events: first, a \code{obs\_start} event is triggered when the output of \F{1} or \F{2} changes to \LOSS; |
|
then an \code{obs\_end} event is triggered when the output of \F{3} changes to \LOSS. The |
|
observer has also a third transition (\code{obs\_err}) that acts as a |
|
timeout and is associated with a time interval $I$ and is enabled |
|
concurrently with \code{obs\_end}. Hence, |
|
Time Observer ends up in the state yield by \code{obs\_err} |
|
when the output of \F{3} deviates from its |
|
expected value for more than $d$ units of time, with $d \in I$. We |
|
have used this observer to check that the \emph{convergence latency} |
|
of the CAS system equals $3$, when |
|
we assume that the |
|
delays are in the time interval $[1, 3]$. |
|
The result is that |
|
\code{obs\_err} is fireable for any value of $d$ in the interval |
|
$[0,3]$, while \code{obs\_err} is not fireable if $I = ]3, \infty[$. |
|
% It is enough to indicate to choose $I = [0,3]$. |
|
These two safety properties can be checked on the system (plus the |
|
observer) in less than $0.6$~s. |
|
|
|
|
|
% \begin{wraptable}{r}{0.35\textwidth} |
|
% %\begin{table}[bth] |
|
% \centering |
|
% \vspace*{-7mm} |
|
% \begin{tabular}{|c|c|l} |
|
% \cline{1-2} |
|
% \textbf{delay}$\mathbf\tau$ & \textbf{safety property} & \\ \cline{1-2} |
|
% % {$\quad[4,4]\quad$} & ok & \\ \cline{1-2} |
|
% {$]3, \infty[$} & holds & \\ \cline{1-2} |
|
% {[}3,3{]} & not holds & \\ \cline{1-2} |
|
% {[}0,3{]} & not holds & \\ \cline{1-2} |
|
% \end{tabular} |
|
% \caption{Latencies check on Eq.~\eqref{eq:mc1} |
|
% with $\delta = [1,3]$.\label{table:1}} |
|
% %\end{table} |
|
% \end{wraptable} |
|
|
|
|
|
%%% Local Variables: |
|
%%% mode: latex |
|
%%% TeX-master: "main" |
|
%%% End: |
|
|
|
%% LocalWords: Fiacre
|
|
|