This project is a demonstrator tool, made by the MOISE project, that translates timed Altarica models into Fiacre models. Such translation allows to use model checkers such as Tina to prove properties. The project contains the translator tool.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

884 lines
44 KiB

Failure propagation models are defined by safety engineers and are
usually obtained through manual assessment of the safety of the
system. This is a complicated task since failures can depend on more
than one element of the system; be the result of the interaction
between many faults; be the consequence of the missed detection of
another fault (e.g. a fault inside an element tasked with detecting
faults); etc. To cope with the complexity of the systems and the
scenarios that need to be analysed, several model-based approaches
have been proposed such as AltaRica~\cite{Altarica2.3,arnold1999altarica}, Figaro~\cite{bouissou05}, etc. each with their
associated tooling.
Our work is concerned with the modelling and analysis of failures
propagation in the presence of time constraints. We concentrate on a
particular safety property, called \emph{loss detection convergence},
meaning that the system applies an appropriate and timely response
to the occurrence of a fault before the failure is
propagated and produces unwanted system behaviours.
%
Similar problems
were addressed in~\cite{thomas2013}, where the authors describe a
process to model Failure Detection Isolation and Reconfiguration architecture (for use on-board satellites) that
requires to take into account failure propagation,
detection, and recovery times. However, these needs are not matched by
an effective way to express or check the safety constraints of the system.
Our contribution is as follows. We define a lightweight extension of
AltaRica, meaning that timing constraints are declared separately from
the behaviour of a system (Sect.~\ref{sec:sect2}). Therefore it is easy to reuse a prior
safety model and to define its temporal behaviour afterwards. We
illustrate our method with an example inspired by safety architectures
found in avionic systems. This example illustrate the impact of time
when reasoning about failure propagation, and we use it to
show that taking into accounts timing constraints---in particular
propagation delays---can help finding new failure modes that cannot be
detected in the untimed models currently in use. In the process, we
define two safety properties: \emph{loss detection}; and its temporal
version, \emph{loss detection convergence}, meaning that a system
applies an appropriate and timely response to the occurrence of a
fault before the failure is propagated and produces unwanted system
behaviours. We show that these two properties, which are of interest
in a much broader context, can be reduced to effective model-checking
problems.
In the next chapter we extend this approach, by applying it to an industrial
satellite study case, and analysing the scalability of the methodology to more
complex system models.
\section{Model-Based Safety Analysis with AltaRica}\label{sec:sect2}
\subsection{AltaRica language and versions}
AltaRica is a high level modelling language dedicated to Safety
Analysis. It has been defined to ease the modelling and analysis of
failure propagation in systems. The goal is to identify the
possible failure modes of the system and, for each mode, the chain of
events that lead to an unwanted situation.
In AltaRica, a system is expressed in terms of variables constrained
by formulas and transitions. Several versions of the language have
been defined (see for
instance~\cite{arn00,prosvirnova2013altarica}).
Three main versions of AltaRica have been designed so far. These three versions differ essentially with
two respects: the way variables are updated after each transition firing and the way hierarchies of
components are managed. In the first version, designed at the Computer Science Laboratory of
University of Bordeaux (LaBRI), variables were updated by solving constraint systems. This model,
although very powerful, was too resource consuming for industrial scale applications. A radical turn
was therefore taken with the design of a data-flow version of the language. In AltaRica 2.0 Dataflow,
variables are updated by propagating values in a fixed order. This order is determined at compile time.
At the time this document is written, most of AltaRica tools rely on this Dataflow version. Ongoing investigation aim
at improving AltaRica Dataflow in several ways, like propagating variable values by determining a fixpoint at execution time,
hence justifying the third version of
the language currently being developped: AltaRica 3.0 or OpenAltaRica\cite{prosvirnova2013altarica}.
\subsection{AltaRica modelling}
In this work, we use the AltaRica 2.0 Dataflow language which is a fragment of other
versions and which is sufficient to analyse the behaviour of computer
based systems. (The approach discussed here could be applied to other
AltaRica dialects.) The models used in this work have been edited and
analysed using Cecilia OCAS, a graphical interactive simulator
developed by Dassault Aviation~\cite{bieber2004safety}.
An AltaRica model is made of interconnected \emph{nodes}. A node can
be essentially viewed as a mode automaton~\cite{rauzy02} extended with
guards and actions on data variables. A node comprises three parts: a
declaration of variables and events, the definition of transitions,
and the definition of assertions. We illustrate these concepts with
the example of a simple node called \FUNC. We give the code (textual
definition) of \FUNC in Listing~\ref{ls:altarica} and a schematic
representation in Fig.~\ref{fig:funct2}. The node \FUNC has one input,
\code{I}, and one output, \code{O}.
\lstinputlisting[language=Altarica,float=bht,captionpos=b,caption=\protect{Example
of AltaRica code for the node \FUNC.},frame=single,label=ls:altarica]{algos/function2.alt}
\begin{figure}[thb]
\centering
\begin{minipage}{0.25\textwidth}
\centering
\includegraphics[width=\textwidth]{altarica-function.pdf}
\end{minipage}
\begin{minipage}{0.7\textwidth}
\centering
\vspace{1em}
\includegraphics[width=0.8\textwidth]{altarica-diagrams.pdf}
\end{minipage}
\caption{Graphical representation of node \FUNC (left) and its
associated failure mode automaton (right).\label{fig:funct2}}\vspace{1em}
\end{figure}
\enlargethispage{\baselineskip}
%
Nodes can have an internal state stored in a set of \emph{state
variables}, declared in a heading called \texttt{state}. In its
nominal state (when \code{S = NOMINAL}), the \FUNC node acts as a
perfect relay: it copies on its output the value provided by its input
(we have \code{O = I}); this is expressed in the \code{assert}
block. On the opposite, its output is set to \LOSS or \ERR when
\code{S} equals \code{LOST} or \code{ERROR}, respectively.
%
The assert directive is used to express constraints on the values of
the input and output variables of a node, also called \emph{flow
variables}, establishing a link between the node and its
environment, i.e. the other interconnected nodes. It distinguishes
between input (\texttt{in}) and output (\texttt{out}) variables. An
assertion defines a rule to update the value of output flows according
to the state of the component and the value of input flows.
The state variables of a node can only change when an
event is triggered. The code of \FUNC declares two events: \code{fail\_err}, that
changes the state from \code{NOMINAL} to \code{ERROR}, and
\code{fail\_loss}, that can transition from any state to
\code{LOST}. This behaviour is the one displayed on the mode
automaton of Fig.~\ref{fig:funct2}. Transitions are listed in the
\texttt{trans} block of the node. Each transition has an (event)
name and a definition of the form
\lstinline[language=Altarica]{g |-evt-> e}, where the guard \code{g}
is a Boolean condition that can refer to state and flow
variables. The event \code{evt} can be triggered when the guard is
satisfied. In this case we apply the effect, \code{e}, that is an
expression that modifies the values of state variables.
Events are useful to model the occurrence of failures or the reaction
to conditions on the flow variables.
% It is possible to indicate
% whether an event is triggered internally by a node---as in the case of
% a reconfiguration---or externally by its environment---as in the case
% of the occurrence of a failure. In the latter case,
We can assign a
law of probability on the occurrence of the failure using the heading
\code{extern}. For instance we could assert that event
\code{fail\_loss} follows an exponential distribution with the
declaration:
%%
{\lstinline[language=Altarica]{extern law (<event fail_loss>)="exp 1e-4";}}
%%
In the next section, we propose a way to enrich this syntax to express
timing constraints on events instead of probability distributions. At
the moment, it is not possible to use stochastic events in addition to
time events.
\enlargethispage{\baselineskip}
In the general case, an AltaRica model is composed of several
interconnected node instances, following a component-based
approach. Global assertions relate the input flows of a component to
the output flows of other components. For the sake of brevity, we do
not describe component synchronisation here and we refer the reader to
\cite{Altarica2.3} for further details. More importantly, a
hierarchical AltaRica model can always be ``flattened'', i.e.
represented by a single node containing all the variables, events,
assertions, and transitions from the composite system. We use this
property in our interpretation of AltaRica in Fiacre.
%%
\subsection{Time AltaRica: Adding Timing Constraints to Events}%%\label{sec:adding-timing-constr}
%%
There already is a limited mechanism for declaring timing constraints
in AltaRica. It relies on the use of external law associated with a
\emph{Dirac distribution}. An event with Dirac$(0)$ law denotes an
instantaneous transition, that should be triggered with the highest
priority. Likewise, an event with Dirac($d$) (where $d$ is a
positive constant) models a transition that should be triggered with a
delay of $d$ units of time.
%
In practice, Dirac laws are rather a way to encode priorities between
events than an actual mean to express duration. Moreover, while Dirac
laws are used during simulation, they are not taken into account by
the other analysis tools. Finally, the use of Dirac laws is not
expressive enough to capture non-deterministic transitions that can
occur within time intervals of the form $[a, b]$, where $a \neq
b$. These constraints are useful to reason about failure propagation
delays with different best and worst case traversal time. For this
reason, we propose to extend event properties with \emph{temporal
laws} of the form:
%%
{\lstinline[language=Altarica]{extern law (evt) = "[a,b]";}}
%%
It is also possible to use open and/or unbounded time intervals, such
as \code{]a,}$\infty$\code{[}.%%\scriptstyle
With such a declaration, the transition
\lstinline[language=Altarica]{g |-evt-> e} can be triggered only if
the guard \code{g} is satisfied for a duration (or \emph{waiting
time}) $\delta$, with $\delta \in [a, b]$. A main difference with
the original semantics of AltaRica is that the timing constraint of an event is
not reinitialised unless its guard is set to false.
%%
% Hence the waiting time of an event decreases when time elapse whereas,
% while with a Dirac($d$) law the waiting time is reset to $d$ each time
% the event fires. Finally,
Moreover, our semantics naturally entails a notion of \emph{urgency},
meaning that it is not possible to miss a deadline: when $\delta$
equals $b$, then either \code{evt} is triggered or another transition
should (instantaneously) change the value of the guard \code{g} to
false.
We can illustrate the use of temporal laws with the following example of a new
node, \PRE; see Listing.~\ref{ls:pre}. This node encodes a buffer that
delays the propagation of its input. When the input changes, event
\code{pre\_read} has to be triggered instantaneously. Then, only after
a duration $\delta \in [a, b]$, the value stored by \PRE (in the state
variable \code{Stored}) is propagated to its output.
%
\lstinputlisting[language=Altarica,float=t,captionpos=b,caption=\protect{Example
of Time AltaRica code: the basic delay.},frame=single,label=ls:pre]{algos/pre.alt}
\section{A Definition of Fiacre Using Examples}\label{sec:sect3}
Fiacre~\cite{berthomieu2008fiacre} is a high-level, formal
specification language designed to represent both the behavioural and
timing aspects of reactive systems. Fiacre programs are stratified in
two main notions: \emph{processes}, which are well-suited for modelling
structured activities (like for example simple state machines), and
\emph{components}, which describes a system as a composition of
processes. In the following, we base our presentation of Fiacre on
code examples used in our interpretation of Time AltaRica.
% \subsection{A Definition of Fiacre by Examples}
% \label{sec:defin-fiacre-exampl}
We give a simple example of Fiacre specification in
Listing~\ref{lst:fiacre}. This code defines a process,
\code{Function}, that simulates the behaviour of the AltaRica node
given in Listing~\ref{ls:altarica}.
\lstinputlisting[language=Fiacre,float=hbt,captionpos=b,caption=\protect{Example
of Fiacre code: type, functions and processes},label=lst:fiacre,frame=single]{algos/function.fcr}
Fiacre is a strongly typed language, meaning that type annotations are
exploited in order to avoid unchecked run-time errors. Our example
defines two enumeration types, \code{FState} and \code{FailureType},
that are the equivalent of the namesake AltaRica domains. We also
define a record type, \code{Flows}, that models the environment of the
node \code{Function}, that is an association from flow variables to
values. Fiacre provides more complex data types, such as arrays,
tagged union or FIFO queues. Fiacre also supports native
\emph{functions} that provide a simple way to compute on values. In
our example, function \code{update} is used to compute the state of
the environment after an event is triggered; that is to model the
effect of assertions in AltaRica. It uses two ternary (conditional)
operators to mimic the \code{case}-expression found in the
\code{assert} heading of Listing~\ref{ls:altarica}.
% Functions are evaluated applicatively (there are no side effects) and
% can be recursively defined.
A Fiacre \emph{process} is defined by a set of parameters and {control
states}, each associated with a set of \emph{complex transitions}
(introduced by the keyword \code{from}). Our example defines a process
with two {shared variables}---symbol \code{\&} denotes variables
passed by reference---that can be updated concurrently by other
processes. In our case, variable \code{S} models the (unique) state
variable of node \FUNC.
Complex transitions are expressions that declares how variables are
updated and which transitions may fire. They are built from constructs
available in imperative programming languages (assignments,
conditionals, sequential composition, \dots); non-deterministic
constructs (such as external choice, with the \code{select} operator);
communication on ports; and jump to a state (with the \code{to} or
\code{loop} operators). In Listing~\ref{lst:fiacre}, the \code{select}
statement defines two possible transitions, separated by the symbol
\code{[]}, that loop back to \code{s0}. Each transition maps exactly
to one of the AltaRica events, \code{fail\_loss} and \code{fail\_err},
that we want to translate. Transitions are triggered
non-deterministically and their effects are atomic (they have an ``all
or nothing'' semantics). A transition can also be guarded by a Boolean
condition, using the operator \code{on} or another conditional
construct.
It is possible to associate a time constraint to a transition using
the operator \code{wait}. Actually, the ability to express directly
timing constraints in programs is a distinguishing feature of
Fiacre. We illustrate this mechanism in the code below, that
corresponds to the interpretation of the node \PRE of
Listing~\ref{ls:pre}. Basically, a transition constrained by a (time)
interval $I$ can be triggered after a time $\delta$, with
$\delta \in I$, only if its guard stayed continuously valid during
this time. It is this behaviour that inspired our choice of semantics
for the temporal law.
The last notion that we need to introduce is \emph{components}.
Components allow the hierarchical composition of processes, in the
same way than AltaRica nodes can be composed into classes in order to
build more complex behaviors.
Thus, a Fiacre component defines a parallel composition of components and/or
processes using statements of the form \code{par} \vars{P}$_0$
$\parallel \dots \parallel$ \vars{P}$_n$ \code{end}. It can also be
used to restrict the visibility of variables and ports and to define
priorities between communication events.
% Priorities are explicitly declared as constraints between
% communication ports, as in \code{go1 > go2}.
We give an example of Fiacre component in Listing~\ref{lst:pre2},
where we define an upgraded version of the delay operator \PRE.
%
\lstinputlisting[language=Fiacre,float=bt,captionpos=b,caption=\protect{Wait statement, components and synchronisation on ports.},label=lst:pre2,frame=single]{algos/pre2.fcr}
% %
% \lstinputlisting[language=Fiacre,float=bt,captionpos=b,caption=\protect{Example
% of Fiacre code: declaring timing
% constraints},label=lst:pre1,frame=single]{algos/pre.fcr}
% %
A problem with the implementation of \PRE is that at most one failure
mode can be delayed at a time. Indeed, if the input of \PRE changes
while the state is \code{Full}, then the updated value is not taken
into account until after event \code{pre\_wait} triggers.
%
It is not possible to implement a version that can delay an unbounded
number of events in a bounded time; this would require an unbounded
amount of memory to store the intermediate values. More fundamentally,
this would give rise to undecidable verification problems (see
e.g.~\cite{bouyer2004updatable}). To fix this problem, we can define a
family of operators, \PRE\code{\_k}, that can delay up-to $k$
simultaneous different inputs. Our implementation relies on a
component that uses three process instances: one instance of
\code{front}, that reads messages from the input (variable \code{I}),
and two instances of \code{delay}, that act as buffers for the values
that need to be delayed. Process \code{front} uses the local ports
\code{go1} and \code{go2} to dispatch values to the buffers.
Component \PRE\code{\_2} is enough to model the use case defined in
Sect.~\ref{sec:example-fail-detect-1}. Indeed, any element in the FDI
system may propagate at most two different status, one from \OK to
\LOSS and then from \LOSS to \ERR.
\section{Example of a Failure Detection and Isolation System}
\label{sec:example-fail-detect-1}
We study the example of a safety critical function that illustrates
standard failure propagation problems. We use this example to show the
adverse effects of temporal failure propagation even in the presence
of % system safety improvements. In our case the addition of
Failure Detection and Isolation (FDI) capabilities.
%
This example is inspired by the avionic functions that provide
parameters for Primary Flight Display (PFD), which is located in the
aircraft cockpit. The system of interest is the computer that acquires
sensors measurements and computes the aircraft \emph{calibrated airspeed} (CAS) parameter.
Airspeed is crucial for pilots: it is taken into
account to adjust aircraft engines thrust and it plays a main role in
the prevention of over speed and stall.
%%
\begin{figure}[thb]
\centering
\includegraphics[width=1\textwidth]{figures/AirSpeed_Computation.jpg}
\caption{Functional and physical views of the airspeed computation function.\label{fig:cockpit}}
\end{figure}
%%
CAS is not directly measured by a dedicated sensor, but is computed as
a function of two auxiliary pressure measurements, the static pressure
(Ps) and total pressure (Pt); that is
$\mathrm{CAS} = f(\mathrm{Pt}, \mathrm{Ps})$.
% \begin{equation*}
% CAS = C_{SO}\sqrt{5 \cdot \left( \left(\frac{Pt - Ps}{P_0} + 1\right)^{2/7} - 1\right)}
% \end{equation*}
% with $C_{SO}$ and $P_0$ two
% constants. % the speed of sound under standard day sea level conditions
These two measurements come from sensors located on the aircraft nose,
a pressure probe and a pitot tube.
% Our proposed functional view shows:
% \begin{itemize}
% \item Two external functions \I{1} and \I{2} that measure static and total pressure. They represent the functions allocated to the sensors.
% \item Three inner functions of the system: \F{1} and \F{2} for sensor measurements acquisition by the onboard computer and \F{3} for airspeed computation.
% \item The PFD functions have not been modelled in the sole purpose of simplification.
% \end{itemize}
%
%%
Our proposed functional view is given in Fig.~\ref{fig:cockpit}. It
consists in two external input functions \I{1} and \I{2} that measure static
and total pressure; and three inner functions of the system, \F{1} and
\F{2} for sensor measurements acquisition by the on-board computer and
\F{3} for airspeed computation. For simplification purposes, the
PFD functions have not been modelled.
Next, we propose a first failure propagation view aiming at
identifying the scenarios leading to an erroneous airspeed computation
and display to the pilot (denoted \ERR). Such failure can only be
detected if a failure detector is implemented, for instance by
comparing the outputs of different functions. Undetected, it could
mislead the pilot and, consequently, lead to an inappropriate engine
thrust setting. We also want to identify the scenarios leading to
the loss of the measure (denoted \LOSS). In such a case, the pilot can
easily assess that the measure is missing or false and consequently
rely upon another measure to control the aircraft (note that such
redundancy is not modelled). For example, airspeed out of
bound---indicating that an airliner has crossed the sonic barrier---is
considered to be of kind \LOSS. It can be understood that scenarios leading to the
loss of the airspeed are less critical than the ones leading erroneous
values.
% We consider two possible kinds of fail status: the ``silent'' failure of
%an element (denoted \ERR); or a communication problem with one of the
% functions (denoted \LOSS).
A failure is typically a loss of connectivity or the
detection of (clearly) erroneous transmitted values, for instance a
variable out of bounds. A \LOSS fail is less critical than an \ERR
since it can be detected immediately and isolated. On the other hand,
an \ERR fail can only be detected if a failure detector is
implemented, for instance by comparing the outputs of different
functions. In the following, we consider that the status
are ordered by their level of severity, that is \ERR $<$ \LOSS $<$\OK.
\subsubsection{Safety model of the architecture without FDI}
%%\label{sec:error-prop-simple}
For the sake of understandability, we choose to study a simplified version of
FDI system. While simple, this system is representative of failure
propagation mechanisms found in actual onboard systems.
%%
We provide an AltaRica model corresponding to the functional view
of the CAS function in Fig.~\ref{fig:example0}. This model, tailored
to study failure propagation, is comprised of:
% ( All the functions are modelled according to the node function
% introduced in the Sect.~\ref{sec:sect2}.
%
two external functions, \I{1} and \I{2}, that have no input (so, in
their nominal state, the output is set to \OK); two inner functions,
\F{1} and \F{2}, which are instances of the node \FUNC described in
Sect.~\ref{sec:sect2}; and a function, \F{3}, that is the composition
of two basic elements: a multiplexer, \MIN, representing the
dependence of the output of \F{3} from its two inputs, and a computing
element \F{3Processing} that represents the computation of the
airspeed. \F{3Processing} is also an instance of node \FUNC.
\begin{figure}[t]
\centering
\begin{tikzpicture}
\node[anchor=south west,inner sep=0] at (0,0) { \includegraphics[width=0.75\textwidth]{figures/Modele0}};
\draw[red!50,ultra thick, dashed] (7,2.85) rectangle (2.6,0.5);
\node[red] at (6.5,2.5) {\texttt{\large F3}};
\end{tikzpicture}
\caption{A simple example of failure propagation.\label{fig:example0}}
\end{figure}
%
In case of single failure scenario, \MIN propagates the failure coming
either from one input or the other. In case of multiple failures,
%when both inputs have identical failure, the faillure propagates;
%For instance,
% both inputs to \LOSS lead to an output of \LOSS.
% But when it receives an erroneous fail then the multiplexer will
% propagate an \ERR. The remaining failure combination drew our
% attention during \F{3} modelling.
%on the other hand,
when different failures propagate, one being \LOSS
and the other being \ERR,---and without appropriate FDI---the system
outcome is uncertain.
%
Solving this uncertainty would require a detailed behavioural model of
the on-board computer and a model for all the possible failure modes,
which is rarely feasible with a sufficient level of confidence, except
for time-tested technology. Given this uncertainty, it is usual to
retain the state with the most critical effect, that is to say:
% in this case, \MIN can be thought of as a simple logic operator that
% computes the minimum of its two inputs, and
the output of \F{3} is \ERR.
% Since a \LOSS can be ``easily'' detected, we want to avoid a situation
% where the whole system propagates \ERR while one of its
%internal element propagates a \LOSS. The rationale is that we should be
% able to isolate (or quarantine) the system when we know for sure that
% one of its element does not reliably respond to commands. This can be
% expressed by the following safety property.
%%
% In the next section, we study the system with the addition of FDI
% capabilities.
Our goal is to prevent the computation of an erroneous airspeed while
one of \F{3} input signals is lost. The rationale is that the system
should be able to passivate automatically the airspeed when it detects
that one of its input signals is not reliable. This behaviour can be
expressed with the following property:
%%
\begin{prop}[Loss Detection and Instantaneous Propagation]\label{prop:1}
A function is \emph{loss detection safe} if, when in nominal mode, it
propagates a \LOSS whenever one of its input nodes propagates a
\LOSS.
\end{prop}
% This is a simple system used to detect non-nominal behaviours
% and to trigger observables in order to isolate the failure mode.
% ({The figures are actual screen capture of models edited with Cecilia
% OCAS.})
% \subsubsection{Analysis}
We can show that our example of Fig.~\ref{fig:example0} does not meet
this property using the \emph{Sequence Generation} tool available in
Cecilia OCAS.
%%
% This system has several sources of unreliability. Both its sources and
% the computing elements (\F{1}, \F{2} and \F{3Processing}) can
% experience spontaneous failures, meaning that their outputs can
% transition instantaneously to \LOSS or \ERR. Errors are irrecoverable;
% once a function becomes faulty, its output will always stay equal to
% \LOSS or \ERR. Likewise, if both inputs are ``nominal'' then the
% output of \MIN is \OK. We assume that the multiplexer cannot undergo
% spontaneous failures.
%%
% This can be tested using a perfect detectors, \FF{3}{Loss}, placed at
% the output of the system.
%%
To this end, we compute the minimal cuts for the target equation
$((\text{\F{1.O.Loss}} \vee \text{\F{2.O.Loss}}) \wedge \neg
\text{\F{3.O.Loss}})$, meaning the scenario where \F{3} does not
propagates \LOSS when one of \F{1} or \F{2} does. Hence function \F{3}
is {loss detection safe} if and only if the set is empty.
\footnote{To check the safety of \F{3}, we need to perform the same
analysis for all the elements that can propagate a loss fail, not
only \F{1}.}.
In our example, once we eliminate the cases where \F{3} is not nominal
(that is when \F{3Processing} is in an error state), we find eight
minimal cuts, all of order $2$. In the following section, we correct
the behaviour of \F{3} by considering a new architecture based on
detectors and a switch to isolate the output of \F{3} when faulty:
%%
{%\centering
\begin{small}
\begin{tabular}{c@{\quad\quad}c}
\begin{minipage}[t]{0.42\linewidth}
\begin{verbatim}
{'F1.fail_err', 'F2.fail_loss'}
{'F1.fail_err', 'I2.fail_loss'}
{'F1.fail_loss', 'F2.fail_err'}
{'F1.fail_loss', 'I2.fail_err'}
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{0.42\linewidth}
\begin{verbatim}
{'F2.fail_err', 'I1.fail_loss'}
{'F2.fail_loss', 'I1.fail_err'}
{'I1.fail_err', 'I2.fail_loss'}
{'I1.fail_loss', 'I2.fail_err'}
\end{verbatim}
\end{minipage}\\
\end{tabular}\\
\end{small}
}
%
Each of these cuts lead to trivial violations of our safety
property. For instance, we obtain the cut \code{\{'F1.fail\_err',
'F2.fail\_loss'\}} that corresponds to the case where \F{1}
propagates \ERR and \F{2} propagates \LOSS, in which case \F{3}
propagates \ERR. This is exactly the situation that we want to
avoid.
\begin{figure}[bt]
\centering
\begin{tikzpicture}
\node[anchor=south west,inner sep=0] at (0,0) { \includegraphics[width=\textwidth]{figures/Modele2.png}};
\draw[red!50,ultra thick, dashed] (10.8,3.4) rectangle (2,0);
\node[red] at (10.4,3.1) {\texttt{\large F3}};
\end{tikzpicture}
\caption{Model of a FDI function with a switch and an alarm.\label{fig:example1}}
\end{figure}%\vspace*{-1cm}
%\enlargethispage{\baselineskip}
\subsubsection{Safety model of the architecture with FDI}
%%\label{sec:simple-fail-detect}
%%
%In order to satisfy the Prop.~\ref{prop:1},
The updated implementation of \F{3} (see Fig.~\ref{fig:example1})
uses two perfect detectors, \F{1Loss} and \F{2Loss}, that can detect
a loss failure event on the inputs of the function. The (Boolean) outputs of
these detectors are linked to an OR gate ({\ttfamily AtLeastOneLoss})
which triggers an \ALARM when at least one of the detectors outputs
true. The alarm commands a \SWITCH; the output of \SWITCH is the same
as \MIN, unless \ALARM is activated, in which case it propagates a
\LOSS failure. The alarm can fail in two modes, either
continuously signaling a \LOSS or never being activated. The
schema in Fig.~\ref{fig:example1} also includes two delays operators, \D{1} and \D{2},
that model delay propagation at the input of the detectors;
we will not consider them in the following lines, but come back to these timing constraints at the end of the section.
% \TODO{why do we need Alarm ! why does the alarm can fail ! we could
% plug directly the or gate to the switch}- Permanent failure of the alarm.
The FDI function---with a switch and an alarm---is a stable scheme for
failure propagation: when in nominal mode, it detects all the failures
of the system and it is able to disambiguate the case where its inputs
contains both \ERR and \LOSS. Once again, this can be confirmed using
the Sequence Generation tool. If we repeat the same analysis than
before---and if we abstract away the delay nodes---we find $56$
minimal cuts, all involving a failure of either \ALARM or
\F{3Processing}, i.e. a non-nominal mode. This means that, in an untimed model, our new
implementation of \F{3} satisfies the loss detection property, as
desired.
% \begin{figure}[hbt]
% \centering
% \begin{tikzpicture}
% \node[anchor=south west,inner sep=0] at (0,0) { \includegraphics[width=\textwidth]{figures/Modele2.png}};
% \draw[red,ultra thick, dashed] (10.6,3.6) rectangle (2.2,0);
% \node[red] at (10,3) {\texttt{\huge F3}};
% \end{tikzpicture}
% % \includegraphics[width=0.95\textwidth]{figures/Modele2}
% \caption{Taking into consideration propagation delays in the FDI
% system.\label{fig:example2}}
% \end{figure}
% Unfortunately, this model makes unrealistic assumptions about the
% instantaneous failure propagation of detection. Indeed, it may be the
% case that the outputs of \F{1} and \F{2} are delayed as they are
% propagated to the observers \F{1Loss} and \F{2Loss}. Next, we study
% new failure modes that may arise from this situation and how we can
% detect them.
%%
% \subsection{Timed Safety model of the architecture with FDI}
% \label{sec:timed-safety-model}
%%
% In this section, we consider a new AltaRica model that takes into
% account propagation delays. To this end, we may insert two instances
% of the delay node (the element \PRE defined in Sect.~\ref{sec:sect2})
% before the detectors \F{1Loss} and \F{2Loss}.
Even so, it is easy to find a timed scenario where the safety property
is violated.
Assume now that \F{1} and \F{2} propagate respectively the status \LOSS and \ERR,
at the same date. In such a case and considering possible latencies,
while \ERR reaches \MIN instantaneously,
the output of \F{1} might reach \F{1Loss} at successive date.
% of the output of \F{2} reaching \F{2Loss},.
This leads to a transient state where the alarm is not activated whereas
the output of \MIN is set to \ERR. This brings us back to the same
dreaded scenario than in our initial model.
%%
In particular, this scenario corresponds to the cut
\code{\{'F1.fail\_loss', 'F2.fail\_err'\}}, that is not admissible in
an untimed semantics.
%%
This example suggests that we need a more powerful method to compute
the set of cuts in the presence of temporal constraints. On the other
hand, we may also advocate that our safety property is too limiting in
this context, where perfect synchronicity of events is rare. Actually,
it can be proven that the output of \F{3} will eventually converge to a
loss detection and isolation mode (assuming that \F{3} stays nominal and
that its inputs stay stable).
%
To reflect
this si\-tua\-tion, we pro\-pose an improved safety property that takes into
account temporal properties of the system:
%%
\begin{prop}[Loss Detection Convergent]\label{prop:2} A function is
\emph{loss detection convergent} if (when in nominal mode) there
exists a duration $\Delta$ such that it continuously outputs a \LOSS
after the date $\delta_0 + \Delta$ if at least one of its input
nodes continuously propagates a \LOSS starting from $\delta_0$
onward. The smallest possible value for $\Delta$ is called the
\emph{convergence latency} of the function.
\end{prop}
Hence, if the latency
needed to detect the loss failure can be bound, and if the bound is sufficiently
small safety-wise, we can still deem our system as safe.
In the example in Fig.~\ref{fig:cockpit}, this property can indicate for how long an erroneous airspeed is shown on the PFD to the pilot,
before the failure is isolated.
In the next section, we use our approach to generate a list of ``timed
cuts'' (as model-checking counterexamples) that would have exposed the
aforedescribed problems. We also use model-checking to compute
the {convergence latency} for the node \F{3}. In this simple example,
we can show that the latency is equal to the maximal propagation delay
at the input of the detectors. The value of the latency could be much
harder to compute in a more sophisticated scenario, where delays can
be chained and/or depends on the internal state of a component.
\section{Compilation of AltaRica and Experimental evaluation}\label{sec:sect4}
We have implemented the transformation outlined in
Sect.~\ref{sec:sect3}; the result is a compiler that automatically
generates Fiacre code from an AltaRica model. The compilation process
relies on the fact that it is possible to ``flatten'' a composition of
interconnected nodes into an intermediate representation, called a
\emph{Guarded Transition System} (GTS)~\cite{rau08}. A GTS is very
similar to a (single) AltaRica node and can therefore be encoded in a
similar way. Our tool is built using the codebase of the model-checker
EPOCH~\cite{epoch}, which provides the functionalities for the
syntactic analysis and the linking of AltaRica code. After
compilation, the Fiacre code can be checked using
Tina~\cite{BRV04}. As seen in Sect.~\ref{sec:tina}, the core of the Tina toolset is an
exploration engine that can be exploited by dedicated model-checking
and transition analyser tools. Tina offers several abstract state
space constructions that preserve specific classes of properties like
absence of deadlocks, reachability of markings, or linear and
branching time temporal properties. These state space abstractions are
vital when dealing with timed systems that generally have an infinite
state space (due to the use of a dense time model). In our
experiments, most of the requirements can be reduced to reachability
properties, so we can use on-the-fly model-checking techniques.
% and very ``aggressive'' abstractions.
\begin{figure}[thb]
\centering
\includegraphics[width=\textwidth]{figures/passerelle.png}
\caption{Graphical representation of the AltaRica/Fiacre toolchain.\label{fig:toolchain}}
\end{figure}
We interpret a GTS by a Fiacre process whose parameters consist of all
its state and flow variables. Each transition
\lstinline[language=Altarica]{g |-evt-> e} in the GTS is (bijectively)
encoded by a transition that matches the guard \code{g} and updates
the variables to reflect the effect of \code{e} plus the assertions. Each transition can
be labelled with time and priorities constraints to take
into account the \code{extern} declarations of the node. This
translation is straightforward since all the operators available in
AltaRica have a direct equivalent in Fiacre. Hence every
state/transition in the GTS corresponds to a unique state/transition
in Fiacre. This means that the state (reachability) graph of a GTS and
its associated Fiacre model are isomorphic. This is a very strong and
useful property for formal verification, since we can very easily
transfer verification artefacts (such as counterexamples) from one
model back to the other.
The close proximity between AltaRica and Fiacre is not really
surprising. First of all, both languages have similar roots in process
algebra theory and share very similar synchronisation mechanisms. More
deeply, they share formal models that are very close: AltaRica
semantics is based on the product of ``communicating automata'',
whereas the semantics of Fiacre can be expressed using (a time
extension of) one-safe Petri nets. The main difference is that
AltaRica provide support for defining probabilities on events, whereas
Fiacre is targeted towards the definition of timing aspects. This
proximity in both syntax and semantics is an advantage for the
validation of our tool, because it means that our translation should
preserve the semantics of AltaRica on models that do not use extern
laws to define probabilities and time. We have used this property to
validate our translation by comparing the behaviours of the models
obtained using Cecilia OCAS simulation tool and their translation. For
instance, in the case of the CAS system of
Sect.~\ref{sec:example-fail-detect-1}, we can compute the set of cuts
corresponding to Safety Property~\ref{prop:1} (loss detection) by
checking an invariant of the form
$((\text{\F{1.O}} = \text{\code{Loss}}) \vee (\text{\F{2.O}} =
\text{\code{Loss}}) \Rightarrow (\text{\F{3.O}} =
\text{\code{Loss}}))$.
%
%%\TODO{Check if it is necessary to mention the tool, as opposed to counterexamples}
In both cases---with and without FDI---we are able to compute the
exact same set of cuts than Cecilia OCAS. This is done using the
model-checker for modal mu-calculus provided with Tina, which can list
all the counterexamples for a (reachability) formula as a graph. More
importantly, we can use our approach to compute the timed
counterexample described at the end of
Sect.~\ref{sec:example-fail-detect-1}. All these computations can be
done in less than a second on our test machine.
\subsection{Empirical evaluation}
We have used our toolchain to generate the reachable state space of
several AltaRica models~\footnote{All the benchmarks tested in this paper are available at
https://w3.onera.fr/ifa-esa/content/model-checking-temporal-failure-propagation-altarica}:
RUDDER describes a control system for the
rudder of an A340 aircraft~\cite{bernard2007experiments}; ELEC refers
to three simplified electrical generation and power distribution
systems for a hypothetical twin jet aircraft; the HYDRAU model
describes a hydraulic system similar to the one of the A320
aircraft~\cite{bieber2002combination}. The results are reported in
Table~\ref{table:2}. In each case, we indicate the time needed to
generate the whole state space (in seconds) and the number of states
and transitions explored. For information, we also give the number of
state variables as reported by Cecilia OCAS. All tests were run on an
Intel 2.50GHz CPU with 8GB of RAM running Linux. In the case of model
HYDRAU we stopped the exploration after $30$ minutes and more than
$9.10^{9}$ generated states.
The state space is large in this benchmark because it models the physical a-causal propagation of a leak, so a leak can impact both upward and backward components and trigger a reconfiguration, multiplying the number of reachable states.
Moreover, the topology of the system shall be reconfigurated after the detection of some failures and this dynamic reconfiguration combined with a-causal propagation increases again more the size of the state space.
In all cases, the time needed to generate
the Fiacre code is negligible, in the order of 10~ms.
\begin{table}[tb]
\centering
\begin{tabular}{|l|c|c|c|c|c|}
\hline
\textbf{Model} & \textbf{time} (s) & \textbf{\# states} &
\textbf{\#
trans.}
& \textbf{\# state vars}\\ \hline
RUDDER~~~ & 0.85 & $3.3\,10^4$ & $2.5\,10^5$ & 15 \\ \hline
ELEC 01 & 0.40 & 512 & $2.3\,10^3$ & 9 \\ \hline
ELEC 02 & 0.40 & 512 & $2.3\,10^3$ & 9 \\ \hline
ELEC 03 & 101 & $4.2\,10^6$ & $4.6\,10^7$ & 22 \\ \hline
HYDRAU & 1800 & --- & --- & 59 \\ \hline
CAS~ & 0.40 & 729 & $2.9\,10^3$ & 6 \\ \hline
CAS with \PRE~~~ & 46 & $9.7\, 10^5$ & $4.3\, 10^6$ & 10 \\ \hline
\end{tabular}
\vspace*{1em}
\caption{State space size and generation time for several use cases.\label{table:2}}
\end{table}
Our models also include two versions of the complete CAS system
(including the detectors, the alarm and the switch); both with and
without the delay functions \D{1} and \D{2}. The ``CAS with \PRE'' model
is our only example that contains timing constraints. In this case, we
give the size of the state class graph generated by Tina, that is an
abstract version of the state space that preserves LTL properties. We
can use Tina to check temporal properties on this
example. More precisely, we can check that \F{3} has the \emph{loss
detection convergence} property. To this end, a solution is to add a
Time Observer to check the maximal duration between two events: first, a \code{obs\_start} event is triggered when the output of \F{1} or \F{2} changes to \LOSS;
then an \code{obs\_end} event is triggered when the output of \F{3} changes to \LOSS. The
observer has also a third transition (\code{obs\_err}) that acts as a
timeout and is associated with a time interval $I$ and is enabled
concurrently with \code{obs\_end}. Hence,
Time Observer ends up in the state yield by \code{obs\_err}
when the output of \F{3} deviates from its
expected value for more than $d$ units of time, with $d \in I$. We
have used this observer to check that the \emph{convergence latency}
of the CAS system equals $3$, when
we assume that the
delays are in the time interval $[1, 3]$.
The result is that
\code{obs\_err} is fireable for any value of $d$ in the interval
$[0,3]$, while \code{obs\_err} is not fireable if $I = ]3, \infty[$.
% It is enough to indicate to choose $I = [0,3]$.
These two safety properties can be checked on the system (plus the
observer) in less than $0.6$~s.
% \begin{wraptable}{r}{0.35\textwidth}
% %\begin{table}[bth]
% \centering
% \vspace*{-7mm}
% \begin{tabular}{|c|c|l}
% \cline{1-2}
% \textbf{delay}$\mathbf\tau$ & \textbf{safety property} & \\ \cline{1-2}
% % {$\quad[4,4]\quad$} & ok & \\ \cline{1-2}
% {$]3, \infty[$} & holds & \\ \cline{1-2}
% {[}3,3{]} & not holds & \\ \cline{1-2}
% {[}0,3{]} & not holds & \\ \cline{1-2}
% \end{tabular}
% \caption{Latencies check on Eq.~\eqref{eq:mc1}
% with $\delta = [1,3]$.\label{table:1}}
% %\end{table}
% \end{wraptable}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:
%% LocalWords: Fiacre